Editor’s Note: In this episode, the participants discuss the performance of four commercially available large language models. This discussion and the testing of these models do not constitute an endorsement by the U.S. Army War College, the U.S. Army, or the Department of War. All products were obtained commercially, and no company or government agency received special consideration.
In February 2026, U.S. Army War College faculty conducted a groundbreaking experiment, administering rigorous oral comprehensive exams to four prominent AI models instead of students. By testing ChatGPT, Google Gemini, Anthropic’s Claude, and xAI’s Grok, the team benchmarked how advanced artificial intelligence handles complex strategic thinking.
Kevin Boyce and John Nagl join host Tom Spahr to discuss their methodology and remarkable findings. While all four commercial models passed, one comfortably stood out. However, the researchers discovered a critical flaw: all of these digital “students” degraded during extended questioning due to technical computing limits, with their responses growing repetitive and lazy.
This project highlights that while AI is a powerful tool for historical recall, human judgment remains indispensable for high-pressure decisions. Senior leaders should treat AI like a capable but imperfect staff officer—asking the right questions while carefully verifying the output.
You can read the article Can AI Pass the U.S. Army War College? by Kevin Boyce, John Nagl and Kris Wheaton here.
You can find the manuscript Responsibly Pursuing Generative Artificial Intelligence (GenAI) for
the War Fighter by Blair Wilcox and Anthony Pfaff here.
Professor Kris Wheaton said about two years ago… that AI is a mediocre staff officer. I would argue now that it’s a good staff officer, but still hallucinations, probabilistic thinking, it’s gonna make mistakes.
Podcast: Download
Kevin Boyce is Director of the Futures Lab and Assistant Professor of Futures and Emerging Technology at the U.S. Army War College. A retired Marine Corps Aviation Command and Control Officer with 25 years of service, his research focuses on emerging technologies, AI benchmarking, and their application in senior military education.
John Nagl is Professor of Warfighting Studies at the U.S. Army War College. He is the author of Learning to Eat Soup with a Knife: Counterinsurgency Lessons from Malaya and Vietnam.
Thomas W. Spahr is the DeSerio Chair of Strategic and Theater Intelligence at the U.S. Army War College. He is a retired colonel in the U.S. Army and holds a Ph.D. in History from The Ohio State University. He teaches courses at the Army War College on Military Campaigning and Intelligence.
The views expressed in this presentation are those of the speakers and do not necessarily reflect those of the U.S. Army War College, U.S. Army, or Department of War.
Photo Description: (L-R) Dr. Alexandra Meise, Dr. Jadwiga Biskupska, Mr. Kevin Boyce and Dr. Kris Wheaton administering oral comprehensive exams to multiple large language models.
Photo Credit: Courtesy of the U.S. Army War College
Given that the AI’s, as “staff officers,” would seem to be able to answer a commander’s questions immediately, fairly reliably and without — for example — 10 months worth of US War College schooling,
Then, given these such matters, how can the AI not be seen as being the better staff officer; this, than the human staff officer — who may have to take days or even weeks of research time — this, to, in equal quality, reliability and/or detail — answer the commander’s questions?
As to the suggestion that I make above (an AI “staff officer” may be able to answer a commander’s questions much quicker, as reliably and at least with equal quality and detail than a human staff officer), note that, from that such perspective — and sometime in the near future — (a) a single AI “staff officer” might be able to (b) replace ALL the human staff officers?
Thus — in a single AI “staff officer” — you might be able to obtain certain of the knowledge, abilities, capabilities and assistance that you currently require, and depend upon, (a) from your manpower or personnel staff officer, (b) from your intelligence, security, and information operations staff officer, (c) from your operations staff officer, (d) from your logistics staff officer (etc., etc., etc.)?
This such reliance on a single AI staff officer, for example, (a) massively reducing the size of your “headquarters” and, thus, (b) massively improving just how fast you can rapidly move, reposition and/or hide this such headquarters?
BC’s observations are rather insightful. It compels consideration.