Editor’s Note: In this episode, the participants discuss the performance of four commercially available large language models. This discussion and the testing of these models do not constitute an endorsement by the U.S. Army War College, the U.S. Army, or the Department of War. All products were obtained commercially, and no company or government agency received special consideration.
In February 2026, U.S. Army War College faculty conducted a groundbreaking experiment, administering rigorous oral comprehensive exams to four prominent AI models instead of students. By testing ChatGPT, Google Gemini, Anthropic’s Claude, and xAI’s Grok, the team benchmarked how advanced artificial intelligence handles complex strategic thinking.
Kevin Boyce and John Nagl join host Tom Spahr to discuss their methodology and remarkable findings. While all four commercial models passed, one comfortably stood out. However, the researchers discovered a critical flaw: all of these digital “students” degraded during extended questioning due to technical computing limits, with their responses growing repetitive and lazy.
This project highlights that while AI is a powerful tool for historical recall, human judgment remains indispensable for high-pressure decisions. Senior leaders should treat AI like a capable but imperfect staff officer—asking the right questions while carefully verifying the output.
You can read the article Can AI Pass the U.S. Army War College? by Kevin Boyce, John Nagl and Kris Wheaton here.
You can find the manuscript Responsibly Pursuing Generative Artificial Intelligence (GenAI) for
the War Fighter by Blair Wilcox and Anthony Pfaff here.
Professor Kris Wheaton said about two years ago… that AI is a mediocre staff officer. I would argue now that it’s a good staff officer, but still hallucinations, probabilistic thinking, it’s gonna make mistakes.
Podcast: Download
Kevin Boyce is Director of the Futures Lab and Assistant Professor of Futures and Emerging Technology at the U.S. Army War College. A retired Marine Corps Aviation Command and Control Officer with 25 years of service, his research focuses on emerging technologies, AI benchmarking, and their application in senior military education.
John Nagl is Professor of Warfighting Studies at the U.S. Army War College. He is the author of Learning to Eat Soup with a Knife: Counterinsurgency Lessons from Malaya and Vietnam.
Thomas W. Spahr is the DeSerio Chair of Strategic and Theater Intelligence at the U.S. Army War College. He is a retired colonel in the U.S. Army and holds a Ph.D. in History from The Ohio State University. He teaches courses at the Army War College on Military Campaigning and Intelligence.
The views expressed in this presentation are those of the speakers and do not necessarily reflect those of the U.S. Army War College, U.S. Army, or Department of War.
Photo Description: (L-R) Dr. Alexandra Meise, Dr. Jadwiga Biskupska, Mr. Kevin Boyce and Dr. Kris Wheaton administering oral comprehensive exams to multiple large language models.
Photo Credit: Courtesy of the U.S. Army War College

