MIT and Harvard researchers have introduced a new approach to improving AI models, using a modified version of “Battleship” to teach them how to ask sharper questions and show how smaller models could outperform larger systems at far lower cost.
The study, conducted by MIT CSAIL and Harvard SEAS researchers, used the game as a testing ground for a broader AI challenge, examining whether language models can investigate uncertain situations by asking useful questions rather than simply responding to prompts.
Smaller AI models improve by asking sharper questions
The researchers created a “Collaborative Battleship” game in which one participant, called the captain, asks natural-language questions about where hidden ships may be located, while a second participant, the spotter, answers in real time.
The team first had more than 40 humans play the game, using their questions and yes-or-no answers to build a dataset called BattleshipQA. They then tested large language models, including GPT-5, and smaller systems such as Llama 4 Scout.
Without additional training, top models were able to finish the game in fewer turns than human players, but smaller AI models struggled to ask rational and useful questions.
The researchers found that the weakness was not only about model size, but about how effectively the systems explored possible answers.
Llama 4 Scout jumps from 8% to 82%
To improve performance, the researchers gave the models a Monte Carlo inference strategy, a method that helps weigh different possible locations for hidden ships as new answers come in.
The change produced one of the study’s most striking results. Llama 4 Scout, a smaller language model, initially beat humans only 8% of the time. After the inference strategy was added, its win rate rose to 82%, allowing it to outperform GPT-5 in the game while operating at about 1% of the cost.
“Today’s language models are primarily optimized to answer complex queries, but it’s less clear whether they learn to ask good questions for themselves,” said Gabriel Grand, an MIT PhD student and CSAIL researcher who led the work.
“Our work shows that asking informative questions depends on the ability to predict and simulate the world,” he added.
Code helps AI answer more accurately
The researchers also improved how AI systems handled the spotter role by converting natural-language questions into Python commands, giving the models clearer instructions for checking whether a ship was located in a specific area.
The method lifted answer accuracy by 15% on average, with GPT-4o-mini posting a nearly 30% performance boost and Claude 4 Opus improving by about eight points.
The team later tested the method on “Guess Who?”, where Llama 4 Scout’s success rate climbed from 30% to more than 72%, while GPT-4o rose from 62% to 90%.
Findings point to cheaper path for stronger AI agents
The findings suggest that future AI agents may become more useful in research-heavy tasks, including scientific discovery, medical diagnosis, coding and mathematics, by improving how they search for information.
The researchers cautioned that the game remains a simplified setting, but said the results point to a cheaper path for building capable AI agents by improving reasoning and exploration instead of relying only on larger models.
