AI Language Models Are Struggling to “Get” Math

If computers are good at anything, they are good at math. So it may come as a surprise that after much struggling, top machine learning researchers have recently made breakthroughs in teaching computers math.

Over the past year, researchers from the University of California, Berkeley, OpenAI, and Google have made leaps and bounds in teaching language models—algorithms similar to GPT-3 and DALL-E 2—basic math concepts. However, until very recently, language models regularly failed to solve even simple word problems. (e.g. Alice has five more balls than Bob, who has two balls after he gives four to Charlie. How many balls does Alice have?)

“When we say computers are very good at math, they’re very good at things that are quite specific,” says Guy Gur-Ari, a machine learning expert at Google. Computers are good at arithmetic—plugging numbers in and calculating is child’s play. But outside of formal structures, computers struggle.

“I think there’s this notion humans doing math have some rigid reasoning system that there’s a sharp distinction between knowing something and not knowing something.”
—Ethan Dyer, Google

Solving word problems, “quantitative reasoning,” is deceptively tricky because it requires a robustness and rigor that many other problems don’t. If any step during the process goes wrong, the answer will be wrong. While DALL-E’s impressive images may leave out fingers or create strange eyes, mistakes are more glaring when it comes to math. “When multiplying really large numbers together … they’ll forget to carry somewhere and be off by one,” says Vineet Kosaraju, a machine learning expert at OpenAI. Other mistakes made by language models are less human, such as misinterpreting 10 as 1 and 0, not ten.

“We work on math because we find it independently very interesting,” says Karl Cobbe, a machine learning expert at OpenAI. But as Gur-Ari puts it, if it’s good at math, “it’s probably also good at solving many other useful problems.”

As machine learning models are trained on larger samples of data, they tend to grow more robust, making fewer mistakes. But scaling up only seems to go so far with quantitative reasoning—researchers realized the mistakes language models make seemed to require a more targeted approach.

Last year, two different teams of researchers, at UC Berkeley and OpenAI, released two datasets, GSM8K and MATH, respectively, which contain thousands of math problems across geometry, algebra, precalculus and more. “We basically wanted to see if it was a problem with datasets,” says Steven Basart, a researcher at the Center for AI Safety who worked on MATH. Language models were known to be bad at word problems—but how bad were they, and could they be fixed by introducing better formatted, bigger datasets? The MATH group found just how challenging quantitative reasoning is for top-of-the-line language models, which scored less than 7 percent. (A human grad student scored 40 percent, while a math olympiad champ scored 90 percent.)

Models attacking GSM8K problems, which had easier grade-school level problems, reached about 20 percent accuracy. The OpenAI researchers used two main techniques: fine-tuning and verification. In fine-tuning, researchers take a pre-trained language model that includes irrelevant information (Wikipedia articles on zambonis, the dictionary entry for “gusto,” etc.) and then show the model, Clockwork Orange-style, only the relevant information (math problems). Verification, on the other hand, is more of a review session. “The model gets to see a lot of examples of its own mistakes, which is really valuable,” Cobbe says.

At the time, OpenAI predicted a model would need to be trained on 100 times more data to reach 80 percent accuracy on GSM8K. But in June, Google’s Minerva announced 78 percent accuracy with minimal scaling upwards. “It’s ahead of any of the trends that we were expecting,” Cobbe says. Basart agrees. “That’s shocking. I thought it would take longer,” he says.

Minerva uses Google’s own language model, Pathways Language Model (PaLM), fine-tuned on scientific papers from the arXiv online preprint server and other sources with formatted math. Two other strategies helped Minerva. In “chain of thought prompting,” Minerva was required to break down larger problems into more palatable chunks. The model also used majority voting—instead of being asked for one answer, it was asked to solve the problem 100 times. Of those answers, Minerva picked the most common answer.

The gains from these new strategies were enormous. Minerva shot up to 50 percent accuracy on MATH and nearly 80 percent accuracy on GSM8K, as well as the MMLU, a more general set of STEM questions that included chemistry and biology. When Minerva was asked to redo a random sample of slightly tweaked questions, it performed just as well, suggesting that its capabilities were not from mere memorization.

What Minerva knows—or doesn’t know—about math is fuzzier. Unlike proof assistants which come with built-in structure, Minerva and other language models have no formal structure. They can have strange messy reasoning and still arrive at the right answer. As numbers grow larger, the language models’ accuracy falters, something that would never happen on a TI-84.

“Just how smart is it? Or isn’t it?” asks Cobbe. Though models like Minerva might arrive at the same answer as a human, the actual process they’re following could be wildly different. On the other hand, “chain of thought prompting” is familiar to any human student who’s been asked to “show your work.”

“I think there’s this notion humans doing math have some rigid reasoning system that there’s a sharp distinction between knowing something and not knowing something,” says Ethan Dyer, a machine learning expert at Google. But humans give inconsistent answers, make errors, and fail to apply core concepts too. The borders, at this frontier of machine learning, are blurred.