DeepMind launches a verifier AI model to efficiently solve challenging mathematical problems

2023-12-15

DeepMind's artificial intelligence research division claims to have cracked a seemingly unsolvable mathematical problem using a large language model (LLM) based chatbot. The chatbot is equipped with a fact-checker to filter out useless outputs.

By using the filter, DeepMind researchers say the LLM can generate millions of responses but only submit those that can be verified as accurate.

This is a milestone achievement as previous DeepMind breakthroughs have typically relied on AI models specifically created to solve specific tasks, such as predicting the weather or designing new protein structures. These models are trained on highly accurate and specific datasets, which sets them apart from LLMs like OpenAI's GPT-4 or Google's Gemini.

These LLMs are trained on extensive and diverse datasets, enabling them to perform a wide range of tasks and discuss almost any topic. However, this approach also carries risks as LLMs are prone to what is known as "hallucinations," which are terms for generating false outputs.

Hallucinations are a major problem faced by LLMs. Gemini, the LLM released by Google this month and claimed to be the most powerful to date, has already shown its vulnerability by incorrectly answering some fairly simple questions, such as who won this year's Oscars.

Researchers believe that the illusion can be fixed by adding a layer on top of the AI model to verify the accuracy of its outputs before passing them on to users. However, building this safety net is challenging when LLMs are trained to discuss such a wide range of topics.

At DeepMind, Alhussein Fawzi and his team members have created a general LLM called FunSearch, based on Google's PaLM2 model. They added a fact-checking layer called the "evaluator." In this case, FunSearch is positioned to solve mathematical and computer science problems only by generating computer code. According to DeepMind, this makes it easier to create the fact-checking layer as its outputs can be quickly verified.

Although the FunSearch model is still prone to hallucinations and producing inaccurate or misleading results, the evaluator can easily filter them out, ensuring that users receive reliable outputs.

Fawzi said, "We believe that perhaps 90% of LLM outputs are useless. Given a candidate solution, it is easy for me to tell you whether it is actually a correct solution and evaluate the solution, but actually coming up with a solution is difficult. Therefore, mathematics and computer science are particularly suitable."

According to Fawzi, FunSearch is capable of generating new scientific knowledge and ideas, which is a new milestone for LLMs.

Researchers tested its capabilities by giving it a problem along with a very basic solution source code as input. The model then generated a database, and the new solutions in the database were checked for accuracy by the evaluator. The most reliable solutions were then re-entered into the LLM, along with a prompt asking it to improve its ideas. According to Fawzi, FunSearch produced millions of potential solutions through this process, ultimately converging to produce the most effective results.

When tasked with a mathematical problem, FunSearch writes computer code to find the solution instead of directly attempting to solve it.

Fawzi and his team had FunSearch attempt to find a solution to the hat set problem, which involves determining patterns of points where no three points form a straight line. As the number of points increased, the problem became extremely complex.

However, FunSearch was able to create a solution with 512 points spanning eight dimensions, surpassing the capabilities of any human mathematician. The results of this experiment were published in the journal Nature.

While most people may never encounter the hat set problem, let alone attempt to solve it, this is an important achievement. Even the most accomplished human mathematicians do not agree on the best approach to solving this challenge. Professor Terence Tao of the University of California describes the hat set problem as his "favorite unsolved mystery," and he believes FunSearch is a very "promising paradigm" as it may be applicable to many other mathematical problems.

When assigned to solve the packing problem, FunSearch proved this point by finding a solution that is better than the best algorithm created specifically for solving this particular problem. Its results could have significant implications for industries such as transportation and logistics.

It is also worth noting that unlike other LLMs, FunSearch allows users to see how it generates its outputs, enabling them to learn from it. This sets it apart from other LLMs, which are more like a "black box" AI.