Improving LLM Accuracy with Self-Verification by DeepMind's GenRM

2024-09-04

Large language models (LLMs) are prone to factual and logical errors when dealing with complex reasoning tasks. To address this challenge, researchers often use validators or reward models to evaluate and select the most accurate responses in LLM-generated outputs.


In a recent paper, researchers from Google DeepMind, University of Toronto, Mila, and UCLA introduced a novel approach called GenRM, which leverages the generative capabilities of LLMs to create more effective validators. GenRM can serve as a practical tool for LLM applications where current validation methods fail.


Limitations of classical validators and reward models


One common method to improve the accuracy of LLMs is to have them generate multiple candidate answers and then use separate components to select the best answer. This approach requires reliable validators or reward models.


In the field of reasoning, LLM-based validators are often trained as discriminative reward models (RMs), which assign numerical scores to candidate solutions and classify them as correct or incorrect based on these scores. However, these RMs do not fully exploit the advantages of LLMs in generating and handling responses.


"Although classical reward models (RMs)/validators are trained by fine-tuning LLMs, they do not leverage the text generation capabilities that LLMs are inherently designed for," said Rishabh Agarwal, Senior Research Scientist at DeepMind, to VentureBeat.


Another popular technique is "LLM as a referee," which uses advanced prompting techniques to evaluate responses. However, while this approach is flexible, "LLM as a referee" lacks the ability to acquire knowledge during the training process, which reward models possess.


Generative reward models


DeepMind's GenRM proposes a different approach: training validators by predicting the next token, thereby leveraging the text generation capabilities of LLMs.


"Training RMs by predicting the next token allows them to harness the numerous advantages of generative LLMs," said Agarwal. "We demonstrate how the same model can both validate and generate solutions by 'thinking more' before validation through the use of chains of thought and increasing computational cost during testing to improve accuracy."


In GenRM, validation decisions are represented as tokens. For example, to generate a numerical score for a solution, the validator uses prompts such as "Is the answer correct?" and represents the score as the probability of a single text token ("yes" or "no") under the context and prompt.


Since validation often involves complex reasoning, generative validators naturally benefit from advanced prompting techniques such as Chains of Thought (CoT) reasoning, which requires the model to generate a thought process before providing an answer.


"Specifically, we can generate intermediate reasoning steps or critiques (CoT) before making a decision on the correctness of a solution, which may uncover subtle reasoning errors missed by direct validators," the researchers wrote.





CoT (Chains of Thought) reasoning used to train the GenRM model can be generated by humans or another LLM. During the inference process, GenRM first generates a CoT reasoning and then assigns a correctness score using the probability of the "yes" token.


The researchers further improved the validation accuracy of the CoT validator through majority voting. They extracted multiple CoT chains and calculated the average score of the "yes" token across all samples, effectively utilizing computational resources during testing.


"GenRM can be seen as combining 'LLM as a referee' with classical validators: it corresponds to 'LLM as a referee' trained on specific domain validation data," said Agarwal. "Therefore, GenRM is applicable to any domain where existing prompted LLMs fall short."


Practical applications of GenRM


To evaluate the effectiveness of GenRM, researchers from DeepMind applied it to multiple reasoning tasks, including last-letter connection, word sorting, and word math problems. They compared GenRM with standard methods, including discriminative reward models, "LLM as a referee," and "consistency" methods, where the model generates multiple answers and selects the most common one as the final response.


In all tasks, GenRM with CoT consistently outperformed other methods by several percentage points, including discriminative reward models trained specifically for the tasks. In the GSM8K mathematical reasoning benchmark, a Gemma-9B model trained with GenRM solved 92.8% of the problems, surpassing the performance of GPT-4 and Gemini 1.5 Pro.





"By unifying solution generation and validation, as GenRM does by training validators to predict the next token, validation performance consistently improved across all tasks," the researchers wrote. "This improvement was observed in both direct and CoT-based generative validators, indicating that teaching validators to mimic correct solutions is often helpful."


The experiments also showed that GenRM's performance improves with larger datasets and model capacity. Additionally, allowing GenRM to sample more responses further enhances its performance, providing LLM developers with more flexibility in balancing accuracy and computational costs.


"GenRM (trained by jointly generating and validating) can still outperform traditional validators using the same data, and the training of GenRM is just standard fine-tuning," said Agarwal. "However, to fully leverage the capabilities of GenRM, we need critical/validation reasoning that explains the reward labels. For high-quality data, this can be done by humans, but a more scalable option is to use synthetic reasoning generated by LLMs."


Possible future directions for GenRM include expanding synthetic validation reasoning on open-ended generation tasks, integrating GenRM into reinforcement learning pipelines, and leveraging advanced LLM capabilities such as few-shot learning, retrieval-augmented generation, ReAct, and code generation and execution to enhance validation.