Improving LLM Reasoning Performance: DeepMind and UC Berkeley Research

2024-08-27

In the context of the high cost and slow speed of training large language models (LLMs), researchers are discussing whether the performance of LLMs can be improved by increasing the computational cycles during the inference stage without the need for retraining.


In a new study, researchers from DeepMind and the University of California, Berkeley explore methods to improve the performance of LLMs by strategically allocating computational resources during the inference stage. They detail their findings in a new research paper, demonstrating that substantial performance improvements can be achieved by optimizing the computational usage during inference time, without increasing the model size or extensive pretraining.


The trade-off between inference time and pretraining computation

The main methods to improve LLM performance involve increasing the model size and pretraining computation. However, this approach has its limitations. Larger models have higher training costs and require more resources to run, which may make their deployment impractical in different environments, including resource-constrained devices.

Another option is to use more computation during inference to improve the accuracy of LLMs on challenging prompts. This approach allows for the deployment of smaller LLMs while still achieving performance comparable to larger, more computationally expensive models.

The question is, if an LLM is allowed a certain amount of inference time computation, how can you achieve optimal performance through different inference methods and how does it compare to larger pretrained models?

The most popular method for extending test-time computation is N-select-1 sampling, where the model generates N outputs in parallel and selects the most accurate response as the final answer. However, there are other ways to leverage computation during inference time to improve LLMs. For example, instead of generating multiple responses in parallel, the model can revise and correct its response in multiple consecutive steps. Another approach is to combine parallel and sequential sampling with various validation strategies and search algorithms to obtain a richer landscape of inference time optimization strategies.

To determine the optimal inference time strategy, researchers define the "test-time computation-optimal extension strategy" as "a hyperparameter selection strategy corresponding to a given test-time computation strategy, aimed at achieving maximum performance gain for specific prompts during testing."

The researchers write, "Ideally, test-time computation should modify the distribution to generate better outputs than directly sampling from the LLM itself."

Utilizing inference time computation in different ways


The researchers explore two main strategies to improve LLM performance by utilizing inference time computation. The first strategy focuses on modifying the proposal distribution, i.e., the process by which the LLM generates responses. This can be achieved by fine-tuning the LLM to iteratively refine its answers in complex inference-based environments.

The second strategy involves optimizing the validator, the mechanism used to select the best answer from the generated responses. This can be accomplished by training a process-based reward model that evaluates the correctness of individual steps in the answer.

To evaluate their methods, the researchers conducted experiments using the PaLM-2 model on a challenging MATH benchmark.

The researchers write, "We find that the effectiveness of specific test-time computation strategies depends heavily on the nature of the specific problem at hand and the underlying LLM being used."

For easier problems where the underlying LLM is already capable of producing reasonable responses, allowing the model to iteratively refine its initial answer proves more effective than generating multiple samples in parallel. For more challenging problems that require exploring different solution strategies, they found that parallel resampling of multiple responses or deploying process-based reward models for tree search is more effective.

"This finding underscores the necessity of deploying adaptive 'computation-optimal' strategies to extend test-time computation, selecting specific ways of utilizing test-time computation based on prompts to make the most of the additional computation," the researchers write.

By appropriately allocating test-time computation, the researchers were able to significantly improve performance, surpassing the N-select-1 baseline while using only about 25% of the computation.

Balancing test-time computation and pretraining computation


The researchers also investigated to what extent test-time computation can replace additional pretraining. They compared a smaller model with additional test-time computation to a model that is 14 times larger and has undergone more pretraining.

For easier and moderately difficult problems, the performance of the smaller model with additional test-time computation is comparable to the larger pretrained model.

"This finding suggests that, instead of purely focusing on scaling up pretraining, in certain settings, it is more desirable to pretrain smaller models using less computation and then apply test-time computation to improve model outputs more effectively," the researchers write.

However, for the most challenging problems, additional pretraining computation proves more effective. This indicates that the current methods of extending test-time computation may not be a perfect replacement for scaling up pretraining in all scenarios.

The researchers suggest several future research directions, including exploring more complex strategies that combine different revision and search techniques, as well as developing more efficient methods to estimate problem difficulty.

"Overall, [our research] demonstrates that even with fairly simple methodological choices, increasing test-time computation is already more desirable than increasing pretraining, and as test-time strategies mature, further improvements can be expected," the researchers write. "In the long run, this suggests a future direction of reducing floating-point operation counts (FLOPs) during pretraining and increasing the use of FLOPs during inference."