OREO: Offline Reasoning Optimization to Enhance Multi-Step Inference in Large Language Models

2024-12-24

Large language models (LLMs) have demonstrated remarkable proficiency in various tasks, but they still face significant challenges in multi-step reasoning. This limitation is particularly evident in complex scenarios such as mathematical problem-solving, agent control, and web navigation. Traditional reinforcement learning (RL) methods, like Proximal Policy Optimization (PPO), have been applied to address this issue, but their high computational and data costs limit their practicality. Similarly, methods like Direct Preference Optimization (DPO) can effectively align models with human preferences, but they struggle with multi-step reasoning tasks. DPO relies on paired preference data and uniform labeling, which weakens its ability to allocate rewards effectively in scenarios with sparse rewards. These obstacles highlight the urgent need for more targeted and efficient solutions to enhance LLMs' reasoning capabilities.

In response to the limitations of existing methods, OREO (Offline Reasoning Optimization) was developed. This is a specialized offline RL method designed to tackle the multi-step reasoning challenges faced by LLMs. Developed by researchers from the University of California, San Diego, Tsinghua University, Salesforce Research, and Northwestern University, OREO is built on insights from maximum entropy reinforcement learning. The method optimizes the soft Bellman equation, training both the policy model and value function simultaneously, thereby eliminating the need for paired preference data and enabling the use of unpaired and sparsely rewarded datasets. Additionally, OREO provides precise credit allocation for reasoning trajectories, which is crucial when success depends on a few key steps. The framework can also be extended to iterative exploration settings and enhances reasoning during testing by combining learned value functions with tree search techniques.

The core innovation of OREO lies in its method of optimizing the soft Bellman equation to train both the policy and value models simultaneously. This approach ensures accurate credit assignment in reasoning steps, overcoming the limitations of methods like DPO. OREO also offers step-level and response-level objectives, providing flexibility for different granularities of reasoning tasks. During test-time reasoning, the value function supports advanced search techniques like beam search to improve accuracy. Compared to baseline methods like supervised fine-tuning or rejection sampling, OREO excels at learning from failed trajectories, enhancing the robustness and adaptability of the model. This ability to learn from failures makes it uniquely valuable in iterative multi-step reasoning tasks.

OREO's performance has been rigorously evaluated on mathematical reasoning benchmarks like GSM8K and MATH, as well as agent control benchmarks like ALFWorld. The results show that, on GSM8K, OREO using a 1.5 billion parameter model achieved a 5.2% accuracy improvement over SFT; on MATH, it improved by 10.5%, reaching an accuracy of 52.5% (without using augmented question sets). In ALFWorld, OREO showed a 17.7% relative performance improvement in unseen environments, highlighting its excellent generalization ability. Iterative training further amplifies OREO's effectiveness, demonstrating continuous accuracy improvements over multiple iterations. In contrast, methods like rejection sampling show diminishing returns. OREO continuously improves performance by learning from failed attempts. Test-time search using the OREO value function on the MATH dataset increased accuracy by up to 17.9% compared to greedy decoding, further demonstrating its impact on reasoning quality.

In summary, OREO provides a practical and effective approach through offline RL, significantly enhancing the multi-step reasoning capabilities of LLMs. It addresses the limitations of existing methods and offers a scalable solution to improve reasoning. The integration of credit assignment, iterative training, and test-time search makes OREO a versatile tool for tackling complex reasoning challenges. The research results demonstrate OREO's potential applications in various fields requiring complex problem-solving, making a significant contribution to the evolution of AI systems capable of deeper reasoning.