Meet OREO (Offline REasoning Optimization): An Offline Reinforcement Learning Method for Enhancing LLM Multi-Step Reasoning

Large Language Models (LLMs) have demonstrated impressive proficiency in numerous tasks, but their ability to perform multi-step reasoning remains a significant challenge. This limitation becomes particularly evident in complex scenarios such as mathematical problem-solving, embodied agent control, and web navigation. Traditional Reinforcement Learning (RL) methods, like Proximal Policy Optimization (PPO), have been applied to address this issue but often come with high computational and data costs, making them less practical. Likewise, methods such as Direct Preference Optimization (DPO), while effective for aligning models with human preferences, struggle with multi-step reasoning tasks. DPO’s reliance on pairwise preference data and uniform token treatment undermines its capacity to assign credit effectively in situations with sparse rewards. These obstacles highlight the need for more targeted and efficient solutions to enhance LLM reasoning capabilities.

Introducing OREO: Offline Reasoning Optimization

OREO (Offline REasoning Optimization) is an offline RL approach specifically designed to address the shortcomings of existing methods in improving multi-step reasoning for LLMs. Developed collaboratively by researchers from UC San Diego, Tsinghua University, Salesforce Research, and Northwestern University, OREO builds on insights from maximum entropy reinforcement learning. It trains a policy model and a value function concurrently by optimizing the soft Bellman Equation. This methodology removes the dependency on pairwise preference data, making it possible to utilize unpaired datasets with sparse rewards. Furthermore, OREO enables precise credit assignment across reasoning trajectories, which is especially beneficial when success depends on a few critical steps. The framework can also be extended to iterative exploration setups and incorporates a learned value function to enhance inference through tree search during testing.

Technical Details and Benefits

OREO’s core innovation lies in optimizing the soft Bellman Equation to simultaneously train policy and value models. This strategy ensures accurate credit assignment across reasoning steps, addressing the limitations of methods like DPO. Additionally, OREO offers step-level and response-level objectives, providing flexibility for different granularities of reasoning tasks. During test-time inference, the value function supports advanced search techniques, such as beam search, improving accuracy. Unlike baseline methods like supervised fine-tuning (SFT) or rejection sampling, OREO excels at leveraging failed trajectories to enhance model robustness and adaptability. This capacity to learn from failures makes it particularly valuable for iterative multi-step reasoning tasks.

Results and Insights

OREO’s performance has been rigorously evaluated on benchmarks such as GSM8K and MATH for mathematical reasoning, and ALFWorld for embodied agent control. Key findings include:

On GSM8K, OREO delivered a 5.2% relative improvement in accuracy using a 1.5B parameter model compared to SFT, alongside a 10.5% improvement on MATH.
52.5% on MATH with 1.5B LLM (w/o using augmented problem set)
In ALFWorld, OREO achieved a 17.7% relative improvement in performance in unseen environments, underscoring its ability to generalize beyond training data.

Iterative training further amplified OREO’s effectiveness, showing consistent accuracy gains over multiple iterations. While approaches like rejection sampling exhibited diminishing returns, OREO continued to improve by incorporating insights from failed attempts. Test-time search using OREO’s value function resulted in up to a 17.9% relative improvement over greedy decoding on the MATH dataset, highlighting its impact on inference quality.

Conclusion

OREO provides a practical and effective solution for enhancing multi-step reasoning in LLMs through offline RL. By addressing the limitations of existing approaches, it offers a scalable method for improving reasoning capabilities. Its integration of detailed credit assignment, iterative training, and test-time search makes it a versatile tool for addressing complex reasoning challenges. The results demonstrate OREO’s potential for application across a range of domains requiring sophisticated problem-solving, contributing to the evolution of AI systems capable of deeper reasoning.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.