Wednesday, 14 May 2025
  • My Feed
  • My Interests
  • My Saves
  • History
  • Blog
Subscribe
Capernaum
  • Finance
    • Cryptocurrency
    • Stock Market
    • Real Estate
  • Lifestyle
    • Travel
    • Fashion
    • Cook
  • Technology
    • AI
    • Data Science
    • Machine Learning
  • Health
    HealthShow More
    Foods That Disrupt Our Microbiome
    Foods That Disrupt Our Microbiome

    Eating a diet filled with animal products can disrupt our microbiome faster…

    By capernaum
    Skincare as You Age Infographic
    Skincare as You Age Infographic

    When I dove into the scientific research for my book How Not…

    By capernaum
    Treating Fatty Liver Disease with Diet 
    Treating Fatty Liver Disease with Diet 

    What are the three sources of liver fat in fatty liver disease,…

    By capernaum
    Bird Flu: Emergence, Dangers, and Preventive Measures

    In the United States in January 2025 alone, approximately 20 million commercially-raised…

    By capernaum
    Inhospitable Hospital Food 
    Inhospitable Hospital Food 

    What do hospitals have to say for themselves about serving meals that…

    By capernaum
  • Sport
  • 🔥
  • Cryptocurrency
  • Data Science
  • Travel
  • Real Estate
  • AI
  • Technology
  • Machine Learning
  • Stock Market
  • Finance
  • Fashion
Font ResizerAa
CapernaumCapernaum
  • My Saves
  • My Interests
  • My Feed
  • History
  • Travel
  • Health
  • Technology
Search
  • Pages
    • Home
    • Blog Index
    • Contact Us
    • Search Page
    • 404 Page
  • Personalized
    • My Feed
    • My Saves
    • My Interests
    • History
  • Categories
    • Technology
    • Travel
    • Health
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Home » Blog » Meet OREO (Offline REasoning Optimization): An Offline Reinforcement Learning Method for Enhancing LLM Multi-Step Reasoning
AIMachine LearningTechnology

Meet OREO (Offline REasoning Optimization): An Offline Reinforcement Learning Method for Enhancing LLM Multi-Step Reasoning

capernaum
Last updated: 2024-12-24 06:30
capernaum
Share
Meet OREO (Offline REasoning Optimization): An Offline Reinforcement Learning Method for Enhancing LLM Multi-Step Reasoning
SHARE

Large Language Models (LLMs) have demonstrated impressive proficiency in numerous tasks, but their ability to perform multi-step reasoning remains a significant challenge. This limitation becomes particularly evident in complex scenarios such as mathematical problem-solving, embodied agent control, and web navigation. Traditional Reinforcement Learning (RL) methods, like Proximal Policy Optimization (PPO), have been applied to address this issue but often come with high computational and data costs, making them less practical. Likewise, methods such as Direct Preference Optimization (DPO), while effective for aligning models with human preferences, struggle with multi-step reasoning tasks. DPO’s reliance on pairwise preference data and uniform token treatment undermines its capacity to assign credit effectively in situations with sparse rewards. These obstacles highlight the need for more targeted and efficient solutions to enhance LLM reasoning capabilities.

Introducing OREO: Offline Reasoning Optimization

OREO (Offline REasoning Optimization) is an offline RL approach specifically designed to address the shortcomings of existing methods in improving multi-step reasoning for LLMs. Developed collaboratively by researchers from UC San Diego, Tsinghua University, Salesforce Research, and Northwestern University, OREO builds on insights from maximum entropy reinforcement learning. It trains a policy model and a value function concurrently by optimizing the soft Bellman Equation. This methodology removes the dependency on pairwise preference data, making it possible to utilize unpaired datasets with sparse rewards. Furthermore, OREO enables precise credit assignment across reasoning trajectories, which is especially beneficial when success depends on a few critical steps. The framework can also be extended to iterative exploration setups and incorporates a learned value function to enhance inference through tree search during testing.

Technical Details and Benefits

OREO’s core innovation lies in optimizing the soft Bellman Equation to simultaneously train policy and value models. This strategy ensures accurate credit assignment across reasoning steps, addressing the limitations of methods like DPO. Additionally, OREO offers step-level and response-level objectives, providing flexibility for different granularities of reasoning tasks. During test-time inference, the value function supports advanced search techniques, such as beam search, improving accuracy. Unlike baseline methods like supervised fine-tuning (SFT) or rejection sampling, OREO excels at leveraging failed trajectories to enhance model robustness and adaptability. This capacity to learn from failures makes it particularly valuable for iterative multi-step reasoning tasks.

Results and Insights

OREO’s performance has been rigorously evaluated on benchmarks such as GSM8K and MATH for mathematical reasoning, and ALFWorld for embodied agent control. Key findings include:

  • On GSM8K, OREO delivered a 5.2% relative improvement in accuracy using a 1.5B parameter model compared to SFT, alongside a 10.5% improvement on MATH.
  • 52.5% on MATH with 1.5B LLM (w/o using augmented problem set)
  • In ALFWorld, OREO achieved a 17.7% relative improvement in performance in unseen environments, underscoring its ability to generalize beyond training data.

Iterative training further amplified OREO’s effectiveness, showing consistent accuracy gains over multiple iterations. While approaches like rejection sampling exhibited diminishing returns, OREO continued to improve by incorporating insights from failed attempts. Test-time search using OREO’s value function resulted in up to a 17.9% relative improvement over greedy decoding on the MATH dataset, highlighting its impact on inference quality.

Conclusion

OREO provides a practical and effective solution for enhancing multi-step reasoning in LLMs through offline RL. By addressing the limitations of existing approaches, it offers a scalable method for improving reasoning capabilities. Its integration of detailed credit assignment, iterative training, and test-time search makes it a versatile tool for addressing complex reasoning challenges. The results demonstrate OREO’s potential for application across a range of domains requiring sophisticated problem-solving, contributing to the evolution of AI systems capable of deeper reasoning.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post Meet OREO (Offline REasoning Optimization): An Offline Reinforcement Learning Method for Enhancing LLM Multi-Step Reasoning appeared first on MarkTechPost.

Share This Article
Twitter Email Copy Link Print
Previous Article Best New Meme Coins to Join Now: BTFD Coin Draws Investors with Stage 7 Price Rollback as Pepe Coin and Shiba Inu Eye Future Surges Best New Meme Coins to Join Now: BTFD Coin Draws Investors with Stage 7 Price Rollback as Pepe Coin and Shiba Inu Eye Future Surges
Next Article Gate Group Expands to Japan With Coin Master Acquisition Gate Group Expands to Japan With Coin Master Acquisition
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Using RSS feeds, we aggregate news from trusted sources to ensure real-time updates on the latest events and trends. Stay ahead with timely, curated information designed to keep you informed and engaged.
TwitterFollow
TelegramFollow
LinkedInFollow
- Advertisement -
Ad imageAd image

You Might Also Like

This AI Paper Investigates Test-Time Scaling of English-Centric RLMs for Enhanced Multilingual Reasoning and Domain Generalization

By capernaum
Rethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification
AIMachine LearningTechnology

Rethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification

By capernaum

PwC Releases Executive Guide on Agentic AI: A Strategic Blueprint for Deploying Autonomous Multi-Agent Systems in the Enterprise

By capernaum

ServiceLink expands closing technology

By capernaum
Capernaum
Facebook Twitter Youtube Rss Medium

Capernaum :  Your instant connection to breaking news & stories . Stay informed with real-time coverage across  AI ,Data Science , Finance, Fashion , Travel, Health. Your trusted source for 24/7 insights and updates.

© Capernaum 2024. All Rights Reserved.

CapernaumCapernaum
Welcome Back!

Sign in to your account

Lost your password?