Wednesday, 14 May 2025
  • My Feed
  • My Interests
  • My Saves
  • History
  • Blog
Subscribe
Capernaum
  • Finance
    • Cryptocurrency
    • Stock Market
    • Real Estate
  • Lifestyle
    • Travel
    • Fashion
    • Cook
  • Technology
    • AI
    • Data Science
    • Machine Learning
  • Health
    HealthShow More
    Skincare as You Age Infographic
    Skincare as You Age Infographic

    When I dove into the scientific research for my book How Not…

    By capernaum
    Treating Fatty Liver Disease with Diet 
    Treating Fatty Liver Disease with Diet 

    What are the three sources of liver fat in fatty liver disease,…

    By capernaum
    Bird Flu: Emergence, Dangers, and Preventive Measures

    In the United States in January 2025 alone, approximately 20 million commercially-raised…

    By capernaum
    Inhospitable Hospital Food 
    Inhospitable Hospital Food 

    What do hospitals have to say for themselves about serving meals that…

    By capernaum
    Gaming the System: Cardiologists, Heart Stents, and Upcoding 
    Gaming the System: Cardiologists, Heart Stents, and Upcoding 

    Cardiologists can criminally game the system by telling patients they have much…

    By capernaum
  • Sport
  • 🔥
  • Cryptocurrency
  • Data Science
  • Travel
  • Real Estate
  • AI
  • Technology
  • Machine Learning
  • Stock Market
  • Finance
  • Fashion
Font ResizerAa
CapernaumCapernaum
  • My Saves
  • My Interests
  • My Feed
  • History
  • Travel
  • Health
  • Technology
Search
  • Pages
    • Home
    • Blog Index
    • Contact Us
    • Search Page
    • 404 Page
  • Personalized
    • My Feed
    • My Saves
    • My Interests
    • History
  • Categories
    • Technology
    • Travel
    • Health
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Home » Blog » ThinkPRM: A Generative Process Reward Models for Scalable Reasoning Verification
AIMachine LearningTechnology

ThinkPRM: A Generative Process Reward Models for Scalable Reasoning Verification

capernaum
Last updated: 2025-04-29 19:40
capernaum
Share
ThinkPRM: A Generative Process Reward Models for Scalable Reasoning Verification
SHARE

Reasoning with LLMs can benefit from utilizing more test compute, which depends on high-quality process reward models (PRMs) to select promising paths for search or ranking. PRMs score problem-solution pairs to indicate whether the solution is correct, and have been implemented as discriminative classifiers. However, these models require extensive resources, including human annotation, gold step-by-step solutions, or computationally intensive rollouts. LLM-as-a-judge approaches offer advantages in data efficiency and interpretability, but they perform poorly compared to specialized reward models for complex reasoning tasks, failing to recognize incorrect reasoning. This creates a challenge to maintain data-efficiency and interpretability advantages while achieving the superior performance of discriminative PRMs.

Research approaches to solve process verification challenges have followed three main paths. Discriminative PRMs function as classifiers that predict numerical correctness scores for each reasoning step, requiring extensive step-level annotations. Generative PRMs frame verification as a language-generation task, producing correctness decisions as natural language tokens accompanied by verification chain-of-thought (CoT). These models compute correctness scores through conditional token probabilities like P(“correct”), making them inherently interpretable and scalable. Test-time scaling techniques like Best-of-N selection and tree-based search improve reasoning performance using additional inference-time compute. The effectiveness of these approaches depends heavily on verifier quality for scoring solutions.

Researchers from the University of Michigan, Mila, LG AI Research, and the University of Illinois Urbana-Champaign have proposed THINKPRM, a long CoT verifier fine-tuned on significantly fewer process labels than those required by discriminative PRMs. It uses the inherent reasoning abilities of long CoT models to outperform both LLM-as-a-Judge and discriminative verifiers while using only 1% of process labels in PRM800K across several challenging benchmarks. Under equal token budgets, THINKPRM scales verification compute more effectively than LLM-as-a-Judge, outperforming it by 7.2% on a ProcessBench subset, highlighting the value of generative, long CoT PRMs for scaling test-time verification compute with minimal supervision.

The THINKPRM is evaluated against DiscPRM, the same base model finetuned with binary cross-entropy on the entire PRM800K dataset containing 712K process labels from 98K problem-solution pairs. Additional comparisons include unweighted majority voting and verifier-weighted majority for best-of-N experiments. The results are shown on three math reasoning tasks: 100 problems from MATH-500 covering all difficulty levels, 2024 American Invitational Mathematics Examination (AIME) problems, and out-of-domain tasks including physics problems from GPQA-Diamond and a 200-problem subset from LiveCodeBench v5. For MATH-500, researchers used THINKPRM-1.5B and THINKPRM-14B with two different generator models.

On best-of-N selection with MATH500, THINKPRM achieves higher or comparable reasoning accuracy to DiscPRM across all sampling budgets. Under verifier-guided search on MATH-500, THINKPRM-1.5B outperforms discPRM by approximately 5 percentage points and surpasses LLM-as-a-judge using the same base model (R1-Qwen-1.5B). THINKPRM-1.5B’s scaling curve exceeds all baselines when compared to strong off-the-shelf PRMs like RLHFFlow-Deepseek-PRM and MATH-Shepherd-PRM, outperforming RLHFFlow-Deepseek-PRM by over 7% at 16 beams. For out-of-domain evaluation, THINKPRM shows better scaling than DiscPRM on GPQA-physics, outperforming it by 8%, while on LiveCodeBench, THINKPRM surpasses DiscPRM by 4.5%.

In conclusion, researchers introduced THINKPRM, a generative process reward model trained with minimal supervision on synthetic data, allowing efficient and scalable verification of step-by-step reasoning. Researchers show that lightweight fine-tuning of generative PRMs on as few as 8K process labels can improve upon zero-shot LLM-as-a-judge baselines. THINKPRM also surpasses discriminative PRMs trained with orders of magnitude more process labels, highlighting the advantages of utilizing generative language-modeling objectives for interpretability, scalability, and data efficiency. The results underscore the potential of generative PRMs to scale verification compute at test-time effectively, benefiting challenging domains such as mathematical and scientific reasoning.


Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post ThinkPRM: A Generative Process Reward Models for Scalable Reasoning Verification appeared first on MarkTechPost.

Share This Article
Twitter Email Copy Link Print
Previous Article From Banff to Quebéc, Canada Flight Deals are Heating Up From Banff to Quebéc, Canada Flight Deals are Heating Up
Next Article Warning: Don’t Rely Solely On Google AI Summaries, ChatGPT For Important Travel Information
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Using RSS feeds, we aggregate news from trusted sources to ensure real-time updates on the latest events and trends. Stay ahead with timely, curated information designed to keep you informed and engaged.
TwitterFollow
TelegramFollow
LinkedInFollow
- Advertisement -
Ad imageAd image

You Might Also Like

ServiceLink expands closing technology

By capernaum
Reinforcement Learning, Not Fine-Tuning: Nemotron-Tool-N1 Trains LLMs to Use Tools with Minimal Supervision and Maximum Generalization
AIMachine LearningTechnology

Reinforcement Learning, Not Fine-Tuning: Nemotron-Tool-N1 Trains LLMs to Use Tools with Minimal Supervision and Maximum Generalization

By capernaum

FHA cites AI emergence as it ‘archives’ inactive policy documents

By capernaum

Better leans on AI, sees first profitable month since 2022

By capernaum
Capernaum
Facebook Twitter Youtube Rss Medium

Capernaum :  Your instant connection to breaking news & stories . Stay informed with real-time coverage across  AI ,Data Science , Finance, Fashion , Travel, Health. Your trusted source for 24/7 insights and updates.

© Capernaum 2024. All Rights Reserved.

CapernaumCapernaum
Welcome Back!

Sign in to your account

Lost your password?