Tuesday, 20 May 2025
  • My Feed
  • My Interests
  • My Saves
  • History
  • Blog
Subscribe
Capernaum
  • Finance
    • Cryptocurrency
    • Stock Market
    • Real Estate
  • Lifestyle
    • Travel
    • Fashion
    • Cook
  • Technology
    • AI
    • Data Science
    • Machine Learning
  • Health
    HealthShow More
    Eating to Keep Ulcerative Colitis in Remission 
    Eating to Keep Ulcerative Colitis in Remission 

    Plant-based diets can be 98 percent effective in keeping ulcerative colitis patients…

    By capernaum
    Foods That Disrupt Our Microbiome
    Foods That Disrupt Our Microbiome

    Eating a diet filled with animal products can disrupt our microbiome faster…

    By capernaum
    Skincare as You Age Infographic
    Skincare as You Age Infographic

    When I dove into the scientific research for my book How Not…

    By capernaum
    Treating Fatty Liver Disease with Diet 
    Treating Fatty Liver Disease with Diet 

    What are the three sources of liver fat in fatty liver disease,…

    By capernaum
    Bird Flu: Emergence, Dangers, and Preventive Measures

    In the United States in January 2025 alone, approximately 20 million commercially-raised…

    By capernaum
  • Sport
  • 🔥
  • Cryptocurrency
  • Travel
  • Data Science
  • Real Estate
  • AI
  • Technology
  • Machine Learning
  • Stock Market
  • Finance
  • Fashion
Font ResizerAa
CapernaumCapernaum
  • My Saves
  • My Interests
  • My Feed
  • History
  • Travel
  • Health
  • Technology
Search
  • Pages
    • Home
    • Blog Index
    • Contact Us
    • Search Page
    • 404 Page
  • Personalized
    • My Feed
    • My Saves
    • My Interests
    • History
  • Categories
    • Technology
    • Travel
    • Health
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Home » Blog » This AI Paper Introduces Inference-Time Scaling Techniques: Microsoft’s Deep Evaluation of Reasoning Models on Complex Tasks
AIMachine LearningTechnology

This AI Paper Introduces Inference-Time Scaling Techniques: Microsoft’s Deep Evaluation of Reasoning Models on Complex Tasks

capernaum
Last updated: 2025-04-08 06:23
capernaum
Share
This AI Paper Introduces Inference-Time Scaling Techniques: Microsoft’s Deep Evaluation of Reasoning Models on Complex Tasks
SHARE

Large language models are often praised for their linguistic fluency, but a growing area of focus is enhancing their reasoning ability—especially in contexts where complex problem-solving is required. These include mathematical equations and tasks involving spatial logic, pathfinding, and structured planning. In such domains, models must simulate human-like step-by-step thinking, where solutions are not immediately obvious. This type of structured reasoning makes inference-time behavior an important subject of study in machine learning research.

Despite the progress in model architecture and training datasets, many language models still falter when presented with multi-step or high-difficulty reasoning tasks. The challenge is that even if a model can access vast information, it might not know how to use it effectively across multiple steps. Tasks like selecting meeting times with constraints or solving NP-hard problems require sustained logical sequencing, which standard models find difficult. Adding more parameters or memory has helped in some areas, but such brute-force solutions often lead to diminishing returns when task complexity increases.

To handle these limitations, researchers have explored tools like chain-of-thought prompting and post-training fine-tuning to better align models with complex tasks. Some methods involve generating multiple independent answers and then using heuristics or voting mechanisms to pick the most likely correct one. Others experiment with self-refinement—having the model critique its answers and revise accordingly. These approaches have been implemented with varying success in conventional models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Pro, but these models still show variability depending on the benchmark. In some instances, longer output did not translate into better accuracy, and token efficiency remained inconsistent.

Researchers at Microsoft introduced a rigorous evaluation framework for inference-time scaling that covers nine models and eight complex task benchmarks. This included comparing conventional models against reasoning-optimized ones such as DeepSeek R1, O1, and O3-mini. Their method involved parallel scaling, where multiple outputs are generated and aggregated, and sequential scaling, where the model is prompted to revise its output based on structured feedback iteratively. Benchmarks were sourced from domains like calendar planning, math Olympiads, and spatial reasoning, and the team introduced two new datasets for NP-hard problems: 3SAT and TSP.

The methodology relied on two core strategies: sampling multiple generations to evaluate result variability and using critics to simulate feedback-enhanced reasoning. In parallel scaling, the model outputs several answers that are evaluated using aggregators such as majority vote or best-of-n. In sequential scaling, the model receives feedback after each attempt and is prompted to try again. This allowed researchers to estimate current performance and the potential ceiling for improvement if computational resources were scaled up. Aggregators like average and worst-of-n helped identify where models consistently failed or succeeded. This dual approach provided insight into how models use additional inference steps and whether feedback mechanisms improve answer quality.

The performance analysis showed significant differences between models and task types. On the GPQA benchmark, the top-performing model, O1, reached 90.9% accuracy, while GPT-4o reached 77.7%. On the TSP dataset, O1 maintained accuracy above 80% across most levels, while GPT-4o’s performance peaked only when superscaled with over 20 inference calls. In BA Calendar, DeepSeek R1 achieved 88.5% accuracy, outperforming Claude 3.7 Sonnet and Gemini 2.0 Pro. However, results also revealed that increased token usage did not guarantee higher accuracy. For example, DeepSeek R1 consumed significantly more tokens than Claude 3.7 Sonnet but only marginally outperformed it in some math tasks. Even within a single model, repeated attempts on the same question showed high variation in token counts, raising concerns about cost predictability for real-world applications.

This study underscores the gap between traditional and reasoning-enhanced models and highlights that intelligent scaling—not just more tokens—can improve complex task performance. The researchers showed that feedback loops and strong verifiers offer substantial gains in model accuracy, even in difficult domains. Their findings suggest that reasoning models still have headroom for improvement, especially when guided by structured inference strategies and cost-efficient token management.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post This AI Paper Introduces Inference-Time Scaling Techniques: Microsoft’s Deep Evaluation of Reasoning Models on Complex Tasks appeared first on MarkTechPost.

Share This Article
Twitter Email Copy Link Print
Previous Article Analyst Forecasts Over 59% Dogecoin Price Crash Coming Soon, Here’s Why Analyst Forecasts Over 59% Dogecoin Price Crash Coming Soon, Here’s Why
Next Article Travel-Friendly Activewear & Active Dresses by Halara
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Using RSS feeds, we aggregate news from trusted sources to ensure real-time updates on the latest events and trends. Stay ahead with timely, curated information designed to keep you informed and engaged.
TwitterFollow
TelegramFollow
LinkedInFollow
- Advertisement -
Ad imageAd image

You Might Also Like

Agentic AI in Financial Services: IBM’s Whitepaper Maps Opportunities, Risks, and Responsible Integration

By capernaum
Chain-of-Thought May Not Be a Window into AI’s Reasoning: Anthropic’s New Study Reveals Hidden Gaps
AIMachine LearningTechnology

Chain-of-Thought May Not Be a Window into AI’s Reasoning: Anthropic’s New Study Reveals Hidden Gaps

By capernaum
Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated Data
AITechnology

Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated Data

By capernaum
This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DB
AI

This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DB

By capernaum
Capernaum
Facebook Twitter Youtube Rss Medium

Capernaum :  Your instant connection to breaking news & stories . Stay informed with real-time coverage across  AI ,Data Science , Finance, Fashion , Travel, Health. Your trusted source for 24/7 insights and updates.

© Capernaum 2024. All Rights Reserved.

CapernaumCapernaum
Welcome Back!

Sign in to your account

Lost your password?