Wednesday, 14 May 2025
  • My Feed
  • My Interests
  • My Saves
  • History
  • Blog
Subscribe
Capernaum
  • Finance
    • Cryptocurrency
    • Stock Market
    • Real Estate
  • Lifestyle
    • Travel
    • Fashion
    • Cook
  • Technology
    • AI
    • Data Science
    • Machine Learning
  • Health
    HealthShow More
    Foods That Disrupt Our Microbiome
    Foods That Disrupt Our Microbiome

    Eating a diet filled with animal products can disrupt our microbiome faster…

    By capernaum
    Skincare as You Age Infographic
    Skincare as You Age Infographic

    When I dove into the scientific research for my book How Not…

    By capernaum
    Treating Fatty Liver Disease with Diet 
    Treating Fatty Liver Disease with Diet 

    What are the three sources of liver fat in fatty liver disease,…

    By capernaum
    Bird Flu: Emergence, Dangers, and Preventive Measures

    In the United States in January 2025 alone, approximately 20 million commercially-raised…

    By capernaum
    Inhospitable Hospital Food 
    Inhospitable Hospital Food 

    What do hospitals have to say for themselves about serving meals that…

    By capernaum
  • Sport
  • 🔥
  • Cryptocurrency
  • Data Science
  • Travel
  • Real Estate
  • AI
  • Technology
  • Machine Learning
  • Stock Market
  • Finance
  • Fashion
Font ResizerAa
CapernaumCapernaum
  • My Saves
  • My Interests
  • My Feed
  • History
  • Travel
  • Health
  • Technology
Search
  • Pages
    • Home
    • Blog Index
    • Contact Us
    • Search Page
    • 404 Page
  • Personalized
    • My Feed
    • My Saves
    • My Interests
    • History
  • Categories
    • Technology
    • Travel
    • Health
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Home » Blog » Model Performance Begins with Data: Researchers from Ai2 Release DataDecide—A Benchmark Suite to Understand Pretraining Data Impact Across 30K LLM Checkpoints
AIMachine LearningTechnology

Model Performance Begins with Data: Researchers from Ai2 Release DataDecide—A Benchmark Suite to Understand Pretraining Data Impact Across 30K LLM Checkpoints

capernaum
Last updated: 2025-04-17 08:22
capernaum
Share
Model Performance Begins with Data: Researchers from Ai2 Release DataDecide—A Benchmark Suite to Understand Pretraining Data Impact Across 30K LLM Checkpoints
SHARE

The Challenge of Data Selection in LLM Pretraining

Developing large language models entails substantial computational investment, especially when experimenting with alternative pretraining corpora. Comparing datasets at full scale—on the order of billions of parameters and hundreds of billions of tokens—can consume hundreds of thousands of GPU hours per run. Consequently, practitioners resort to smaller‐scale experiments as proxies for large‐model behavior. Yet these “pilot” studies are rarely published, producing a fragmented landscape in which each laboratory repeats similar small‐scale tests without shared benchmarks or methodologies . This opacity impedes reproducibility, underutilizes collective insights, and obscures the true trade‑offs between development compute and final model performance.

DataDecide

To address these limitations, the Allen Institute for AI (AI2), in collaboration with the University of Washington and the University of Pennsylvania, today releases DataDecide—a comprehensive suite of controlled pretraining experiments spanning 25 distinct corpora and 14 model sizes from 4 million to 1 billion parameters. DataDecide’s datasets include well‑known sources such as Dolma, DCLM, RefinedWeb, C4, and FineWeb, alongside variations produced by domain ablation, deduplication, quality filtering, and source mixing. Each model is trained at a fixed token‑to‑parameter ratio of 100 (100 tokens per parameter), reflecting the “overtraining” regime that optimizes inference efficiency. In total, over 1,050 models and more than 30,000 checkpoints—each evaluated across ten downstream tasks—are released to the public.

Technical Structure and Pragmatic Benefits

DataDecide orchestrates experiments along three axes:

    • Data Recipes: Twenty‑five well‑documented pretraining corpora, each embodying different curation strategies (see Table 1 in the paper for full recipe specifications) .
    • Model Scale: Fourteen parameter configurations (4 M–1 B), programmatically derived via the OLMo model ladder to ensure consistent training hyperparameters across scales. Each non‑target scale includes two “early‑stop” seed runs, while the 1 B‑parameter models feature three complete seed reruns to quantify variability.
    • Evaluation Suite: The OLMES benchmark of ten multiple‑choice tasks (e.g., MMLU, ARC Easy/Challenge, HellaSwag, MBPP, HumanEval) provides a multifaceted view of language understanding, commonsense reasoning, and code generation performance.

    By releasing both pretraining datasets and corresponding models, DataDecide enables researchers to:

    • Reuse checkpoints for new evaluations without retraining.
    • Experiment with novel prediction methods (e.g., advanced scaling‑law fits, smoothing techniques).
    • Investigate benchmark sensitivity to training data and model scale.

    Key Findings and Quantitative Insights

    DataDecide’s systematic analysis yields four practical guidelines:

      • Single‑Scale Baseline Robustness: Ranking corpora by downstream accuracy at a single, small scale (e.g., 150 M parameters) achieves ~80 percent decision accuracy for predicting the best dataset at the 1 B‑parameter target scale. In contrast, eight baseline scaling‑law extrapolations do not surpass this simple heuristic, underscoring its cost‑effectiveness.
      • Task‑Dependent Compute Sensitivity: The compute budget required for reliable decisions varies markedly by task. Benchmarks like MMLU and ARC Easy become predictable with less than 0.01 percent of the target compute, whereas HellaSwag and SocialIQA demand orders of magnitude more FLOPs to achieve similar decision accuracy .
      • Proxy Metric Selection: Continuous likelihood metrics—specifically the character‑normalized average probability of correct continuations (CORRECT PROB) and total probability (TOTAL PROB)—outperform discrete accuracy measures at small scales. This is most pronounced on code tasks (MBPP, HumanEval), where decision accuracy jumps from near‑random to over 80 percent with CORRECT PROB as the proxy .
      • Variance and Spread Considerations: High decision accuracy correlates with low run‑to‑run variance (noise) and ample performance spread across datasets. Proxy metrics that reduce noise or amplify spread thus directly enhance prediction reliability.

      Concluding Perspective

      DataDecide transforms pretraining data selection from an ad hoc art into a transparent, data‐driven science. By open‑sourcing all 25 corpora, 1,050 models, 30,000+ checkpoints, and evaluation scripts on Hugging Face and GitHub, AI2 invites the community to reproduce findings, extend evaluations to new benchmarks, and innovate on decision‑making methods. As LLM development continues to demand ever‑greater compute resources, DataDecide offers a principled framework for minimizing wasted experiments and maximizing insight—paving the way toward more efficient, reproducible, and collaborative AI research.


      Check out the Paper, Model on Hugging Face and Technical details. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

      🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

        The post Model Performance Begins with Data: Researchers from Ai2 Release DataDecide—A Benchmark Suite to Understand Pretraining Data Impact Across 30K LLM Checkpoints appeared first on MarkTechPost.

        Share This Article
        Twitter Email Copy Link Print
        Previous Article OpenAI Introduces o3 and o4-mini: Progressing Towards Agentic AI with Enhanced Multimodal Reasoning OpenAI Introduces o3 and o4-mini: Progressing Towards Agentic AI with Enhanced Multimodal Reasoning
        Next Article Towns Airdrop Listing Date: Here’s What to Expect From TOWNS Token Price at Launch? Towns Airdrop Listing Date: Here’s What to Expect From TOWNS Token Price at Launch?
        Leave a comment

        Leave a Reply Cancel reply

        Your email address will not be published. Required fields are marked *

        Your Trusted Source for Accurate and Timely Updates!

        Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Using RSS feeds, we aggregate news from trusted sources to ensure real-time updates on the latest events and trends. Stay ahead with timely, curated information designed to keep you informed and engaged.
        TwitterFollow
        TelegramFollow
        LinkedInFollow
        - Advertisement -
        Ad imageAd image

        You Might Also Like

        Linux Foundation quietly became open source’s sprawling kingmaker
        Data Science

        Linux Foundation quietly became open source’s sprawling kingmaker

        By capernaum
        The “know-it-all” AI and the open source alternative
        AIData Science

        The “know-it-all” AI and the open source alternative

        By capernaum
        A Step-by-Step Guide to Build a Fast Semantic Search and RAG QA Engine on Web-Scraped Data Using Together AI Embeddings, FAISS Retrieval, and LangChain
        AI

        A Step-by-Step Guide to Build a Fast Semantic Search and RAG QA Engine on Web-Scraped Data Using Together AI Embeddings, FAISS Retrieval, and LangChain

        By capernaum
        Agent-Based Debugging Gets a Cost-Effective Alternative: Salesforce AI Presents SWERank for Accurate and Scalable Software Issue Localization
        AI

        Agent-Based Debugging Gets a Cost-Effective Alternative: Salesforce AI Presents SWERank for Accurate and Scalable Software Issue Localization

        By capernaum
        Capernaum
        Facebook Twitter Youtube Rss Medium

        Capernaum :  Your instant connection to breaking news & stories . Stay informed with real-time coverage across  AI ,Data Science , Finance, Fashion , Travel, Health. Your trusted source for 24/7 insights and updates.

        © Capernaum 2024. All Rights Reserved.

        CapernaumCapernaum
        Welcome Back!

        Sign in to your account

        Lost your password?