Thursday, 15 May 2025
  • My Feed
  • My Interests
  • My Saves
  • History
  • Blog
Subscribe
Capernaum
  • Finance
    • Cryptocurrency
    • Stock Market
    • Real Estate
  • Lifestyle
    • Travel
    • Fashion
    • Cook
  • Technology
    • AI
    • Data Science
    • Machine Learning
  • Health
    HealthShow More
    Foods That Disrupt Our Microbiome
    Foods That Disrupt Our Microbiome

    Eating a diet filled with animal products can disrupt our microbiome faster…

    By capernaum
    Skincare as You Age Infographic
    Skincare as You Age Infographic

    When I dove into the scientific research for my book How Not…

    By capernaum
    Treating Fatty Liver Disease with Diet 
    Treating Fatty Liver Disease with Diet 

    What are the three sources of liver fat in fatty liver disease,…

    By capernaum
    Bird Flu: Emergence, Dangers, and Preventive Measures

    In the United States in January 2025 alone, approximately 20 million commercially-raised…

    By capernaum
    Inhospitable Hospital Food 
    Inhospitable Hospital Food 

    What do hospitals have to say for themselves about serving meals that…

    By capernaum
  • Sport
  • 🔥
  • Cryptocurrency
  • Data Science
  • Travel
  • Real Estate
  • AI
  • Technology
  • Machine Learning
  • Stock Market
  • Finance
  • Fashion
Font ResizerAa
CapernaumCapernaum
  • My Saves
  • My Interests
  • My Feed
  • History
  • Travel
  • Health
  • Technology
Search
  • Pages
    • Home
    • Blog Index
    • Contact Us
    • Search Page
    • 404 Page
  • Personalized
    • My Feed
    • My Saves
    • My Interests
    • History
  • Categories
    • Technology
    • Travel
    • Health
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Home » Blog » Alibaba Released Babel: An Open Multilingual Large Language Model LLM Serving Over 90% of Global Speakers
AIMachine LearningTechnology

Alibaba Released Babel: An Open Multilingual Large Language Model LLM Serving Over 90% of Global Speakers

capernaum
Last updated: 2025-03-06 18:47
capernaum
Share
Alibaba Released Babel: An Open Multilingual Large Language Model LLM Serving Over 90% of Global Speakers
SHARE

Most existing LLMs prioritize languages with abundant training resources, such as English, French, and German, while widely spoken but underrepresented languages like Hindi, Bengali, and Urdu receive comparatively less attention. This imbalance limits the accessibility of AI-driven language tools for many global populations, leaving billions without high-quality language processing solutions. Addressing this challenge requires innovative approaches to training and optimizing multilingual LLMs to deliver consistent performance across languages with varying resource availability.

A critical challenge in multilingual NLP is the uneven distribution of linguistic resources. High-resource languages benefit from extensive corpora, while languages spoken in developing regions often lack sufficient training data. This limitation affects the performance of multilingual models, which tend to exhibit better accuracy in well-documented languages while struggling with underrepresented ones. Addressing this gap requires innovative approaches that expand language coverage while maintaining model efficiency.

Several multilingual LLMs have attempted to address this challenge, including Bloom, GLM-4, and Qwen2.5. These models support multiple languages, but their effectiveness depends on the availability of training data. They prioritize languages with extensive textual resources while offering suboptimal performance in languages with scarce data. For example, existing models excel in English, Chinese, and Spanish but face difficulties when processing Swahili, Javanese, or Burmese. Also, many of these models rely on traditional pretraining methods, which fail to accommodate language diversity without increasing computational requirements. Without structured approaches to improving language inclusivity, these models remain inadequate for truly global NLP applications.

Researchers from DAMO Academy at Alibaba Group introduced Babel, a multilingual LLM designed to support over 90% of global speakers by covering the top 25 most spoken languages to bridge this gap. Babel employs a unique layer extension technique to expand its model capacity without compromising performance. The research team introduced two model variants: Babel-9B, optimized for efficiency in inference and fine-tuning, and Babel-83B, which establishes a new benchmark in multilingual NLP. Unlike previous models, Babel includes widely spoken but often overlooked languages such as Bengali, Urdu, Swahili, and Javanese. The researchers focused on optimizing data quality by implementing a rigorous pipeline that curates high-quality training datasets from multiple sources.

Babel’s architecture differs from conventional multilingual LLMs by employing a structured layer extension approach. Rather than relying on continuous pretraining, which requires extensive computational resources, the research team increased the model’s parameter count through controlled expansion. Additional layers were integrated strategically to maximize performance while preserving computational efficiency. For instance, Babel-9B was designed to balance speed and multilingual comprehension, making it suitable for research and localized deployment, whereas Babel-83B extends its capabilities to match commercial models. The model’s training process incorporated extensive data-cleaning techniques, using an LLM-based quality classifier to filter and refine training content. The dataset was sourced from diverse origins, including Wikipedia, news articles, textbooks, and structured multilingual corpora such as MADLAD-400 and CulturaX.

Evaluation metrics demonstrated Babel’s superiority over existing multilingual LLMs. Babel-9B achieved an average score of 63.4 across multiple multilingual benchmarks, outperforming competitors such as GLM4-9B (59.2) and Gemma2-9B (59.5). The model excelled in reasoning tasks like MGSM, scoring 43.4, and in translation tasks such as Flores-200, achieving 55.1. Meanwhile, Babel-83B set a new standard in multilingual performance, reaching an average score of 73.2, surpassing Qwen2.5-72B (69.8) and Llama3.1-70B (66.9). The model’s ability to handle low-resource languages was particularly notable, showing 5-10% improvements over previous multilingual LLMs. Also, Babel’s supervised fine-tuning (SFT) models, trained on over 1 million conversation-based datasets, achieved performance comparable to commercial AI models such as GPT-4o.

Some Key Takeaways from the Research on Babel include:

  1. Babel supports 25 of the world’s most widely spoken languages, reaching over 90% of global speakers. Many languages, such as Swahili, Javanese, and Burmese, were previously underrepresented in open-source LLMs.
  2. Instead of relying on traditional pretraining, Babel increases its parameter count using a structured layer extension technique, enhancing scalability without excessive computational demands.
  3. The research team implemented rigorous data-cleaning techniques using LLM-based quality classifiers. The training corpus includes Wikipedia, CC-News, CulturaX, and MADLAD-400, ensuring high linguistic accuracy.
  4. Babel-9B outperformed similar-sized models, achieving an average score of 63.4, while Babel-83B set a new benchmark at 73.2. These models demonstrated state-of-the-art performance in reasoning, translation, and multilingual understanding tasks.
  5. Babel significantly improves accuracy for languages with limited training data, achieving up to 10% better performance in underrepresented languages compared to existing multilingual LLMs.
  6. Babel-83B-Chat reached 74.4 overall performance, closely trailing GPT-4o (75.1) while outperforming other leading open-source models.
  7. The supervised fine-tuning (SFT) dataset comprises 1 million conversations, allowing Babel-9B-Chat and Babel-83B-Chat to rival commercial AI models in multilingual discussions and problem-solving.
  8. The research team emphasizes that further enhancements, such as incorporating additional alignment and preference tuning, could further elevate Babel’s capabilities, making it an even stronger multilingual AI tool.

Check out the Paper, GitHub Page, Model on HF and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

🚨 Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post Alibaba Released Babel: An Open Multilingual Large Language Model LLM Serving Over 90% of Global Speakers appeared first on MarkTechPost.

Share This Article
Twitter Email Copy Link Print
Previous Article Real Brokerage leverages agent count growth for soaring revenue Real Brokerage leverages agent count growth for soaring revenue
Next Article Bitwise Ventures Into DeFi With Maple Finance Partnership Bitwise Ventures Into DeFi With Maple Finance Partnership
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Using RSS feeds, we aggregate news from trusted sources to ensure real-time updates on the latest events and trends. Stay ahead with timely, curated information designed to keep you informed and engaged.
TwitterFollow
TelegramFollow
LinkedInFollow
- Advertisement -
Ad imageAd image

You Might Also Like

Coding Agents See 75% Surge: SimilarWeb’s AI Usage Report Highlights the Sectors Winning and Losing in 2025’s Generative AI Boom

By capernaum
Google DeepMind Introduces AlphaEvolve: A Gemini-Powered Coding AI Agent for Algorithm Discovery and Scientific Optimization
AITechnology

Google DeepMind Introduces AlphaEvolve: A Gemini-Powered Coding AI Agent for Algorithm Discovery and Scientific Optimization

By capernaum

Eight startups selected for NAR’s REACH tech program

By capernaum

Settlor adds CertifID fraud prevention to title production software

By capernaum
Capernaum
Facebook Twitter Youtube Rss Medium

Capernaum :  Your instant connection to breaking news & stories . Stay informed with real-time coverage across  AI ,Data Science , Finance, Fashion , Travel, Health. Your trusted source for 24/7 insights and updates.

© Capernaum 2024. All Rights Reserved.

CapernaumCapernaum
Welcome Back!

Sign in to your account

Lost your password?