Wednesday, 14 May 2025
  • My Feed
  • My Interests
  • My Saves
  • History
  • Blog
Subscribe
Capernaum
  • Finance
    • Cryptocurrency
    • Stock Market
    • Real Estate
  • Lifestyle
    • Travel
    • Fashion
    • Cook
  • Technology
    • AI
    • Data Science
    • Machine Learning
  • Health
    HealthShow More
    Foods That Disrupt Our Microbiome
    Foods That Disrupt Our Microbiome

    Eating a diet filled with animal products can disrupt our microbiome faster…

    By capernaum
    Skincare as You Age Infographic
    Skincare as You Age Infographic

    When I dove into the scientific research for my book How Not…

    By capernaum
    Treating Fatty Liver Disease with Diet 
    Treating Fatty Liver Disease with Diet 

    What are the three sources of liver fat in fatty liver disease,…

    By capernaum
    Bird Flu: Emergence, Dangers, and Preventive Measures

    In the United States in January 2025 alone, approximately 20 million commercially-raised…

    By capernaum
    Inhospitable Hospital Food 
    Inhospitable Hospital Food 

    What do hospitals have to say for themselves about serving meals that…

    By capernaum
  • Sport
  • 🔥
  • Cryptocurrency
  • Data Science
  • Travel
  • Real Estate
  • AI
  • Technology
  • Machine Learning
  • Stock Market
  • Finance
  • Fashion
Font ResizerAa
CapernaumCapernaum
  • My Saves
  • My Interests
  • My Feed
  • History
  • Travel
  • Health
  • Technology
Search
  • Pages
    • Home
    • Blog Index
    • Contact Us
    • Search Page
    • 404 Page
  • Personalized
    • My Feed
    • My Saves
    • My Interests
    • History
  • Categories
    • Technology
    • Travel
    • Health
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Home » Blog » MinMo: A Multimodal Large Language Model with Approximately 8B Parameters for Seamless Voice Interaction
AITechnology

MinMo: A Multimodal Large Language Model with Approximately 8B Parameters for Seamless Voice Interaction

capernaum
Last updated: 2025-01-15 20:38
capernaum
Share
MinMo: A Multimodal Large Language Model with Approximately 8B Parameters for Seamless Voice Interaction
SHARE

Advances in large language and multimodal speech-text models have laid a foundation for seamless, real-time, natural, and human-like voice interactions. Achieving this requires systems to process speech content, emotional tones, and audio cues while giving accurate and coherent responses. However, challenges remain in overcoming differences in speech and text sequences, limited pre-training for speech tasks and preserving knowledge in the language model. The system cannot also fill gaps in functions such as speech translation, emotion recognition, and simultaneous processing during a conversation.

Currently, voice interaction systems are divided into native and aligned multimodal models. Native multimodal models integrate both speech and text understanding and generation. However, they face challenges with longer speech token sequences than text sequences, making them inefficient as model sizes grow. These models also struggle with limited speech data, leading to issues like catastrophic forgetting. Aligned multimodal models aim to combine voice capabilities with pre-trained text models. However, these are trained on small datasets and lack focus on complex speech tasks like emotion recognition or speaker analysis. Besides, these models have not been properly evaluated to handle different speaking styles or full-duplex conversation, essential for seamless voice interactions.

To mitigate the issues with current multimodal models, researchers from Tongyi Lab and Alibaba Group proposed MinMo, a new multimodal large language model designed to improve voice comprehension and generation. The researcher trained the model on over 1.4 million hours of speech data across various tasks like Speech-to-Text, Text-to-Speech, and Speech-to-Speech. This extensive training allows MinMo to achieve state-of-the-art performance on multiple benchmarks while preventing catastrophic forgetting of text LLM capabilities. Unlike previous models, MinMo integrates speech and text seamlessly without losing performance on text tasks and enhances voice interaction capabilities like emotion recognition, speaker analysis, and multilingual speech recognition.

Researchers designed MinMo with a multi-stage training approach to align speech and text modalities, enabling speech-to-text, text-to-speech, speech-to-speech, and duplex interactions. The model leverages a pretrained text LLM and includes core components like the SenseVoice-large voice encoder for multilingual speech and emotion recognition, the Qwen2.5-7B-instruct LLM for text processing, and CosyVoice 2 for efficient audio generation. MinMo also introduces an AR streaming Transformer voice decoder, which enhances performance and reduces latency. With around 8 billion parameters, the model provided real-time response and full-duplex interaction with a latency of about 600 ms.

The researchers tested MinMo across various benchmarks, including multilingual speech recognition, speech-to-text enhancement, and voice generation. The results showed that MinMo outperformed most models, including Whisper Large v3, particularly in multilingual speech recognition tasks, and achieved state-of-the-art performance in multilingual speech translation. It also excelled in speech-to-text enhancement, speech emotion recognition (SER), and audio event understanding. MinMo achieved 85.3% accuracy in language identification using Fleur’s dataset, surpassing all previous models. In tasks such as gender detection, age estimation, and punctuation insertion, MinMo demonstrated strong performance, outpacing models like Qwen2.5-7B and SenseVoice-L. It also showed superior performance in dialect and role-playing tasks in voice generation, with an accuracy of 98.4%, compared to GLM-4-Voice’s 63.1%. Despite declining performance in speech-to-speech tasks due to complexity, MinMo performed well in conversational tasks and logical reasoning. The model achieved high sensitivity in turn-taking with around 99% prediction performance and a response latency of about 600ms in full-duplex interactions.

In conclusion, the proposed MinMo model advances voice interaction systems by addressing challenges like sequence length discrepancies and catastrophic forgetting. Its multi-stage alignment strategy and voice decoder enable multilingual speech and emotion recognition performance. MinMo sets a new benchmark for natural voice interactions and can act as a baseline for future research, with potential improvements in instruction-following and end-to-end audio generation. Future advancements could focus on refining pronunciation handling and developing fully integrated systems.


Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

🚨 Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)

The post MinMo: A Multimodal Large Language Model with Approximately 8B Parameters for Seamless Voice Interaction appeared first on MarkTechPost.

Share This Article
Twitter Email Copy Link Print
Previous Article This AI Study Saves Researchers from Metadata Chaos with a Comparative Analysis of Extraction Techniques for Scholarly Documents This AI Study Saves Researchers from Metadata Chaos with a Comparative Analysis of Extraction Techniques for Scholarly Documents
Next Article RE/MAX hires Chris Lim as EVP and chief growth officer RE/MAX hires Chris Lim as EVP and chief growth officer
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Using RSS feeds, we aggregate news from trusted sources to ensure real-time updates on the latest events and trends. Stay ahead with timely, curated information designed to keep you informed and engaged.
TwitterFollow
TelegramFollow
LinkedInFollow
- Advertisement -
Ad imageAd image

You Might Also Like

Agent-Based Debugging Gets a Cost-Effective Alternative: Salesforce AI Presents SWERank for Accurate and Scalable Software Issue Localization
AI

Agent-Based Debugging Gets a Cost-Effective Alternative: Salesforce AI Presents SWERank for Accurate and Scalable Software Issue Localization

By capernaum

This AI Paper Investigates Test-Time Scaling of English-Centric RLMs for Enhanced Multilingual Reasoning and Domain Generalization

By capernaum
Rethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification
AIMachine LearningTechnology

Rethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification

By capernaum

PwC Releases Executive Guide on Agentic AI: A Strategic Blueprint for Deploying Autonomous Multi-Agent Systems in the Enterprise

By capernaum
Capernaum
Facebook Twitter Youtube Rss Medium

Capernaum :  Your instant connection to breaking news & stories . Stay informed with real-time coverage across  AI ,Data Science , Finance, Fashion , Travel, Health. Your trusted source for 24/7 insights and updates.

© Capernaum 2024. All Rights Reserved.

CapernaumCapernaum
Welcome Back!

Sign in to your account

Lost your password?