Saturday, 17 May 2025
  • My Feed
  • My Interests
  • My Saves
  • History
  • Blog
Subscribe
Capernaum
  • Finance
    • Cryptocurrency
    • Stock Market
    • Real Estate
  • Lifestyle
    • Travel
    • Fashion
    • Cook
  • Technology
    • AI
    • Data Science
    • Machine Learning
  • Health
    HealthShow More
    Eating to Keep Ulcerative Colitis in Remission 
    Eating to Keep Ulcerative Colitis in Remission 

    Plant-based diets can be 98 percent effective in keeping ulcerative colitis patients…

    By capernaum
    Foods That Disrupt Our Microbiome
    Foods That Disrupt Our Microbiome

    Eating a diet filled with animal products can disrupt our microbiome faster…

    By capernaum
    Skincare as You Age Infographic
    Skincare as You Age Infographic

    When I dove into the scientific research for my book How Not…

    By capernaum
    Treating Fatty Liver Disease with Diet 
    Treating Fatty Liver Disease with Diet 

    What are the three sources of liver fat in fatty liver disease,…

    By capernaum
    Bird Flu: Emergence, Dangers, and Preventive Measures

    In the United States in January 2025 alone, approximately 20 million commercially-raised…

    By capernaum
  • Sport
  • 🔥
  • Cryptocurrency
  • Data Science
  • Travel
  • Real Estate
  • AI
  • Technology
  • Machine Learning
  • Stock Market
  • Finance
  • Fashion
Font ResizerAa
CapernaumCapernaum
  • My Saves
  • My Interests
  • My Feed
  • History
  • Travel
  • Health
  • Technology
Search
  • Pages
    • Home
    • Blog Index
    • Contact Us
    • Search Page
    • 404 Page
  • Personalized
    • My Feed
    • My Saves
    • My Interests
    • History
  • Categories
    • Technology
    • Travel
    • Health
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Home » Blog » Speech-to-Speech Foundation Models Pave the Way for Seamless Multilingual Interactions
AI

Speech-to-Speech Foundation Models Pave the Way for Seamless Multilingual Interactions

capernaum
Last updated: 2025-03-18 07:25
capernaum
Share
Speech-to-Speech Foundation Models Pave the Way for Seamless Multilingual Interactions
SHARE

At NVIDIA GTC25, Gnani.ai experts unveiled groundbreaking advancements in voice AI, focusing on the development and deployment of Speech-to-Speech Foundation Models. This innovative approach promises to overcome the limitations of traditional cascaded voice AI architectures, ushering in an era of seamless, multilingual, and emotionally aware voice interactions.

The Limitations of Cascaded Architectures

Current state-of-the-art architecture powering voice agents involves a three-stage pipeline: Speech-to-Text (STT), Large Language Models (LLMs), and Text-to-Speech (TTS). While effective, this cascaded architecture suffers from significant drawbacks, primarily latency and error propagation. A cascaded architecture has multiple blocks in the pipeline, and each block will add its own latency. The cumulative latency across these stages can range from 2.5 to 3 seconds, leading to a poor user experience. Moreover, errors introduced in the STT stage propagate through the pipeline, compounding inaccuracies. This traditional architecture also loses critical paralinguistic features such as sentiment, emotion, and tone, resulting in monotonous and emotionally flat responses.

Introducing Speech-to-Speech Foundation Models

To address these limitations, Gnani.ai presents a novel Speech-to-Speech Foundation Model. This model directly processes and generates audio, eliminating the need for intermediate text representations. The key innovation lies in training a massive audio encoder with 1.5 million hours of labeled data across 14 languages, capturing nuances of emotion, empathy, and tonality. This model employs a nested XL encoder, retrained with comprehensive data, and an input audio projector layer to map audio features into textual embeddings. For real-time streaming, audio and text features are interleaved, while non-streaming use cases utilize an embedding merge layer. The LLM layer, initially based on Llama 8B, was expanded to include 14 languages, necessitating the rebuilding of tokenizers. An output projector model generates mel spectrograms, enabling the creation of hyper-personalized voices.

Key Benefits and Technical Hurdles

The Speech-to-Speech model offers several significant benefits. Firstly, it significantly reduces latency, moving from 2 seconds to approximately 850-900 milliseconds for the first token output. Secondly, it enhances accuracy by fusing ASR with the LLM layer, improving performance, especially for short and long speeches. Thirdly, the model achieves emotional awareness by capturing and modeling tonality, stress, and rate of speech. Fourthly, it enables improved interruption handling through contextual awareness, facilitating more natural interactions. Finally, the model is designed to handle low bandwidth audio effectively, which is crucial for telephony networks. Building this model presented several challenges, notably the massive data requirements. The team created a crowd-sourced system with 4 million users to generate emotionally rich conversational data. They also leveraged foundation models for synthetic data generation and trained on 13.5 million hours of publicly available data. The final model comprises a 9 billion parameter model, with 636 million for the audio input, 8 billion for the LLM, and 300 million for the TTS system.

NVIDIA’s Role in Development

The development of this model was heavily reliant on the NVIDIA stack. NVIDIA Nemo was used for training encoder-decoder models, and NeMo Curator facilitated synthetic text data generation. NVIDIA EVA was employed to generate audio pairs, combining proprietary information with synthetic data.

Use Cases 

Gnani.ai showcased two primary use cases: real-time language translation and customer support. The real-time language translation demo featured an AI engine facilitating a conversation between an English-speaking agent and a French-speaking customer. The customer support demo highlighted the model’s ability to handle cross-lingual conversations, interruptions, and emotional nuances. 

Speech-to-Speech Foundation Model

The Speech-to-Speech Foundation Model represents a significant leap forward in voice AI. By eliminating the limitations of traditional architectures, this model enables more natural, efficient, and emotionally aware voice interactions. As the technology continues to evolve, it promises to transform various industries, from customer service to global communication.

The post Speech-to-Speech Foundation Models Pave the Way for Seamless Multilingual Interactions appeared first on MarkTechPost.

Share This Article
Twitter Email Copy Link Print
Previous Article CBRE boosts Apac hotels team amid growing activity CBRE boosts Apac hotels team amid growing activity
Next Article ByteDance Research Releases DAPO: A Fully Open-Sourced LLM Reinforcement Learning System at Scale ByteDance Research Releases DAPO: A Fully Open-Sourced LLM Reinforcement Learning System at Scale
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Using RSS feeds, we aggregate news from trusted sources to ensure real-time updates on the latest events and trends. Stay ahead with timely, curated information designed to keep you informed and engaged.
TwitterFollow
TelegramFollow
LinkedInFollow
- Advertisement -
Ad imageAd image

You Might Also Like

This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency
AITechnology

This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency

By capernaum

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

By capernaum
Windsurf Launches SWE-1: A Frontier AI Model Family for End-to-End Software Engineering
AITechnology

Windsurf Launches SWE-1: A Frontier AI Model Family for End-to-End Software Engineering

By capernaum
Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation
AITechnology

Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation

By capernaum
Capernaum
Facebook Twitter Youtube Rss Medium

Capernaum :  Your instant connection to breaking news & stories . Stay informed with real-time coverage across  AI ,Data Science , Finance, Fashion , Travel, Health. Your trusted source for 24/7 insights and updates.

© Capernaum 2024. All Rights Reserved.

CapernaumCapernaum
Welcome Back!

Sign in to your account

Lost your password?