Monday, 12 May 2025
  • My Feed
  • My Interests
  • My Saves
  • History
  • Blog
Subscribe
Capernaum
  • Finance
    • Cryptocurrency
    • Stock Market
    • Real Estate
  • Lifestyle
    • Travel
    • Fashion
    • Cook
  • Technology
    • AI
    • Data Science
    • Machine Learning
  • Health
    HealthShow More
    Skincare as You Age Infographic
    Skincare as You Age Infographic

    When I dove into the scientific research for my book How Not…

    By capernaum
    Treating Fatty Liver Disease with Diet 
    Treating Fatty Liver Disease with Diet 

    What are the three sources of liver fat in fatty liver disease,…

    By capernaum
    Bird Flu: Emergence, Dangers, and Preventive Measures

    In the United States in January 2025 alone, approximately 20 million commercially-raised…

    By capernaum
    Inhospitable Hospital Food 
    Inhospitable Hospital Food 

    What do hospitals have to say for themselves about serving meals that…

    By capernaum
    Gaming the System: Cardiologists, Heart Stents, and Upcoding 
    Gaming the System: Cardiologists, Heart Stents, and Upcoding 

    Cardiologists can criminally game the system by telling patients they have much…

    By capernaum
  • Sport
  • 🔥
  • Cryptocurrency
  • Data Science
  • Travel
  • Real Estate
  • AI
  • Technology
  • Machine Learning
  • Stock Market
  • Finance
  • Fashion
Font ResizerAa
CapernaumCapernaum
  • My Saves
  • My Interests
  • My Feed
  • History
  • Travel
  • Health
  • Technology
Search
  • Pages
    • Home
    • Blog Index
    • Contact Us
    • Search Page
    • 404 Page
  • Personalized
    • My Feed
    • My Saves
    • My Interests
    • History
  • Categories
    • Technology
    • Travel
    • Health
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Home » Blog » Microsoft Research Introduces MMInference to Accelerate Pre-filling for Long-Context Vision-Language Models
AITechnology

Microsoft Research Introduces MMInference to Accelerate Pre-filling for Long-Context Vision-Language Models

capernaum
Last updated: 2025-04-25 08:23
capernaum
Share
Microsoft Research Introduces MMInference to Accelerate Pre-filling for Long-Context Vision-Language Models
SHARE

Integrating long-context capabilities with visual understanding significantly enhances the potential of VLMs, particularly in domains such as robotics, autonomous driving, and healthcare. Expanding the context size enables VLMs to process extended video and text sequences, thereby enhancing temporal resolution and performance in complex tasks, such as video comprehension. However, one major limitation is the quadratic complexity of attention mechanisms during the pre-fill phase, which results in high latency before autoregressive decoding begins. This delay, known as Time-to-First-Token, makes real-world deployment of long-context VLMs challenging. Various sparse attention methods, such as Sparse Transformer, Swin Transformer, and StreamingLLM, overlook the specific sparse patterns found in VLMs with mixed modalities, thereby limiting their efficiency and effectiveness.

Unlike text-only inputs, visual and video data in VLMs demonstrate unique spatiotemporal attention structures, forming grid-like patterns due to local correlations. In mixed-modality scenarios, clear boundaries exist between different modalities, leading to distinct attention behaviors that general sparse methods fail to capture. Recent advancements, such as MInference and dynamic sparse attention approaches, aim to improve inference efficiency by adapting attention patterns online. Yet, these techniques often fall short in handling the intricacies of mixed-modality inputs. While vision token compression and RNN-Transformer hybrids have been explored to reduce computational load, most of these methods focus on long-video and short-text pairings, neglecting the more complex dynamics of multiturn, mixed-modality interactions, which are increasingly important in practical applications.

Researchers from the University of Surrey and Microsoft have introduced MMInference, a dynamic, sparse attention method designed to accelerate the pre-filling stage of long-context VLMs. By identifying grid-like sparsity patterns in video inputs and distinct modality boundaries, MMInference applies permutation-based strategies to optimize attention computation. It dynamically constructs sparse distributions for each input and utilizes custom GPU kernels for enhanced efficiency, all without requiring modifications to existing models. Tested on benchmarks like Video QA, Captioning, and Vision-NIAH, MMInference achieved up to 8.3× speedup at 1M tokens, outperforming previous methods while maintaining high accuracy across multiple state-of-the-art VLMs.

MMInference is a framework designed to speed up the pre-filling phase of long-context vision-language models by leveraging modality-aware sparse attention. It integrates three key components: (1) intra-modality sparse patterns like Grid, A-shape, and Vertical-Slash attention; (2) cross-modality patterns such as Q-Boundary and 2D-Boundary; and (3) a modality-aware sparse attention search algorithm. Instead of dense computation, it uses dynamic sparse attention with optimized GPU kernels and efficient tensor handling. The framework dynamically identifies attention patterns and permutes tensors based on modality, enabling efficient handling of multi-modal inputs and reducing computational overhead while maintaining strong performance.

The study evaluates MMInference’s performance and efficiency on long-video tasks, including captioning, question answering, and retrieval in both unimodal and mixed-modality settings. Experiments were conducted using state-of-the-art models, such as Llava-Video and LongVILA, with comparisons against several sparse attention baselines. Results show that MMInference achieves near full-attention performance while being more computationally efficient. It performs particularly well in the newly introduced Mixed-Modality Needle in a Haystack (MM-NIAH) task by leveraging inter-modality sparse patterns. Additionally, MMInference demonstrates significant speedups in end-to-end latency and maintains robustness across varying context lengths and input types.

In conclusion, MMInference is a modality-aware sparse attention technique designed to accelerate long-context VLMs without compromising accuracy. It employs a permutation-based grid attention pattern tailored for the spatial-temporal locality of video inputs, along with specialized handling for mixed-modality boundaries. A search algorithm identifies optimal sparse patterns per attention head, dynamically adapting to the input. The method integrates directly into current VLM pipelines without requiring model changes or fine-tuning. With optimized GPU kernels, MMInference achieves up to 8.3× acceleration during the pre-filling stage at 1M tokens across various tasks, including video QA, captioning, and mixed-modality benchmarks, while retaining full-attention performance.


Check out the Paper and Code. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Microsoft Research Introduces MMInference to Accelerate Pre-filling for Long-Context Vision-Language Models appeared first on MarkTechPost.

Share This Article
Twitter Email Copy Link Print
Previous Article NVIDIA AI Releases OpenMath-Nemotron-32B and 14B-Kaggle: Advanced AI Models for Mathematical Reasoning that Secured First Place in the AIMO-2 Competition and Set New Benchmark Records NVIDIA AI Releases OpenMath-Nemotron-32B and 14B-Kaggle: Advanced AI Models for Mathematical Reasoning that Secured First Place in the AIMO-2 Competition and Set New Benchmark Records
Next Article Polygon and Pyse to Launch Tokenized EV Fleet in Dubai Polygon and Pyse to Launch Tokenized EV Fleet in Dubai
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Using RSS feeds, we aggregate news from trusted sources to ensure real-time updates on the latest events and trends. Stay ahead with timely, curated information designed to keep you informed and engaged.
TwitterFollow
TelegramFollow
LinkedInFollow
- Advertisement -
Ad imageAd image

You Might Also Like

NVIDIA AI Introduces Audio-SDS: A Unified Diffusion-Based Framework for Prompt-Guided Audio Synthesis and Source Separation without Specialized Datasets
AITechnology

NVIDIA AI Introduces Audio-SDS: A Unified Diffusion-Based Framework for Prompt-Guided Audio Synthesis and Source Separation without Specialized Datasets

By capernaum
This AI Paper Introduces Effective State-Size (ESS): A Metric to Quantify Memory Utilization in Sequence Models for Performance Optimization
AIMachine LearningTechnology

This AI Paper Introduces Effective State-Size (ESS): A Metric to Quantify Memory Utilization in Sequence Models for Performance Optimization

By capernaum
LightOn AI Released GTE-ModernColBERT-v1: A Scalable Token-Level Semantic Search Model for Long-Document Retrieval and Benchmark-Leading Performance
AIMachine LearningTechnology

LightOn AI Released GTE-ModernColBERT-v1: A Scalable Token-Level Semantic Search Model for Long-Document Retrieval and Benchmark-Leading Performance

By capernaum

A Coding Implementation of Accelerating Active Learning Annotation with Adala and Google Gemini

By capernaum
Capernaum
Facebook Twitter Youtube Rss Medium

Capernaum :  Your instant connection to breaking news & stories . Stay informed with real-time coverage across  AI ,Data Science , Finance, Fashion , Travel, Health. Your trusted source for 24/7 insights and updates.

© Capernaum 2024. All Rights Reserved.

CapernaumCapernaum
Welcome Back!

Sign in to your account

Lost your password?