Wednesday, 14 May 2025
  • My Feed
  • My Interests
  • My Saves
  • History
  • Blog
Subscribe
Capernaum
  • Finance
    • Cryptocurrency
    • Stock Market
    • Real Estate
  • Lifestyle
    • Travel
    • Fashion
    • Cook
  • Technology
    • AI
    • Data Science
    • Machine Learning
  • Health
    HealthShow More
    Foods That Disrupt Our Microbiome
    Foods That Disrupt Our Microbiome

    Eating a diet filled with animal products can disrupt our microbiome faster…

    By capernaum
    Skincare as You Age Infographic
    Skincare as You Age Infographic

    When I dove into the scientific research for my book How Not…

    By capernaum
    Treating Fatty Liver Disease with Diet 
    Treating Fatty Liver Disease with Diet 

    What are the three sources of liver fat in fatty liver disease,…

    By capernaum
    Bird Flu: Emergence, Dangers, and Preventive Measures

    In the United States in January 2025 alone, approximately 20 million commercially-raised…

    By capernaum
    Inhospitable Hospital Food 
    Inhospitable Hospital Food 

    What do hospitals have to say for themselves about serving meals that…

    By capernaum
  • Sport
  • 🔥
  • Cryptocurrency
  • Data Science
  • Travel
  • Real Estate
  • AI
  • Technology
  • Machine Learning
  • Stock Market
  • Finance
  • Fashion
Font ResizerAa
CapernaumCapernaum
  • My Saves
  • My Interests
  • My Feed
  • History
  • Travel
  • Health
  • Technology
Search
  • Pages
    • Home
    • Blog Index
    • Contact Us
    • Search Page
    • 404 Page
  • Personalized
    • My Feed
    • My Saves
    • My Interests
    • History
  • Categories
    • Technology
    • Travel
    • Health
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Home » Blog » NVIDIA AI Research Unveils ‘Star Attention’: A Novel AI Algorithm for Efficient LLM Long-Context Inference
AIMachine LearningTechnology

NVIDIA AI Research Unveils ‘Star Attention’: A Novel AI Algorithm for Efficient LLM Long-Context Inference

capernaum
Last updated: 2024-11-28 09:24
capernaum
Share
NVIDIA AI Research Unveils ‘Star Attention’: A Novel AI Algorithm for Efficient LLM Long-Context Inference
SHARE

Transformer-based Large Language Models (LLMs) face significant challenges in efficiently processing long sequences due to the quadratic complexity of the self-attention mechanism. This will increase their computational and memory demands exponentially with sequence length, so scaling up these models to realistic applications like multi-document summarization, retrieval-based reasoning, or even fine-grained code analysis at the repository level proves impossible. Current approaches fail to manage sequences extending to millions of tokens without considerable computational overhead or loss in accuracy, which creates a major obstacle to their effective deployment in diverse use cases.

Various strategies have been proposed to address these inefficiencies. Sparse attention mechanisms are designed to reduce computational intensity but often fail to preserve the most critical global dependencies, resulting in degraded task performance. Methods for enhancing memory efficiency, such as key-value cache compression and low-rank approximations, reduce resource usage at the cost of scalability and accuracy. Distributed systems such as the Ring Attention improve scalability by distributing computations across several devices. However, these approaches incur significant communication overhead and thus limit their effectiveness in extremely long sequences. Such limitations point to the urgent need for an innovative mechanism that can balance efficiency, scalability, and performance with accuracy.

Researchers from NVIDIA introduced Star Attention, an innovative block-sparse attention mechanism designed to address these challenges. Star Attention essentially breaks an input sequence into smaller blocks, which is preceded by what researchers call an “anchor block,” which holds much information globally. Then blocks process independently on many hosts to significantly reduce computation complexity with the capability to capture patterns globally. The inference processes combine the attention scores for each block using a distributed softmax algorithm that enables efficient global attention while minimizing the data transmission. The integration of the model with prior Transformer-based frameworks is non-intrusive and fine-tuning is not mandatory, making it a quite practical solution to manage lengthy sequences in real-world practice. The technical foundation of Star Attention is a split process. In the first phase, context encoding, each input block is augmented with an anchor block that ensures the model captures global attention patterns. After processing, key-value caches for anchor blocks are discarded to conserve memory. In the second phase, query encoding, and token generation, attention scores are computed locally on each host and combined via distributed softmax, allowing the model to maintain computational efficiency and scalability.

Star Attention was evaluated on benchmarks such as RULER, which includes retrieval and reasoning tasks, and BABILong, which tests long-context reasoning.  Over sequences between 16,000 to 1 million tokens long, the models tested – Llama-3.1-8B and Llama-3.1-70B – are being tested, using HuggingFace Transformers and the A100 GPU, which takes advantage of bfloat16 for maximum speed.

Star Attention delivers significant advancements in both speed and accuracy. It achieves up to 11 times faster inference compared to baselines while maintaining 95-100% accuracy across tasks. On the RULER benchmark, it shines in retrieval tasks but its accuracy degrades by a mere 1-3% in more complex multi-hop reasoning scenarios. The BABILong benchmark focused on testing reasoning over longer contexts, and the results are always within the 0-3% range compared with the baseline. It’s also scalable up to 1 million tokens sequence length, making it a strong and flexible candidate that adapts well to highly sequence-dependent applications.

Star Attention establishes a transformative framework for efficient inference in Transformer-based LLMs, addressing key limitations in processing long sequences. Block-sparse attention plus anchor blocks strike the right balance between computational efficiency and accuracy, enabling speedups with significant performance preservation. This advance brings scalable, practical solutions to a wide range of AI applications: reasoning, retrieval, and summarization. Future work will involve designing refinements to anchor mechanisms and improving bottleneck performance in inter-block-communication-dependent tasks with it.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

🎙 🚨 ‘Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)

The post NVIDIA AI Research Unveils ‘Star Attention’: A Novel AI Algorithm for Efficient LLM Long-Context Inference appeared first on MarkTechPost.

Share This Article
Twitter Email Copy Link Print
Previous Article Will XRP Price to Hit $2 in Next 30 Days? Will XRP Price to Hit $2 in Next 30 Days?
Next Article 2025 Account Abstraction Trends: Our Senior Smart Contract Engineer Weighs In 2025 Account Abstraction Trends: Our Senior Smart Contract Engineer Weighs In
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Using RSS feeds, we aggregate news from trusted sources to ensure real-time updates on the latest events and trends. Stay ahead with timely, curated information designed to keep you informed and engaged.
TwitterFollow
TelegramFollow
LinkedInFollow
- Advertisement -
Ad imageAd image

You Might Also Like

This AI Paper Investigates Test-Time Scaling of English-Centric RLMs for Enhanced Multilingual Reasoning and Domain Generalization

By capernaum
Rethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification
AIMachine LearningTechnology

Rethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification

By capernaum

PwC Releases Executive Guide on Agentic AI: A Strategic Blueprint for Deploying Autonomous Multi-Agent Systems in the Enterprise

By capernaum

ServiceLink expands closing technology

By capernaum
Capernaum
Facebook Twitter Youtube Rss Medium

Capernaum :  Your instant connection to breaking news & stories . Stay informed with real-time coverage across  AI ,Data Science , Finance, Fashion , Travel, Health. Your trusted source for 24/7 insights and updates.

© Capernaum 2024. All Rights Reserved.

CapernaumCapernaum
Welcome Back!

Sign in to your account

Lost your password?