Wednesday, 14 May 2025
  • My Feed
  • My Interests
  • My Saves
  • History
  • Blog
Subscribe
Capernaum
  • Finance
    • Cryptocurrency
    • Stock Market
    • Real Estate
  • Lifestyle
    • Travel
    • Fashion
    • Cook
  • Technology
    • AI
    • Data Science
    • Machine Learning
  • Health
    HealthShow More
    Foods That Disrupt Our Microbiome
    Foods That Disrupt Our Microbiome

    Eating a diet filled with animal products can disrupt our microbiome faster…

    By capernaum
    Skincare as You Age Infographic
    Skincare as You Age Infographic

    When I dove into the scientific research for my book How Not…

    By capernaum
    Treating Fatty Liver Disease with Diet 
    Treating Fatty Liver Disease with Diet 

    What are the three sources of liver fat in fatty liver disease,…

    By capernaum
    Bird Flu: Emergence, Dangers, and Preventive Measures

    In the United States in January 2025 alone, approximately 20 million commercially-raised…

    By capernaum
    Inhospitable Hospital Food 
    Inhospitable Hospital Food 

    What do hospitals have to say for themselves about serving meals that…

    By capernaum
  • Sport
  • 🔥
  • Cryptocurrency
  • Data Science
  • Travel
  • Real Estate
  • AI
  • Technology
  • Machine Learning
  • Stock Market
  • Finance
  • Fashion
Font ResizerAa
CapernaumCapernaum
  • My Saves
  • My Interests
  • My Feed
  • History
  • Travel
  • Health
  • Technology
Search
  • Pages
    • Home
    • Blog Index
    • Contact Us
    • Search Page
    • 404 Page
  • Personalized
    • My Feed
    • My Saves
    • My Interests
    • History
  • Categories
    • Technology
    • Travel
    • Health
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Home » Blog » VisualWebInstruct: A Large-Scale Multimodal Reasoning Dataset for Enhancing Vision-Language Models
AITechnology

VisualWebInstruct: A Large-Scale Multimodal Reasoning Dataset for Enhancing Vision-Language Models

capernaum
Last updated: 2025-03-18 06:01
capernaum
Share
VisualWebInstruct: A Large-Scale Multimodal Reasoning Dataset for Enhancing Vision-Language Models
SHARE

VLMs have shown notable progress in perception-driven tasks such as visual question answering (VQA) and document-based visual reasoning. However, their effectiveness in reasoning-intensive tasks remains limited due to the scarcity of high-quality, diverse training datasets. Existing multimodal reasoning datasets have several shortcomings: some focus too narrowly on specific scientific imagery, others rely on synthetic data lacking real-world generalization, and many are too small or simplistic to develop robust reasoning capabilities. Due to these constraints, VLMs struggle with multi-step reasoning tasks such as those evaluated in MMMU, MathVista, and MEGABench benchmarks. Given the challenges of manual annotation at scale, researchers have explored automated data mining approaches. Inspired by WebInstruct, a method for retrieving reasoning-focused text from the internet, efforts have been made to extend this approach to multimodal reasoning. However, the absence of large-scale multimodal datasets and the limitations of current retrieval models hinder its feasibility.

Researchers have explored various strategies to advance multimodal reasoning, including neural symbolic reasoning, optimized visual encoding, plan-based prompting, and structured reasoning frameworks. While proprietary models like GPT-4o and Gemini demonstrate state-of-the-art performance, their restricted access has led to the development of open-source alternatives such as LLaVA, MiniGPT-4, and Deepseek-VL. Many of these models utilize lightweight connector-based architectures to integrate visual and textual representations. A key technique that has significantly improved reasoning in LLMs is CoT prompting, which breaks down complex queries into sequential reasoning steps, enhancing logical inference. Models such as Prism and MSG have built upon this structured reasoning approach, refining perception-reasoning pipelines and optimizing prompt-based methodologies. Despite these advances, the limited availability of large-scale supervised datasets for multimodal reasoning remains a major bottleneck, impeding further improvements in VLM capabilities.

Researchers from the University of Waterloo, the University of Toronto, UC Santa Barbara, Carnegie Mellon University (CMU), the National University of Singapore (NUS), and Netmind.ai have introduced VisualWebInstruct, a large-scale multimodal reasoning dataset to enhance VLMs. Using Google Image Search, they collected 30,000 seed images from disciplines like math, physics, and finance, retrieving 700K+ web pages to extract 900K question-answer pairs (40% visual). Fine-tuning MAmmoTH-VL2 on this dataset led to state-of-the-art performance on benchmarks like MMMU-Pro-std and Dyna-Math, demonstrating its effectiveness in improving complex reasoning tasks for VLMs.

The data mining pipeline extracts image-rich QA pairs from the internet, starting with 30K scientific images across various disciplines. Using Google Image Search, it gathers 758,490 unique URLs, filtering out non-educational sources. Accessibility trees are constructed to extract relevant text and images. The Gemini 1.5 Flash model identifies and filters QA pairs based on quality criteria. Further refinement with GPT-4o ensures answer consistency, generating multiple responses and validating them against original web sources. The final dataset, VISUALWEBINSTRUCT, contains 1.04 million QA pairs, with 38% including images covering subjects like mathematics (62.5%), physics (14.5%), and finance (7.25%).

The study fine-tuned MAmmoTH-VL using the VISUALWEBINSTRUCT dataset, resulting in MAmmoTH-VL2. The architecture includes a Qwen2.5-7B-Instruct language model, a SigLip vision encoder, and a projector module. The training was performed with supervised fine-tuning, a batch size 256, and distinct learning rates for different components. The model was evaluated on seven multimodal reasoning benchmarks, where it outperformed similar open-source models, particularly in mathematical reasoning. An ablation study demonstrated that integrating VISUALWEBINSTRUCT with Llava-CoT produced the best results, highlighting the dataset’s effectiveness in improving multimodal reasoning performance across diverse tasks.

In conclusion, the study explores building large-scale multimodal reasoning datasets without human annotation. It is the first to use Google Image Search to mine high-quality visual reasoning data, achieving state-of-the-art results on five benchmarks. The proposed method, VisualWebInstruct, leverages search engines to create a diverse dataset across multiple disciplines. Processing over 700K URLs generates around 900K question-answer pairs. Models fine-tuned on this dataset show notable performance improvements, with MAmmoTH-VL2 achieving leading results among 10B-parameter models. These findings demonstrate the dataset’s effectiveness in enhancing vision-language models for complex reasoning tasks.


Check out the Paper, Project Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post VisualWebInstruct: A Large-Scale Multimodal Reasoning Dataset for Enhancing Vision-Language Models appeared first on MarkTechPost.

Share This Article
Twitter Email Copy Link Print
Previous Article Solana Marks 5-Year Journey – 400 Billion Transactions And Counting Solana Marks 5-Year Journey – 400 Billion Transactions And Counting
Next Article Mubarak Meme Coin Trader Turns $232 Into $1.1 Million, Here’s How Mubarak Meme Coin Trader Turns $232 Into $1.1 Million, Here’s How
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Using RSS feeds, we aggregate news from trusted sources to ensure real-time updates on the latest events and trends. Stay ahead with timely, curated information designed to keep you informed and engaged.
TwitterFollow
TelegramFollow
LinkedInFollow
- Advertisement -
Ad imageAd image

You Might Also Like

Rethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification
AIMachine LearningTechnology

Rethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification

By capernaum

PwC Releases Executive Guide on Agentic AI: A Strategic Blueprint for Deploying Autonomous Multi-Agent Systems in the Enterprise

By capernaum

ServiceLink expands closing technology

By capernaum
Reinforcement Learning, Not Fine-Tuning: Nemotron-Tool-N1 Trains LLMs to Use Tools with Minimal Supervision and Maximum Generalization
AIMachine LearningTechnology

Reinforcement Learning, Not Fine-Tuning: Nemotron-Tool-N1 Trains LLMs to Use Tools with Minimal Supervision and Maximum Generalization

By capernaum
Capernaum
Facebook Twitter Youtube Rss Medium

Capernaum :  Your instant connection to breaking news & stories . Stay informed with real-time coverage across  AI ,Data Science , Finance, Fashion , Travel, Health. Your trusted source for 24/7 insights and updates.

© Capernaum 2024. All Rights Reserved.

CapernaumCapernaum
Welcome Back!

Sign in to your account

Lost your password?