Alibaba DAMO Academy Open-Sources VideoLLaMA3: 7B Model Reaches New Heights in Video Understanding AI NEWS

Home
AInews
Alibaba DAMO Academy Open-Sources VideoLLaMA3: 7B Model Reaches New Heights in Video Understanding

Alibaba DAMO Academy Open-Sources VideoLLaMA3: 7B Model Reaches New Heights in Video Understanding

2025-02-14

Damo Academy has introduced a video-language model named VideoLLaMA 3, which is built around image-centric architecture and has achieved remarkable results in video understanding. Despite having only 7 billion parameters, VideoLLaMA 3 excels in core evaluation dimensions such as general video comprehension, temporal reasoning, and long-video understanding, surpassing most baseline models.

VideoLLaMA 3's design philosophy permeates its entire model architecture and training process. By leveraging high-quality image-text data, it establishes a robust foundation for video understanding. Even with just 3 million video-text datasets, it outperforms open-source models of the same parameter scale across the board. Additionally, an edge-side optimized version with 2 billion parameters demonstrates superior performance in image understanding, performing exceptionally well on multiple benchmark tests.

On the HuggingFace platform, VideoLLaMA 3 provides demonstrations for both image and video understanding. For instance, when presented with the painting "Mona Lisa," VideoLLaMA 3 can accurately discuss its historical impact and significance in the art world. In video understanding demos, it can precisely identify unusual elements in videos, such as a bear eating sushi on a table.

The success of VideoLLaMA 3 is primarily attributed to its image-centered training paradigm, which encompasses four key components: visual encoder adaptation, vision-language alignment, multi-task fine-tuning, and video fine-tuning. By enabling the visual encoder to process dynamic resolution images, utilizing rich image-text data to enhance multimodal understanding, and combining image-text question-answering data with video caption data for fine-tuning, VideoLLaMA 3 achieves deep comprehension of video content.

Besides, the framework design of VideoLLaMA 3 is highly innovative. It adopts the Arbitrary Resolution Vision Tokenization (AVT) method, breaking through traditional fixed-resolution limitations and allowing the visual encoder to handle images and videos of varying resolutions. The introduction of Differential Frame Pruner (DiffFP) effectively addresses the issue of redundant video data, improving video processing efficiency.

In terms of data, the training of VideoLLaMA 3 relies on high-quality datasets. The Damo Academy team constructed the VL3Syn7M dataset, which includes 7 million image-caption pairs. Through processes like aspect ratio filtering, aesthetic scoring filtering, text-image similarity calculation, visual feature clustering, and image re-labeling, the quality and diversity of the data are ensured, providing a solid foundation for the training of VideoLLaMA 3.

COUNT

COUNT - Automate accounting and gain valuable insights

Scan Relief

Scan Relief - Automate receipt scanning and organization

Mindtrip

Mindtrip - AI chatbot that helps you organize a your trip

Ai Drive

Ai Drive - Chat with multiple PDF files

Convex

Convex - AI backend platform for AI assisted app development

Ilus AI

Ilus AI - AI illustration tool for stunning visual content

Vast AI

Vast AI - Cloud-based GPU Rentals for AI Computing

RECENT AI TOOLS

Gitingest

COUNT

Scan Relief

Mindtrip

Ai Drive

RECENT AI NEWS

Huawei to Launch New AI Chip, Challenging Nvidia

Google DeepMind UK Team Reportedly Seeks to Form a Union

Cedar: A New Approach to Solving Kubernetes Authorization Issues

Thin Film Actuator Powered Microbots: Morph, Lock Shape, and Operate Tetherlessly

Double-clicking the Google Photos search icon restores classic search

Meta's AI Chatbot Enables Sexual Conversations with Minors

Solve This Math Problem by Musk to Get Hired at Tesla?

Google AI Studio Update: Features, Tools, VEO 2, and Gemini 2.0

RECENT AI TOOLS