Damo Academy has introduced a video-language model named VideoLLaMA 3, which is built around image-centric architecture and has achieved remarkable results in video understanding. Despite having only 7 billion parameters, VideoLLaMA 3 excels in core evaluation dimensions such as general video comprehension, temporal reasoning, and long-video understanding, surpassing most baseline models.
VideoLLaMA 3's design philosophy permeates its entire model architecture and training process. By leveraging high-quality image-text data, it establishes a robust foundation for video understanding. Even with just 3 million video-text datasets, it outperforms open-source models of the same parameter scale across the board. Additionally, an edge-side optimized version with 2 billion parameters demonstrates superior performance in image understanding, performing exceptionally well on multiple benchmark tests.
On the HuggingFace platform, VideoLLaMA 3 provides demonstrations for both image and video understanding. For instance, when presented with the painting "Mona Lisa," VideoLLaMA 3 can accurately discuss its historical impact and significance in the art world. In video understanding demos, it can precisely identify unusual elements in videos, such as a bear eating sushi on a table.
The success of VideoLLaMA 3 is primarily attributed to its image-centered training paradigm, which encompasses four key components: visual encoder adaptation, vision-language alignment, multi-task fine-tuning, and video fine-tuning. By enabling the visual encoder to process dynamic resolution images, utilizing rich image-text data to enhance multimodal understanding, and combining image-text question-answering data with video caption data for fine-tuning, VideoLLaMA 3 achieves deep comprehension of video content.
Besides, the framework design of VideoLLaMA 3 is highly innovative. It adopts the Arbitrary Resolution Vision Tokenization (AVT) method, breaking through traditional fixed-resolution limitations and allowing the visual encoder to handle images and videos of varying resolutions. The introduction of Differential Frame Pruner (DiffFP) effectively addresses the issue of redundant video data, improving video processing efficiency.
In terms of data, the training of VideoLLaMA 3 relies on high-quality datasets. The Damo Academy team constructed the VL3Syn7M dataset, which includes 7 million image-caption pairs. Through processes like aspect ratio filtering, aesthetic scoring filtering, text-image similarity calculation, visual feature clustering, and image re-labeling, the quality and diversity of the data are ensured, providing a solid foundation for the training of VideoLLaMA 3.