Meta AI and Stanford University Jointly Launch Apollo Video Multimodal Model

2024-12-17

While Large Multimodal Models (LMMs) have made significant strides in text and image tasks, the advancement of video models has been comparatively slower. Videos encompass intricate spatial and temporal dimensions, necessitating greater computational resources for processing. Current approaches typically apply image-based techniques directly or rely on uniform frame sampling, which struggle to effectively capture motion and temporal patterns. Additionally, training large-scale video models is computationally expensive, making it challenging to explore design choices efficiently.

To address these challenges, Meta AI collaborated with researchers from Stanford University to develop the Apollo video multimodal model. Apollo is designed to advance the boundaries of video understanding through thoughtful design decisions, enhanced efficiency, and by setting new benchmarks for tasks such as temporal reasoning and video-based question answering.

The Apollo model series introduced by Meta AI focuses on handling videos up to an hour in length and demonstrates exceptional performance in key video-language tasks. Apollo is available in three sizes: 1.5B, 3B, and 7B parameters, catering to various computational constraints and practical requirements.

Apollo's key innovations include:

  1. Scale Consistency: Design choices that effectively transfer from smaller to larger models, reducing the need for extensive large-scale experiments.
  2. Frames Per Second (fps) Sampling: Utilizes more efficient video sampling techniques to ensure better temporal consistency, outperforming uniform frame sampling.
  3. Dual Visual Encoders: Combines the spatial understanding capabilities of SigLIP with the enhanced temporal reasoning of InternVideo2 to achieve balanced video data representation.
  4. ApolloBench Benchmark Suite: Carefully curated to minimize redundancy in evaluations, providing detailed insights into model performance.

Apollo models boast several technical highlights and advantages:

  • FPS Sampling: Maintains consistent temporal flow, enabling Apollo to better comprehend actions, speeds, and the sequence of events within videos.
  • Scale Consistency: Experiments show that design choices from medium-sized models generalize well to larger models, reducing computational costs while maintaining performance improvements.
  • Dual Visual Encoders: The integration of two complementary encoders results in more accurate video representations.
  • Token Resampling: With the Perceiver Resampler, Apollo efficiently reduces video tokens without losing information, allowing the processing of long-duration videos without excessive computational overhead.
  • Optimized Training: Implements a three-stage training process to ensure stable and effective learning.
  • Multi-Turn Dialogues: Supports interactive multi-turn conversations based on video content, suitable for applications like video chat systems or content analysis.

Apollo's performance has been validated across multiple benchmarks, often surpassing larger models. For instance, Apollo-1.5B outperforms models like Phi-3.5-Vision (4.2B) and LongVA-7B; Apollo-3B competes with and exceeds many 7B models; and Apollo-7B rivals or even surpasses models with over 30B parameters, such as Oryx-34B and VILA1.5-40B.

The Apollo series provides practical solutions for real-world applications, ranging from video-based question answering to content analysis and interactive systems. Meta AI's ApolloBench benchmark suite offers a more streamlined and effective method for evaluating video LMMs, paving the way for future research.

In summary, Apollo represents a significant advancement in the development of video LMMs. By addressing key challenges like efficient video sampling and model scalability, Apollo delivers practical and robust solutions for understanding video content. Its ability to outperform larger models underscores the importance of in-depth research into design and training strategies.