Meta AI and Stanford University Jointly Launch Apollo Video Multimodal Model AI NEWS

Home
AInews
Meta AI and Stanford University Jointly Launch Apollo Video Multimodal Model

Meta AI and Stanford University Jointly Launch Apollo Video Multimodal Model

2024-12-17

While Large Multimodal Models (LMMs) have made significant strides in text and image tasks, the advancement of video models has been comparatively slower. Videos encompass intricate spatial and temporal dimensions, necessitating greater computational resources for processing. Current approaches typically apply image-based techniques directly or rely on uniform frame sampling, which struggle to effectively capture motion and temporal patterns. Additionally, training large-scale video models is computationally expensive, making it challenging to explore design choices efficiently.

To address these challenges, Meta AI collaborated with researchers from Stanford University to develop the Apollo video multimodal model. Apollo is designed to advance the boundaries of video understanding through thoughtful design decisions, enhanced efficiency, and by setting new benchmarks for tasks such as temporal reasoning and video-based question answering.

The Apollo model series introduced by Meta AI focuses on handling videos up to an hour in length and demonstrates exceptional performance in key video-language tasks. Apollo is available in three sizes: 1.5B, 3B, and 7B parameters, catering to various computational constraints and practical requirements.

Apollo's key innovations include:

Scale Consistency: Design choices that effectively transfer from smaller to larger models, reducing the need for extensive large-scale experiments.
Frames Per Second (fps) Sampling: Utilizes more efficient video sampling techniques to ensure better temporal consistency, outperforming uniform frame sampling.
Dual Visual Encoders: Combines the spatial understanding capabilities of SigLIP with the enhanced temporal reasoning of InternVideo2 to achieve balanced video data representation.
ApolloBench Benchmark Suite: Carefully curated to minimize redundancy in evaluations, providing detailed insights into model performance.

Apollo models boast several technical highlights and advantages:

FPS Sampling: Maintains consistent temporal flow, enabling Apollo to better comprehend actions, speeds, and the sequence of events within videos.
Scale Consistency: Experiments show that design choices from medium-sized models generalize well to larger models, reducing computational costs while maintaining performance improvements.
Dual Visual Encoders: The integration of two complementary encoders results in more accurate video representations.
Token Resampling: With the Perceiver Resampler, Apollo efficiently reduces video tokens without losing information, allowing the processing of long-duration videos without excessive computational overhead.
Optimized Training: Implements a three-stage training process to ensure stable and effective learning.
Multi-Turn Dialogues: Supports interactive multi-turn conversations based on video content, suitable for applications like video chat systems or content analysis.

Apollo's performance has been validated across multiple benchmarks, often surpassing larger models. For instance, Apollo-1.5B outperforms models like Phi-3.5-Vision (4.2B) and LongVA-7B; Apollo-3B competes with and exceeds many 7B models; and Apollo-7B rivals or even surpasses models with over 30B parameters, such as Oryx-34B and VILA1.5-40B.

The Apollo series provides practical solutions for real-world applications, ranging from video-based question answering to content analysis and interactive systems. Meta AI's ApolloBench benchmark suite offers a more streamlined and effective method for evaluating video LMMs, paving the way for future research.

In summary, Apollo represents a significant advancement in the development of video LMMs. By addressing key challenges like efficient video sampling and model scalability, Apollo delivers practical and robust solutions for understanding video content. Its ability to outperform larger models underscores the importance of in-depth research into design and training strategies.

COUNT

COUNT - Automate accounting and gain valuable insights

Scan Relief

Scan Relief - Automate receipt scanning and organization

Mindtrip

Mindtrip - AI chatbot that helps you organize a your trip

Ai Drive

Ai Drive - Chat with multiple PDF files

Convex

Convex - AI backend platform for AI assisted app development

Ilus AI

Ilus AI - AI illustration tool for stunning visual content

Vast AI

Vast AI - Cloud-based GPU Rentals for AI Computing

RECENT AI TOOLS

Gitingest

COUNT

Scan Relief

Mindtrip

Ai Drive

RECENT AI NEWS

Huawei to Launch New AI Chip, Challenging Nvidia

Google DeepMind UK Team Reportedly Seeks to Form a Union

Cedar: A New Approach to Solving Kubernetes Authorization Issues

Thin Film Actuator Powered Microbots: Morph, Lock Shape, and Operate Tetherlessly

Double-clicking the Google Photos search icon restores classic search

Meta's AI Chatbot Enables Sexual Conversations with Minors

Solve This Math Problem by Musk to Get Hired at Tesla?

Google AI Studio Update: Features, Tools, VEO 2, and Gemini 2.0

RECENT AI TOOLS