Alibaba Open Sources Qwen2-VL: Understanding Videos Longer Than 20 Minutes

2024-08-30

Alibaba Cloud, the cloud service and storage department of the e-commerce giant Alibaba, recently launched its latest visual language model Qwen2-VL. The model aims to enhance visual understanding, video analysis, and multilingual text-image processing capabilities.


In third-party benchmark tests, Qwen2-VL has demonstrated outstanding performance comparable to other top models such as Meta's Llama 3.1, OpenAI's GPT-4 (note: GPT-4o in the original text may be a typo), Anthropic's Claude 3 Haiku, and Google's Gemini-1.5 Flash. Users can experience the inference capabilities of this model through the Hugging Face platform.

Highlights of the model:

  • Powerful visual and video analysis capabilities: Qwen2-VL can not only recognize and analyze multilingual handwritten content but also identify, describe, and differentiate multiple objects in static images. It can even analyze video content in near real-time, providing summaries or feedback. In the future, it may be used in real-time operational scenarios such as technical support.
  • Video content understanding: This model can summarize video content, answer related questions, and maintain coherence in real-time conversations. It provides functionality similar to a personal assistant by extracting insights and information directly from video content.
  • Multiple version options: Qwen2-VL offers three versions with different parameter scales, including Qwen2-VL-72B with 72 billion parameters, Qwen2-VL-7B with 7 billion parameters, and Qwen2-VL-2B with 2 billion parameters. The latter two smaller-scale versions have been open-sourced under the Apache 2.0 license, allowing enterprises to use them for commercial purposes.
  • Function invocation and visual perception: Qwen2-VL supports integration with other third-party software, applications, and tools. It can extract and understand information from these external sources, such as flight status, weather forecasts, or package tracking, and interact in a way that simulates human perception of the world.
  • Architecture optimization: The model adopts multiple architectural improvements, such as Naive Dynamic Resolution, which supports processing images of different resolutions, and the Multimodal Rotary Position Embedding (M-ROPE) system, which enables the model to simultaneously capture and integrate positional information from text, images, and videos.

Currently, the Qwen2-VL model is available for developers and researchers to use, and the team encourages exploration of the potential of these cutting-edge tools.