Shanghai AI Lab Unveils Next-Generation "ShuSheng" Visual Model, Pioneering Open Source for Core Visual Tasks

2024-01-30

Shanghai Artificial Intelligence Laboratory, in collaboration with Tsinghua University, the Chinese University of Hong Kong, and SenseTime, has officially open-sourced a new generation of Scholar·Vision Large Model (InternVL).


It is understood that this visual encoder model, named InternVL-6B, has a parameter size of up to 6 billion, marking a significant breakthrough in the field of visual large models. The model adopts a progressive alignment technique that combines contrastive learning and generation, achieving fine alignment between visual and language large models on internet-scale data. This means that InternVL-6B can not only handle subtle visual information in complex images but also perform image-to-text tasks.

What is even more remarkable is that InternVL-6B also has the ability to interpret complex page information and even solve mathematical problems. The implementation of this function undoubtedly further expands its practical application in various fields.

Shanghai AI Laboratory has always been at the forefront of visual large model research. In 2021, they released Scholar 1.0, becoming the first large model in China to cover a wide range of visual tasks. With just one base model, the model can comprehensively cover four core visual tasks: classification, object detection, semantic segmentation, and depth estimation.

In 2022, they once again updated and released the visual large model InternImage. This model constructs a new architecture for visual large models based on dynamic sparse convolution, opening up a new approach to large model architecture beyond Transformers. In 12 visual tasks, InternImage has demonstrated outstanding performance.

It can be foreseen that with the continuous development and application of visual large models, their potential in various fields will be further unleashed.