Microsoft Releases LLM2CLIP, Ushering in a New Era of Cross-Modal Representation

2024-11-15

In today's rapidly evolving technological landscape, CLIP stands as one of the most significant foundational multimodal models, spearheading the trend of integrating images and text. By employing large-scale image-text pairs for straightforward contrastive learning, CLIP has effectively merged visual and textual signals into a shared feature space, demonstrating robust cross-modal representation capabilities.

As a robust retrieval system, CLIP supports a wide range of tasks, including zero-shot classification, detection, segmentation, and image-text retrieval. Furthermore, serving as a feature extractor, CLIP holds a leading position in cross-modal representation tasks such as image understanding, video comprehension, and text-to-image/video generation. Its distinctive feature is its capability to link images with natural language, providing visual encoders with a new perspective by capturing detailed textual descriptions from human knowledge.

With the rapid development of large language models (LLMs), the boundaries of language comprehension and generation are continually expanding. The powerful text processing capabilities of LLMs provide new opportunities for CLIP, especially in handling long and complex titles, thereby compensating for some of CLIP’s original limitations. Additionally, the extensive knowledge base of LLMs enhances training efficiency. However, despite the superior understanding abilities of LLMs, their text generation methods may result in outputs that lack clarity, and integrating LLMs with CLIP also presents numerous challenges.

To overcome these challenges, researchers from Tongji University and Microsoft conducted extensive studies and proposed the LLM2CLIP approach. This method enhances visual representation learning by integrating large language models, replacing the original CLIP text encoder with the extensive knowledge of LLMs, and augmenting CLIP’s visual encoder. This innovation not only audaciously replaces the original encoders but also introduces cost-effective fine-tuning strategies to address potential challenges.

The LLM2CLIP method successfully overcomes difficulties encountered when using an LLM as the CLIP text encoder by introducing title contrastive fine-tuning techniques, significantly enhancing the LLM’s ability to differentiate titles. Experiments indicate that this approach performs exceptionally well in tasks such as image-text matching, even surpassing existing state-of-the-art models. By integrating the enhanced LLM with the pretrained CLIP visual encoder, the LLM2CLIP framework creates a powerful cross-modal model that remains computationally efficient with minimal cost increases.

Throughout the experiments, researchers fine-tuned the model using datasets of varying sizes to improve image-text matching performance. The results show that models trained with LLM2CLIP outperform standard CLIP and EVA models in tasks like image-to-text and text-to-image retrieval, highlighting the advantages of integrating large language models with image-text models.

It is noteworthy that the LLM2CLIP approach not only enhances performance in both long and short text retrieval tasks but also transforms the CLIP model, which was originally trained solely on English data, into a state-of-the-art multilingual model. After undergoing multimodal training with models such as Llava 1.5, LLM2CLIP outperforms CLIP in nearly all benchmark tests, demonstrating significant overall performance improvements.

Researchers plan to train LLM2CLIP from scratch using larger datasets to achieve better results and performance. This work not only provides new insights for CLIP training but also establishes a solid foundation for the widespread application of cross-modal representations, signaling the imminent dawn of a new chapter in cross-modal technology.