HyperLLaVA: Enhancing Multimodal Language Models with Dynamic Visual and Linguistic Experts

2024-03-27

Large language models (LLMs) have demonstrated remarkable versatility in handling language-centric applications. However, in order to deal with a wider range of input forms, multimodal large language models (MLLMs) have begun to receive increasing attention. These models are crucial for building flexible assistants that can understand and process various forms of information such as text, images, videos, and audio.

Popular MLLMs, such as LLaVA, typically follow a two-stage training mode: first, visual and language alignment, where visual features are matched with the word embedding space of the language model using a static projector, enabling the LLM to understand visual content; then, multimodal instruction fine-tuning, where the LLM is fine-tuned to better respond to diverse user requests involving visual content.

Although these two stages are crucial, the structure of the projector and the adjustment strategy of the LLM still need further exploration. Existing research mostly focuses on expanding pre-training data, instruction-following data, visual encoders, or language models. However, learning static parameters may have limitations in handling diverse multimodal tasks.

To overcome this limitation, researchers propose HyperLLaVA, a dynamic version of LLaVA. It draws inspiration from expert module design in HyperNetworks and is able to generate dynamic parameters based on input information. This allows the model to adaptively adjust the projector and LLM layers, enhancing its reasoning ability in various multimodal tasks.

The training of HyperLLaVA consists of two key steps:

First, in the visual and language alignment stage, the projector is decomposed into a static layer (i.e., the original MLP in LLaVA) and a dynamic layer (i.e., visual experts). The parameters of the static layer remain unchanged, while the parameters of the dynamic layer are dynamically generated based on visual input. With the assistance of HyperNetworks, the visual experts help the static projector learn a specific projector for visual features, which can adaptively model features based on visual information. This allows the projector to transfer adaptive visual cues into the language semantic space.

Second, in the multimodal instruction fine-tuning stage, a language expert is equipped for the LLM to model dynamic parameters for LLM blocks. The intermediate outputs of the LLM are treated as language guidance, guiding the language expert to better understand user requests and provide targeted instructions. By generating unique parameters for each input, MLLM enhances its flexibility, leveraging similarities across datasets and avoiding potential interference among samples within the same dataset. This language expert serves as an efficient fine-tuning method for MLLMs, not only comparable to the performance of the original LLaVA but also enhancing the model's ability to handle diverse multimodal tasks.

In experiments, researchers evaluated the performance of HyperLLaVA on multiple datasets, including five VQA datasets (VQAv2, GQA, VizWiz, SQAI, and VQAT) and seven benchmark toolkits (POPE, MME, MMB, MMBCN, SEED, LLaVAW, and MM-Vet). The results show that HyperLLaVA outperforms existing state-of-the-art methods in almost all of these benchmark tests in multimodal scenarios, including larger MLLMs with billions of trainable parameters. Through carefully designed lightweight visual and language experts, the static projector and LLM are able to enhance the processing capabilities of different multimodal tasks, surpassing the performance of the original LLaVA in 11 out of 12 benchmarks.

In summary, the innovative dynamic adjustment strategy of HyperLLaVA opens up new avenues for the advancement of multimodal learning systems. By adaptively adjusting the projector and LLM parameters and combining dynamic visual and language experts, researchers propose a parameter-efficient new approach that surpasses existing performance benchmarks. This approach provides a new perspective for improving the performance of multimodal tasks through personalized, dynamic adjustments, and holds the potential to unlock new pathways for seamless understanding and integration of multimodal information.