Large-scale language models (LLMs) have successfully leveraged the capabilities of subfields of artificial intelligence (AI), including natural language processing (NLP), natural language generation (NLG), and computer vision. With LLMs, it is possible to create visual-language models that can perform complex reasoning on images, answer queries about images, and describe images using natural language. However, it is still uncertain whether LLMs can perform localization tasks, such as word localization or reference localization.
To overcome this challenge, a research team from Google Research and the University of California, San Diego, introduced an intelligent model called PixelLLM, which enables fine-grained localization and visual-language alignment. This approach is inspired by the way people naturally behave, particularly how infants use gestures, pointing, and naming to describe their visual environment. The team states that the goal is to understand how LLMs can derive spatial understanding and reasoning from visual inputs.
PixelLLM densely aligns each output word of the language model to a pixel position. To achieve this, a small multi-layer perceptron (MLP) is added on top of the word features, allowing it to regress to the pixel position of each word. Low-rank fine-tuning (LoRA) is used, which allows for updating or freezing the weights of the language model. The model can also receive text or location prompts, enabling it to provide outputs based on those prompts.
The architecture of the model includes an image encoder, a prompt encoder, and a prompt feature extractor. The large-scale language model is inputted with prompt-conditioned image features and an optional text prompt, and outputs the positioning and captioning of each word. This architecture, which can accept different combinations of language or location as input or output, makes it adaptable to various visual-language activities, highly diverse and flexible.
The team has evaluated the model using well-known visual tasks such as dense object annotation, location-conditioned annotation, and reference localization. PixelLLM has demonstrated state-of-the-art results in various challenges, including 89.8 P@0.5 on RefCOCO reference localization, 19.9 CIDEr on Visual Genome location-conditioned annotation, and 17.0 mAP on dense object annotation. A ablation study on RefCOCO shows that the dense per-pixel localization formulation is crucial, improving by 3.7 percentage points compared to other localization formulations. Therefore, PixelLLM has proven to be successful in achieving precise visual-language alignment and localization.
The team summarizes their main contributions as follows:
- Introducing a new visual-language model called PixelLLM, which can generate word localization and generate image captions.
- The model supports text or optional location prompts beyond image inputs.
- Training for word localization using a localized narrative dataset.
- The model can adapt to various visual-language tasks, including segmentation, location-conditioned annotation, reference localization, and dense annotation.
- The model demonstrates excellent results in location-conditioned annotation, dense annotation, reference localization, and segmentation.