Google Launches PixelLLM: Innovative Visual-Language Positioning and Generation Model AI NEWS

Home
AInews
Google Launches PixelLLM: Innovative Visual-Language Positioning and Generation Model

Google Launches PixelLLM: Innovative Visual-Language Positioning and Generation Model

2023-12-19

Large-scale language models (LLMs) have successfully leveraged the capabilities of subfields of artificial intelligence (AI), including natural language processing (NLP), natural language generation (NLG), and computer vision. With LLMs, it is possible to create visual-language models that can perform complex reasoning on images, answer queries about images, and describe images using natural language. However, it is still uncertain whether LLMs can perform localization tasks, such as word localization or reference localization.

To overcome this challenge, a research team from Google Research and the University of California, San Diego, introduced an intelligent model called PixelLLM, which enables fine-grained localization and visual-language alignment. This approach is inspired by the way people naturally behave, particularly how infants use gestures, pointing, and naming to describe their visual environment. The team states that the goal is to understand how LLMs can derive spatial understanding and reasoning from visual inputs.

PixelLLM densely aligns each output word of the language model to a pixel position. To achieve this, a small multi-layer perceptron (MLP) is added on top of the word features, allowing it to regress to the pixel position of each word. Low-rank fine-tuning (LoRA) is used, which allows for updating or freezing the weights of the language model. The model can also receive text or location prompts, enabling it to provide outputs based on those prompts.

The architecture of the model includes an image encoder, a prompt encoder, and a prompt feature extractor. The large-scale language model is inputted with prompt-conditioned image features and an optional text prompt, and outputs the positioning and captioning of each word. This architecture, which can accept different combinations of language or location as input or output, makes it adaptable to various visual-language activities, highly diverse and flexible.

The team has evaluated the model using well-known visual tasks such as dense object annotation, location-conditioned annotation, and reference localization. PixelLLM has demonstrated state-of-the-art results in various challenges, including 89.8 P@0.5 on RefCOCO reference localization, 19.9 CIDEr on Visual Genome location-conditioned annotation, and 17.0 mAP on dense object annotation. A ablation study on RefCOCO shows that the dense per-pixel localization formulation is crucial, improving by 3.7 percentage points compared to other localization formulations. Therefore, PixelLLM has proven to be successful in achieving precise visual-language alignment and localization.

The team summarizes their main contributions as follows:

Introducing a new visual-language model called PixelLLM, which can generate word localization and generate image captions.
The model supports text or optional location prompts beyond image inputs.
Training for word localization using a localized narrative dataset.
The model can adapt to various visual-language tasks, including segmentation, location-conditioned annotation, reference localization, and dense annotation.
The model demonstrates excellent results in location-conditioned annotation, dense annotation, reference localization, and segmentation.

Watermark Remover

Watermark Remover - AI tool for automatic watermark removal

Geo Finder AI

Geo Finder AI - AI tool for identifying locations in media

Mailteorite

Mailteorite - AI email generator that reflects your brand

Figr

Figr - AI design assistant for fast prototyping

Completely AI

Completely AI - AI tool for generating competitive analysis

Zeroheight

Zeroheight - Centralized design system documentation tool

LockedIn AI

LockedIn AI - AI job interview assistant

RECENT AI TOOLS

Kiro AI

Watermark Remover

Geo Finder AI

Mailteorite

Figr

RECENT AI NEWS

Google Discover Launches AI Summaries, Publishers Face Greater Traffic Challenges

Google Consolidates Android and Chrome OS to Emulate Apple's Success

Mistral Releases Voxtral: First Open-Source AI Audio Model

Uber and Baidu Collaborate to Launch Robotaxis Globally, Starting in Dubai and Abu Dhabi

Meta's Latest AI Strategy: Building Two Large Data Centers to Achieve Superintelligence

Former OpenAI Engineer Reveals Inside Look at Company Work Experience

Meta Patches Vulnerability That Could Lead to Data Leaks in User AI Prompts and Generated Content

Meta Uses Tents to Build Data Center

RECENT AI TOOLS