Google Unveils General-Purpose Text Embedding Model Gecko AI NEWS

Home
AInews
Google Unveils General-Purpose Text Embedding Model Gecko

Google Unveils General-Purpose Text Embedding Model Gecko

2024-04-03

Google announced the launch of Gecko, a compact and versatile text embedding model that has rich world knowledge from large language models (LLMs).

Gecko is trained using the synthetic dataset FRet, generated by large language models (LLMs), which includes positive and negative data ranked by LLMs.

The text embedding model represents natural language as dense vectors, bringing semantically similar texts closer to each other in the embedding space. In simple terms, text embedding models act as translators for computers. They receive text and convert it into numbers that computers can understand.

These numerical representations, also known as embeddings, capture the semantic information of words or sentences in the text. By enabling computers to process natural language, these embeddings are used for various downstream tasks, including document retrieval, sentence similarity, classification, and clustering.

Instead of building separate embedding models for each downstream task, there is a trend to create a single model that can support multiple tasks. However, such a universal text embedding model requires a large amount of training data to cover the required domains and skills comprehensively. This is why Google leverages LLMs in this research.

"LLMs contain rich knowledge across various domains and are excellent few-shot learners." Google's approach utilizes insights from knowledge distillation to create Gecko, an embedding model driven by LLMs, which consists of two steps.

"Our two-step distillation process starts with generating diverse synthetic paired data from LLMs. Next, we further refine the data quality by retrieving a set of candidate paragraphs for each query and re-labeling positive and hard-to-distinguish negative paragraphs using the same LLM."

Essentially, the research team starts with a large corpus of unlabeled paragraphs and uses LLMs prompted by a small number of samples to generate relevant tasks and queries for each paragraph. They then connect the tasks and queries using a pre-trained embedding model to obtain the nearest neighboring paragraphs. Next, they re-rank the paragraphs using LLMs and obtain positive and negative paragraphs based on LLM scores. This approach helps Gecko achieve powerful retrieval performance.

The research shows that training Gecko on the synthetic dataset FRet, which contains positive and negative data ranked by LLMs, brings significant improvements and sets a strong baseline for zero-shot embedding models on the large-scale Text Embedding Benchmark (MTEB).

"By combining this LLM-generated and LLM-ranked data with manually annotated data, our model Gecko-1B (with 768-dimensional embeddings) performs best among models with compatible embedding dimensions and model sizes in the popular MTEB benchmark test. It achieves an average score of 66.31, competing with models seven times larger and embeddings five times higher in dimension." mentioned in the research.

MathGPT

MathGPT - Solve math problems with step-by-step explanations

Face Detector

Face Detector - Analyze face shape from uploaded photos

Glambase

Glambase - Create and monetize AI influencers.

Aider Chat

Aider Chat - Pair program with AI in terminal.

Tidio Chat

Tidio Chat - Manage customer communications through live chat, email, and chatbots.

Botpress

Botpress - Build and manage AI chatbots.

Theee AI

Theee AI - Use 50,000 AI tools for free online

RECENT AI TOOLS

CopyCopter

MathGPT

Face Detector

Glambase

Aider Chat

RECENT AI NEWS

El Capitan Tops Supercomputer Rankings, Powered by AMD Instinct Chips

Logo Creator: New AI-Powered Design Tool Simplifies Logo Creation Process

AWS Launches Multi-Agent Orchestrator for Managing AI Agents

Microsoft Ignite Conference Unveils Copilot Actions and Multiple AI Enhancements

Microsoft Launches Windows 365 Link, a New Option for Cloud Mini PCs

Niantic Develops Large-Scale Geospatial Models to Redefine Real-World Interactions

Google Gemini Update: Personalized Memory Feature Launched

OpenAI Launches Advanced Voice Mode for ChatGPT Web Version

RECENT AI TOOLS