Google announced the launch of Gecko, a compact and versatile text embedding model that has rich world knowledge from large language models (LLMs).
Gecko is trained using the synthetic dataset FRet, generated by large language models (LLMs), which includes positive and negative data ranked by LLMs.
The text embedding model represents natural language as dense vectors, bringing semantically similar texts closer to each other in the embedding space. In simple terms, text embedding models act as translators for computers. They receive text and convert it into numbers that computers can understand.
These numerical representations, also known as embeddings, capture the semantic information of words or sentences in the text. By enabling computers to process natural language, these embeddings are used for various downstream tasks, including document retrieval, sentence similarity, classification, and clustering.
Instead of building separate embedding models for each downstream task, there is a trend to create a single model that can support multiple tasks. However, such a universal text embedding model requires a large amount of training data to cover the required domains and skills comprehensively. This is why Google leverages LLMs in this research.
"LLMs contain rich knowledge across various domains and are excellent few-shot learners." Google's approach utilizes insights from knowledge distillation to create Gecko, an embedding model driven by LLMs, which consists of two steps.
"Our two-step distillation process starts with generating diverse synthetic paired data from LLMs. Next, we further refine the data quality by retrieving a set of candidate paragraphs for each query and re-labeling positive and hard-to-distinguish negative paragraphs using the same LLM."
Essentially, the research team starts with a large corpus of unlabeled paragraphs and uses LLMs prompted by a small number of samples to generate relevant tasks and queries for each paragraph. They then connect the tasks and queries using a pre-trained embedding model to obtain the nearest neighboring paragraphs. Next, they re-rank the paragraphs using LLMs and obtain positive and negative paragraphs based on LLM scores. This approach helps Gecko achieve powerful retrieval performance.
The research shows that training Gecko on the synthetic dataset FRet, which contains positive and negative data ranked by LLMs, brings significant improvements and sets a strong baseline for zero-shot embedding models on the large-scale Text Embedding Benchmark (MTEB).
"By combining this LLM-generated and LLM-ranked data with manually annotated data, our model Gecko-1B (with 768-dimensional embeddings) performs best among models with compatible embedding dimensions and model sizes in the popular MTEB benchmark test. It achieves an average score of 66.31, competing with models seven times larger and embeddings five times higher in dimension." mentioned in the research.