Ideally, you should be able to create custom embedding models for your applications. However, training embedding models comes with many challenges and difficulties. That's why developers often use pre-trained embedding models for general applications.
Microsoft researchers have recently proposed a technique that significantly reduces the cost and complexity of training custom embedding models. This technique uses an open-source Language Learning Model (LLM) instead of a BERT-style encoder to reduce the retraining steps. It also uses proprietary LLM to automatically generate labeled training data. Such research can open up new LLM applications, allowing organizations to create customized LLMs for their needs.
Challenges in Training Embedding Models
Embedding models create numerical representations that capture the main features of input data. For example, word embeddings capture the semantic meaning of words, sentence embeddings capture the relationships between words in a sentence, and image embeddings represent the visual features of the input. Embeddings are very useful for various tasks, such as comparing the similarity of two words, sentences, or texts.
One important application of embeddings is Retrieval-Augmented Generation (RAG) used in combination with LLM. In RAG, embeddings help find and retrieve documents relevant to user prompts. The retrieved document content is inserted into the prompt and then guides the LLM to generate a response based on the document. RAG enables LLM to avoid hallucinations and complete tasks involving information beyond its training dataset.
The quality of RAG highly depends on the quality of the embedding model. If the embeddings fail to capture the correct features from the documents and match them to the user prompts, the RAG pipeline will not be able to retrieve relevant documents.
Training embedding models on custom data is one way to improve the quality of specific applications. However, popular embedding models currently use a multi-stage training process. First, the model is trained on a large-scale weakly supervised text pair dataset obtained through contrastive learning. Then, it is fine-tuned on a small-scale but high-quality carefully labeled example dataset.
The problem with this approach is that it requires a lot of engineering effort to curate relevant text pairs. It also relies on manually collected datasets that often cover only a few tasks and languages. That's why in most cases, developers use general embedding models that may not be suitable for their applications.
LLM as an Embedding Model
The new technique proposed by Microsoft trains embeddings in a single stage, instead of the two-stage approach used in other models. To do this, they rely on proprietary LLMs like GPT-4 to generate synthetic data for various embedding tasks.
First, they bootstrap GPT-4 to create a small pool of candidate tasks. Then, they feed the candidate tasks into the model and guide it to generate training examples.
I tested a simplified prompt pipeline on ForeFront, a platform that allows you to create no-code workflows on top of GPT-4 and Claude. In the first stage, I obtained a list of abstract task descriptions:
[
"Retrieve case law that supports or opposes a specific legal argument.",
"Find recipes based on a list of ingredients provided in the query.",
"Search for news articles related to a particular event or phenomenon.",
"Locate scholarly articles that discuss the efficacy of a certain medical treatment.",
"Identify historical documents that mention a specific figure or event.",
"Gather financial reports related to a particular industry or company.",
… (redacted)
]
In the next stage, I provided an example to GPT-4 and asked it to generate a JSON structure that includes specific user prompts and 150-word positive and negative examples. The results were almost perfect, except for the last sentence of the hard negative example, which could undermine the embedding:
{
"user_query": "Looking for a vegetarian chili recipe that uses black beans, quinoa, and sweet potatoes",
"positive_document": "This hearty vegetarian chili is the perfect way to warm up on a chilly day. Begin by sautéing one diced onion and three cloves of minced garlic in olive oil until translucent. Add one diced bell pepper and two diced sweet potatoes and cook until slightly soft. Stir in two cups of vegetable broth, one can of black beans (rinsed and drained), one cup of cooked quinoa, one can of diced tomatoes, and one can of tomato paste. Season with two tablespoons of chili powder, one teaspoon of cumin, and salt and pepper to taste. Let the chili simmer for 30 minutes, stirring occasionally. Serve hot, garnished with avocado, cilantro, and a squeeze of lime juice.",
"hard_negative_document": "This delicious sweet potato and black bean burrito bowl is a great vegetarian meal. Start by roasting two diced sweet potatoes in the oven with olive oil, salt, and pepper until golden and tender. Meanwhile, cook one cup of quinoa as per package instructions. In a pan, cook one can of black beans with one diced onion, two cloves of garlic, and one tablespoon of cumin until heated through. Assemble your bowls by placing a scoop of quinoa at the bottom, followed by the roasted sweet potatoes, and the black bean mixture. Top with sliced avocado, fresh cilantro, and a dollop of sour cream. While this recipe shares ingredients with the user's query, it is not a chili recipe."
}
The researchers have not released any source code or data for their experiments. But you can see a highly simplified version of the pipeline in this Python notebook I created. Naturally, this is a very flexible process that you can easily customize templates according to your needs.
To increase the diversity of the dataset, the researchers designed multiple prompt templates and combined them. Overall, they generated 500,000 examples, including 150,000 unique instructions, using GPT-3.5 and GPT-4 through Azure OpenAI services. Their total token consumption was approximately 180 million, which could cost around $5,000.
Interestingly, the researchers used their training data to fine-tune an open-source autoregressive model instead of a bidirectional encoder like BERT, which is the conventional practice. The premise is that since these models have already been pretrained on very large datasets, they can be fine-tuned for embedding tasks at a very low cost.
They tested their approach on synthetic data and 13 public datasets. Using techniques like LoRA, they reduced the training cost. They were able to achieve state-of-the-art results on popular benchmark datasets and even surpassed OpenAI's Ada-002 and Cohere's embedding models in RAG and embedding quality benchmark tests.
LLM and Embedding
The main finding of the paper is that training autoregressive models like Mistral-7B in embedding tasks does not require an expensive contrastive pretraining stage.
"Extensive autoregressive pretraining allows LLMs to obtain good text representations that can be transformed into effective embedding models with minimal fine-tuning," they wrote.
Their findings also suggest that LLMs should be able to generate suitable training data for fine-tuning embedding models at a very low cost. This has important implications for future LLM applications, allowing organizations to create custom embeddings for their applications.
"We believe that generative language modeling and text embedding are two sides of the same coin, with both tasks requiring models to have a deep understanding of natural language," the researchers wrote. "Given a task definition for embedding, a truly robust LLM should be able to generate its own training data and then transform into an embedding model through lightweight fine-tuning. Our experiments reveal the potential of this direction and call for further research to fully explore it."