Large language models (LLMs) like GPT-4 and Claude can learn new tasks through well-designed prompt engineering. However, long prompts can increase the cost of using these models and also slow them down.
A new technology called LLMLingua, developed by Microsoft, compresses prompts by eliminating irrelevant parts. It is worth noting that LLMLingua can compress prompts up to 20 times without affecting the quality of the model's response. When used properly, LLMLingua can reduce the cost of using advanced LLMs and make them more widely applicable to users and applications.
The Cost of Prompt Engineering
Prompt engineering is the cornerstone of implementing practical applications with LLMs. Techniques such as chaining, context learning, and incorporating relevant documents or historical conversations are crucial for improving the model's performance on specific tasks. However, these methods often require longer prompts, sometimes reaching thousands of tokens. This has a significant impact on the cost of using advanced models, especially expensive LLMs like GPT-4.
There are different approaches to optimize models and reduce costs. One category of research utilizes inherent redundancy in natural language to compress prompts, while some methods reduce the number of tokens required for inference by learning specific tokens through prompt adjustment.
However, these methods are often task-specific and may require fine-tuning the entire model, limiting their usage and making them incompatible with API-based models like ChatGPT.
Other techniques use LLMs to summarize conversations and create compressed memory and knowledge representations. However, these methods often involve invoking multiple costly LLMs.
A notable approach is selective context, which uses a smaller language model to evaluate the information content of text snippets and discards less informative content to compress prompts. Microsoft's latest technology builds upon this approach and enhances it.
LLMLingua
LLMLingua is an innovative technology that compresses prompts from coarse-grained to fine-grained levels. This approach consists of several components.
The first component is the "budget controller," which dynamically assigns different compression ratios to different elements of the original prompt, such as instructions, examples, and questions. The underlying principle here is that instructions and questions usually have a more direct impact on the generated results as they contain the basic knowledge required for the LLM to produce answers. Conversely, when a prompt contains multiple examples, the information may be redundant. Therefore, the budget controller allocates a larger budget to instructions and questions, meaning a smaller compression ratio, and a smaller budget to examples.
LLMLingua uses smaller language models like GPT-2 or LLaMA to manage this allocation. The model calculates the perplexity of each example, which serves as a measure of the text's relevance to the model's response. LLMLingua prioritizes examples with the highest perplexity values and incorporates them into the prompt until the token budget for examples is met. The remaining budget is used to refine instructions and questions.
The second component of LLMLingua is the Iterative Token-level Prompt Compression (ITPC) algorithm, which allows for finer-grained compression. ITPC first segments the prompt and then uses a small model to determine the perplexity distribution of these segments. The algorithm then constructs a compressed prompt, retaining tokens with high perplexity and ensuring the preservation of key information by considering the conditional dependencies between tokens.
The third component involves instruction-based alignment, which synchronizes the distribution patterns of large and small language models. This process starts with a pre-trained small language model and then fine-tunes it using data generated by a larger LLM. Through instruction-based alignment, the behavior of the small model aligns more closely with that of the large model, enhancing the entire compression process.
Testing LLMLingua
In their experiments, researchers used GPT-3.5 Turbo and Claude 1.3 as the primary LLMs and performed compression tasks using Alpaca-7B or GPT2-Alpaca. They tested LLMLingua on various benchmark tests, including GSM8k and BBH for inference and context learning, as well as ShareGPT and Arxiv-March23 for dialogue context understanding and summarization tasks.
"Our proposed method consistently outperformed previous methods in almost all experiments," the researchers reported.
In the inference and context learning benchmark tests of GSM8K and BBH, LLMLingua not only achieved higher results than the full-shot method but also achieved significantly higher compression ratios of 5 times and 3 times, respectively.
"This clearly indicates that our compressed prompts effectively preserve the reasoning information contained in the original prompts," the researchers wrote.
In the context understanding benchmark tests of ShareGPT and Arxiv-March23, LLMLingua achieved compression ratios of 9 times and 3.3 times, respectively. This indicates that LLMLingua preserves the semantic integrity of the initial prompts during compression. Furthermore, LLMLingua outperformed other prompt compression methods in terms of accuracy and compression level. In some cases, it achieved up to 20 times compression compared to the original prompts.
Despite involving multiple steps and two models, LLMLingua also achieved speed improvements ranging from 1.7 times to 5.7 times, with minimal computational overhead.
"Our method has substantial practical significance as it not only reduces computational costs but also provides a potential solution for accommodating longer contexts in LLMs," the researchers asserted.
To facilitate wider adoption, Microsoft has made LLMLingua available through an easy-to-use open-source library. Developers can integrate LLMLingua into their own applications using this library.