Google LLM Breakthrough: Achieving Unlimited Text Processing Capability

2024-04-16

Google researchers have recently published a paper claiming that large language models (LLMs) now have the ability to handle unlimited-length text. The paper introduces a technique called "Infini-attention" that adjusts the configuration of the language model to expand its "context window" while keeping memory and computational requirements unchanged. The context window refers to the length of text fragments that the model can simultaneously process. If the conversation with ChatGPT exceeds its context window range, its performance will be significantly reduced, and it may even forget the content at the beginning of the conversation. Currently, many organizations are trying to customize their LLM applications by inserting specific documents and knowledge. Therefore, increasing the context length has become a key factor in improving model performance and gaining a competitive advantage. Experiments conducted by the Google research team show that models using Infini-attention technology can maintain excellent performance when processing over one million text fragments without the need for additional memory. Theoretically, this advantage can be extended to even longer text lengths. So, what is Infini-attention? Infini-attention is described in the paper as "long-term compressed memory and local causal attention" and aims to effectively model long-range and short-range contextual dependencies. Specifically, Infini-attention retains the classic attention mechanism in the Transformer block of LLMs and adds a "compressed memory" module to handle longer inputs. When the input exceeds the model's context length, the model stores the old attention states in the compressed memory component, which improves computational efficiency by maintaining a certain number of memory parameters. Ultimately, Infini-attention synthesizes compressed memory and local attention context to produce output results. The researchers wrote, "By making subtle but crucial modifications to the Transformer attention layer and combining continued pre-training and fine-tuning, existing LLMs can naturally scale to handle infinitely long contexts." So, how does Infini-attention perform in practical applications? The researchers tested the Transformer with Infini-attention on benchmarks for processing long input sequences using LLMs. Infini-attention not only maintains lower perplexity scores (a measure of model coherence) in long-context language modeling tasks but also reduces memory requirements by 114 times compared to other long-context Transformer models. In the "password retrieval" test, Infini-attention successfully retrieves random numbers inserted into long texts consisting of up to one million text fragments. It also outperforms other long-context techniques in summarizing tasks with up to five hundred thousand text fragments. According to the paper, these tests were conducted on LLMs with 1 billion and 8 billion parameters. However, Google has not publicly released these models or code, so other researchers cannot currently verify these results. Nevertheless, the reported results are similar to the performance reports of Gemini, whose context length can also reach millions of text fragments. The application prospects of long-context LLMs are vast. Currently, cutting-edge AI labs are conducting research and competition in this field. For example, Anthropic's Claude 3 supports up to 200,000 text fragments, while OpenAI's GPT-4 has a context window of 128,000 text fragments. One significant advantage of LLMs with infinite context is the ability to create customized applications. Currently, customizing LLMs for specific applications requires techniques such as fine-tuning or retrieval-augmented generation (RAG). Although these techniques are practical, they often require complex engineering work. In theory, LLMs with infinite context would allow you to insert all documents into prompts and let the model select the most relevant parts based on each query. Additionally, you can customize the model by providing a series of examples to improve its performance on specific tasks without the need for fine-tuning. However, this does not mean that infinite context will completely replace other techniques. It will lower the barrier for application development, allowing developers and organizations to quickly prototype applications without significant engineering efforts. Ultimately, organizations will need to optimize their LLM pipelines to reduce costs and improve speed and accuracy.