Streaming LLM Technology: Expanding Context Constraints of Large Language Models

2023-11-28

Large language models (LLMs) are known for their ability to handle long text sequences. However, when dealing with lengthy articles, books, or continuous chat conversations, these models often reach their context limits.

Expanding the model's context to longer sequences poses a challenge. The current methods to address this problem either require high computational demands and memory usage or lack precision.

A breakthrough solution is StreamingLLM, developed by a team of researchers from Meta AI, MIT, and Carnegie Mellon University. This innovative technology allows the context of an LLM to be expanded to millions of tokens without requiring significant computational and memory resources, while maintaining high-quality performance. StreamingLLM is set to become a valuable tool for applications that deal with long sequence texts.

LLMs and Context Windows

LLMs are essentially designed with fixed context lengths, determined by their architecture and training methods. For example, the popular LLM Llama-2 has a context of approximately 4000 tokens, equivalent to about 3000 words. As long as the interaction with the language model stays within this context limit, the model can maintain its high-quality performance. However, this limited sequence length restricts its broader applications.

One potential solution to overcome this limitation is to create a model with a longer context length. However, this approach requires modifying the model's architecture and retraining, which can be costly and impractical for many organizations. Additionally, extending the context length leads to a quadratic increase in costs, where doubling the context of an LLM results in a fourfold increase in memory and computational costs.

Another approach is to implement a sliding context window. In this case, if a model has a context of 4000 tokens, the model is always fed the last 4000-x tokens, where 'x' is the number of tokens it is expected to generate.

Although this technique seems intuitive, it has significant drawbacks in practical applications.

Autoregressive LLMs employ a mechanism called "KV cache" to improve efficiency. This mechanism calculates and stores the attention scores of previous tokens, eliminating the need to recompute them for each new token. The attention score of each token depends on its preceding tokens. When the context window is shifted, the entire KV cache needs to be recomputed, significantly reducing the model's throughput.

Another solution is to maintain a cache of the overlapping tokens between the old and new contexts while sliding the window. Although this approach does provide some improvement, it is not without flaws. Once the context starts deviating from the initial setup, the model's quality rapidly deteriorates.

Attention Sinks

In their paper, the researchers highlight an interesting characteristic of autoregressive LLMs like GPT-3.5 and Llama-2: a significant amount of attention scores is allocated to the initial tokens, regardless of their relevance to the language modeling task. They refer to these tokens as "attention sinks."

Interestingly, they observed that when the text length exceeds the cache size, the model's perplexity significantly increases, primarily due to the exclusion of these initial tokens. (Perplexity measures the uncertainty of the model in its predictions, with lower values indicating higher accuracy.) This finding suggests that these attention sinks play a crucial role in maintaining LLM stability, regardless of their distance from the predicted tokens.

The reason behind this phenomenon is straightforward. Given the autoregressive nature of language modeling, the initial tokens are visible to almost all subsequent tokens, making them preferred candidates for attention sinks. In contrast, the later tokens are only visible to a limited set of subsequent tokens. Therefore, the initial tokens are more likely to be trained as attention sinks, occupying a disproportionate amount of attention.

Thus, when attention scores of the initial few tokens are removed from the context, the model's performance starts to decline due to the loss of a significant amount of attention. Preserving these attention sinks forms the fundamental premise of the StreamingLLM technology, offering a promising solution to the limitations of current LLMs.

How StreamingLLM Works

StreamingLLM is an innovative framework that allows large language models to process infinitely long texts without the need for fine-tuning. This technology preserves attention sinks to maintain a close-to-normal distribution of attention scores. When the conversation sequence with an LLM exceeds the model's context length, StreamingLLM retains the KV cache of attention sink tokens, preserving only four initial tokens and discarding subsequent tokens to make space for the sliding window tokens. This approach enables the model to expand its context and stabilize its performance without recomputing the entire KV values.

"Introducing four initial tokens as attention sinks is sufficient to recover the performance of LLM," the researchers wrote. "In contrast, adding only one or two is not enough for a complete recovery. We believe this pattern arises because these models do not include consistent starting tokens in all input samples during pretraining."

In the StreamingLLM framework, the KV cache consists of attention sinks and a rolling KV cache that retains the most recent tokens crucial for language modeling. The researchers emphasize the versatility of StreamingLLM, stating, "The design of StreamingLLM is generic and can be seamlessly integrated into any autoregressive language model that uses relative positional encoding."

The researchers state that LLMs like Llama-2 (70-700 billion parameters), Falcon (70-400 billion parameters), and Pythia (29-120 billion parameters) can reliably model up to 4 million tokens or more using the StreamingLLM framework. This technology effectively addresses the challenges faced by other methods, providing fast inference, high quality, and low memory requirements.

"StreamingLLM first breaks the connection between the pretraining window size of LLM and its actual text generation length, paving the way for the streaming deployment of LLM," the researchers wrote.

Using Attention Sinks for Pretrained Language Models

The researchers emphasize that one significant reason for models excessively attending to multiple initial tokens is the lack of a dedicated sink token to absorb excessive attention scores. Therefore, the model unintentionally designates globally visible tokens, primarily the initial tokens, as attention sinks.

"A potential remedy could be intentionally including a globally trainable attention sink token, referred to as 'sink token,' which serves as a repository for unnecessary attention scores," they propose.

With this insight, pretrained language models can be trained to require only one attention sink token during streaming deployment. The only prerequisite is to include an additional learnable token at the beginning of all training samples to serve as the attention sink.

To validate this approach, the researchers trained several language models with 160 million parameters from scratch, including a single attention sink token at the beginning of the training examples. Their experiments showed that adding this single sink token effectively maintained the model's performance in streaming scenarios.

"This contrasts with regular models, which require reintroducing multiple initial tokens as attention sinks to achieve the same performance level," the researchers noted.

Furthermore, they found that including a sink token during pretraining did not negatively impact the model's convergence or subsequent performance on various natural language processing (NLP) benchmarks.

Practical Applications of StreamingLLM

The authors of this research paper have made the code for StreamingLLM publicly available on GitHub. This Python library is compatible with Llama-2, MPT, Falcon, and Pythia models.

In addition, there is an open-source implementation of StreamingLLM as a plug-and-play alternative in the Hugging Face Transformers library, compatible with other models on the Hugging Face platform.

Hugging Face is closely monitoring the development of StreamingLLM and considering its integration into their Transformers library. This progress is expected to provide enhanced tools for the use of StreamingLLM in various applications, marking a significant advancement in the field of language modeling.