Diff Transformer: A New Architecture for Enhancing Information Retrieval in Large Language Models

2024-10-17

The advancement of large language models (LLMs) in information retrieval has emerged as a prominent research focus, potentially influencing key applications such as Retrieval-Augmented Generation (RAG) and In-Context Learning (ICL). Recently, researchers from Microsoft Research and Tsinghua University introduced a novel LLM architecture called the Diff Transformer, which enhances performance by increasing attention to relevant contexts while filtering out noise.

Transformer Architecture and the "Intermediate Information Loss" Phenomenon

The Transformer architecture underpins modern LLMs, utilizing attention mechanisms to weigh the importance of different parts of the input sequence during output generation. However, studies have shown that Transformers struggle to retrieve crucial information when processing lengthy texts, a challenge known as the "intermediate information loss" phenomenon. This issue arises when LLMs fail to robustly utilize information from extended input contexts, leading to a significant decline in performance when the model needs to access relevant information situated in the middle of long texts.

Moreover, researchers have discovered that certain hallucination phenomena in LLMs—where the model produces erroneous outputs despite having relevant contextual information—are associated with flawed attention patterns. The softmax function used in the Transformer's attention mechanism tends to allocate attention scores to all tokens, including those irrelevant to the task, causing the model to lose focus on the most important sections within long texts.

Diff Transformer

To address these challenges, researchers developed the Diff Transformer, a new foundational architecture for LLMs. The central concept is the use of a "differential attention" mechanism to eliminate noise and enhance focus on the most relevant parts of the input.

Differential attention is achieved by splitting the query and key vectors into two groups and computing two separate softmax attention maps. The difference between these two maps is then used as the attention score. This process removes common noise and encourages the model to concentrate on information relevant to the input.

Although the Diff Transformer introduces a subtraction operation compared to the classical Transformer, it maintains efficiency through parallelization and optimization techniques.

Experimental Evaluation

The Diff Transformer was evaluated across various language modeling tasks, including scaling model sizes (from 3 billion to 13 billion parameters), training tokens, and context lengths (up to 64,000 tokens).

The experimental results demonstrate that the Diff Transformer consistently outperforms the classical Transformer architecture across different benchmarks. Compared to a Transformer model of the same size, a 3 billion-parameter Diff Transformer trained on 1 trillion tokens showed several percentage points of sustained improvement.

Further experiments confirmed the scalability of the Diff Transformer. The study also found that the Diff Transformer is particularly effective at leveraging increasingly long contexts and exhibits significant improvements in key information retrieval, hallucination mitigation, and in-context learning.

Future Outlook

Despite the encouraging initial results, there is still room for improvement. The research team is working on scaling the Diff Transformer to larger model sizes and training datasets, with plans to extend it to other modalities, including images, audio, video, and multimodal data.

The researchers have released the Diff Transformer code, implementing various attention and optimization mechanisms. They believe that this architecture will help enhance the performance of various LLM applications.