Google Unveils Titans Architecture to Enhance Long-Sequence Processing in Large Language Models

2025-01-17

Google researchers have introduced a novel neural network architecture named Titans, designed to tackle a significant challenge faced by large language models (LLMs) when processing lengthy sequences: how to enhance their memory capabilities during inference without substantially increasing memory and computational costs.

Titans architecture integrates attention mechanisms from traditional LLMs with "neural memory" layers, enabling the model to efficiently handle both short-term and long-term memory tasks. According to the researchers, LLMs utilizing neural long-term memory can process millions of tokens more effectively than classic LLMs and alternatives like Mamba, while requiring fewer parameters.

Traditional LLMs are typically built on the Transformer architecture, using self-attention mechanisms to compute relationships between tokens, which is an effective technique for learning complex patterns in token sequences. However, as sequence lengths increase, the computational and storage costs for attention grow quadratically.

To address this issue, some recent alternative architectures have been proposed with linear complexity, allowing scalability without a dramatic rise in memory and computation costs. Nevertheless, Google researchers highlight that these linear models cannot match the performance of classic Transformers because they compress context data, potentially missing crucial details.

The ideal architecture should feature distinct memory components that work together to leverage existing knowledge, memorize new facts, and learn abstract concepts from context. Researchers propose that an effective learning paradigm should resemble the human brain, consisting of different yet interconnected modules, each responsible for key parts of the learning process.

To fill gaps in current language models, researchers developed a "neural long-term memory" module capable of learning new information during inference without the inefficiencies associated with full attention mechanisms. This module learns a function that remembers new facts during inference and dynamically adjusts its memory processes based on encountered data, solving generalization issues found in other neural network architectures.

The neural memory module employs the concept of "surprise" to determine which information is worth storing. The greater the difference between a token sequence and stored information in the model's weights and existing memory, the more surprising it is, thus more worthy of being remembered. This enables the module to efficiently use its limited memory by storing only useful data segments.

To manage very long data sequences, the neural memory module features an adaptive forgetting mechanism that removes unnecessary information, thereby managing the limited capacity of memory.

Titans architecture is a family of models combining existing Transformer modules with neural memory modules. It includes three key components: a "core" module serving as short-term memory, using classical attention mechanisms to focus on the current segment of input tokens being processed; a "long-term memory" module using neural memory architecture to store information beyond the current context; and a "persistent memory" module with fixed learnable parameters post-training, used for storing timeless knowledge.

Researchers proposed various ways to connect these three components. Overall, the architecture’s main advantage lies in allowing attention and memory modules to complement each other. For instance, attention layers can use historical and current contexts to decide which parts of the current context window should be stored in long-term memory. Simultaneously, long-term memory provides historical knowledge not present in the current attention context.

Researchers conducted small-scale tests on Titan models ranging from 170 million to 760 million parameters, including tasks such as language modeling and long-sequence language tasks. They compared Titans' performance with various Transformer-based models, linear models like Mamba, and hybrid models like Samba. Results showed that Titan outperformed other models in language modeling and performed better than similarly sized Transformer and linear models.

In tasks involving long sequences, such as "needle in a haystack" (where the model must retrieve information from extremely long sequences) and BABILong (where the model must perform cross-fact reasoning across extensive documents), Titan demonstrated particularly strong performance, surpassing models with significantly more parameters, including GPT-4, GPT-4o-mini, and Llama-3 enhanced with retrieval-augmented generation (RAG).

Moreover, researchers managed to extend Titans' context window to 2 million tokens while keeping memory costs low. Although further testing on larger models is needed, the results indicate that Titans' potential has not yet been fully realized.

For enterprise applications, given Google’s leading position in long-context models, this technology could be applied to private and open models like Gemini and Gemma. Supporting longer context windows allows for the creation of applications capable of incorporating new knowledge through prompts without relying on complex techniques like RAG. Prompt-based application development cycles are faster compared to complex RAG pipelines. Additionally, architectures like Titans help reduce inference costs for long sequences, enabling enterprises to deploy LLM applications across more use cases. Google plans to release PyTorch and JAX code for training and evaluating Titans models.