The World Needs Models Superior to Transformers

2024-05-10

It can be said that modern artificial intelligence, or generative artificial intelligence, is running on Google's attention mechanism or Transformer. Seven years after the publication of the paper, everyone is still striving to find a better architecture for artificial intelligence. But it can be said that despite many opposing voices, Transformer still dominates.


However, challenging Transformer is not something new for researchers. Sepp Hochreiter, the inventor of Long Short-Term Memory (LSTM), revealed a new LLM architecture in his latest paper, with a major innovation called xLSTM, which stands for Extended Long Short-Term Memory. This new architecture addresses a major weakness of the previous LSTM design, which is inherently sequential and cannot process all information simultaneously.


Compared to Transformer, LSTM is limited by storage capacity, the inability to modify storage decisions, and the lack of parallelism due to memory mixing. Unlike LSTM, Transformer can operate on tokens in parallel, significantly improving efficiency.


The main components of the new architecture include the matrix memory of LSTM, which eliminates memory mixing, and exponential gating. These modifications allow LSTM to modify its memory more efficiently when processing new data.


What problems do Transformers have?


Last December, researchers Albert Gu and Tri Dao from Carnegie Mellon University and Together AI introduced Mamba, challenging the dominance of Transformer.


Their research revealed Mamba as a State Space Model (SSM) that demonstrated outstanding performance in various modalities, including language, audio, and genomics. For example, researchers attempted language modeling using the Mamba-3B model, which surpassed Transformer-based models of equal size in pre-training and downstream evaluation, and matched the scale of Transformer twice its size.


Researchers highlighted the efficiency of Mamba's selective SSM layer, which aims to address the computational inefficiency of Transformer when dealing with long sequences of millions of lengths, a major limitation of Transformer.


Another paper written by the Allen Institute for Artificial Intelligence, "Belief and Fate: The Limitations of Transformers in Compositionality," explores the fundamental limitations of Transformer language models by focusing on compositional problems that require multi-step reasoning.


The study investigates three representative compositional tasks: long multiplication, logical grid puzzles (such as Einstein puzzles), and a classic dynamic programming problem.


The autoregressive nature of Transformers poses a fundamental challenge in fully understanding tasks. These findings emphasize the urgent need to improve Transformer architecture and training methods.


A good starting point


According to Yann LeCun, Meta's AI director, "Autoregressive LLM is like a process that exponentially diverges from the correct answer."


This may be why Meta also introduced MEGALODON, a neural network architecture with infinite context length for efficient sequence modeling. It is designed to address the limitations of Transformer architecture when dealing with long sequences, including quadratic computational complexity and limited inductive bias for length generalization.


This is similar to Google's introduction of Feedback Attention Memory (FAM), a novel Transformer architecture that utilizes feedback loops to allow the network to attend to its own latent representations, enabling it to handle infinitely long sequences.


In April of this year, Google also released the RecurrentGemma 2B, a new open-source language model series based on the Griffin architecture developed by Google DeepMind.


This architecture achieves fast inference by replacing global attention with a mixture of local attention and linear recursion when generating long sequences.


Speaking of mixtures, the Mixture of Experts (MoE) model is also emerging. It is a neural network architecture that combines the strengths of multiple smaller models, called "experts," for prediction or output generation. The MoE model is like a team of experts in a hospital, with each expert specializing in a specific medical field, such as cardiology, neurology, or orthopedics.


For the Transformer model, MoE has two key elements - sparse MoE layers and gating networks. The sparse MoE layer represents different "experts" in the model, each capable of handling specific tasks. The gating network acts as a manager, determining which words or tokens are assigned to each expert.


The end of Transformers?


Before Transformers became popular, people were enthusiastic about using Recurrent Neural Networks (RNNs) for deep learning. However, by definition, RNNs process data sequentially, which is considered not suitable for text-based models.


But Transformers are just modifications of RNNs with the addition of an attention layer. This may be the same as "replacing" Transformers with something else.


At NVIDIA GTC 2024, Jensen Huang asked attendees about the most important improvements to the basic Transformer design. Aidan Gomez replied that a lot of work has been done to accelerate these models in terms of inference. However, Gomez expressed dissatisfaction with all current developments based on Transformers.


"I still feel that we are so similar to the original form, which makes me uneasy. I think the world needs something better than Transformers," he said, adding that he hopes it can be replaced by a "new performance peak." "I think it's too similar to something from six or seven years ago."