Breakthrough in MoE Technology: DeepMind's PEER Architecture

2024-07-16

In recent years, the "Mixture-of-Experts" (MoE) technology has emerged in the field of Large Language Models (LLMs), greatly expanding the capacity and functionality of models without additional computational burden. Unlike the traditional approach of utilizing model resources comprehensively for each input, the MoE architecture acts as an intelligent scheduler, assigning tasks to a group of specialized "small expert" teams, allowing them to leverage their expertise. This strategy enables large language models to maintain efficient inference speed even with a surge in parameters, as demonstrated by models such as Mixtral, DBRX, Grok, and the highly anticipated GPT-4, all of which adopt this advanced strategy. However, MoE technology also faces challenges, particularly in the expansion of the number of experts. To address this, Google DeepMind proposed the "Parameter Efficient Expert Retrieval" (PEER) architecture, which paves the way for the future development of MoE technology. PEER stands out by enabling MoE models to support millions of experts, striking a finer balance between performance and computational resources. The growth of large language models has always been a process of increasing parameters and improving performance. However, as the number of parameters increases, the bottleneck of computation and memory becomes more severe. In the traditional Transformer architecture, the attention layers of each block are responsible for analyzing the relationships between inputs, while the feed-forward networks act as a vast knowledge base, occupying a large number of parameters. This "comprehensive deployment" approach leads to a linear increase in computational costs as the model size expands. The emergence of MoE technology brings flexibility to this parameter race. It replaces the massive feed-forward networks with a group of specialized small experts, with each expert focusing on a specific domain. Intelligent routers allocate tasks based on input content, ensuring that even with an increase in the number of experts, the overall computational load remains stable, achieving an increase in model capacity without additional burden. However, determining the optimal number of experts for MoE is not an easy task and requires comprehensive consideration of factors such as training data volume and computational resources. Studies have shown that increasing the "density" of experts, i.e., increasing the number of experts, significantly improves model performance, especially when the model size and training data volume grow simultaneously. In addition, high-density MoE allows models to absorb new knowledge more flexibly by adding new experts and appropriate regularization to adapt to evolving data environments. However, existing MoE methods have limitations, especially when the number of experts reaches a certain scale. Traditional fixed router designs struggle to efficiently manage a large team of experts. In this context, DeepMind's PEER architecture comes into play. It replaces the traditional router with a learning-based index, enabling rapid retrieval and precise activation of a massive number of experts. For each input, PEER first narrows down the range of candidate experts through fast pre-screening and then precisely activates the most relevant experts, ensuring that processing speed is not affected by an increase in the number of experts. What's more ingenious is that PEER's expert design is extremely concise, with each expert containing only one hidden neuron, greatly improving parameter utilization efficiency through sharing mechanisms. At the same time, drawing on the multi-head attention mechanism, PEER achieves comprehensive exploration and integration of expert capabilities. The introduction of the PEER architecture injects new vitality into MoE technology and points to a new direction for the future development of large language models. Whether as an enhancement layer for Transformer models or a direct replacement for feed-forward networks, PEER achieves significant performance improvements with high parameter efficiency. Furthermore, its combination with Parameter Efficient Fine-Tuning (PEFT) technology enables models to quickly adapt and optimize with minimal cost when facing new tasks. According to the paper, PEER also has the potential to select PEFT adapters at runtime, providing the possibility of dynamically adding new knowledge and functionality to Large Language Models (LLMs). In addition, there are reports that PEER may be used in DeepMind's Gemini 1.5 model, which is said to adopt a "new Mixture-of-Experts (MoE) architecture." In practical applications, researchers have validated the performance of PEER on different benchmark tests and compared it with Transformer models using dense feed-forward layers and other MoE architectures. Experimental results show that the PEER model achieves a better balance between performance and computation, achieving lower perplexity scores with the same computational budget. Researchers have also found that increasing the number of experts in the PEER model further reduces perplexity. This finding challenges the previous view that the efficiency of MoE models peaks when the number of experts is limited, proving that through optimized retrieval and routing mechanisms, MoE technology has the potential to expand to millions of experts, further reducing the cost and complexity of training and serving super-large language models.