Innovative Technology Significantly Enhances Neural Network Efficiency, Language Processing Capability Accelerated prominently

2023-11-27

Researchers at the Swiss Federal Institute of Technology in Zurich have developed a new technology that significantly improves the speed of neural networks. They have demonstrated that changing the inference process can greatly reduce the computational requirements of these networks. In experiments conducted on BERT, they achieved over 99% reduction in computation. This innovative technology can also be applied to transformer models in large language models such as GPT-3, opening up new possibilities for faster and more efficient language processing. Fast Feed-Forward Networks Neural network transformers that support large language models consist of various layers, including attention layers and feed-forward layers. The latter accounts for a significant proportion of the model parameters and incurs a large computational cost due to the need to compute the product of all neurons and input dimensions. However, the researchers' paper shows that not every input requires all neurons in the feed-forward layers to be active during the inference process. They propose a "fast feed-forward" layer (FFF) as an alternative to the traditional feed-forward layer. FFF uses a mathematical operation called conditional matrix multiplication (CMM) instead of the dense matrix multiplication (DMM) used in conventional feed-forward networks. In DMM, all input parameters are multiplied with all neurons in the network, which is both computationally intensive and inefficient. In contrast, CMM processes inference in a way that subsequent networks do not require more neurons for each input. By selecting the correct neurons for each computation, FFF can significantly reduce the computational load, resulting in faster and more efficient language models. Application of Fast Feed-Forward Networks To validate their innovative technology, the researchers developed FastBERT, an improved version of Google's BERT transformer model. FastBERT revolutionizes the model by replacing the intermediate feed-forward layers with fast feed-forward layers. FFF arranges their neurons into a balanced binary tree and only executes one branch given the input. To evaluate the performance of FastBERT, the researchers fine-tuned different variants on several tasks in the General Language Understanding Evaluation (GLUE) benchmark test. GLUE is a comprehensive dataset used for training, evaluating, and analyzing natural language understanding systems. The results are impressive, as FastBERT performs comparably to the benchmark BERT model with similar size and training process. The variants of FastBERT, trained on a single A6000 GPU for one day, retain at least 96.0% of the original BERT model's performance. It is worth noting that the best-performing FastBERT model achieves performance on par with the original BERT model while using only 0.3% of the feed-forward neurons. The researchers believe that incorporating fast feed-forward networks into large language models has tremendous acceleration potential. For example, in GPT-3, the feed-forward network of each transformer layer consists of 49,152 neurons. The researchers point out, "If trainable, this network can be replaced by a fast feed-forward network with a maximum depth of 15, containing 65,536 neurons, but only 16 are used during inference. This corresponds to approximately 0.03% of the GPT-3 neurons." Room for Improvement Significant hardware and software optimizations have been made for the mathematical operation used in traditional feed-forward neural networks, dense matrix multiplication. The researchers state, "Dense matrix multiplication is the most optimized mathematical operation in computing history, with significant efforts made to design memory, chips, instruction sets, and software routines that execute as fast as possible. Many of these advancements are kept secret due to complexity or competitive advantage and are only exposed to end-users through powerful but restrictive programming interfaces." In contrast, there is currently no efficient, native implementation for the conditional matrix multiplication used in fast feed-forward networks. Popular deep learning frameworks do not provide interfaces for implementing CMM, except for advanced simulations. The researchers have developed their own CPU and GPU instruction-based implementation of CMM operations, resulting in a 78-fold speed improvement during inference. However, the researchers believe that there is potential for over 300-fold improvement in inference speed through improvements in the underlying implementation of hardware and algorithms. This could greatly address a major challenge faced by language models - the number of tokens generated per second. The researchers state, "At the scale of the BERT-base model, the theoretical acceleration promise is 341 times, and we hope our work will inspire efforts to implement conditional neural execution primitives in device programming interfaces." This research is part of the effort to address the memory and computational bottlenecks of large language models, paving the way for more efficient and powerful artificial intelligence systems.