Meta Unveils Details of GPU Cluster to be Utilized for Llama3 Training

2024-03-13

Meta Platforms has announced two powerful graphics processing unit (GPU) clusters that will be used to support the training of next-generation generative artificial intelligence (AI) models, including the upcoming Llama 3. In a blog post, Meta engineers Kevin Lee, Adi Gangidi, and Mathew Oldham explained that these data center-scale clusters, each equipped with 24,576 GPUs, were built to support larger and more complex generative AI models than their previously released models, such as Llama 2, which is a popular open-source algorithm competing with OpenAI's ChatGPT and Google's Gemini. The engineers also stated that these clusters will contribute to future AI research and development. Each cluster is equipped with thousands of Nvidia Corp.'s most powerful H100 GPUs, surpassing the scale of the company's previous large clusters, which contained approximately 16,000 Nvidia A100 GPUs. The company has been reported to have been acquiring thousands of the latest Nvidia chips, and a recent report by Omdia revealed that it has become one of the chip manufacturer's largest customers. Now, we know the reason behind this. Meta stated that it will use the new clusters to fine-tune its existing AI systems and train updated and more powerful systems, including Llama 3, the planned successor to Llama 2. This blog post is Meta's first confirmation of the development of Llama 3, although it has been widely speculated before. The engineers stated that the development of Llama 3 is currently "in progress," but did not disclose when it might be released. In the long term, Meta aims to create artificial general intelligence (AGI) systems that will be closer to human creativity than existing generative AI models. In the blog post, Meta stated that the new clusters will contribute to advancing these ambitious plans. Additionally, Meta revealed that it is upgrading its PyTorch AI framework to support a larger number of GPUs. In terms of internal architecture, although both clusters have the same number of GPUs and are interconnected through endpoints with a speed of 400 gigabits per second, they employ different architectures. One is based on Arista Networks' Arista 7800, equipped with Wedge400 and Minipack2 OCP rack switches, achieving remote direct memory access (RDMA) through a converged Ethernet network architecture. The other is built using Nvidia's own network architecture technology called Quantum2 InfiniBand. Both clusters are built on Meta's open-source GPU hardware platform, Grand Teton, designed to support large-scale AI workloads. It is said that the bandwidth from the host to the GPU in Grand Teton is four times that of its predecessor, the Zion-EX platform, and the computing and data network bandwidth is also doubled, while the power range has increased by two times. Meta stated that these clusters adopt its latest Open Rack power and rack infrastructure architecture, which aims to provide greater flexibility for data center designs. The engineers stated that Open Rack v3 allows power shelves to be installed at any position within the rack, rather than being fixed on busbars, enabling more flexible configurations. Furthermore, the number of servers in each rack is customizable, making it possible to achieve more efficient balance in terms of server throughput capacity. Meta stated that this has reduced the total number of racks to some extent. In terms of storage, the clusters utilize a Linux-based user-space file system application programming interface supported by Meta's distributed storage platform, Tectonic. Meta has also partnered with a startup called Hammerspace to create a new parallel network file system for the clusters. Finally, the engineers explained that these clusters are based on the YV3 Sierra Point server platform and are equipped with state-of-the-art E1.S solid-state drives. The team noted that they customized the network topology and routing architecture of the clusters and deployed Nvidia's Collective Communications Library, a set of communication programs optimized for its GPUs. Meta mentioned in the blog post that it remains committed to open innovation in the AI hardware stack. The engineers reminded readers that the company is a member of the recently announced AI Alliance, which aims to create an open ecosystem to enhance transparency and trust in AI development and ensure that everyone benefits from innovation. "Looking ahead, we recognize that methods that were effective yesterday or today may not meet tomorrow's needs," wrote the engineers. "That's why we continuously evaluate and improve every aspect of our infrastructure, from the physical and virtual layers to the software layer and beyond." Meta also revealed that it will continue to purchase more Nvidia H100 GPUs and plans to have over 350,000 by the end of the year. These GPUs will be used to further build its AI infrastructure, and in the near future, we may see the emergence of more powerful GPU clusters.