NVIDIA Hopper Leads in MLPerf Results for Generative AI Inference

2024-03-28

NVIDIA has announced performance metrics in the latest MLPerf benchmark test, further solidifying its leading position in the field of generative AI. Leveraging TensorRT-LLM specifically designed for large language models (LLM) inference tasks, NVIDIA's Hopper architecture GPU has achieved a 3x performance improvement on GPT-J LLM compared to six months ago. Leading innovative companies in the industry are optimizing their models using TensorRT-LLM, and NVIDIA NIM (NVIDIA Inference Microservices Suite) further drives this process with powerful engines such as TensorRT-LLM. This comprehensive approach simplifies the deployment of NVIDIA's inference platform, providing unparalleled efficiency and flexibility for enterprises. The recent MLPerf benchmark test showcases a significant leap in the capabilities of generative AI. TensorRT-LLM runs on NVIDIA's latest H200 Tensor Core GPU, which features enhanced memory capabilities and makes its debut in the MLPerf arena, achieving astonishing throughput by generating up to 31,000 tokens per second in the Llama 2 70B benchmark test. The success of the H200 GPU also highlights innovative strides in thermal management, with custom solutions contributing to a performance boost of up to 14%. These advancements are further demonstrated through creative implementations by system manufacturers in NVIDIA MGX designs, enhancing the performance capabilities of the Hopper GPU. NVIDIA has started shipping the H200 GPU, which will soon be available through nearly 20 renowned system manufacturers and cloud service providers. The GH200 superchip boasts an impressive memory bandwidth of nearly 5TB/s, showcasing outstanding performance, particularly in memory-intensive MLPerf evaluations such as recommendation systems. NVIDIA engineers have employed a technique called structured sparsity, which aims to reduce computational workload and was initially introduced in the NVIDIA A100 Tensor Core GPU. This technique has achieved a speed improvement of up to 33% in the inference tasks of Llama 2. In response to the expanding scale of LLM models, NVIDIA's founder and CEO, Jensen Huang, revealed at last week's GTC conference that the upcoming NVIDIA Blackwell architecture GPU will offer higher performance levels to meet the demands of trillion-parameter AI models.