Apple and Nvidia Collaborate to Enhance Text Generation Performance of Large Language Models

2024-12-19

In today's blog post, Apple's engineers provided a detailed update on their collaboration with NVIDIA, aimed at further enhancing the text generation speed of large language models (LLMs).

Earlier this year, Apple introduced its innovative Recurrent Drafter (ReDrafter) technology and made it open source. ReDrafter represents a novel approach to LLM text generation, offering significant speed improvements and achieving "state-of-the-art performance." The technology combines beam search, which explores multiple text generation possibilities, and dynamic tree attention, which efficiently handles selection processes.

Although Apple's research has already demonstrated the power of ReDrafter, the company has not stopped there. Instead, it has partnered with NVIDIA to bring this technology into production. As part of this collaboration, ReDrafter has been successfully integrated into NVIDIA's TensorRT-LLM tool, designed for accelerating LLMs on NVIDIA GPUs.

The collaboration has yielded impressive results: To integrate ReDrafter, NVIDIA added new operators and optimized existing ones, significantly enhancing TensorRT-LLM's ability to handle complex models and decoding methods. Benchmark tests on production models with billions of parameters, using the TensorRT-LLM inference acceleration framework with ReDrafter on NVIDIA GPUs, showed a remarkable 2.7x increase in token generation speed under greedy decoding mode. These benchmark results highlight the technology's potential to significantly reduce user latency while minimizing GPU usage and power consumption.

Apple's machine learning researchers concluded, "As LLMs become more prevalent in production applications, improving inference efficiency is crucial for reducing computational costs and minimizing user latency. By integrating ReDrafter's novel speculative decoding method into the NVIDIA TensorRT-LLM framework, developers can now achieve faster token generation speeds for their production LLM applications on NVIDIA GPUs."