While there is no simple speedometer that can directly measure the speed of a generative AI model, one mainstream method is to evaluate it by measuring the number of tokens the model can process per second.
Recently, SambaNova Systems announced a new milestone in the performance of generative AI. Their Llama 3 8B parameter instruction model achieved an impressive processing speed of 1000 tokens per second. Previously, Groq held the fastest benchmark test score for Llama 3, with a speed of 800 tokens per second. This milestone of 1000 tokens per second has been independently verified by testing company Artificial Analysis. This speed improvement has multiple impacts for businesses, including faster response times, higher hardware utilization, and lower costs, which could lead to significant commercial benefits.
"We have witnessed an acceleration in the AI chip race that far exceeds the expectations of most people, and we are pleased to validate SambaNova's claims in independent benchmark tests that focus on real-world application performance," said George Cameron, co-founder of Artificial Analysis, to VentureBeat. "Now, AI developers have more hardware choices, which is particularly exciting for use cases that have strict speed requirements, such as AI agents and consumer AI applications that require low-latency response and document parsing."
How SambaNova combines hardware and software acceleration for Llama 3 and generative AI
SambaNova is a provider focused on enterprise-level generative AI with strong hardware and software capabilities.
In terms of hardware, the company has developed an AI chip called the Reconfigurable Dataflow Unit (RDU). This RDU is similar to Nvidia's AI accelerators and can be used for model training and inference. SambaNova specifically focuses on optimizing its RDU for enterprise workloads and model fine-tuning. The company's latest chip is the SN40L, announced in September 2023.
On top of the silicon chips, SambaNova has built its own software stack, which includes the Samba-1 model, first released on February 28. Samba-1 is a model with 1 trillion parameters, also known as Samba-CoE (Expert Ensemble). This approach allows businesses to flexibly combine multiple models or use them individually, and fine-tune and train the models based on enterprise data.
For the speed of 1000 tokens per second, SambaNova actually used its Samba-1 Turbo model, which is an API version specifically designed for testing. The company plans to integrate this speed update into mainstream models for enterprises in the coming weeks. Cameron cautioned that Groq's measurement of 800 tokens per second is for its public API shared endpoints, while SambaNova's measurement is for dedicated private endpoints. Therefore, he suggests that direct comparisons should not be made as they are not entirely the same.
Reconfigurable Dataflow Architecture enables iterative optimization
The key to SambaNova's performance lies in its Reconfigurable Dataflow Architecture, which is the core of the company's RDU silicon technology.
The Reconfigurable Dataflow Architecture allows SambaNova to optimize resource allocation for individual neural network layers and kernels through compiler mapping.
When Llama 3 was first released, Liang's team ran the model and achieved a preliminary performance of 330 tokens per second on Samba-1. Liang explained that through a series of optimizations over the past few months, this speed has tripled to the current high point of 1000 tokens per second. Liang explained that optimization is a process of balancing resource allocation between kernels, avoiding bottlenecks, and maximizing the throughput of the entire neural network pipeline. This is also the same fundamental approach that SambaNova helps businesses optimize their fine-tuning work as part of their software stack.
Faster speed with enterprise-level quality and customization
Liang emphasized that SambaNova achieves its speed milestone using 16-bit precision, which provides the required level of high quality for enterprises.
He pointed out that reducing to 8-bit precision is not a good choice for enterprise users.
Speed is particularly important for enterprise users for several reasons. As organizations increasingly shift towards AI agent-based workflows, where one model flows into another, speed becomes more critical than ever. At the same time, improving speed also brings economic incentives.