Snowflake Claims Breakthrough Technology Reduces AI Inference Time by Over 50%

2025-01-17

Snowflake announced today that it is integrating a new technology into its managed large language models (LLMs) designed to significantly reduce the cost and time required for AI inference, which involves using trained models to generate outputs based on new input data.

This technology, known as SwiftKV, was developed by Snowflake's AI research team and has been open-sourced. It enhances inference efficiency by reusing information called hidden states from earlier layers of the LLM, thus avoiding redundant computations in subsequent layers' key-value caches.

Key-value caches act like memory shortcuts for language models, storing crucial information about input texts so that the model doesn't need to recalculate everything when generating or processing additional text. This makes the model faster and more efficient.

Snowflake claims that SwiftKV can boost LLM inference throughput by up to 50% and cut inference costs by up to 75% for the open-source Llama 3.3 70B and Llama 3.1 405B models compared to runs without SwiftKV.

The company initially integrated this technology with virtual large language models—a standalone yet similar technology covering end-to-end inference—and made it available in both Llama models. The same optimizations will be added to other model families accessible via Snowflake Cortex AI, a feature within Snowflake's Data Cloud platform enabling businesses to build, deploy, and scale AI and machine learning models directly within Snowflake. However, Snowflake did not specify a timeline for supporting other models.

Reducing Overhead

By eliminating redundant calculations, SwiftKV decreases memory usage and computational overhead, making decoding faster and more efficient, especially in real-time AI applications involving autoregressive tasks. These tasks involve generating one token at a time—whether a word or part of a word—based on previously generated tokens. Such processes are commonly used in chatbots, real-time translation, and text generation, where speed is critical.

The company noted that SwiftKV's performance gains stem from the assumption that most computing resources are consumed during the input or prompt phase. Many business tasks involve long questions with short answers, meaning much of the computational power is spent interpreting prompts. Snowflake shared a distribution chart on its engineering blog showing typical customer workloads have ten times more input tokens than output tokens.

"SwiftKV does not differentiate between inputs and outputs," said Yuxiong He, Head of AI Research at Snowflake and Distinguished Software Engineer. "When SwiftKV is enabled, model rewiring occurs during both input processing and output generation. We achieve computational reductions specifically during input processing, also referred to as prefill computation."

SwiftKV saves time by reusing completed work instead of repeating the same computations, cutting extra steps in half while maintaining minimal accuracy loss. It also employs a technique called "self-distillation" to ensure all necessary information is retained, ensuring no degradation in answer quality. In benchmark tests, Snowflake reported less than a 1% drop in accuracy.

"The quality gap between them is very small," He stated. "However, if customers have specific concerns in this area, they can opt to use the base Llama models within Cortex AI."

Snowflake indicated that this technology achieves performance optimizations across various use cases. It increases throughput for unstructured text processing tasks such as summarization, translation, and sentiment analysis. In latency-sensitive scenarios like chatbots or AI assistants, SwiftKV reduces the time to first token generation—the time it takes for the model to produce and return the first segment of output—by up to 50%.