Snowflake Claims Breakthrough Technology Reduces AI Inference Time by Over 50% AI NEWS

Home
AInews
Snowflake Claims Breakthrough Technology Reduces AI Inference Time by Over 50%

Snowflake Claims Breakthrough Technology Reduces AI Inference Time by Over 50%

2025-01-17

Snowflake announced today that it is integrating a new technology into its managed large language models (LLMs) designed to significantly reduce the cost and time required for AI inference, which involves using trained models to generate outputs based on new input data.

This technology, known as SwiftKV, was developed by Snowflake's AI research team and has been open-sourced. It enhances inference efficiency by reusing information called hidden states from earlier layers of the LLM, thus avoiding redundant computations in subsequent layers' key-value caches.

Key-value caches act like memory shortcuts for language models, storing crucial information about input texts so that the model doesn't need to recalculate everything when generating or processing additional text. This makes the model faster and more efficient.

Snowflake claims that SwiftKV can boost LLM inference throughput by up to 50% and cut inference costs by up to 75% for the open-source Llama 3.3 70B and Llama 3.1 405B models compared to runs without SwiftKV.

The company initially integrated this technology with virtual large language models—a standalone yet similar technology covering end-to-end inference—and made it available in both Llama models. The same optimizations will be added to other model families accessible via Snowflake Cortex AI, a feature within Snowflake's Data Cloud platform enabling businesses to build, deploy, and scale AI and machine learning models directly within Snowflake. However, Snowflake did not specify a timeline for supporting other models.

Reducing Overhead

By eliminating redundant calculations, SwiftKV decreases memory usage and computational overhead, making decoding faster and more efficient, especially in real-time AI applications involving autoregressive tasks. These tasks involve generating one token at a time—whether a word or part of a word—based on previously generated tokens. Such processes are commonly used in chatbots, real-time translation, and text generation, where speed is critical.

The company noted that SwiftKV's performance gains stem from the assumption that most computing resources are consumed during the input or prompt phase. Many business tasks involve long questions with short answers, meaning much of the computational power is spent interpreting prompts. Snowflake shared a distribution chart on its engineering blog showing typical customer workloads have ten times more input tokens than output tokens.

"SwiftKV does not differentiate between inputs and outputs," said Yuxiong He, Head of AI Research at Snowflake and Distinguished Software Engineer. "When SwiftKV is enabled, model rewiring occurs during both input processing and output generation. We achieve computational reductions specifically during input processing, also referred to as prefill computation."

SwiftKV saves time by reusing completed work instead of repeating the same computations, cutting extra steps in half while maintaining minimal accuracy loss. It also employs a technique called "self-distillation" to ensure all necessary information is retained, ensuring no degradation in answer quality. In benchmark tests, Snowflake reported less than a 1% drop in accuracy.

"The quality gap between them is very small," He stated. "However, if customers have specific concerns in this area, they can opt to use the base Llama models within Cortex AI."

Snowflake indicated that this technology achieves performance optimizations across various use cases. It increases throughput for unstructured text processing tasks such as summarization, translation, and sentiment analysis. In latency-sensitive scenarios like chatbots or AI assistants, SwiftKV reduces the time to first token generation—the time it takes for the model to produce and return the first segment of output—by up to 50%.

COUNT

COUNT - Automate accounting and gain valuable insights

Scan Relief

Scan Relief - Automate receipt scanning and organization

Mindtrip

Mindtrip - AI chatbot that helps you organize a your trip

Ai Drive

Ai Drive - Chat with multiple PDF files

Convex

Convex - AI backend platform for AI assisted app development

Ilus AI

Ilus AI - AI illustration tool for stunning visual content

Vast AI

Vast AI - Cloud-based GPU Rentals for AI Computing

Reducing Overhead

RECENT AI TOOLS

Gitingest

COUNT

Scan Relief

Mindtrip

Ai Drive

RECENT AI NEWS

Huawei to Launch New AI Chip, Challenging Nvidia

Google DeepMind UK Team Reportedly Seeks to Form a Union

Cedar: A New Approach to Solving Kubernetes Authorization Issues

Thin Film Actuator Powered Microbots: Morph, Lock Shape, and Operate Tetherlessly

Double-clicking the Google Photos search icon restores classic search

Meta's AI Chatbot Enables Sexual Conversations with Minors

Solve This Math Problem by Musk to Get Hired at Tesla?

Google AI Studio Update: Features, Tools, VEO 2, and Gemini 2.0

RECENT AI TOOLS