Anthropic has introduced a prompt caching feature in its API, which allows developers to remember contextual information across API calls, greatly facilitating their work and eliminating the need for repetitive input. Currently, this feature is enabled in the public beta versions of Claude 3.5 Sonnet and Claude 3 Haiku, while support for the more powerful Claude Opus model is still in the works.
A 2023 paper provides a detailed explanation of how prompt caching works, allowing users to retain and reuse frequently needed background information within a session. With the model's ability to intelligently remember these prompts, users can easily add rich background information without incurring additional costs. This is particularly useful for users who need to include a large amount of contextual information within a single prompt and continuously reference it across different dialogue rounds, providing developers and other users with more flexibility in fine-tuning model responses.
Anthropic revealed that early adopters have experienced significant speed improvements and cost savings in various application scenarios, whether it's integrating complete knowledge bases with hundreds of examples or embedding every step of a conversation within prompts.
In terms of pricing, prompt caching demonstrates its economic advantages. Anthropic points out that the cost of using prompt caching is much lower than the cost of basic input tokens. Specifically, for Claude 3.5 Sonnet, the cost of writing prompts to be cached is $3.75 per million tokens (MTok), which drops to $0.30 per million tokens when using prompt caching. Compared to the base price of $3 per million tokens, this means that users can enjoy up to a 10x cost savings by paying a small additional fee in advance for subsequent usage.
The cost of prompt caching for Claude 3 Haiku is also $0.30 per million tokens, but the cost when using it is as low as $0.03 per million tokens. Although the current version of Claude 3 Opus does not support this feature, Anthropic has announced its future pricing strategy: $18.75 per million tokens for writing prompts to be cached and $1.50 per million tokens for accessing prompt caching.
However, it is worth noting that Anthropic's caching mechanism has a lifespan of 5 minutes and is refreshed every time it is called, as pointed out by AI industry expert Simon Willison on social media.
This is not the first time Anthropic has challenged the market with price advantages. Prior to the release of the Claude 3 series models, the company had already lowered token prices and is currently engaged in fierce competition with other competitors such as Google and OpenAI in offering low-cost options for third-party developers.
Prompt caching is not unique in the industry. For example, Lamina, a large-scale language model inference system, reduces GPU costs through key-value (KV) caching. The OpenAI community also has discussions on how to cache prompts, but it is important to note that prompt caching is not the same concept as the built-in memory function of large language models. While models like OpenAI's GPT-4 provide memory capabilities to remember user preferences or details, they do not directly store the history of prompts and responses, which is fundamentally different from prompt caching.