Google Introduces Implicit Caching to Lower Access Costs for Latest AI Models

Google is introducing a feature in its API Gemini that, according to the company, will reduce costs for third-party developers using its latest AI models. This feature, referred to as «implicit caching,» is claimed to deliver up to 75% savings on «repeating context» sent to models via the Gemini API.

It is compatible with Google’s Gemini 2.5 Pro and 2.5 Flash models, which you can access through BotHub via a referral link. This development is likely to be welcomed by developers, especially as the costs associated with using advanced models continue to rise.

Caching is a widely adopted strategy in the AI industry, allowing the reuse of commonly accessed or pre-computed data from models to minimize computational demands and expenses. For instance, caches can retain answers to frequently asked questions, thus sparing the model from repetitively generating responses to the same queries.

Previously, Google provided explicit prompt caching, which required developers to identify their most frequently used prompts. While savings were expected, explicit caching often involved a substantial amount of manual effort.

Some developers expressed dissatisfaction with how Google’s explicit caching for Gemini 2.5 Pro operated, leading to unexpectedly high API bills. Complaints peaked last week, prompting the Gemini team to apologize and pledge improvements.

In contrast to explicit caching, implicit caching is automated and enabled by default for Gemini 2.5 models. It passes on cost savings when an API request to the model matches a cached request.

«When you send a request to one of the Gemini 2.5 models, if it shares the same prefix as previous requests, it qualifies for caching,» Google explained in its blog. «We’ll dynamically pass those savings onto you.»

The minimum request token count for implicit caching is set at 1024 for 2.5 Flash and 2048 for 2.5 Pro, according to Google’s developer documentation, which is not an excessive number, implying that significant time is not required to realize these automatic savings. Tokens represent raw data bits processed by the models, with one thousand tokens roughly equating to about 750 words.

Given that Google’s recent claims about cost savings through caching have not materialized, there are several factors customers should consider regarding this new feature. For instance, Google advises developers to place repeating context at the beginning of requests to increase the chances of implicit caching hits, while context that may vary should be added at the end.

Additionally, Google has not provided any independent validation that the new implicit caching system will deliver the promised automatic savings, so early adopters will need to share their experiences.

[Source](https://techcrunch.com/2025/05/08/google-launches-implicit-caching-to-make-accessing-its-latest-ai-models-cheaper/)