Caching
Caching LLM responses can be useful for a variety of use cases. For example, if you’re creating a user-facing application, then it can help to speed up responses and save money for common requests.
Or perhaps you’re building a complex multi-step LLM pipeline, and caching makes it much easier to debug, without needing to continually send the same queries to the same LLMs repeatedly, burning both time and money.
Caching is currently only supported via the Python client (server-side caching coming soon). It can be turned on in the client constructor, like so:
It can also be configured on a prompt-by-prompt basis:
The examples above are very contrived, but with multi-step LLM applications this caching can be very helpful when debugging:
By default, the .cache.json
file will be created in your current working directory.
The location can be configured by setting the environment variable UNIFY_CACHE_DIR
.
Deleting the .cache.json
file will of course delete the entire cache,
and queries will once again be made to the LLMs. You can also open up the .cache.json
file in any text editor, and modify the cache on a query-by-query basis if needed.
It’s also worth noting that this cache implementation is very simple. For production scale caching of thousands or millions of prompts, you should use server-side caching, which uses SQL under the hood (coming soon).
Thoughts + feedback always welcome on discord! 👾