Caching
Caching LLM responses can be useful for a variety of use cases. For example, if you’re creating a user-facing application, then it can help to speed up responses and save money for common requests.
As another example, if you’re building a complex multi-step LLM pipeline, caching makes it much easier to debug, without needing to continually send the same queries to the same LLMs over and over again, burning both time and money.
Caching is currently only supported via the Python client. Caching behind the API is not yet supported. However, if caching behind the API would be useful, then feel free to hop on a call or let us know in discord!
Caching can be turned on in the client constructor, like so:
It can also be configured on a prompt-by-prompt basis:
The examples above are very contrived, but with multi-step LLM applications this caching can be very helpful when debugging:
By default, the .cache.json
file will be created in your current working directory.
The location can be configured by setting the environment variable UNIFY_CACHE_DIR
.
Deleting the .cache.json
file will of course delete the entire cache,
and queries will once again be made to the LLMs. You can also open up the .cache.json
file in any text editor, and modify the cache on a query-by-query basis if needed.
It’s also worth noting that this cache implementation is very simple. For production scale caching of thousands or millions of prompts, a more efficient caching mechanisms would be needed. If you need more sophisticated caching for your application, then Get in touch or ping us on discord! 👾