Caching LLM responses can be useful for a variety of use cases. For example, if you’re creating a user-facing application, then it can help to speed up responses and save money for common requests.

Or perhaps you’re building a complex multi-step LLM pipeline, and caching makes it much easier to debug, without needing to continually send the same queries to the same LLMs repeatedly, burning both time and money.

Caching is currently only supported via the Python client (server-side caching coming soon). It can be turned on in the client constructor, like so:

import unify
client = unify.Unify("gpt-4o@openai", cache=True)
# queries LLM, saves query + response to .cache.json
client.generate("Hello.")
# does not query, loads from .cache.json
client.generate("Hello.")

It can also be configured on a prompt-by-prompt basis:

import unify
client = unify.Unify("gpt-4o@openai")
# queries LLM, does not save
client.generate("Hello.")
# queries LLM, saves to .cache.json
client.generate("Hello.", cache=True)
# queries LLM, does not save nor load cache
client.generate("Hello.")
# does not query, loads from .cache.json
client.generate("Hello.", cache=True)

The examples above are very contrived, but with multi-step LLM applications this caching can be very helpful when debugging:

import unify
from docs import pdf_text
from prompts import parse_doc, split, summarize
client = unify.Unify("gpt-4o@openai", cache=True)
# works fine
parsed = client.generate(
    pdf_text, system_message=parse_doc
)
# works fine
sections = client.generate(
    parsed, system_message=split
)
# debugging, trying different system messages each time
summaries = client.generate(
    sections, system_message=summarize
)

By default, the .cache.json file will be created in your current working directory. The location can be configured by setting the environment variable UNIFY_CACHE_DIR.

Deleting the .cache.json file will of course delete the entire cache, and queries will once again be made to the LLMs. You can also open up the .cache.json file in any text editor, and modify the cache on a query-by-query basis if needed.

It’s also worth noting that this cache implementation is very simple. For production scale caching of thousands or millions of prompts, you should use server-side caching, which uses SQL under the hood (coming soon).

Thoughts + feedback always welcome on discord! 👾