Philosophy
Unity tests use real LLM calls, never mocked. Responses are cached after the first run, so subsequent runs replay instantly. This gives you the confidence of real inference with the speed of unit tests.Quick start
Symbolic vs. eval tests
Tests fall on a spectrum between two paradigms: Symbolic tests use the LLM as a stub. The LLM receives minimal instructions designed to trigger specific code paths. Focus is on infrastructure: async tool loops, steering, state mutations. The LLM’s “intelligence” is irrelevant. Failures indicate regressions in programmatic logic. Eval tests exercise the system end-to-end. We ask a question or give a directive, then verify the outcome. Internal tool calls don’t matter. Failures may indicate prompt issues, tool design problems, or capability gaps. Most tests sit somewhere between these extremes. Mark eval-heavy files with:Caching
WhenUNILLM_CACHE="true" (the default), all LLM responses are cached per unique input:
- Cache key = exact LLM input (prompts, tools, parameters)
- Cache hit = identical input seen before → cached response replayed
- Cache miss = new input → real LLM call, response stored
