So, let’s say we want to build an AI powered teaching assistant, which is specifically designed to help students with their maths problems.

Great, so… where do we start? Should we just take an off-the-shelf LLM and throw it into production? probably not right?

Should we create an evaluation set with 1000s of hypothetical failure modes before putting it in front of any users at all? Also probably not right?

Let’s refer back to our “Data Flywheel” on the previous page:

  1. Add unit tests (evals) ❓
  2. See if they pass ✅
  3. Iterate on system message, in-context example, available tools etc. until they do 🔁
  4. Beta test with users and find failure modes from production traffic 🚦
  5. Convert these into unit tests 🗂️
  6. Repeat! 🔝

So, the first step is to add unit tests. While it might feels a bit early to be adding unit tests, how else are you going to express what you want the LLM to actually do?

In a similar mindset to the philosophy of test-driven-development, adding unit tests as step 1 is a good way to define the bare-minimum requirements that we expect our teaching assistant to be able to deal with, and then we can also create a bare-minimum solution, and start the data flywheel spinning! 🎡

Iteration 1

Step 1

So, what are some good unit tests to begin with? How about the following?

qs = ["3 - 1", "4 + 7", "6 + 2", "9 - 3", "7 + 9"]

Seems pretty reasonable. Now we need an LLM solution to the problem. Lets just use a simple 0-shot LLM to begin with:

import unify
client = unify.Unify("gpt-4o@openai", cache=True)

We’ve also added caching such that repeated queries of the same input to the same model read from the cache, rather than re-querying the model. This enables us to keep the LLM calls in place, and continually re-run the code without worrying about accumulating costs and wait times.

So, we now need to create a function to determine whether or not the LLM got the answer correct, the following should do the job:

def evaluate_response(question: str, response: str) -> float:
    correct_answer = eval(question)
        response_int = int(
            "".join([c for c in response.split(" ")[-1] if c.isdigit()]),
        return float(correct_answer == response_int)
    except ValueError:
        return 0.

Step 2

We can then run our evals as follows:

scores = [evaluate_response(q, client.generate(q)) for q in qs]
[1.0, 1.0, 1.0, 1.0, 1.0]

Okay great, we’ve now done step 1 and 2, and 3 is not necessary as they’re all passing. We could deploy this and get some user feedback, but we can probably expand the evals a bit more before we get to step 4.

Iteration 2

Step 1

Let’s go back to step 1, and expand the dataset to verify that we weren’t just lucky with these three examples:

from random import randint, choice, seed
qs += [f"{randint(0, 100)} {choice(['+', '-'])} {randint(0, 100)}" for i in range(25)]

Step 2

Re-running the evals:

scores = [evaluate_response(q, client.generate(q)) for q in qs]
print(f"{int(sum(scores))}/{len(scores)} correct")
25/30 correct.

The above code takes roughly ~30s to run. Things are this slow because each LLM query is being run serially. The simplest way to speed up our evals is to use which uses multi-threading under the hood to run each evaluation in parallel. We first need to define the evaluation function we’d like to map our arguments across:

def evaluate(q: str):
    response = client.generate(q)
    return evaluate_response(q, response)

Lets also quickly turn off caching, to verify that any speed improvements are due to the parallel queries, and not just the LLM responses caching.


We can then map the arguments across this evaluation function as follows.

scores =, qs)

This takes ~3.5s, which is roughly 10x faster than the serial version. Re-adding caching makes it even faster (roughly 0.005s, or 6000x faster), with each parallel thread no longer needing to wait for the LLM response.

Getting back to our evals, we see that only 25 of the tests are passing:

print(f"{int(sum(scores))}/{len(scores)} correct")
25/30 correct.

Lets dive in and try to work out what’s going wrong 🔍

To make our lives easier, we should start to properly log everything into the console. We first need to create a project to log our results into:


We can then modify our evaluate function to log our results as follows:

def evaluate(q: str):
    response = client.generate(q)
    score = evaluate_response(q, response)
    return unify.log(

Let’s re-run the evals with the logging included:

logs =, qs)

We can then go into our console, select our “maths_assistant” project, and see our logged results: