The first step of the flywheel is to create our agent. Let’s add something simple to begin with, and we can add more complexity on future iterations 🔁

The full script for running this iteration can be found here.

🤖 Create Agent

We can start with a simple 0-shot LLM.

agent = unify.Unify("o3-mini@openai", traced=True, cache="read-only")

Let’s also download a .cache.json file which was previously generated whilst running through this demo. This avoids the need to make any real LLM calls, saving time and money, and it also ensures the walkthrough is fully deterministic.

If you’d rather go down your own unique iteration journey, then you should skip the cell below, and either remove cache="read-only" (turn off caching) or replace it with cache=True (create your own local cache) in the agent constructor above. However, this would mean many parts of the remaining walkthrough might not directly apply in your case, as the specific failure modes and the order in which they appear are likely to be different.

if os.path.exists(".cache.json"):
    os.remove(".cache.json")
wget.download(
    "https://raw.githubusercontent.com/"
    "unifyai/demos/refs/heads/main/"
    "marking_assistant/.cache.json",
)

The agent needs to mark student answers to questions, out of a possible maximum number of marks. Let’s give it a sensible system message template to begin with:

system_message = """
Your task is to award a suitable number of marks for a student's answer to a question, from 0 up to a maximum of {available_marks_total} marks.

The question is:

{question}


Their answer to this question is:

{answer}


As the very final part of your response, simply provide the number of marks on a *new line*, without any additional formatting. For example:

3
"""

In our test set, the student answer is stored as a dictionary with keys for each sub-question. When no sub-questions are present, then a dummy ”_” key is used instead (to preserve the dict type for all answers). Rather than unpacking the dictionary directly into the system message, let’s instead add a simple “pretty print” function to format this more elegantly.

def pretty_print_dict(d, indent=0):
    output = ""
    for key, value in d.items():
        if key != "_":
            output += " " * indent + str(key) + ":\n"
        if isinstance(value, dict):
            output += pretty_print_dict(value, indent=indent + (4 * int(key != "_")))
        else:
            for line in str(value).splitlines():
                output += " " * (indent + 4) + line + "\n"
    return output

Let’s now wrap our system message template in a simple function to populate the template variables with the specific data involved, before querying the agent for a response. We also make use of the pretty_print_dict for the answer as explained above.

@unify.traced
def call_agent(system_msg, question, answer, available_marks_total):

    # copy agent, so each thread uniquely sets system message
    local_agent = agent.copy()

    # set each of the placeholder values
    local_agent.set_system_message(
        system_msg.replace(
            "{question}",
            textwrap.indent(question, " " * 4),
        )
        .replace(
            "{answer}",
            pretty_print_dict(answer, indent=4),
        )
        .replace(
            "{available_marks_total}",
            str(available_marks_total),
        ),
    )

    # return prediction (no user message, only system message)
    return local_agent.generate()

🗂️ Add Tests

Great, we now have our agent implemented. So, what are some good unit tests to begin with? Rather than using all 321 examples for our first iteration. Let’s use the smallest subset of 10 examples, which we created in the previous section.

test_set_10 = unify.download_dataset("TestSet10")

🧪 Run Tests

Let’s add an evaluation function, and include all other arguments that we would like to log as part of the evaluation. All input arguments, intermediate variables, and return variables without a prepending "_" in the name (all “non-private” arguments, returns and intermediate variables) will automatically be logged when the function is called, due to the inclusion of the unify.log decorator.

@unify.log
def evaluate(
    question,
    student_answer,
    available_marks_total,
    correct_marks_total,
    _system_message,
):
    pred_marks = call_agent(
        _system_message,
        question,
        student_answer,
        available_marks_total,
    )
    _pred_marks_split = pred_marks.split("\n")
    pred_marks_total, diff_total, error_total = None, None, None
    for _substr in reversed(_pred_marks_split):
        _extracted = "".join([c for c in _substr if c.isdigit()])
        if _extracted != "":
            pred_marks_total = int(_extracted)
            diff_total = correct_marks_total - pred_marks_total
            error_total = abs(diff_total)
            break
    pred_marks = {"_": {"marks": pred_marks_total, "rationale": pred_marks}}
    return error_total

We can then run our evaluation, with the logging included, like so:

with unify.Experiment("simple_agent", overwrite=True), unify.Params(
    system_message=system_message,
    dataset="TestSet10",
    source=unify.get_source(),
):
    unify.map(
        evaluate,
        [dict(**d.entries, _system_message=system_message) for d in test_set_10],
        name="Evals",
    )

The unify.Experiment() term creates an "experiment" parameter in the context, with value "simple_agent" in this case.

The overwrite=True argument will remove all prior logs with experiment parameter equal to "simple_agent". This is useful if you would like to re-run an experiment (clearing any previous runs).

If you would like to accumulate data for a specific experiment, then the overwrite flag should not be set, and new logs will simply be added to the experiment without deleting any prior logs.

The unify.Params() term sets the parameters which are held constant throughout the experiment.

By looking in our interface, we can see that we have some failures, with a mean error of 0.8 across the ten examples.

Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.

It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅

So, for the next iteration, we’ll need to dive in and understand why the agent is failing to make the correct prediction in some cases 🔁