The full script for running this iteration can be found here.

🔍 Still Ignoring Mark Types

Let’s see what affect our new output format had on the nature of the agent’s responses, if any.

Considering Example 207, the agent still failed to award SC1 for the student’s answer 1/3, 0.34, 3.5%, despite the markscheme explicitly stating SC1 for 1/3, 0.34, 3.5%. The agent’s explicit thoughts about SC1 were:

🤖 No special case credit is applicable here since the order is incorrect and no alternative acceptable method is demonstrated.

This is a pretty fluffy and empty statement. Despite o3-mini being a multi-step reasoning model, perhaps we’re still asking the agent to consider too many things at once.

Enforcing the agent to consider one mark at a time might rectify this lack of attention to detail.

Example 132 is even more difficult, where the agent not only needs to consider each mark, but it also has six different sub-questions to reason about, each with their own set of available marks and mark types.

Let’s see if using a separate LLM call per sub-question improves the performance on Example 132.

🔀 Queries per Subquestion

Firstly, let’s create a new system prompt for our agent, which will reason about one-subquestion at a time.

system_message = """
Your task is to award a suitable number of marks for a student's answer to a question, from 0 up to a maximum of {available_marks} marks.

The general marking guidelines (relevant for all questions) are as follows:

{general_guidelines}


The question you need to mark is:

{question}


Their answer is:

{answer}


The markscheme is:

{markscheme}


{output_response_explanation}
""".replace(
    "{general_guidelines}",
    general_guidelines,
)

Given the changes, we can also remove the output_response_explanations dict, and relace it with a single output_response_explanation string variable, given that the agent no longer needs to output responses for multiple sub-questions in a single response.

output_response_explanation = "You should populate the `reasoning` field with your general thoughts on each individual mark identified in the markscheme, and also a decision as to whether each of these mark should be awarded. These marks are not necessarily cumulative with regards to the marks to award, and some may be irrelevant given the student's approach or answer, in which case just respond `False` for the `should_award` field. Finally, you should put the total number of marks to award in the `marks` field."

Let’s update call_agent to map each subquestion to a unique LLM call.

@unify.traced
def call_agent(
    example_id,
    system_msg,
    sub_questions,
    markscheme,
    answer,
    available_marks,
):
    agents = {k: agent.copy() for k in markscheme.keys()}
    response_formats = {
        k: create_marks_and_reasoning_format(
            [
                itm[0]
                for itm in parse_marks_from_markscheme(f"_{k}" if k != "_" else "", v)
            ],
        )
        for k, v in markscheme.items()
    }
    [
        agnt.set_response_format(rf)
        for agnt, rf in zip(
            agents.values(),
            response_formats.values(),
        )
    ]
    markscheme = {
        k: update_markscheme(f"_{k}" if k != "_" else "", v)
        for k, v in markscheme.items()
    }
    for k in markscheme.keys():
        agents[k].set_system_message(
            system_msg.replace(
                "{question}",
                textwrap.indent(sub_questions[k], " " * 4),
            )
            .replace(
                "{markscheme}",
                textwrap.indent(markscheme[k], " " * 4),
            )
            .replace(
                "{answer}",
                textwrap.indent(answer[k], " " * 4),
            )
            .replace(
                "{available_marks}",
                str(available_marks[k.replace("_", "total")]),
            )
            .replace(
                "{output_response_explanation}",
                output_response_explanation,
            ),
        )
    rets = unify.map(
        lambda k, a: a.generate(tags=[k]),
        list(agents.items()),
        name=f"Evals[{example_id}]->SubQAgent",
    )
    rets = [
        ret.split("```")[-2].lstrip("json") if "```" in ret else ret for ret in rets
    ]
    rets = {
        k: response_formats[k].model_validate_json(ret).model_dump()
        for k, ret in zip(markscheme.keys(), rets)
    }
    return rets

Let’s also update evaluate to pass the updated parameters to call_agent.

@unify.log
def evaluate(
    example_id,
    sub_questions,
    student_answer,
    available_marks,
    markscheme,
    correct_marks,
    per_question_breakdown,
    _system_message,
):
    pred_marks = call_agent(
        example_id,
        _system_message,
        sub_questions,
        markscheme,
        student_answer,
        available_marks,
    )
    pred_marks_total = sum([v["marks"] for v in pred_marks.values()])
    diff = {
        k: vcor["marks"] - vpred["marks"]
        for (k, vcor), (_, vpred) in zip(correct_marks.items(), pred_marks.items())
    }
    error = {k: abs(v) for k, v in diff.items()}
    diff_total = sum(diff.values())
    error_total = sum(error.values())
    per_question_breakdown = {
        k: {
            **per_question_breakdown[k],
            "predicted_marks": pm,
            "diff": d,
        }
        for (k, pqb), pm, d in zip(
            per_question_breakdown.items(),
            pred_marks.values(),
            diff.values(),
        )
    }
    return error_total

🧪 Rerun Tests

with unify.Experiment(
    "queries_per_subquestion",
    overwrite=True,
), unify.Params(
    system_message=system_message,
    dataset="TestSet10",
    source=unify.get_source(),
):
    unify.map(
        evaluate,
        [dict(**d.entries, _system_message=system_message) for d in test_set_10],
        name="Evals",
    )

We actually have a regression it seems, with the mean error increasing from 0.3 up to to 0.5.

Firstly, let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.

It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅

Again, let’s explore what’s going wrong in the next iteration, and try to understand why we’ve seen this regression 🔁