The full script for running this iteration can be found here.

🔍 Difficulty Debugging

Let’s explore the failure modes which are still remaining.

The change seems to have fixed Example 261, but the other two [Examples 207 and 132 (c)] are still failing.

It’s also becoming quite difficult to track the exact discrepency between the correct marks and those predicted by the agent, as the agent’s response is a single block of text, unlike the ground truth data which is formatted in a dictionary with each sub-question independently marked.

Adding structured output could both help the agent to reason about each part of the question independently, and it will also make the response easier to parse, enabling us to present a diff at the subquestion level, rather than just for the entire question. Let’s give it a try!

🔀 Add Structured Output

So, let’s go ahead and implement structured output! 🧑‍💻

Let’s first define the output we want for each sub-question:

class MarksAndReasoning(BaseModel):
    reasoning: str
    marks: int

Let’s now write a simply function to build the desired pydantic output dynamically, based on the subquestions present.

@unify.traced
def create_response_format(response_keys):
    if response_keys:
        response_fields = dict(
            zip(response_keys, [(MarksAndReasoning, ...)] * len(response_keys)),
        )
        return create_model("Response", **response_fields)
    else:
        return MarksAndReasoning

Let’s update our system prompt, so the agent knows how to populate the structured response correctly. We also actually only use the nested output structure if subquestions are present. So we’ll want to populate the instructions dynamically, depending on the presence or abscence of sub-questions. Let’s create two alternatives. Firstly, when sub-questions are present:

“For each sub-question , you should populate the reasoning field with your initial reasoning about the correct number of marks to award. Finally, you should put the number of marks to award for this sub-question in the marks field.”

When they are not present:

“You should populate the reasoning field with your initial reasoning about the correct number of marks to award. Finally, you should put the number of marks to award in the marks field.”

Firstly, the general template:

system_message = """
Your task is to award a suitable number of marks for a student's answer to a question, from 0 up to a maximum of {available_marks_total} marks.

The general marking guidelines (relevant for all questions) are as follows:

{general_guidelines}

The question you need to mark is:

{question}


The markscheme for this specific question is:

{markscheme}


The student's answer to this question (which you need to marked) is:

{answer}


{output_response_explanation}
""".replace(
    "{general_guidelines}",
    general_guidelines,
)

Then the two excerpts:

output_response_explanations = dict()
output_response_explanations["with_subqs"] = (
    "For each sub-question {subquestions} you should populate the `reasoning` field with your initial reasoning about the correct number of marks to award. Finally, you should put the number of marks to award for this sub-question in the `marks` field."
)
output_response_explanations["without_subqs"] = (
    "You should populate the `reasoning` field with your initial reasoning about the correct number of marks to award. Finally, you should put the number of marks to award in the `marks` field."
)

Let’s update our call_agent method to set the output format dynamically:

@unify.traced
def call_agent(system_msg, question, markscheme, answer, available_marks_total):
    local_agent = agent.copy()
    with_subqs = len(markscheme) > 1
    response_format = create_response_format(
        list(markscheme.keys()) if with_subqs else None,
    )
    local_agent.set_response_format(response_format)
    if with_subqs:
        output_response_exp = output_response_explanations["with_subqs"]
        output_response_exp = output_response_exp.replace(
            "{subquestions}",
            ", ".join(list(markscheme.keys())),
        )
    else:
        output_response_exp = output_response_explanations["without_subqs"]
    local_agent.set_system_message(
        system_msg.replace(
            "{question}",
            textwrap.indent(question, " " * 4),
        )
        .replace(
            "{markscheme}",
            pretty_print_dict(markscheme, indent=4),
        )
        .replace(
            "{answer}",
            pretty_print_dict(answer, indent=4),
        )
        .replace(
            "{available_marks_total}",
            str(available_marks_total),
        )
        .replace(
            "{output_response_explanation}",
            output_response_exp,
        ),
    )
    ret = local_agent.generate()
    if "```" in ret:
        ret = ret.split("```")[-2].lstrip("json")
    ret = response_format.model_validate_json(ret).model_dump()
    if not with_subqs:
        return {"_": ret}
    return ret

Let’s also update our evaluate method to parse the returned json correctly, and also include a subquestion level diff, and update the per-question-breakdown to also include the subquestion level predictions:

@unify.log
def evaluate(
    question,
    student_answer,
    available_marks_total,
    markscheme,
    correct_marks,
    per_question_breakdown,
    _system_message,
):
    pred_marks = call_agent(
        _system_message,
        question,
        markscheme,
        student_answer,
        available_marks_total,
    )
    pred_marks_total = sum([v["marks"] for v in pred_marks.values()])
    diff = {
        k: vcor["marks"] - vpred["marks"]
        for (k, vcor), (_, vpred) in zip(correct_marks.items(), pred_marks.items())
    }
    error = {k: abs(v) for k, v in diff.items()}
    diff_total = sum(diff.values())
    error_total = sum(error.values())
    per_question_breakdown = {
        k: {
            **per_question_breakdown[k],
            "predicted_marks": pm,
            "diff": d,
        }
        for (k, pqb), pm, d in zip(
            per_question_breakdown.items(),
            pred_marks.values(),
            diff.values(),
        )
    }
    return error_total

🧪 Rerun Tests

with unify.Experiment("add_structured_output", overwrite=True), unify.Params(
    system_message=system_message,
    dataset="TestSet10",
    source=unify.get_source(),
):
    unify.map(
        evaluate,
        [dict(**d.entries, _system_message=system_message) for d in test_set_10],
        name="Evals",
    )

The mean error has actually gone back up from 0.2 to 0.4.

Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.

It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅

Let’s explore what’s going wrong in the next iteration 🔁