The full script for running this iteration can be found here.

🔍 Lack of Global Context

These updates have actually regressed the overall performance, with us now having a mean error of 0.5.

Maybe the purely local reasoning has some shortcomings. Let’s focus on one of the new regressions, to understand why our latest change has distrupted the agent where it was previously very consistently correct.

Example 20 (b)

Clearly, the agent is now taking some liberties with what constitutes a “valid reason”. It’s unclear why the agent is only making these mistakes now.

Let’s look at some of the previous justifications for not awarding the mark, from the previous runs.

Perhaps preventing the agent from having access to the full question prevents it from using “common sense” and realizing how “silly” the proposed answer is, in light of the overall question and the information provided to the student.

Maybe strict adherence to the markscheme alone without the full context is prohibitive.

Let’s update our per-subquestion system prompts to also fully include the preceeding sub-questions, their markschemes, and their answers. It’s unlikely that the context of a later question will assists with the marking of an earlier question, and we still want to try and keep the agent as focused as possible on the relevant information.

🔀 Include Preceeding Context

Let’s first update the system prompt, re-introducing the placeholder for the aligned subquestions, markschemes and answers, this time calling it {prior_context}, which will only be included when sub-questions are present. Let’s also include the full question.

system_message = """
Your task is to award a suitable number of marks for a student's answer to question {subq}, from 0 up to a maximum of {available_marks} marks.

The general marking guidelines (relevant for all questions) are as follows:

{general_guidelines}


The *overall* question is:

{question}

{prior_context}

The specific question you need to mark is:

{subquestion}


Their answer to this specific question is:

{answer}


The markscheme for this specific question is:

{markscheme}


{output_response_explanation}
""".replace(
    "{general_guidelines}",
    general_guidelines,
)

Let’s also add a general explanation for the prior context, in cases where it is included.

prior_context_exp = """
All of the *preceeding* sub-questions, their specific markschemes and the student's answers are as follows:
"""

Let’s now update call_agent to pass in the required information.

@unify.traced
def call_agent(
    example_id,
    system_msg,
    question_num,
    question,
    sub_questions,
    markscheme,
    answer,
    available_marks,
):
    agents = {k: agent.copy() for k in markscheme.keys()}
    with_subqs = len(markscheme) > 1
    response_formats = {
        k: create_marks_and_reasoning_format(
            [
                itm[0]
                for itm in parse_marks_from_markscheme(f"_{k}" if k != "_" else "", v)
            ],
        )
        for k, v in markscheme.items()
    }
    [
        agnt.set_response_format(rf)
        for agnt, rf in zip(
            agents.values(),
            response_formats.values(),
        )
    ]
    markscheme = {
        k: update_markscheme(f"_{k}" if k != "_" else "", v)
        for k, v in markscheme.items()
    }
    for i, k in enumerate(markscheme.keys()):
        agents[k].set_system_message(
            system_msg.replace(
                "{subq}",
                k.replace("_", str(question_num)),
            )
            .replace(
                "{question}",
                textwrap.indent(question, " " * 4),
            )
            .replace(
                "{subquestion}",
                textwrap.indent(sub_questions[k], " " * 4),
            )
            .replace(
                "{markscheme}",
                textwrap.indent(markscheme[k], " " * 4),
            )
            .replace(
                "{answer}",
                textwrap.indent(answer[k], " " * 4),
            )
            .replace(
                "{available_marks}",
                str(available_marks[k.replace("_", "total")]),
            )
            .replace(
                "{output_response_explanation}",
                output_response_explanation,
            )
            .replace(
                "{prior_context}",
                (
                    (
                        prior_context_exp
                        + pretty_print_dict(
                            {
                                k: {
                                    "sub-question": sub_questions[k],
                                    "markscheme": markscheme[k],
                                    "answer": answer[k],
                                }
                                for k in list(sub_questions.keys())[0:i]
                            },
                            indent=4,
                        )
                    )
                    if with_subqs and i > 0
                    else ""
                ),
            ),
        )
    rets = unify.map(
        lambda k, a: a.generate(tags=[k]),
        list(agents.items()),
        name=f"Evals[{example_id}]->SubQAgent",
    )
    rets = [
        ret.split("```")[-2].lstrip("json") if "```" in ret else ret for ret in rets
    ]
    rets = {
        k: response_formats[k].model_validate_json(ret).model_dump()
        for k, ret in zip(markscheme.keys(), rets)
    }
    return rets

Finally, let’s update evaluate accordingly.

@unify.log
def evaluate(
    example_id,
    question_num,
    question,
    sub_questions,
    student_answer,
    available_marks,
    markscheme,
    correct_marks,
    per_question_breakdown,
    _system_message,
):
    pred_marks = call_agent(
        example_id,
        _system_message,
        question_num,
        question,
        sub_questions,
        markscheme,
        student_answer,
        available_marks,
    )
    pred_marks_total = sum([v["marks"] for v in pred_marks.values()])
    diff = {
        k: vcor["marks"] - vpred["marks"]
        for (k, vcor), (_, vpred) in zip(correct_marks.items(), pred_marks.items())
    }
    error = {k: abs(v) for k, v in diff.items()}
    diff_total = sum(diff.values())
    error_total = sum(error.values())
    per_question_breakdown = {
        k: {
            **per_question_breakdown[k],
            "predicted_marks": pm,
            "diff": d,
        }
        for (k, pqb), pm, d in zip(
            per_question_breakdown.items(),
            pred_marks.values(),
            diff.values(),
        )
    }
    return error_total

🧪 Rerun Tests

with unify.Experiment(
    "with_preceeding_context",
    overwrite=True,
), unify.Params(
    system_message=system_message,
    dataset="TestSet10",
    source=unify.get_source(),
):
    unify.map(
        evaluate,
        [dict(**d.entries, _system_message=system_message) for d in test_set_10],
        name="Evals",
    )

Great, so we’ve fixed the new regressions, but again we’re back at the same three failures, failing for the same reason.

Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.

It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅

Again, let’s explore what’s going wrong in the next iteration 🔁