The full script for running this iteration can be found here.

🔍 Still Ignoring Mark Types

As usual, let’s take a look and explore why the agent might be failing on the remaining examples 🕵️

Let’s try to force the agent to reason about each potential mark mentioned in the markscheme, by further refining our structured output.

Let’s expand upon the reasoning field for each sub-question, with a field for each mark type referenced in the sub-question markscheme, going from the following structure:

Prediction:
  a:
    reasoning: str
    marks: int
  b:
    reasoning: str
    marks: int
  ...

To this version which explicitly enforces reasoning about each potential mark type referenced in the markscheme:

Prediction:
  a:
    reasoning:
      B1:
        thoughts: str
        should_award: bool
      SC1:
        thoughts: str
        should_award: bool
      ...,
      overall_thoughts: str
    marks: int
  b:
    reasoning:
      M1:
        thoughts: str
        should_award: bool
      A1:
        thoughts: str
        should_award: bool
      ...,
      overall_thoughts: str
    marks: int
  ...

This way, the agent will be forced to reason about SC1 for Example 207, M1 for Example 261, and B1 for Example 132 (c).

🔀 Mark Type Reasoning

Let’s first define a function to dynamically construct the required pydantic type. For each parsed mark type, we want the model to give it’s thoughts and make a decision as to whether or not the mark should be awarded. Let’s create this pydantic type first:

class ThoughtsAndAwardDecision(BaseModel):
    thoughts: str
    should_award: bool

Let’s then create a function to dynamically construct a PerMarkReasoning pydantic type, with one ThoughtsAndAwardDecision instance for each mark detected in the sub-question markscheme.

@unify.traced(name="create_per_mark_reasoning_format_{mark_types}")
def create_per_mark_reasoning_format(mark_types):
    response_fields = dict(
        zip(
            mark_types + ["overall_thoughts"],
            [(ThoughtsAndAwardDecision, ...)] * len(mark_types) + [(str, ...)],
        ),
    )
    return create_model("PerMarkReasoning", **response_fields)

Let’s then re-define MarksAndReasoning (previously this was statically defined, see previous iteration) such that the reasoning field is no longer just a string, but is intead our newly created PerMarkReasoning (above).

@unify.traced(name="create_marks_and_reasoning_format_{mark_types}")
def create_marks_and_reasoning_format(mark_types):
    return create_model(
        "MarksAndReasoning",
        reasoning=(create_per_mark_reasoning_format(mark_types), ...),
        marks=(int, ...),
    )

Finally, let’s then update the top-level function create_response_format such that we’re making use of our newly defined create_marks_and_reasoning_format for each sub-question.

@unify.traced(name="create_response_format_{mark_types}")
def create_response_format(response_keys, mark_types):
    if response_keys:
        response_fields = dict(
            zip(
                response_keys,
                [
                    (create_marks_and_reasoning_format(mark_types[key]), ...)
                    for key in response_keys
                ],
            ),
        )
        return create_model("Response", **response_fields)
    else:
        return create_marks_and_reasoning_format(mark_types["_"])

We also need to write a function to parse the relevant marks from each sub-question markscheme.

We can take inspiration from update_markscheme defined in the previous iteration, which parses the markscheme in the same manner but for a different reason.

Let’s have the function extract the marks, and also the surrounding context.

@unify.traced(name="parse_marks_from_markscheme{subquestion}")
def parse_marks_from_markscheme(subquestion: str, markscheme: str):
    extracted_marks = re.findall(r"(?:SC|M|A|B)\d+", markscheme)
    if not extracted_marks:
        return []
    marks_n_context = list()
    for i, mark in enumerate(extracted_marks):
        index = markscheme.find(mark)
        chunk = markscheme[0:index]
        if i > 0:
            prev_mark = extracted_marks[i - 1]
            marks_n_context[i - 1][1] += chunk
        markscheme = markscheme[index:]
        marks_n_context.append([mark, chunk])
    marks_n_context[-1][1] += markscheme
    return marks_n_context

Finally, we’ll also need to update call_agent such that we call parse_available_marks_from_markscheme on each sub-question markscheme, and then pass these into our newly defined create_response_format.

@unify.traced
def call_agent(
    system_msg,
    question,
    sub_questions,
    markscheme,
    answer,
    available_marks_total,
):
    local_agent = agent.copy()
    with_subqs = len(markscheme) > 1
    response_format = create_response_format(
        list(markscheme.keys()) if with_subqs else None,
        {
            k: [
                itm[0]
                for itm in parse_marks_from_markscheme(f"_{k}" if k != "_" else "", v)
            ]
            for k, v in markscheme.items()
        },
    )
    local_agent.set_response_format(response_format)
    if with_subqs:
        output_response_exp = output_response_explanations["with_subqs"]
        output_response_exp = output_response_exp.replace(
            "{subquestions}",
            ", ".join(list(markscheme.keys())),
        )
    else:
        output_response_exp = output_response_explanations["without_subqs"]
    markscheme = {
        k: update_markscheme(f"_{k}" if k != "_" else "", v)
        for k, v in markscheme.items()
    }
    local_agent.set_system_message(
        system_msg.replace(
            "{question}",
            textwrap.indent(question, " " * 4),
        )
        .replace(
            "{markscheme}",
            pretty_print_dict(markscheme, indent=4),
        )
        .replace(
            "{answer}",
            pretty_print_dict(answer, indent=4),
        )
        .replace(
            "{available_marks_total}",
            str(available_marks_total),
        )
        .replace(
            "{questions_markscheme_and_answers}",
            pretty_print_dict(
                {
                    k: {
                        "sub-question": sub_questions[k],
                        "markscheme": markscheme[k],
                        "answer": answer[k],
                    }
                    for k in sub_questions.keys()
                },
                indent=4,
            ),
        )
        .replace(
            "{output_response_explanation}",
            output_response_exp,
        ),
    )
    ret = local_agent.generate()
    if "```" in ret:
        ret = ret.split("```")[-2].lstrip("json")
    ret = response_format.model_validate_json(ret).model_dump()
    if not with_subqs:
        return {"_": ret}
    return ret

Let’s also update our system message to better explain to the agent how it should reason about this new output structure.

output_response_explanations = dict()
output_response_explanations["with_subqs"] = (
    "For each sub-question {subquestions} you should populate the `reasoning` field with your general thoughts on each individual mark identified in the markscheme, and also a decision as to whether each of these mark should be awarded. These marks are not necessarily cumulative with regards to the marks to award, and some may be irrelevant given the student's approach or answer, in which case just respond `False` for the `should_award` field. Finally, you should put the total number of marks to award for each sub-question in the corresponding `marks` field."
)
output_response_explanations["without_subqs"] = (
    "You should populate the `reasoning` field with your general thoughts on each individual mark identified in the markscheme, and also a decision as to whether each of these mark should be awarded. These marks are not necessarily cumulative with regards to the marks to award, and some may be irrelevant given the student's approach or answer, in which case just respond `False` for the `should_award` field. Finally, you should put the total number of marks to award in the `marks` field."
)

🧪 Rerun Tests

with unify.Experiment(
    "mark_type_reasoning",
    overwrite=True,
), unify.Params(
    system_message=system_message,
    dataset="TestSet10",
    source=unify.get_source(),
):
    unify.map(
        evaluate,
        [dict(**d.entries, _system_message=system_message) for d in test_set_10],
        name="Evals",
    )

The failure modes are still entirely unchanged! o3-mini is certainly very stubborn about it’s decision for these questions.

Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.

It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅

Again, let’s explore what’s going wrong in the next iteration 🔁