The full script for running this iteration can be found here.

🔍 Still Ignoring Mark Types

As usual, let’s take a look and explore why the agent might be failing on the remaining examples 🕵️

Given that the agent is still failing to follow the instructions for each mark in the markscheme, perhaps it’s time we tried to perform per-mark reasoning, with a separate LLM call made for each candidate mark to award. This might help the LLM deeply consider each candidate mark mentioned in the markscheme.

Let’s give it a try!

🔀 Queries per Mark

We will still want our per-subquestion LLM to perform the final reasoning about the number of marks to award for the sub-question, but we just want to provide it with the reasoning performed by each of our per-mark LLM queries.

We therefore now have two different LLMs, with two different roles, and therefore we need two different system messages.

Let’s first update the subquestion system message, in anticipation of the incoming mark-by-mark reasoning. Let’s also split the markscheme and the mark type reasoning, rather than naively combining these as was done in update_markscheme.

subq_system_message = """
Your task is to award a suitable number of marks for a student's answer to question {subq}, from 0 up to a maximum of {available_marks} marks.

The general marking guidelines (relevant for all questions) are as follows:

{general_guidelines}


The *overall* question is:

{question}

{prior_context}

The specific question you need to mark is:

{subquestion}


Their answer to this specific question is:

{answer}


The markscheme for this specific question is:

{markscheme}

{mark_types_explanation}

{mark_observations}

{output_response_explanation}
""".replace(
    "{general_guidelines}",
    general_guidelines,
)

The "{mark_types_explanation}" placeholder can be overriden explicitly, giving us more control. Let’s create a new function extract_mark_type_explanation, inspired from update_markscheme above.

@unify.traced(name="extract_mark_type_explanation{subquestion}")
def extract_mark_type_explanation(
    subquestion: str,
    markscheme: str,
    marks_to_consider=None,
):
    m_marks = sorted(list(set(re.findall(r"M\d+", markscheme))))
    a_marks = sorted(list(set(re.findall(r"A\d+", markscheme))))
    b_marks = sorted(list(set(re.findall(r"B\d+", markscheme))))
    sc_marks = sorted(list(set(re.findall(r"SC\d+", markscheme))))
    if not any(m_marks + a_marks + b_marks + sc_marks):
        return ""
    full_exp = "As a recap, {mark_types_explanation}"
    for marks in (m_marks, a_marks, b_marks, sc_marks):
        for mark in marks:
            if marks_to_consider and mark not in marks_to_consider:
                continue
            key = "".join(c for c in mark if not c.isdigit())
            num_marks = int("".join(c for c in mark if c.isdigit()))
            exp = mark_types[key]
            exp = exp.replace(
                "{num}",
                str(num_marks),
            ).replace(
                "{num_marks}",
                "1 mark" if num_marks == 1 else f"{num_marks} marks",
            )
            full_exp = full_exp.replace(
                "{mark_types_explanation}",
                exp + "\n{mark_types_explanation}",
            )
    return full_exp.replace("{mark_types_explanation}", "")

Let’s now create the system message for our mark reasoning agent, again with the explicit {mark_types_explanation} placeholder.

mark_system_message = """
Your task is to determine whether mark {mark} should be awarded for the following student's answer to question {subq}, based on the provided markscheme.

The general marking guidelines (relevant for all questions) are as follows:

{general_guidelines}


The *overall* question is:

{question}

{prior_context}

The specific question you need to mark is:

{subquestion}


Their answer to this specific question is:

{answer}


The markscheme for this specific question, with the mark in question {mark} expressed in bold and with a prepending `(to consider!)`, is as follows:

{markscheme}

{mark_types_explanation}
You should populate the `thoughts` field with your thoughts on the whether the specific mark {mark} identified within the markscheme should be awarded for the student's answer. This {mark} mark might be irrelevant given the student's approach or answer, in which case just respond `False` for the `should_award` field, and explain this in the `thoughts` field. Please think carefully about your decision for awarding this {mark} mark, considering the general guidelines.
""".replace(
    "{general_guidelines}",
    general_guidelines,
)

Let’s first define call_subq_agent, which will include mark-by-mark reasoning with several LLM calls

@unify.traced(name="call_subq_agent_{subq}")
def call_subq_agent(
    example_id,
    subq,
    subq_agent,
    markscheme,
    parsed_markscheme,
    mark_sys_msg,
):
    mark_agents = [[k, agent.copy()] for k in [itm[0] for itm in parsed_markscheme]]
    [agnt.set_response_format(ThoughtsAndAwardDecision) for _, agnt in mark_agents]
    for i, (k, v) in enumerate(parsed_markscheme):
        mark_agents[i][1].set_system_message(
            mark_sys_msg.replace(
                "{mark}",
                k,
            )
            .replace(
                "{markscheme}",
                textwrap.indent(
                    markscheme.replace(
                        v,
                        v.replace(k, f"**{k}** (to consider!)"),
                    ),
                    " " * 4,
                ),
            )
            .replace(
                "{mark_types_explanation}",
                extract_mark_type_explanation(
                    f"_{k}({i})" if k != "_" else "",
                    markscheme,
                    [k],
                ),
            ),
        )
    if mark_agents:
        explanation = "An expert marker has already taken a look at the student's answer, and they have made the following observations for each of the candidate marks mentioned in the markscheme. You should pay special attention to these observations."
        vals = unify.map(
            lambda i, m, a: json.loads(a.generate(tags=[m + f"({i})"])),
            [tuple([i] + item) for i, item in enumerate(mark_agents)],
            name=f"Evals[{example_id}]->SubQAgent[{subq}]->MarkAgent",
        )
        keys = list()
        for k, _ in mark_agents:
            keys.append(
                k + f"({len([ky for ky in keys if k in ky])})",
            )
        mark_obs_dict = dict(zip(keys, vals))
        mark_observations = (
            explanation
            + "\n\n"
            + pretty_print_dict(
                mark_obs_dict,
                indent=4,
            )
        )
    else:
        mark_observations = ""
    subq_agent.set_system_message(
        subq_agent.system_message.replace(
            "{mark_observations}",
            mark_observations,
        ),
    )
    ret = subq_agent.generate(tags=[subq])
    if "```" in ret:
        ret = ret.split("```")[-2].lstrip("json")
    ret = json.loads(ret)
    if not mark_agents:
        return ret
    ret["reasoning"] = {
        **mark_obs_dict,
        "overall_thoughts": ret["reasoning"],
    }
    return ret

Let’s now update call_agent, making use of our call_subq_agent function, which processes a single sub-question.

@unify.traced
def call_agent(
    example_id,
    subq_system_message,
    mark_system_message,
    question_num,
    question,
    sub_questions,
    markscheme,
    answer,
    available_marks,
):
    subq_agents = {k: agent.copy() for k in markscheme.keys()}
    with_subqs = len(markscheme) > 1
    response_formats = {k: MarksAndReasoning for k, v in markscheme.items()}
    [
        agnt.set_response_format(rf)
        for agnt, rf in zip(
            subq_agents.values(),
            response_formats.values(),
        )
    ]
    mark_sys_msgs = list()
    parsed_markschemes = list()
    for i, k in enumerate(markscheme.keys()):
        parsed_markscheme = parse_marks_from_markscheme(
            f"_{k}" if k != "_" else "",
            markscheme[k],
        )
        parsed_markschemes.append(parsed_markscheme)
        this_markscheme = markscheme[k]
        for i, (mark, chunk) in enumerate(parsed_markscheme):
            this_markscheme = this_markscheme.replace(
                chunk,
                chunk.replace(
                    mark,
                    f"{mark}({len([m for m, _ in parsed_markscheme[0:i] if m == mark])})",
                ),
            )
        subq_agents[k].set_system_message(
            subq_system_message.replace(
                "{subq}",
                k.replace("_", str(question_num)),
            )
            .replace(
                "{question}",
                textwrap.indent(question, " " * 4),
            )
            .replace(
                "{subquestion}",
                textwrap.indent(sub_questions[k], " " * 4),
            )
            .replace(
                "{markscheme}",
                textwrap.indent(this_markscheme, " " * 4),
            )
            .replace(
                "{mark_types_explanation}",
                textwrap.indent(
                    extract_mark_type_explanation(
                        f"_{k}" if k != "_" else "",
                        markscheme[k],
                    ),
                    " " * 4,
                ),
            )
            .replace(
                "{answer}",
                textwrap.indent(answer[k], " " * 4),
            )
            .replace(
                "{available_marks}",
                str(available_marks[k.replace("_", "total")]),
            )
            .replace(
                "{output_response_explanation}",
                output_response_explanation,
            )
            .replace(
                "{prior_context}",
                (
                    (
                        prior_context_exp
                        + pretty_print_dict(
                            {
                                k: {
                                    "sub-question": sub_questions[k],
                                    "markscheme": markscheme[k],
                                    "answer": answer[k],
                                }
                                for k in list(sub_questions.keys())[0:i]
                            },
                            indent=4,
                        )
                    )
                    if with_subqs and i > 0
                    else ""
                ),
            ),
        )
        mark_sys_msgs.append(
            mark_system_message.replace(
                "{subq}",
                k.replace("_", str(question_num)),
            )
            .replace(
                "{question}",
                textwrap.indent(question, " " * 4),
            )
            .replace(
                "{subquestion}",
                textwrap.indent(sub_questions[k], " " * 4),
            )
            .replace(
                "{answer}",
                textwrap.indent(answer[k], " " * 4),
            )
            .replace(
                "{prior_context}",
                (
                    (
                        prior_context_exp
                        + pretty_print_dict(
                            {
                                k: {
                                    "sub-question": sub_questions[k],
                                    "markscheme": markscheme[k],
                                    "answer": answer[k],
                                }
                                for k in list(sub_questions.keys())[0:i]
                            },
                            indent=4,
                        )
                    )
                    if with_subqs and i > 0
                    else ""
                ),
            ),
        )
    rets = unify.map(
        lambda *a: call_subq_agent(example_id, *a),
        list(sub_questions.keys()),
        list(subq_agents.values()),
        list(markscheme.values()),
        parsed_markschemes,
        mark_sys_msgs,
        from_args=True,
        name=f"Evals[{example_id}]->SubQAgent",
    )
    return dict(zip(markscheme.keys(), rets))

We also need to update the evaluate function, to pass each of the two different system messages to the call_agent function.

@unify.log
def evaluate(
    example_id,
    question_num,
    question,
    sub_questions,
    student_answer,
    available_marks,
    markscheme,
    correct_marks,
    per_question_breakdown,
    _subq_system_message,
    _mark_system_message,
):
    pred_marks = call_agent(
        example_id,
        _subq_system_message,
        _mark_system_message,
        question_num,
        question,
        sub_questions,
        markscheme,
        student_answer,
        available_marks,
    )
    pred_marks_total = sum([v["marks"] for v in pred_marks.values()])
    diff = {
        k: vcor["marks"] - vpred["marks"]
        for (k, vcor), (_, vpred) in zip(correct_marks.items(), pred_marks.items())
    }
    error = {k: abs(v) for k, v in diff.items()}
    diff_total = sum(diff.values())
    error_total = sum(error.values())
    per_question_breakdown = {
        k: {
            **per_question_breakdown[k],
            "predicted_marks": pm,
            "diff": d,
        }
        for (k, pqb), pm, d in zip(
            per_question_breakdown.items(),
            pred_marks.values(),
            diff.values(),
        )
    }
    return error_total

🧪 Rerun Tests

with unify.Experiment(
    "queries_per_mark",
    overwrite=True,
), unify.Params(
    subq_system_message=subq_system_message,
    mark_system_message=mark_system_message,
    dataset="TestSet10",
    source=unify.get_source(),
):
    unify.map(
        evaluate,
        [
            dict(
                **d.entries,
                _subq_system_message=subq_system_message,
                _mark_system_message=mark_system_message,
            )
            for d in test_set_10
        ],
        name="Evals",
    )

Great, this seems to have addressed two of the three failures (on this run at least).

Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.

It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅

Again, let’s explore what’s going wrong in the next iteration 🔁