5: Better Align Guidelines + Clarify Strict Reasoning

As usual, let’s take a look and explore why the agent might be failing on the remaining examples 🕵️

The full script for running this iteration can be found here.

🔍 Ignoring Mark Types

Lookg a bit deeper, it seems we’re getting the same failures as we were before. These errors seem quite persistent, we might need to make more radical changes to try and address them. Let’s revisit each of these failures again, and explore them in more detail, to see how we can rectify them. Given the consistent prevalence of these errors, it’s useful to consider these failing examples across all experiments, and see the different kinds of rationale our agent is giving across each run. Our evaluation runs serve both as tests for comparing parameter configurations (when the agent output depend strongly on the changed parameters 🔀) and they also perform a kind of sampling from the noise distribution for each test set example (when the agent output depends less strongly on the changed parameters across each experiment 📊). This is one benefit of the flexible spreadsheet design. Experiments are not indivisible atomic groups, they’re simply labels, and the raw evaluation data can be juggled in whatever way makes sense to you as the user. Let’s first recap the fixed question ❓, student answer 📝, markscheme ☑️, and correct marks ✅, before unpacking the various agent justifications 🤖 across all experiments.

Example 207

📄 PDFs

❓ Parsed Question [2 Marks]

📝 Student's Answer

☑️ Parsed Markscheme

✅ Correct Marks [1/2] Rationale

🤖 Predicted Marks [x/2] Rationales (with added insights💡)

Clearly, the agent is really failing to pick up on the SC1 mark from the markscheme, irrespective of the various parameter changes we’ve made across each experiment run.

Example 261

📄 PDFs

❓ Parsed Question [5 Marks]

📝 Student's Answer

☑️ Parsed Markscheme

✅ Correct Marks [1/5] Rationale

🤖 Predicted Marks [x/5] Rationale (with added insights💡)

Again, regardless of our parameter variations, it seems as though the agent is not correctly understanding the M1 mark in the markscheme, which clearly states M1 for 10 × (2/5) = 4 litres red or for 10 × (3/5) = 6 litres white. Even on the two occassions where it got things right, it feels like a lucky guess, as this was not justified via the markscheme’s M1 mark. Example 132 (c) is failing for a slighly different reason, and so we’ll consider this separately. In general, as we get deeper into the evaluation iterations, it’s often wise to consider multiple failure modes at once. Larger evals can be expensive to run, and you generally want to use all of the newly gained knowledge to try and improve your agent in the next evaluation run, even if this means making several unrelated changes to address several unrelated failure modes.

🔍 Lenient Reasoning

Let’s perform the same deepe analysis for Example 132 (c), and see what’s going wrong in this case.

Example 132 (c)

📄 PDFs

❓ Parsed Question [2 Marks]

📝 Student's Answer

☑️ Parsed Markscheme

✅ Correct Marks [1/2] Rationale

🤖 Predicted Marks [x/2] Rationale (with added insights💡)

In this case, it’s not so much an error with understanding the mark types, it’s more of an issue with falsely hallucinating things the student said, or perhaps being very liberal in the interpretation. Perhaps the agent is confusing the markscheme with the student answer. The student’s answer “we cannot be certain this trend always holds true” is very different to “there may be other factors involved”. The former (incorrect) assumes the trend does exist, but just might not continue, while the (correct) latter is a point about correlation != causation, indicating the apparent causation may not exist at all.

🔀 Better Align Guidelines

Firstly, the recurring problem for Example 207 and 261 seems to be that the agent doesn’t remember and/or understand the different types of marks (B1, SC1, M1 etc.). Let’s be more explicit, and parse each sub-question markscheme for the different mark types, and add the explanations directly as part of the sub-question specific markschemes, and see if this improves performance. Let’s first create a dictionary with the mark type explanations, written in a more direct manner to accompany the subquestion-specific markschemes and to make it easier to parse:

mark_types = {
    "M": "M{num} ({num_marks}) should be awarded if a correct method is used, and should not be lost for purely numerical errors.",
    "A": "A{num} ({num_marks}) should be awarded for an accurate answer, and this depends on preceding M (method) marks. If preceding M (method marks are not awarded, then A{num} cannot be awarded).",
    "B": "B{num} ({num_marks}) should be awarded for the correct final answer, a partially correct answer, or a correct intermediate stage (depending on how this is expressed and explained below). B{num} is independent of M (method) marks.",
    "SC": "SC{num} ({num_marks}) should be awarded for the special cases explained below, which are worthy of some credit.",
}

Let’s then write a simple function to update each sub-question specific markscheme, prepending the markscheme with the relevant definitions from our mark_types dict, so that the agent has all the relevant information close at hand:

@unify.traced(name="update_markscheme{subquestion}")
def update_markscheme(subquestion: str, markscheme: str):
    m_marks = sorted(list(set(re.findall(r"M\d+", markscheme))))
    a_marks = sorted(list(set(re.findall(r"A\d+", markscheme))))
    b_marks = sorted(list(set(re.findall(r"B\d+", markscheme))))
    sc_marks = sorted(list(set(re.findall(r"SC\d+", markscheme))))
    if not any(m_marks + a_marks + b_marks + sc_marks):
        return markscheme
    markscheme = (
        "{mark_types}With this in mind, marks should be awarded as follows:\n"
        + markscheme
    )
    for marks in (m_marks, a_marks, b_marks, sc_marks):
        for mark in marks:
            key = "".join(c for c in mark if not c.isdigit())
            num_marks = int("".join(c for c in mark if c.isdigit()))
            explanation = mark_types[key]
            explanation = explanation.replace(
                "{num}",
                str(num_marks),
            ).replace(
                "{num_marks}",
                "1 mark" if num_marks == 1 else f"{num_marks} marks",
            )
            markscheme = markscheme.replace(
                "{mark_types}",
                explanation + "\n{mark_types}",
            )
    markscheme = markscheme.replace(
        "{mark_types}",
        "",
    )
    return markscheme

Let’s now update our call_agent method such that the markscheme changes are dynamically applied before passing to the agent:

@unify.traced
def call_agent(
    system_msg,
    question,
    sub_questions,
    markscheme,
    answer,
    available_marks_total,
):
    local_agent = agent.copy()
    with_subqs = len(markscheme) > 1
    response_format = create_response_format(
        list(markscheme.keys()) if with_subqs else None,
    )
    local_agent.set_response_format(response_format)
    if with_subqs:
        output_response_exp = output_response_explanations["with_subqs"]
        output_response_exp = output_response_exp.replace(
            "{subquestions}",
            ", ".join(list(markscheme.keys())),
        )
    else:
        output_response_exp = output_response_explanations["without_subqs"]
    markscheme = {
        k: update_markscheme(f"_{k}" if k != "_" else "", v)
        for k, v in markscheme.items()
    }
    local_agent.set_system_message(
        system_msg.replace(
            "{question}",
            textwrap.indent(question, " " * 4),
        )
        .replace(
            "{markscheme}",
            pretty_print_dict(markscheme, indent=4),
        )
        .replace(
            "{answer}",
            pretty_print_dict(answer, indent=4),
        )
        .replace(
            "{available_marks_total}",
            str(available_marks_total),
        )
        .replace(
            "{questions_markscheme_and_answers}",
            pretty_print_dict(
                {
                    k: {
                        "sub-question": sub_questions[k],
                        "markscheme": markscheme[k],
                        "answer": answer[k],
                    }
                    for k in sub_questions.keys()
                },
                indent=4,
            ),
        )
        .replace(
            "{output_response_explanation}",
            output_response_exp,
        ),
    )
    ret = local_agent.generate()
    if "```" in ret:
        ret = ret.split("```")[-2].lstrip("json")
    ret = response_format.model_validate_json(ret).model_dump()
    if not with_subqs:
        return {"_": ret}
    return ret

🔀 Clarify Strict Reasoning

We’ve just addressed the recurring problem for Example 207 and 261, but the failure for Example 132 (c) was quite different. Let’s add another instructions to our general_guidelines variable, with an imaginary extra piece of guidance, to try and avoid the leniency we’ve observed in the marking of Example 132 (c).

general_guidelines = (
    general_guidelines.rstrip("-")
    + """15.
When students are explaining something in their answer, then their explanation must make *exactly* the same point(s) as are made in the markscheme. The wording can be slightly different, but the underlying observations/reasons must be *identical*, unless otherwise stated *explicitly* in the markscheme.

----
"""
)

🧪 Rerun Tests

Now we’ve made both of these changes, let’s re-run our evals to see if either of these changes were able to address the problems they’re intended to resolve.

with unify.Experiment(
    "align_guidelines_and_clarify_reasoning",
    overwrite=True,
), unify.Params(
    system_message=system_message,
    dataset="TestSet10",
    source=unify.get_source(),
):
    unify.map(
        evaluate,
        [dict(**d.entries, _system_message=system_message) for d in test_set_10],
        name="Evals",
    )

Our failure mechanisms are exactly the same as before, clearly the agent is still struggling to correctly reason about the different mark types. Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.

It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅ Again, let’s explore what’s going wrong in the next iteration 🔁

​🔍 Ignoring Mark Types

​Example 207

​Example 261

​🔍 Lenient Reasoning

​Example 132 (c)

​🔀 Better Align Guidelines

​🔀 Clarify Strict Reasoning

​🧪 Rerun Tests

🔍 Ignoring Mark Types

Example 207

Example 261

🔍 Lenient Reasoning

Example 132 (c)

🔀 Better Align Guidelines

🔀 Clarify Strict Reasoning

🧪 Rerun Tests