The full script for running this iteration can be found here.

πŸ” Inconsistent Formatting

Let’s take a look at our results in the table.

Example 215

Example 215 has the largest error. Let’s take a look at what’s going wrong.

Based on these two, it seems as though there is a small discrepency in the text-based box formatting in the question given to the agent, with the question presenting four boxes per question but the true number being five. We can verify this by looking at the original page, which can be found here.

Looking at this failure, it’s obvious that the model would benefit from having the markscheme included in the context, so that it knows what the correct answers are, and how to award the marks.

Without the markscheme, the agent is unable to notice this discrepency, and therefore presumes that a total of four numbers are reqired for each answer, instead of the necessary five.

πŸ”€ Add Markscheme

Let’s update the system message to include a placeholder for the markscheme.

system_message = """
Your task is to award a suitable number of marks for a student's answer to a question, from 0 up to a maximum of {available_marks_total} marks.

The question is:

{question}


The markscheme for the question is:

{markscheme}


Their answer to this question is:

{answer}


As the very final part of your response, simply provide the number of marks on a *new line*, without any additional formatting. For example:

3
"""

We then make sure to update the system message with the true markscheme during call_agent:

@unify.traced
def call_agent(system_msg, question, markscheme, answer, available_marks_total):
    local_agent = agent.copy()
    local_agent.set_system_message(
        system_msg.replace(
            "{question}",
            textwrap.indent(question, " " * 4),
        )
        .replace(
            "{markscheme}",
            pretty_print_dict(markscheme, indent=4),
        )
        .replace(
            "{answer}",
            pretty_print_dict(answer, indent=4),
        )
        .replace(
            "{available_marks_total}",
            str(available_marks_total),
        ),
    )
    return local_agent.generate()

We also need to update evaluate accordingly:

@unify.log
def evaluate(
    question,
    student_answer,
    available_marks_total,
    markscheme,
    correct_marks_total,
    _system_message,
):
    pred_marks = call_agent(
        _system_message,
        question,
        markscheme,
        student_answer,
        available_marks_total,
    )
    _pred_marks_split = pred_marks.split("\n")
    pred_marks_total, diff_total, error_total = None, None, None
    for _substr in reversed(_pred_marks_split):
        _extracted = "".join([c for c in _substr if c.isdigit()])
        if _extracted != "":
            pred_marks_total = int(_extracted)
            diff_total = correct_marks_total - pred_marks_total
            error_total = abs(diff_total)
            break
    pred_marks = {"_": {"marks": pred_marks_total, "rationale": pred_marks}}
    return error_total

πŸ§ͺ Rerun Tests

with unify.Experiment("add_markscheme", overwrite=True), unify.Params(
    system_message=system_message,
    dataset="TestSet10",
    source=unify.get_source(),
):
    unify.map(
        evaluate,
        [dict(**d.entries, _system_message=system_message) for d in test_set_10],
        name="Evals",
    )

Great! Our mean error has gone down from 0.8 to 0.3, we’re definitely making progress πŸ’ͺ

Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.

It seems as though everything was implemented correctly, and the per-LLM system messages look good βœ…

We still need to try to address the remaining errors though. Let’s explore what’s going wrong in another iteration.