1: Add Markscheme - Unify Documentation

The full script for running this iteration can be found here.

🔍 Inconsistent Formatting

Let’s take a look at our results in the table.

Example 215

Example 215 has the largest error. Let’s take a look at what’s going wrong.

📄 PDFs

❓ Parsed Question [8 Marks]

In this row of boxes, you start with 5 and 7.

┌───┬───┬────┬────┬────┐
│ 5 │ 7 │    │    │    │
└───┴───┴────┴────┴────┘

You add 5 and 7 to get 12 to go in the third box.
You add 7 and 12 to get 19 to go in the fourth box.
You add 12 and 19 to get 31 to go in the fifth box.

┌───┬───┬────┬────┬────┐
│ 5 │ 7 │ 12 │ 19 │ 31 │
└───┴───┴────┴────┴────┘

Complete these rows of boxes using the rule shown above.(a)

┌───┬───┬────┬────┐
│ 4 │ 6 │    │    │ [1]
└───┴───┴────┴────┘

(b)

┌────┬────┬────┬────┐
│    │    │ 34 │ 55 │ [2]
└────┴────┴────┴────┘

┌───┬───┬────┬────┐
│ a │ b │    │    │ [2]
└───┴───┴────┴────┘

(d) Use your answer to (c) to help you fill in the missing numbers in this row of boxes.

┌───┬────┬────┬────┐
│ 6 │    │    │ 57 │ [3]
└───┴────┴────┴────┘

📝 Student's Answer

Based on these two, it seems as though there is a small discrepency in the text-based box formatting in the question given to the agent, with the question presenting four boxes per question but the true number being five. We can verify this by looking at the original page, which can be found here.

☑️ Parsed Markscheme (Not Given to Agent)

✅ Correct Marks [6/8] Rationale (Not Given to Agent)

🤖 Predicted Marks [1/8] Rationale

Looking at this failure, it’s obvious that the model would benefit from having the markscheme included in the context, so that it knows what the correct answers are, and how to award the marks. Without the markscheme, the agent is unable to notice this discrepency, and therefore presumes that a total of four numbers are reqired for each answer, instead of the necessary five.

🔀 Add Markscheme

Let’s update the system message to include a placeholder for the markscheme.

system_message = """
Your task is to award a suitable number of marks for a student's answer to a question, from 0 up to a maximum of {available_marks_total} marks.

The question is:

{question}


The markscheme for the question is:

{markscheme}


Their answer to this question is:

{answer}


As the very final part of your response, simply provide the number of marks on a *new line*, without any additional formatting. For example:

3
"""

We then make sure to update the system message with the true markscheme during call_agent:

@unify.traced
def call_agent(system_msg, question, markscheme, answer, available_marks_total):
    local_agent = agent.copy()
    local_agent.set_system_message(
        system_msg.replace(
            "{question}",
            textwrap.indent(question, " " * 4),
        )
        .replace(
            "{markscheme}",
            pretty_print_dict(markscheme, indent=4),
        )
        .replace(
            "{answer}",
            pretty_print_dict(answer, indent=4),
        )
        .replace(
            "{available_marks_total}",
            str(available_marks_total),
        ),
    )
    return local_agent.generate()

We also need to update evaluate accordingly:

@unify.log
def evaluate(
    question,
    student_answer,
    available_marks_total,
    markscheme,
    correct_marks_total,
    _system_message,
):
    pred_marks = call_agent(
        _system_message,
        question,
        markscheme,
        student_answer,
        available_marks_total,
    )
    _pred_marks_split = pred_marks.split("\n")
    pred_marks_total, diff_total, error_total = None, None, None
    for _substr in reversed(_pred_marks_split):
        _extracted = "".join([c for c in _substr if c.isdigit()])
        if _extracted != "":
            pred_marks_total = int(_extracted)
            diff_total = correct_marks_total - pred_marks_total
            error_total = abs(diff_total)
            break
    pred_marks = {"_": {"marks": pred_marks_total, "rationale": pred_marks}}
    return error_total

🧪 Rerun Tests

with unify.Experiment("add_markscheme", overwrite=True), unify.Params(
    system_message=system_message,
    dataset="TestSet10",
    source=unify.get_source(),
):
    unify.map(
        evaluate,
        [dict(**d.entries, _system_message=system_message) for d in test_set_10],
        name="Evals",
    )

Great! Our mean error has gone down from 0.8 to 0.3, we’re definitely making progress 💪 Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.

It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅ We still need to try to address the remaining errors though. Let’s explore what’s going wrong in another iteration.

​🔍 Inconsistent Formatting

​Example 215

​🔀 Add Markscheme

​🧪 Rerun Tests

🔍 Inconsistent Formatting

Example 215

🔀 Add Markscheme

🧪 Rerun Tests