Skip to main content
The full script for running this iteration can be found here.

πŸ” Inconsistent Formatting

Let’s take a look at our results in the table.

Example 215

Example 215 has the largest error. Let’s take a look at what’s going wrong.
In this row of boxes, you start with 5 and 7.
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”
β”‚ 5 β”‚ 7 β”‚    β”‚    β”‚    β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜
You add 5 and 7 to get 12 to go in the third box.
You add 7 and 12 to get 19 to go in the fourth box.
You add 12 and 19 to get 31 to go in the fifth box.
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”
β”‚ 5 β”‚ 7 β”‚ 12 β”‚ 19 β”‚ 31 β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜
Complete these rows of boxes using the rule shown above.(a)
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”
β”‚ 4 β”‚ 6 β”‚    β”‚    β”‚ [1]
β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜
(b)
β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”
β”‚    β”‚    β”‚ 34 β”‚ 55 β”‚ [2]
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜
(c) Complete this row of boxes, writing your expressions in their simplest form.
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”
β”‚ a β”‚ b β”‚    β”‚    β”‚ [2]
β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜
(d) Use your answer to (c) to help you fill in the missing numbers in this row of boxes.
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”
β”‚ 6 β”‚    β”‚    β”‚ 57 β”‚ [3]
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜
(a) The next three boxes are 10, 16, and 26. So the row becomes 4, 6, 10, 16, 26.(b) The missing boxes are 8, 13, and 21, making the sequence 8, 13, 21, 34, 55.(c) a + b, a + 2b, a + 3b(d) By setting 2a + 3b = 57, we substitute a = 6 to get 12 + 3b = 57, so b = 15. Then, adding consecutively gives 6, 15, 21, 37, with the final box stated as 57 in the question.
Based on these two, it seems as though there is a small discrepency in the text-based box formatting in the question given to the agent, with the question presenting four boxes per question but the true number being five. We can verify this by looking at the original page, which can be found here.
(a) Answer: 10, 16, 26. Marks: 1(b) Answer: 8, 13, 21. Marks: 2. Guidance: M1 for one correct subtraction of two boxes(c) Answer: a + b, a + 2b, 2a + 3b. Marks: 2. Guidance: M1 for two expressions correct(d) Answer: 15, 21, 36. Marks: 3. Guidance: M1 for their β€˜2a + 3b’ = 57, M1 for substituting a = 6 into their final expression and solving for b
(a) 1/1 Mark. By adding consecutive pairs (4+6=10, 6+10=16, 10+16=26), we match the markscheme’s answer of 10,16,26. This secures the single mark (1) allocated for part (a).(b) 2/2 Marks. Here, we deduce the missing numbers by working backwards from 34 and 55 or by applying the same rule forward. According to the markscheme (2 marks), we get full credit for the fully correct set of missing numbers, with one method mark (M1) for a correct subtraction and a final accuracy mark (A1) for the correct sequence.(c) 1/2 Marks. We used the rule to form expressions but made an error in the final expression: the correct set per the markscheme should be (a+b), (a+2b), (2a+3b). We have two correct expressions, so we earn the method mark (M1), but lose the accuracy mark for an incorrect final expression.(d) 2/3 Marks. Using part (c)’s structure, we form 2a+3b, set it to 57, and solve correctly for b = 15 (which earns the two method marks, M1 and M1). However, we introduced an error when adding 15 + 21 = 37 (should be 36), so we do not achieve the final accuracy mark (A1) for a fully correct final row.
Let’s check each part:(a) The correct row with starting numbers 4 and 6 should give 4, 6, 4+6 = 10, and then 6+10 = 16. The answer given β€œ4, 6, 10, 16, 26” is one extra term. (0/1 mark)(b) For the row with boxes β€œ__, __, 34, 55” the correct answer would be 13, 21, 34, 55 (since 13 + 21 = 34 and 21 + 34 = 55). The answer given β€œ8, 13, 21, 34, 55” does not match the required four‐box row. (0/2 marks)(c) For the row β€œa, b, __, __” the missing boxes should be expressed as a + b and a + 2b in simplest form. The answer β€œa + b, a + 2b, a + 3b” gives an extra term. (1/2 marks)(d) In the row β€œ6, __, __, 57”, if we interpret the pattern as in (c) with a = 6 so that the boxes are 6, b, 6+b, 6+2b and 6 + 2b = 57, then b = 25.5. The given answer instead sets up a different equation and finds b = 15 leading to an inconsistent sequence. (0/3 marks)Total marks awarded: 0 + 0 + 1 + 0 = 11
Looking at this failure, it’s obvious that the model would benefit from having the markscheme included in the context, so that it knows what the correct answers are, and how to award the marks. Without the markscheme, the agent is unable to notice this discrepency, and therefore presumes that a total of four numbers are reqired for each answer, instead of the necessary five.

πŸ”€ Add Markscheme

Let’s update the system message to include a placeholder for the markscheme.
system_message = """
Your task is to award a suitable number of marks for a student's answer to a question, from 0 up to a maximum of {available_marks_total} marks.

The question is:

{question}


The markscheme for the question is:

{markscheme}


Their answer to this question is:

{answer}


As the very final part of your response, simply provide the number of marks on a *new line*, without any additional formatting. For example:

3
"""
We then make sure to update the system message with the true markscheme during call_agent:
@unify.traced
def call_agent(system_msg, question, markscheme, answer, available_marks_total):
    local_agent = agent.copy()
    local_agent.set_system_message(
        system_msg.replace(
            "{question}",
            textwrap.indent(question, " " * 4),
        )
        .replace(
            "{markscheme}",
            pretty_print_dict(markscheme, indent=4),
        )
        .replace(
            "{answer}",
            pretty_print_dict(answer, indent=4),
        )
        .replace(
            "{available_marks_total}",
            str(available_marks_total),
        ),
    )
    return local_agent.generate()
We also need to update evaluate accordingly:
@unify.log
def evaluate(
    question,
    student_answer,
    available_marks_total,
    markscheme,
    correct_marks_total,
    _system_message,
):
    pred_marks = call_agent(
        _system_message,
        question,
        markscheme,
        student_answer,
        available_marks_total,
    )
    _pred_marks_split = pred_marks.split("\n")
    pred_marks_total, diff_total, error_total = None, None, None
    for _substr in reversed(_pred_marks_split):
        _extracted = "".join([c for c in _substr if c.isdigit()])
        if _extracted != "":
            pred_marks_total = int(_extracted)
            diff_total = correct_marks_total - pred_marks_total
            error_total = abs(diff_total)
            break
    pred_marks = {"_": {"marks": pred_marks_total, "rationale": pred_marks}}
    return error_total

πŸ§ͺ Rerun Tests

with unify.Experiment("add_markscheme", overwrite=True), unify.Params(
    system_message=system_message,
    dataset="TestSet10",
    source=unify.get_source(),
):
    unify.map(
        evaluate,
        [dict(**d.entries, _system_message=system_message) for d in test_set_10],
        name="Evals",
    )
Great! Our mean error has gone down from 0.8 to 0.3, we’re definitely making progress πŸ’ͺ Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly, and the per-LLM system messages look good βœ… We still need to try to address the remaining errors though. Let’s explore what’s going wrong in another iteration.
⌘I