You add 5 and 7 to get 12 to go in the third box.
You add 7 and 12 to get 19 to go in the fourth box.
You add 12 and 19 to get 31 to go in the fifth box.
(a) The next three boxes are 10, 16, and 26. So the row becomes 4, 6, 10, 16, 26.
(b) The missing boxes are 8, 13, and 21, making the sequence 8, 13, 21, 34, 55.
(c) a + b, a + 2b, a + 3b
(d) By setting 2a + 3b = 57, we substitute a = 6 to get 12 + 3b = 57, so b = 15. Then, adding consecutively gives 6, 15, 21, 37, with the final box stated as 57 in the question.
Based on these two,
it seems as though there is a small discrepency in the text-based box formatting in the question given to the agent,
with the question presenting four boxes per question but the true number being five.
We can verify this by looking at the original page,
which can be found
here.
(a) Answer: 10, 16, 26. Marks: 1
(b) Answer: 8, 13, 21. Marks: 2. Guidance: M1 for one correct subtraction of two boxes
(c) Answer: a + b, a + 2b, 2a + 3b. Marks: 2. Guidance: M1 for two expressions correct
(d) Answer: 15, 21, 36. Marks: 3. Guidance: M1 for their β2a + 3bβ = 57, M1 for substituting a = 6 into their final expression and solving for b
(a) 1/1 Mark. By adding consecutive pairs (4+6=10, 6+10=16, 10+16=26), we match the markschemeβs answer of 10,16,26. This secures the single mark (1) allocated for part (a).
(b) 2/2 Marks. Here, we deduce the missing numbers by working backwards from 34 and 55 or by applying the same rule forward. According to the markscheme (2 marks), we get full credit for the fully correct set of missing numbers, with one method mark (M1) for a correct subtraction and a final accuracy mark (A1) for the correct sequence.
(c) 1/2 Marks. We used the rule to form expressions but made an error in the final expression: the correct set per the markscheme should be (a+b), (a+2b), (2a+3b). We have two correct expressions, so we earn the method mark (M1), but lose the accuracy mark for an incorrect final expression.
(d) 2/3 Marks. Using part (c)βs structure, we form 2a+3b, set it to 57, and solve correctly for b = 15 (which earns the two method marks, M1 and M1). However, we introduced an error when adding 15 + 21 = 37 (should be 36), so we do not achieve the final accuracy mark (A1) for a fully correct final row.
Letβs check each part:
(a) The correct row with starting numbers 4 and 6 should give 4, 6, 4+6 = 10, and then 6+10 = 16. The answer given β4, 6, 10, 16, 26β is one extra term. (0/1 mark)
(b) For the row with boxes β__, __, 34, 55β the correct answer would be 13, 21, 34, 55 (since 13 + 21 = 34 and 21 + 34 = 55). The answer given β8, 13, 21, 34, 55β does not match the required fourβbox row. (0/2 marks)
(c) For the row βa, b, __, __β the missing boxes should be expressed as a + b and a + 2b in simplest form. The answer βa + b, a + 2b, a + 3bβ gives an extra term. (1/2 marks)
(d) In the row β6, __, __, 57β, if we interpret the pattern as in (c) with a = 6 so that the boxes are 6, b, 6+b, 6+2b and 6 + 2b = 57, then b = 25.5. The given answer instead sets up a different equation and finds b = 15 leading to an inconsistent sequence. (0/3 marks)
Total marks awarded: 0 + 0 + 1 + 0 = 1
1
Looking at this failure,
itβs obvious that the model would benefit from having the markscheme included in the context,
so that it knows what the correct answers are,
and how to award the marks.
Without the markscheme,
the agent is unable to notice this discrepency,
and therefore presumes that a total of four numbers are reqired for each answer,
instead of the necessary five.
Letβs update the system message to include a placeholder for the markscheme.
system_message = """Your task is to award a suitable number of marks for a student's answer to a question, from 0 up to a maximum of {available_marks_total} marks.The question is:{question}The markscheme for the question is:{markscheme}Their answer to this question is:{answer}As the very final part of your response, simply provide the number of marks on a *new line*, without any additional formatting. For example:3"""
We then make sure to update the system message with the true markscheme during call_agent:
with unify.Experiment("add_markscheme", overwrite=True), unify.Params( system_message=system_message, dataset="TestSet10", source=unify.get_source(),): unify.map( evaluate,[dict(**d.entries, _system_message=system_message)for d in test_set_10], name="Evals",)
Great!
Our mean error has gone down from 0.8 to 0.3,
weβre definitely making progress πͺ
Letβs take a look at the traces,
to ensure that the system message template has been implemented correctly,
and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly,
and the per-LLM system messages look good β
We still need to try to address the remaining errors though.
Letβs explore whatβs going wrong in another iteration.