1: Add Markscheme
The full script for running this iteration can be found here.
π Inconsistent Formatting
Letβs take a look at our results in the table.
Example 215
Example 215 has the largest error. Letβs take a look at whatβs going wrong.
π PDFs
π PDFs
β Paper 2 -> Question 17
β Paper 2 -> Question 17
βοΈ Paper 2 -> Question 17 Markscheme
βοΈ Paper 2 -> Question 17 Markscheme
β Parsed Question [8 Marks]
β Parsed Question [8 Marks]
In this row of boxes, you start with 5 and 7.
You add 5 and 7 to get 12 to go in the third box.
You add 7 and 12 to get 19 to go in the fourth box.
You add 12 and 19 to get 31 to go in the fifth box.
Complete these rows of boxes using the rule shown above.
(a)
(b)
(c) Complete this row of boxes, writing your expressions in their simplest form.
(d) Use your answer to (c) to help you fill in the missing numbers in this row of boxes.
π Student's Answer
π Student's Answer
(a) The next three boxes are 10, 16, and 26. So the row becomes 4, 6, 10, 16, 26.
(b) The missing boxes are 8, 13, and 21, making the sequence 8, 13, 21, 34, 55.
(c) a + b, a + 2b, a + 3b
(d) By setting 2a + 3b = 57, we substitute a = 6 to get 12 + 3b = 57, so b = 15. Then, adding consecutively gives 6, 15, 21, 37, with the final box stated as 57 in the question.
Based on these two, it seems as though there is a small discrepency in the text-based box formatting in the question given to the agent, with the question presenting four boxes per question but the true number being five. We can verify this by looking at the original page, which can be found here.
βοΈ Parsed Markscheme (Not Given to Agent)
βοΈ Parsed Markscheme (Not Given to Agent)
(a) Answer: 10, 16, 26. Marks: 1
(b) Answer: 8, 13, 21. Marks: 2. Guidance: M1 for one correct subtraction of two boxes
(c) Answer: a + b, a + 2b, 2a + 3b. Marks: 2. Guidance: M1 for two expressions correct
(d) Answer: 15, 21, 36. Marks: 3. Guidance: M1 for their β2a + 3bβ = 57, M1 for substituting a = 6 into their final expression and solving for b
β
Correct Marks [6/8] Rationale (Not Given to Agent)
β Correct Marks [6/8] Rationale (Not Given to Agent)
(a) 1/1 Mark. By adding consecutive pairs (4+6=10, 6+10=16, 10+16=26), we match the markschemeβs answer of 10,16,26. This secures the single mark (1) allocated for part (a).
(b) 2/2 Marks. Here, we deduce the missing numbers by working backwards from 34 and 55 or by applying the same rule forward. According to the markscheme (2 marks), we get full credit for the fully correct set of missing numbers, with one method mark (M1) for a correct subtraction and a final accuracy mark (A1) for the correct sequence.
(c) 1/2 Marks. We used the rule to form expressions but made an error in the final expression: the correct set per the markscheme should be (a+b), (a+2b), (2a+3b). We have two correct expressions, so we earn the method mark (M1), but lose the accuracy mark for an incorrect final expression.
(d) 2/3 Marks. Using part (c)βs structure, we form 2a+3b, set it to 57, and solve correctly for b = 15 (which earns the two method marks, M1 and M1). However, we introduced an error when adding 15 + 21 = 37 (should be 36), so we do not achieve the final accuracy mark (A1) for a fully correct final row.
π€ Predicted Marks [1/8] Rationale
π€ Predicted Marks [1/8] Rationale
Letβs check each part:
(a) The correct row with starting numbers 4 and 6 should give 4, 6, 4+6 = 10, and then 6+10 = 16. The answer given β4, 6, 10, 16, 26β is one extra term. (0/1 mark)
(b) For the row with boxes β__, __, 34, 55β the correct answer would be 13, 21, 34, 55 (since 13 + 21 = 34 and 21 + 34 = 55). The answer given β8, 13, 21, 34, 55β does not match the required fourβbox row. (0/2 marks)
(c) For the row βa, b, __, __β the missing boxes should be expressed as a + b and a + 2b in simplest form. The answer βa + b, a + 2b, a + 3bβ gives an extra term. (1/2 marks)
(d) In the row β6, __, __, 57β, if we interpret the pattern as in (c) with a = 6 so that the boxes are 6, b, 6+b, 6+2b and 6 + 2b = 57, then b = 25.5. The given answer instead sets up a different equation and finds b = 15 leading to an inconsistent sequence. (0/3 marks)
Total marks awarded: 0 + 0 + 1 + 0 = 1
1
Looking at this failure, itβs obvious that the model would benefit from having the markscheme included in the context, so that it knows what the correct answers are, and how to award the marks.
Without the markscheme, the agent is unable to notice this discrepency, and therefore presumes that a total of four numbers are reqired for each answer, instead of the necessary five.
π Add Markscheme
Letβs update the system message to include a placeholder for the markscheme.
We then make sure to update the system message with the true markscheme during call_agent
:
We also need to update evaluate
accordingly:
π§ͺ Rerun Tests
Great! Our mean error has gone down from 0.8 to 0.3, weβre definitely making progress πͺ
Letβs take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly, and the per-LLM system messages look good β
We still need to try to address the remaining errors though. Letβs explore whatβs going wrong in another iteration.