The full script for running this iteration can be found here.
Seven out of ten have a perfect error of 0, with three examples having an error of 1. Let’s take a closer look at these examples, and why they’re failing.
Let’s dive in and explore each of these failures in more detail, to see how we can rectify them.
📄 PDFs
❓ Paper 1 -> Question 2
☑️ Paper 1 -> Question 2 Markscheme
❓ Parsed Question [2 Marks]
Write these in order, smallest first:
0.34, 1/3, 3.5%
…………………… …………………… ……………………
📝 Student's Answer
1/3, 0.34, 3.5%
☑️ Parsed Markscheme
3.5%, 1/3, 0.34
Part marks and guidance:
Accept correct order with equivalent values.
✅ Correct Marks [1/2] Rationale
The candidate’s order is 1/3, 0.34, 3.5%. According to the markscheme, this earns SC1 (special case, partially correct ordering) for 1 mark.
🤖 Predicted Marks [0/2] Rationale
The student’s ordering is incorrect because smallest first should be 3.5%, 1/3, 0.34, but the student provided 1/3, 0.34, 3.5%.
0
📄 PDFs
❓ Paper 1 -> Question 19
☑️ Paper 1 -> Question 19 Markscheme
❓ Parsed Question [5 Marks]
Peter makes a large amount of pink paint by mixing red and white paint in the ratio 2 : 3.
• Red paint costs £80 per 10 litres.
• White paint costs £5 per 10 litres.
• Peter sells his pink paint in 10-litre tins for £60 per tin.
Calculate how much profit he makes for each tin he sells.
You must show your working.
£ …………………………………………
📝 Student's Answer
Peter needs 4 litres of red paint and 6 litres of white paint for every 10-litre batch (because 2 out of 5 parts is red, 3 out of 5 parts is white).
☑️ Parsed Markscheme
25 with correct working (5 marks)
“Correct working” requires evidence of at least M1M1
Alternative method:
If 0 or 1 scored, instead award SC2 for answer 25 with no working or insufficient working
✅ Correct Marks [1/5] Rationale
Awarding 1 mark (M1) for correctly finding that a 10-litre tin contains 4 litres of red and 6 litres of white from the 2:3 ratio. No further steps towards the final profit are shown, so only this 1 mark is earned.
🤖 Predicted Marks [0/5] Rationale
The student’s response only states that Peter needs 4 litres of red paint and 6 litres of white paint per 10-litre batch. It does not show any further working or provide the calculation of the cost of the paints or the profit per tin. Therefore, the answer does not meet the full requirements for the calculation and is awarded 0 marks.
0
📄 PDFs
❓ Paper 2 -> Question 10 (c)
☑️ Paper 2 -> Question 10 (c) Markscheme
❓ Parsed Question [2 Marks]
A newspaper headline reads: ‘High temperatures make more people buy ice cream!’ Does the graph above prove this claim? Give a reason for your decision.
📝 Student's Answer
No. The graph alone does not completely prove the claim, because we cannot be certain this trend always holds true in all situations.
☑️ Parsed Markscheme
Answer: No, because there may be other factors involved (2 marks) Guidance: B1 for ‘No’, with partial reason.
✅ Correct Marks [1/2] Rationale
The mark scheme gives 2 marks if the student states “No, there may be other factors involved.” Here, only a partial reason is given, so 1 mark is awarded for denying that the graph proves the claim but not fully explaining other possible factors.
🤖 Predicted Marks [2/2] Rationale
No Rationale Given
Overall,
it’s clear that the agent is unable to properly make sense of the different mark types,
such as B1
, SC1
, M1
, A1
etc.
This is not surprising, as we’ve never explained what these terms mean in the system prompt!
Let’s add the general marking guidelines to the system prompt, so the agent knows what all of these mark terms mean, and also fully understands how to interpret the markscheme for each question.
The marking guidelines can be extracted from the beginning of any of the markscheme pdf files, such as this one.
Let’s store this in a seperate variable, which will make it easier for us to parameterize the inclusion of the guidelines in future experiment iterations.
Let’s also update the system message to include a placeholder for these general guidelines, and let’s also reword other parts of the system message to make it clear which parts are for the question, and which parts are general guidelines.
Let’s then update the system message to include the guidelines by default.
The mean error has gone down from 0.3
to 0.2
.
Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅
Let’s explore the remaining errors in the next iteration 🔁
The full script for running this iteration can be found here.
Seven out of ten have a perfect error of 0, with three examples having an error of 1. Let’s take a closer look at these examples, and why they’re failing.
Let’s dive in and explore each of these failures in more detail, to see how we can rectify them.
📄 PDFs
❓ Paper 1 -> Question 2
☑️ Paper 1 -> Question 2 Markscheme
❓ Parsed Question [2 Marks]
Write these in order, smallest first:
0.34, 1/3, 3.5%
…………………… …………………… ……………………
📝 Student's Answer
1/3, 0.34, 3.5%
☑️ Parsed Markscheme
3.5%, 1/3, 0.34
Part marks and guidance:
Accept correct order with equivalent values.
✅ Correct Marks [1/2] Rationale
The candidate’s order is 1/3, 0.34, 3.5%. According to the markscheme, this earns SC1 (special case, partially correct ordering) for 1 mark.
🤖 Predicted Marks [0/2] Rationale
The student’s ordering is incorrect because smallest first should be 3.5%, 1/3, 0.34, but the student provided 1/3, 0.34, 3.5%.
0
📄 PDFs
❓ Paper 1 -> Question 19
☑️ Paper 1 -> Question 19 Markscheme
❓ Parsed Question [5 Marks]
Peter makes a large amount of pink paint by mixing red and white paint in the ratio 2 : 3.
• Red paint costs £80 per 10 litres.
• White paint costs £5 per 10 litres.
• Peter sells his pink paint in 10-litre tins for £60 per tin.
Calculate how much profit he makes for each tin he sells.
You must show your working.
£ …………………………………………
📝 Student's Answer
Peter needs 4 litres of red paint and 6 litres of white paint for every 10-litre batch (because 2 out of 5 parts is red, 3 out of 5 parts is white).
☑️ Parsed Markscheme
25 with correct working (5 marks)
“Correct working” requires evidence of at least M1M1
Alternative method:
If 0 or 1 scored, instead award SC2 for answer 25 with no working or insufficient working
✅ Correct Marks [1/5] Rationale
Awarding 1 mark (M1) for correctly finding that a 10-litre tin contains 4 litres of red and 6 litres of white from the 2:3 ratio. No further steps towards the final profit are shown, so only this 1 mark is earned.
🤖 Predicted Marks [0/5] Rationale
The student’s response only states that Peter needs 4 litres of red paint and 6 litres of white paint per 10-litre batch. It does not show any further working or provide the calculation of the cost of the paints or the profit per tin. Therefore, the answer does not meet the full requirements for the calculation and is awarded 0 marks.
0
📄 PDFs
❓ Paper 2 -> Question 10 (c)
☑️ Paper 2 -> Question 10 (c) Markscheme
❓ Parsed Question [2 Marks]
A newspaper headline reads: ‘High temperatures make more people buy ice cream!’ Does the graph above prove this claim? Give a reason for your decision.
📝 Student's Answer
No. The graph alone does not completely prove the claim, because we cannot be certain this trend always holds true in all situations.
☑️ Parsed Markscheme
Answer: No, because there may be other factors involved (2 marks) Guidance: B1 for ‘No’, with partial reason.
✅ Correct Marks [1/2] Rationale
The mark scheme gives 2 marks if the student states “No, there may be other factors involved.” Here, only a partial reason is given, so 1 mark is awarded for denying that the graph proves the claim but not fully explaining other possible factors.
🤖 Predicted Marks [2/2] Rationale
No Rationale Given
Overall,
it’s clear that the agent is unable to properly make sense of the different mark types,
such as B1
, SC1
, M1
, A1
etc.
This is not surprising, as we’ve never explained what these terms mean in the system prompt!
Let’s add the general marking guidelines to the system prompt, so the agent knows what all of these mark terms mean, and also fully understands how to interpret the markscheme for each question.
The marking guidelines can be extracted from the beginning of any of the markscheme pdf files, such as this one.
Let’s store this in a seperate variable, which will make it easier for us to parameterize the inclusion of the guidelines in future experiment iterations.
Let’s also update the system message to include a placeholder for these general guidelines, and let’s also reword other parts of the system message to make it clear which parts are for the question, and which parts are general guidelines.
Let’s then update the system message to include the guidelines by default.
The mean error has gone down from 0.3
to 0.2
.
Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅
Let’s explore the remaining errors in the next iteration 🔁