10: Clarify Method + Answer Marks
The full script for running this iteration can be found here.
🔍 Method Marks Confusion
All of the prior failures now seem to have been resolved, but we have a new regression for Example 215 (b). Let’s take a look.
Example 215 (b)
📄 PDFs
📄 PDFs
❓ Paper 2 -> Question 17 (b)
❓ Paper 2 -> Question 17 (b)
☑️ Paper 2 -> Question 17 (b) Markscheme
☑️ Paper 2 -> Question 17 (b) Markscheme
❓ Parsed Question [2 Marks]
❓ Parsed Question [2 Marks]
Complete these rows of boxes using the rule shown above.
📝 Student's Answer
📝 Student's Answer
The missing boxes are 8, 13, and 21, making the sequence 8, 13, 21, 34, 55.
☑️ Parsed Markscheme
☑️ Parsed Markscheme
Answer: 8, 13, 21 Marks: 2 Guidance: M1 for one correct subtraction of two boxes
✅ Correct Marks [2/2] Rationale
✅ Correct Marks [2/2] Rationale
Here, we deduce the missing numbers by working backwards from 34 and 55 or by applying the same rule forward. According to the markscheme (2 marks), we get full credit for the fully correct set of missing numbers, with one method mark (M1) for a correct subtraction and a final accuracy mark (A1) for the correct sequence.
🤖 Predicted Marks [0/2] Rationale
🤖 Predicted Marks [0/2] Rationale
The student’s answer shows the correct numerical result (8, 13, 21, 34, 55) for the missing boxes. However, the mark scheme awards an M1 mark for demonstrating a correct subtraction method (specifically subtracting two boxes) in the working. Since the student’s answer does not include any working or demonstration of this method, the M1 mark should not be awarded. Therefore, based on the guidelines, no marks are given.
This is an interesting failure mode. The justification for the “correct” (ground truth) marks is actually wrong. There is no A1 mark for this question (which would depend on a method mark). This is irrelevant in terms of the agent failure (the agent doesn’t know the correct marks or rationale), but it’s still an interesting observsation regarding our “ground truth” data.
Interestingly, the agent has made the same mistake as occurs in the “ground truth” rationale. Our agent presumes the existence of an A mark where none were stated. It seems like the agent doesn’t understand that correct answers should always earn full marks, unless otherwise explicitly stated. M1 marks are not necessary to achieve full marks in such cases, unless an A mark is specifically referenced.
🔀 Clarify Method + Answer Marks
Let’s try to fully clarify these points for the sub-question agent, and re-run the evals.
🧪 Rerun Tests
We’ve finally got all 10/10 tests passing perfectly 🎉
If we were to continue optimizing our agent,
we would expand the test set to "TestSet20"
and continue iterating,
spotting failures and applying remedies 🔁
Feel free to use this case study a starting point, and see how far you can get. Can you get all 321 examples marked correctly? 👀
Otherwise, feel free to dive into any of the core concepts of the platform: Universal API, Logging and Interfaces.
As always, happy prompting 🧑💻