Complete these rows of boxes using the rule shown above.
The missing boxes are 8, 13, and 21, making the sequence 8, 13, 21, 34, 55.
Answer: 8, 13, 21
Marks: 2
Guidance:
M1 for one correct subtraction of two boxes
Here, we deduce the missing numbers by working backwards from 34 and 55 or by applying the same rule forward. According to the markscheme (2 marks), we get full credit for the fully correct set of missing numbers, with one method mark (M1) for a correct subtraction and a final accuracy mark (A1) for the correct sequence.
The student’s answer shows the correct numerical result (8, 13, 21, 34, 55) for the missing boxes. However, the mark scheme awards an M1 mark for demonstrating a correct subtraction method (specifically subtracting two boxes) in the working. Since the student’s answer does not include any working or demonstration of this method, the M1 mark should not be awarded. Therefore, based on the guidelines, no marks are given.
This is an interesting failure mode.
The justification for the “correct” (ground truth) marks is actually wrong.
There is no A1 mark for this question (which would depend on a method mark).
This is irrelevant in terms of the agent failure (the agent doesn’t know the correct marks or rationale),
but it’s still an interesting observsation regarding our “ground truth” data.
Interestingly,
the agent has made the same mistake as occurs in the “ground truth” rationale.
Our agent presumes the existence of an A mark where none were stated.
It seems like the agent doesn’t understand that correct answers should always earn full marks,
unless otherwise explicitly stated.
M1 marks are not necessary to achieve full marks in such cases,
unless an A mark is specifically referenced.
Let’s try to fully clarify these points for the sub-question agent,
and re-run the evals.
Copy
Ask AI
output_response_explanation = """You should populate the `reasoning` field with your general thoughts on each individual mark identified in the markscheme, and also a decision as to whether each of these mark should be awarded.If you deem that a mark *should* be awarded (such SC1, B1, A1 etc.), then it is worth as many marks as appear in the mark type itself (SC1, B1, and A1 are therefore worth 1 mark each, A2 is worth 2 marks etc.). However, these marks are not *necessarily* cumulative with regards to the total marks to award for this sub-question, and some may be irrelevant given the student's approach or answer.More importantly, full marks should *always* be given for a fully correct answer, unless otherwise *explicitly* stated. For example, a correct answer without any method shown should still get *full marks*, despite the M1 criteria not being met. The only exception to this is explicitly referenced A marks, which do depend on the preceding M marks being awarded.Finally, after you've given it a lot of thought, you should put the total number of marks to award for this sub-question in the `marks` field."""
with unify.Experiment( "clarify_method_marks", overwrite=True,), unify.Params( subq_system_message=subq_system_message, mark_system_message=mark_system_message, dataset="TestSet10", source=unify.get_source(),): unify.map( evaluate, [ dict( **d.entries, _subq_system_message=subq_system_message, _mark_system_message=mark_system_message, ) for d in test_set_10 ], name="Evals", )
We’ve finally got all 10/10 tests passing perfectly 🎉
If we were to continue optimizing our agent,
we would expand the test set to "TestSet20" and continue iterating,
spotting failures and applying remedies 🔁
Feel free to use this case study a starting point,
and see how far you can get.
Can you get all 321 examples marked correctly? 👀