10: Clarify Method + Answer Marks
The full script for running this iteration can be found here.
🔍 Method Marks Confusion
All of the prior failures now seem to have been resolved, but we have a new regression for Example 215 (b). Let’s take a look.
Example 215 (b)
This is an interesting failure mode. The justification for the “correct” (ground truth) marks is actually wrong. There is no A1 mark for this question (which would depend on a method mark). This is irrelevant in terms of the agent failure (the agent doesn’t know the correct marks or rationale), but it’s still an interesting observsation regarding our “ground truth” data.
Interestingly, the agent has made the same mistake as occurs in the “ground truth” rationale. Our agent presumes the existence of an A mark where none were stated. It seems like the agent doesn’t understand that correct answers should always earn full marks, unless otherwise explicitly stated. M1 marks are not necessary to achieve full marks in such cases, unless an A mark is specifically referenced.
🔀 Clarify Method + Answer Marks
Let’s try to fully clarify these points for the sub-question agent, and re-run the evals.
🧪 Rerun Tests
Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅
Also, we’ve finally got all 10/10 tests passing perfectly 🎉
If we were to continue optimizing our agent,
we would expand the test set to "TestSet20"
and continue iterating,
spotting failures and applying remedies 🔁
Feel free to use this case study a starting point, and see how far you can get. Can you get all 321 examples marked correctly? 👀
Otherwise, feel free to dive into any of the core concepts of the platform: Universal API, Logging and Interfaces.
As always, happy prompting 🧑💻