The full script for running this iteration can be found here.

🔍 Method Marks Confusion

All of the prior failures now seem to have been resolved, but we have a new regression for Example 215 (b). Let’s take a look.

Example 215 (b)

This is an interesting failure mode. The justification for the “correct” (ground truth) marks is actually wrong. There is no A1 mark for this question (which would depend on a method mark). This is irrelevant in terms of the agent failure (the agent doesn’t know the correct marks or rationale), but it’s still an interesting observsation regarding our “ground truth” data.

Interestingly, the agent has made the same mistake as occurs in the “ground truth” rationale. Our agent presumes the existence of an A mark where none were stated. It seems like the agent doesn’t understand that correct answers should always earn full marks, unless otherwise explicitly stated. M1 marks are not necessary to achieve full marks in such cases, unless an A mark is specifically referenced.

🔀 Clarify Method + Answer Marks

Let’s try to fully clarify these points for the sub-question agent, and re-run the evals.

output_response_explanation = """
You should populate the `reasoning` field with your general thoughts on each individual mark identified in the markscheme, and also a decision as to whether each of these mark should be awarded.

If you deem that a mark *should* be awarded (such SC1, B1, A1 etc.), then it is worth as many marks as appear in the mark type itself (SC1, B1, and A1 are therefore worth 1 mark each, A2 is worth 2 marks etc.). However, these marks are not *necessarily* cumulative with regards to the total marks to award for this sub-question, and some may be irrelevant given the student's approach or answer.

More importantly, full marks should *always* be given for a fully correct answer, unless otherwise *explicitly* stated. For example, a correct answer without any method shown should still get *full marks*, despite the M1 criteria not being met. The only exception to this is explicitly referenced A marks, which do depend on the preceding M marks being awarded.

Finally, after you've given it a lot of thought, you should put the total number of marks to award for this sub-question in the `marks` field.
"""

🧪 Rerun Tests

with unify.Experiment(
    "clarify_method_marks",
    overwrite=True,
), unify.Params(
    subq_system_message=subq_system_message,
    mark_system_message=mark_system_message,
    dataset="TestSet10",
    source=unify.get_source(),
):
    unify.map(
        evaluate,
        [
            dict(
                **d.entries,
                _subq_system_message=subq_system_message,
                _mark_system_message=mark_system_message,
            )
            for d in test_set_10
        ],
        name="Evals",
    )

Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.

It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅

Also, we’ve finally got all 10/10 tests passing perfectly 🎉

If we were to continue optimizing our agent, we would expand the test set to "TestSet20" and continue iterating, spotting failures and applying remedies 🔁

Feel free to use this case study a starting point, and see how far you can get. Can you get all 321 examples marked correctly? 👀

Otherwise, feel free to dive into any of the core concepts of the platform: Universal API, Logging and Interfaces.

As always, happy prompting 🧑‍💻