6: Mark Type Reasoning
The full script for running this iteration can be found here.
🔍 Still Ignoring Mark Types
As usual, let’s take a look and explore why the agent might be failing on the remaining examples 🕵️
Let’s try to force the agent to reason about each potential mark mentioned in the markscheme, by further refining our structured output.
Let’s expand upon the reasoning
field for each sub-question,
with a field for each mark type referenced in the sub-question markscheme,
going from the following structure:
To this version which explicitly enforces reasoning about each potential mark type referenced in the markscheme:
This way,
the agent will be forced to reason about SC1
for Example 207,
M1
for Example 261,
and B1
for Example 132 (c).
🔀 Mark Type Reasoning
Let’s first define a function to dynamically construct the required pydantic type. For each parsed mark type, we want the model to give it’s thoughts and make a decision as to whether or not the mark should be awarded. Let’s create this pydantic type first:
Let’s then create a function to dynamically construct a PerMarkReasoning
pydantic type,
with one ThoughtsAndAwardDecision
instance for each mark detected in the sub-question markscheme.
Let’s then re-define MarksAndReasoning
(previously this was statically defined, see previous iteration)
such that the reasoning
field is no longer just a string,
but is intead our newly created PerMarkReasoning
(above).
Finally,
let’s then update the top-level function create_response_format
such that we’re making use of our newly defined create_marks_and_reasoning_format
for each sub-question.
We also need to write a function to parse the relevant marks from each sub-question markscheme.
We can take inspiration from update_markscheme
defined in the previous iteration,
which parses the markscheme in the same manner but for a different reason.
Let’s have the function extract the marks, and also the surrounding context.
Finally,
we’ll also need to update call_agent
such that we call parse_available_marks_from_markscheme
on each sub-question markscheme,
and then pass these into our newly defined create_response_format
.
Let’s also update our system message to better explain to the agent how it should reason about this new output structure.
🧪 Rerun Tests
The failure modes are still entirely unchanged! o3-mini
is certainly very stubborn about it’s decision for these questions.
Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅
Again, let’s explore what’s going wrong in the next iteration 🔁