3: Add Structured Output
The full script for running this iteration can be found here.
🔍 Difficulty Debugging
Let’s explore the failure modes which are still remaining.
The change seems to have fixed Example 261, but the other two [Examples 207 and 132 (c)] are still failing.
It’s also becoming quite difficult to track the exact discrepency between the correct marks and those predicted by the agent, as the agent’s response is a single block of text, unlike the ground truth data which is formatted in a dictionary with each sub-question independently marked.
Adding structured output could both help the agent to reason about each part of the question independently, and it will also make the response easier to parse, enabling us to present a diff at the subquestion level, rather than just for the entire question. Let’s give it a try!
🔀 Add Structured Output
So, let’s go ahead and implement structured output! 🧑💻
Let’s first define the output we want for each sub-question:
Let’s now write a simply function to build the desired pydantic output dynamically, based on the subquestions present.
Let’s update our system prompt, so the agent knows how to populate the structured response correctly. We also actually only use the nested output structure if subquestions are present. So we’ll want to populate the instructions dynamically, depending on the presence or abscence of sub-questions. Let’s create two alternatives. Firstly, when sub-questions are present:
“For each sub-question , you should populate the
reasoning
field with your initial reasoning about the correct number of marks to award. Finally, you should put the number of marks to award for this sub-question in themarks
field.”
When they are not present:
“You should populate the
reasoning
field with your initial reasoning about the correct number of marks to award. Finally, you should put the number of marks to award in themarks
field.”
Firstly, the general template:
Then the two excerpts:
Let’s update our call_agent
method to set the output format dynamically:
Let’s also update our evaluate
method to parse the returned json correctly,
and also include a subquestion level diff,
and update the per-question-breakdown to also include the subquestion level predictions:
🧪 Rerun Tests
The mean error has actually gone back up from 0.2
to 0.4
.
Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅
Let’s explore what’s going wrong in the next iteration 🔁