🔍 Difficulty Debugging
Let’s explore the failure modes which are still remaining. The change seems to have fixed Example 261, but the other two [Examples 207 and 132 (c)] are still failing. It’s also becoming quite difficult to track the exact discrepency between the correct marks and those predicted by the agent, as the agent’s response is a single block of text, unlike the ground truth data which is formatted in a dictionary with each sub-question independently marked. Adding structured output could both help the agent to reason about each part of the question independently, and it will also make the response easier to parse, enabling us to present a diff at the subquestion level, rather than just for the entire question. Let’s give it a try!🔀 Add Structured Output
So, let’s go ahead and implement structured output! 🧑💻 Let’s first define the output we want for each sub-question:“For each sub-question , you should populate theWhen they are not present:reasoning
field with your initial reasoning about the correct number of marks to award. Finally, you should put the number of marks to award for this sub-question in themarks
field.”
“You should populate theFirstly, the general template:reasoning
field with your initial reasoning about the correct number of marks to award. Finally, you should put the number of marks to award in themarks
field.”
call_agent
method to set the output format dynamically:
evaluate
method to parse the returned json correctly,
and also include a subquestion level diff,
and update the per-question-breakdown to also include the subquestion level predictions:
🧪 Rerun Tests
0.2
to 0.4
.
Let’s take a look at the traces,
to ensure that the system message template has been implemented correctly,
and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly,
and the per-LLM system messages look good ✅
Let’s explore what’s going wrong in the next iteration 🔁