Improve Agent
In the last section we created test sets of varying sizes, ready to evaluate our agent. So, itβs finally time to start our data flywheel spinning! The general process for optimizing an LLM agent is quite straightforward:
Firstly, letβs activate the MarkingAssistant
project.
Letβs also set a new context Evals
,
where weβll store all of our evaluation runs.
Letβs now go through this data flywheel step-by-step!
π€ Create Agent
Letβs start with a simple 0-shot LLM to begin with.
Letβs also download a .cache.json
file which was previously generated whilst running this notebook,
to avoid making any real LLM calls,
and to also make our walkthrough deterministic.
If youβd rather go down your own unique iteration journey,
then you should skip the cell below,
and either remove cache="read-only"
(turn off caching)
or replace it with cache=True
(create your own local cache) in the agent constructor above.
However,
this would mean many parts of the remaining walkthrough might not directly apply in your case,
as the specific failure modes and the order in which they appear are likely to be different.
The agent needs to mark student answers to questions, out of a possible maximum number of marks. Letβs give it a sensible system message to begin with:
Letβs wrap our system prompt in a simple function so the system message is updated based on the specific data involved:
ποΈ Add Tests
Great, we now have our agent implemented. So, what are some good unit tests to begin with? Rather than using all 321 examples for our first iteration. Letβs use the smallest subset of 10 examples, which we created in the previous section.
π§ͺ Run Tests
Letβs add evaluation function,
and include all other arguments that we would like to log as part of the evaluation.
All input arguments, intermediate variables,
and return variables without a prepending "_"
in the name
(all βnon-privateβ arguments, returns and variables)
will automatically be logged when the function is called,
with the use of a unify.log
decorator.
We can then run our evaluation, with the logging included, like so:
The unify.Experiment()
term creates an "experiment"
parameter in the context. The unify.Params()
term sets the parameters which are held constant throughout the experiment.
We have some failures, with a mean error of 0.8
across the ten examples. Weβll need to dive in and understand whatβs going wrong.
π Iteration 1
π Inconsistent Formatting
Letβs take a look at our results in the table
GIF
Example 215 has the largest error. Letβs take a look at whatβs going wrong.
β Question [8 Marks]:
In this row of boxes, you start with 5 and 7.
βββββ¬ββββ¬βββββ¬βββββ¬βββββ β 5 β 7 β β β β βββββ΄ββββ΄βββββ΄βββββ΄βββββ
βββββ¬ββββ¬βββββ¬βββββ¬βββββ β 5 β 7 β 12 β 19 β 31 β βββββ΄ββββ΄βββββ΄βββββ΄βββββ
βββββ¬ββββ¬βββββ¬βββββ β 4 β 6 β β β [1] βββββ΄ββββ΄βββββ΄βββββ
ββββββ¬βββββ¬βββββ¬βββββ β β β 34 β 55 β [2] ββββββ΄βββββ΄βββββ΄βββββ
βββββ¬ββββ¬βββββ¬βββββ β a β b β β β [2] βββββ΄ββββ΄βββββ΄βββββ
βββββ¬βββββ¬βββββ¬βββββ β 6 β β β 57 β [3] βββββ΄βββββ΄βββββ΄βββββ
We then make sure to update the system message with the true markscheme during call_agent
:
We also need to update evaluate
accordingly:
π§ͺ Rerun Tests
Great! Our mean error has gone down from 0.8 to 0.3, weβre definitely making progress πͺ
We still need to try to address the remaining errors though. Letβs explore whatβs going wrong in another iteration.
π Iteration 2
π Misunderstanding Mark Types
Seven out of ten have a perfect error of 0, with three examples having an error of 1. Letβs take a closer look at these examples, and why theyβre failing.
GIF
Letβs dive in and explore each of these failures in more detail, to see how we can rectify them.
Example 207
β Question [2 Marks]:
Write these in order, smallest first:
0.34, 1/3, 3.5%
β¦β¦β¦β¦β¦β¦β¦β¦ β¦β¦β¦β¦β¦β¦β¦β¦ β¦β¦β¦β¦β¦β¦β¦β¦
π Studentβs Answer:
1/3, 0.34, 3.5%
βοΈ Markscheme:
3.5%, 1/3, 0.34
Part marks and guidance:
- B1 for 1/3 = 0.33β¦ or 33β¦%
- or B1 for 0.34 = 34%
- or B1 for changing 3.5% to 0.035
- or SC1 for 1/3, 0.34, 3.5%
Accept correct order with equivalent values.
β Correct Marks [1/2] Rationale:
The candidateβs order is 1/3, 0.34, 3.5%. According to the markscheme, this earns SC1 (special case, partially correct ordering) for 1 mark.
π€ Predicted Marks [0/2] Rationale:
The studentβs ordering is incorrect because smallest first should be 3.5%, 1/3, 0.34, but the student provided 1/3, 0.34, 3.5%.
0
Example 261
β Question [5 Marks]:
Peter makes a large amount of pink paint by mixing red and white paint in the ratio 2 : 3.
β’ Red paint costs Β£80 per 10 litres.
β’ White paint costs Β£5 per 10 litres.
β’ Peter sells his pink paint in 10-litre tins for Β£60 per tin.Calculate how much profit he makes for each tin he sells.
You must show your working.Β£ β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦
π Studentβs Answer:
Peter needs 4 litres of red paint and 6 litres of white paint for every 10-litre batch (because 2 out of 5 parts is red, 3 out of 5 parts is white).
βοΈ Markscheme:
25 with correct working (5 marks)
βCorrect workingβ requires evidence of at least M1M1
- M1 for 10 Γ (2/5) = 4 litres red or for 10 Γ (3/5) = 6 litres white
- M1 for red costs Β£8 per litre or for white costs Β£0.50 per litre
- M1 for cost of one 10-litre can is their 4 Γ their 8 + their 6 Γ their 0.5
- M1 for 60 β their 35
Alternative method:
- M1 for 2 : 3 = 20 litres red : 30 litres white
- M1 for 2 Γ Β£80 + 3 Γ Β£5 = Β£175
- M1 for (their 175) Γ· 5 = 35
- M1 for 60 β their 35
- M1 for cost of one 10-litre can is their 4 Γ their 8 + their 6 Γ their 0.5
If 0 or 1 scored, instead award SC2 for answer 25 with no working or insufficient working
β Correct Marks [1/5] Rationale:
Awarding 1 mark (M1) for correctly finding that a 10-litre tin contains 4 litres of red and 6 litres of white from the 2:3 ratio. No further steps towards the final profit are shown, so only this 1 mark is earned.
π€ Predicted Marks [0/5] Rationale:
The studentβs response only states that Peter needs 4 litres of red paint and 6 litres of white paint per 10-litre batch. It does not show any further working or provide the calculation of the cost of the paints or the profit per tin. Therefore, the answer does not meet the full requirements for the calculation and is awarded 0 marks.
0
Example 132 (c)
β Question [2 Marks]:
A newspaper headline reads: βHigh temperatures make more people buy ice cream!β Does the graph above prove this claim? Give a reason for your decision.
π Studentβs Answer:
No. The graph alone does not completely prove the claim, because we cannot be certain this trend always holds true in all situations.
βοΈ Markscheme:
Answer: No, because there may be other factors involved (2 marks) Guidance: B1 for βNoβ, with partial reason.
β Correct Marks [1/2] Rationale:
The mark scheme gives 2 marks if the student states βNo, there may be other factors involved.β Here, only a partial reason is given, so 1 mark is awarded for denying that the graph proves the claim but not fully explaining other possible factors.
π€ Predicted Marks [2/2] Rationale:
No Rationale Given
Thoughts:
Overall, itβs clear that the agent is unable to properly make sense of the different mark types, such as B1, SC1, M1, A1 etc.
This is not surprising, as weβve never explained what these terms mean in the system prompt!
π Add Marking Guidelines
Letβs add the general marking guidelines to the system prompt, so the agent knows what all of these mark terms mean, and also fully understands how to interpret the markscheme for each question.
The marking guidelines can be extracted from the beginning of any of the markscheme pdf files, such as this one.
Letβs store this in a seperate variable, which will make it easier for us to parameterize the inclusion of the guidelines in future experiment iterations.
Letβs also update the system message to include a placeholder for these general guidelines.
Letβs then update the system message to include the guidelines by default.
π§ͺ Rreun Tests
The mean error has gone down from 0.3
to 0.2
. Letβs explore the remaining errors in another iteration.
π Iteration 3
π Difficulty Debugging
The change seems to have fixed Example 261, but the other two [Examples 207 and 132 (c)] are still failing.
Itβs also becoming quite difficult to track the exact discrepency between the correct marks and those predicted by the agent, as the agentβs response is a single block of text, unlike the ground truth data which is formatted in a dictionary with each sub-question independently marked.
Adding structured output could both help the agent to reason about each part of the question independently, and it will also make the response easier to parse, enabling us to present a diff at the subquestion level, rather than just for the entire question. Letβs give it a try!
π Add Structured Output
Letβs first define the output we want for each sub-question:
Letβs now write a simply function to build the desired pydantic output dynamically, based on the subquestions present.
Letβs update our system prompt, so the agent knows how to populate the structured response correctly. We also actually only use the nested output structure if subquestions are present. So weβll want to populate the instructions dynamically, depending on the presence or abscence of sub-questions. Letβs create two alternatives. Firstly, when sub-questions are present:
βFor each sub-question , you should populate the
reasoning
field with your initial reasoning about the correct number of marks to award. Finally, you should put the number of marks to award for this sub-question in themarks
field.β
When they are not present:
βYou should populate the
reasoning
field with your initial reasoning about the correct number of marks to award. Finally, you should put the number of marks to award in themarks
field.β
Firstly, the general template:
Then the two excerpts:
Letβs update our call_agent
method to set the output format dynamically:
Letβs also update our evaluate
method to parse the returned json correctly, and also include a subquestion level diff, and update the per-question-breakdown to also include the subquestion level predictions:
π§ͺ Rreun Tests
The mean error has actually gone back up from 0.2
to 0.4
. Letβs explore whatβs going wrong in another iteration.
π Iteration 4
π Context Alignment
In terms of failures, letβs take Example 132 as an example. For this particular question, there are 6 sub-questions (a.i, a.ii, b.i, b.ii, b.iii, c), and weβre asking the LLM to do a lot in a single shot:
- understand all 16 points in the general marking guidelines
- understand all 6 of the sub-questions
- understand all 6 of the studentβs answers to the sub-questions
- understand all 6 of the markschemeβs reasoning for said sub-questions
More importantly, the system prompt doesnβt align the relevant information together. The agent receives the information like so:
Letβs update the system prompt, so the information is aligned better, more like the following:
π Better Align Context
First, letβs abstract this into a "{questions_markscheme_and_answers}"
placeholder:
Letβs then update call_agent
Letβs also update our evaluate
function, so that we pass the sub_questions
into the call_agent
function:
π§ͺ Rreun Tests
The mean error has now dropped to 0.3
. Again, letβs explore whatβs going wrong in another iteration.
π Iteration 5
π Ignoring Mark Types
Lookg a bit deeper, it seems weβre getting the same failures as we were before. These errors seem quite persistent, we might need to make more radical changes to try and address them.
Letβs revisit each of these failures again, and explore them in more detail, to see how we can rectify them.
Given the consistent prevalence of these errors, itβs useful to consider these failing examples across all experiments, and see the different kinds of rationale our agent is giving across each run.
Our evaluation runs serve both as tests for comparing parameter configurations (when the agent output depend strongly on the changed parameters π) and they also perform a kind of sampling from the noise distribution for each test set example (when the agent output depends less strongly on the changed parameters across each experiment π).
This is one benefit of the flexible spreadsheet design. Experiments are not indivisible atomic groups, theyβre simply labels, and the raw evaluation data can be juggled in whatever way makes sense to you as the user.
GIF
Letβs first recap, the fixed question β, student answer π, markscheme βοΈ and correct marks β , before unpacking the various agent justifications π€ across all experiments.
Example 207
β Question [2 Marks]:
Write these in order, smallest first:
0.34, 1/3, 3.5%
β¦β¦β¦β¦β¦β¦β¦β¦ β¦β¦β¦β¦β¦β¦β¦β¦ β¦β¦β¦β¦β¦β¦β¦β¦
π Studentβs Answer:
1/3, 0.34, 3.5%
βοΈ Markscheme:
3.5%, 1/3, 0.34
Part marks and guidance:
- B1 for 1/3 = 0.33β¦ or 33β¦%
- or B1 for 0.34 = 34%
- or B1 for changing 3.5% to 0.035
- or SC1 for 1/3, 0.34, 3.5%
Accept correct order with equivalent values.
β Correct Marks [1/2] Rationale:
The candidateβs order is 1/3, 0.34, 3.5%. According to the markscheme, this earns SC1 (special case, partially correct ordering) for 1 mark.
π€ Predicted Marks [x/2] Rationales (with added insights π‘):
simple_agent
[0/2] β: The correct order should be 3.5%, 1/3, 0.34. The studentβs answer is incorrect. [No reference to SC1 π‘]
add_markscheme
[0/2] β: The studentβs ordering is incorrect because smallest first should be 3.5%, 1/3, 0.34, but the student provided 1/3, 0.34, 3.5%. [No reference to SC1 π‘]
add_marking_guidelines
[0/2] β: The required order, from smallest to largest, is 3.5%, 1/3, 0.34 (since 3.5% = 0.035, 1/3 β 0.333, and 0.34 = 0.34). The studentβs answer lists the numbers as 1/3, 0.34, 3.5%, which is not in the correct order. [No reference to SC1 π‘]
add_structured_output
[0/2] β: The correct order should list the values from smallest to largest: 3.5% (which is equivalent to 0.035), then 1/3 (approximately 0.333β¦), and finally 0.34. The studentβs answer, β1/3, 0.34, 3.5%β, does not follow the required order. Therefore, no marks can be awarded for this answer. [No reference to SC1 π‘]
align_context
[0/2] β: The correct order, based on numerical values (3.5% = 0.035, 1/3 β 0.333, 0.34 = 0.34), should be 3.5%, 1/3, 0.34. The studentβs answer (1/3, 0.34, 3.5%) does not follow the βsmallest firstβ requirement. Therefore, no marks should be awarded. [No reference to SC1 π‘]
Clearly, the agent is really failing to pick up on the SC1 mark from the markscheme, irrespective of the various parameter changes weβve made across each experiment run.
Example 261
β Question [5 Marks]:
Peter makes a large amount of pink paint by mixing red and white paint in the ratio 2 : 3.
β’ Red paint costs Β£80 per 10 litres.
β’ White paint costs Β£5 per 10 litres.
β’ Peter sells his pink paint in 10-litre tins for Β£60 per tin.Calculate how much profit he makes for each tin he sells.
You must show your working.Β£ β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦
π Studentβs Answer:
Peter needs 4 litres of red paint and 6 litres of white paint for every 10-litre batch (because 2 out of 5 parts is red, 3 out of 5 parts is white).
βοΈ Markscheme:
25 with correct working (5 marks)
βCorrect workingβ requires evidence of at least M1M1
- M1 for 10 Γ (2/5) = 4 litres red or for 10 Γ (3/5) = 6 litres white
- M1 for red costs Β£8 per litre or for white costs Β£0.50 per litre
- M1 for cost of one 10-litre can is their 4 Γ their 8 + their 6 Γ their 0.5
- M1 for 60 β their 35
Alternative method:
- M1 for 2 : 3 = 20 litres red : 30 litres white
- M1 for 2 Γ Β£80 + 3 Γ Β£5 = Β£175
- M1 for (their 175) Γ· 5 = 35
- M1 for 60 β their 35
- M1 for cost of one 10-litre can is their 4 Γ their 8 + their 6 Γ their 0.5
If 0 or 1 scored, instead award SC2 for answer 25 with no working or insufficient working
β Correct Marks [1/5] Rationale:
Awarding 1 mark (M1) for correctly finding that a 10-litre tin contains 4 litres of red and 6 litres of white from the 2:3 ratio. No further steps towards the final profit are shown, so only this 1 mark is earned.
π€ Predicted Marks [x/5] Rationale (with added insights π‘):
simple_agent
[1/5] β : The full solution should involve several steps:
- Determining that in a 10βlitre mix, Peter uses 4 litres of red (2/5 of 10) and 6 litres of white (3/5 of 10).
- Calculating the cost per litre for each colour:
- Red: Β£80/10 = Β£8 per litre, so 4 litres costs 4 Γ Β£8 = Β£32.
- White: Β£5/10 = Β£0.50 per litre, so 6 litres costs 6 Γ Β£0.50 = Β£3.
- Adding the costs gives a total cost of Β£32 + Β£3 = Β£35.
- Subtracting this from the selling price, the profit per tin is Β£60 β Β£35 = Β£25.
The studentβs answer only identifies the paint volumes (4 litres red and 6 litres white) without completing the cost and profit calculations. [No reference to M1 π‘]
add_markscheme
[0/5] β: The studentβs response only states that Peter needs 4 litres of red paint and 6 litres of white paint per 10-litre batch. It does not show any further working or provide the calculation of the cost of the paints or the profit per tin. Therefore, the answer does not meet the full requirements for the calculation and is awarded 0 marks. [No reference to M1 π‘]
add_marking_guidelines
[1/5] β : The student only shows that a 10βlitre tin requires 4 litres red paint and 6 litres white paint (i.e. the ratio part), but they do not continue to use these figures to work out the cost or the profit. This is an incomplete answer which does not answer the question. [No reference to M1 π‘]
add_structured_output
[0/5] β: The studentβs work shows part of the method by correctly determining that a 10βlitre batch contains 4 litres of red paint and 6 litres of white paint. However, the answer does not extend to calculate the cost of the red and white paint, nor does it compute the final profit per tin by subtracting the total cost from the selling price. According to the mark scheme, a correct final answer of Β£25 with full working is needed for 5 marks. Since the student has given only part of the working (the mix ratio) and omitted the critical profit calculation, the answer deserves only credit for knowing the ratio. In such cases, with insufficient working, the mark scheme permits awarding SC2 (2 marks) for an answer that states the correct profit value by implication. As the final profit figure is missing, awarding 2 marks is appropriate. [No reference to M1 π‘]
align_context
[0/5] β: The answer only shows that Peter uses 4 litres of red paint and 6 litres of white paint in a 10βlitre batch. However, the question asks for the profit per tin, which requires further calculation including the cost of the paints and then subtracting from the selling price (Β£60). No profit calculation or final numerical answer for the profit is given. As a result, none of the required steps for computing profit were provided. Therefore, no marks can be awarded. [No reference to M1 π‘]
Again, regardless of our parameter variations, it seems as though the agent is not correctly understanding the M1 mark in the markscheme, which clearly states M1 for 10 Γ (2/5) = 4 litres red or for 10 Γ (3/5) = 6 litres white
. Even on the two occassions where it got things right, it feels like a lucky guess, as this was not justified via the markschemeβs M1 mark.
Example 132 (c) is failing for a slighly different reason, and so weβll consider this separately.
In general, as we get deeper into the evaluation iterations, itβs often wise to consider multiple failure modes at once. Larger evals can be expensive to run, and you generally want to use all of the newly gained knowledge to try and improve your agent in the next evaluation run, even if this means making several unrelated changes to address several unrelated failure modes.
π Lenient Reasoning
Letβs perform the same deepe analysis for Example 132 (c), and see whatβs going wrong in this case.
Example 132 (c)
β Question [2 Marks]:
A newspaper headline reads: βHigh temperatures make more people buy ice cream!β Does the graph above prove this claim? Give a reason for your decision.
π Studentβs Answer:
No. The graph alone does not completely prove the claim, because we cannot be certain this trend always holds true in all situations.
βοΈ Markscheme:
Answer: No, because there may be other factors involved (2 marks) Guidance: B1 for βNoβ, with partial reason.
β Correct Marks [1/2] Rationale:
The mark scheme gives 2 marks if the student states βNo, there may be other factors involved.β Here, only a partial reason is given, so 1 mark is awarded for denying that the graph proves the claim but not fully explaining other possible factors.
π€ Predicted Marks [x/2] Rationale (with added insights π‘):
simple_agent
[2/2] β: The response correctly explains that the graph, by itself, does not prove the claim as it only shows a correlation under the conditions measured. [No mention of other factors, which is needed π‘]
add_markscheme
[2/2] β: [No rationale given at all! π‘]
add_marking_guidelines
[2/2] β: [No rationale given at all! π‘]
add_structured_output
[2/2] β: The explanation correctly states that the graph alone does not prove the claim due to other potential factors affecting ice cream sales. This answer meets the criteria for 2 marks. [Not true, the student doesnβt mention βother potential factors affecting ice cream salesβ π‘]
align_context
[2/2] β: The answer indicates that the graph alone does not fully prove the claim since other factors might be involved, which adequately meets the criteria for the mark scheme. [Not true, the student doesnβt mention βother factors might be involvedβ π‘]
In this case, itβs not so much an error with understanding the mark types, itβs more of an issue with falsely hallucinating things the student said, or perhaps being very liberal in the interpretation. Perhaps the agent is confusing the markscheme with the student answer. The studentβs answer βwe cannot be certain this trend always holds trueβ is very different to βthere may be other factors involvedβ. The former (incorrect) assumes the trend does exist, but just might not continue, while the (correct) latter is a point about correlation != causation, indicating the apparent causation may not exist at all.
π Better Align Guidelines
Firstly, the recurring problem for Example 207 and 261 seems to be that the agent doesnβt remember and/or understand the different types of marks (B1, SC1, M1 etc.). Letβs be more explicit, and parse each sub-question markscheme for the different mark types, and add the explanations directly as part of the sub-question specific markschemes, and see if this improves performance.
Letβs first create a dictionary with the mark type explanations, written in a more direct manner to accompany the subquestion-specific markschemes and to make it easier to parse:
Letβs then write a simple function to update each sub-question specific markscheme, prepending the markscheme with the relevant definitions from our mark_types
dict, so that the agent has all the relevant information close at hand:
Letβs now update our call_agent
method such that the markscheme changes are dynamically applied before passing to the agent:
π Clarify Strict Reasoning
Weβve just addressed the recurring problem for Example 207 and 261, but the failure for Example 132 (c) was quite different.
Letβs add another instructions to our general_guidelines
variable, with an imaginary extra piece of guidance, to try and avoid the leniency weβve observed in the marking of Example 132 (c).
Despite us applying no changes to the source code of evaluate
, we need to run this in the notebook again such that the @unify.log
decorator catches the updated functions called by evaluate
, such as call_agent
.
π§ͺ Rreun Tests
Now weβve made both of these changes, letβs re-run our evals to see if either of these changes were able to address the problems theyβre intended to resolve.
Our failure mechanisms are exactly the same as before, clearly the agent is still struggling to correctly reason about the different mark types.
π Iteration 6
π Still Ignoring Mark Types
Letβs try to force the agent to reason about each potential mark mentioned in the markscheme, by further refining our structured output. Letβs expand upon the reasoning
field for each sub-question, with a field for each mark type referenced in the sub-question markscheme, going from the following structure:
To this version which explicitly enforces reasoning about each potential mark type referenced in the markscheme:
This way, the agent will be forced to reason about SC1
for Example 207, M1
for Example 261, and B1
for Example 132 (c).
π Mark Type Reasoning
Letβs first define a function to dynamically construct the required pydantic type. For each parsed mark type, we want the model to give itβs thoughts and make a decision as to whether or not the mark should be awarded. Letβs create this pydantic type first:
Letβs then create a function to dynamically construct a PerMarkReasoning
pydantic type, with one ThoughtsAndAwardDecision
instance for each mark detected in the sub-question markscheme.
Letβs then re-define MarksAndReasoning
(previously this was statically defined, see above) such that the reasoning
field is no longer just a string, but is intead our newly created PerMarkReasoning
(above).
Finally, letβs then update the top-level function create_response_format
such that weβre making use of our newly defined create_marks_and_reasoning_format
for each sub-question.
We also need to write a function to parse the relevant marks from each sub-question markscheme. We can take inspiration from here, where we wrote a parser for a slightly different reason. Letβs have the function extract the marks, and also the surrounding context.
Finally, weβll also need to update call_agent
such that we call parse_available_marks_from_markscheme
on each sub-question markscheme, and then pass these into our newly defined create_response_format
.
Letβs also update our system message to explain to better explain to the agent how it should reason about this new output structure.
We also need to update evaluate
, given some of the other changes.
π§ͺ Rreun Tests
The failure modes are still entirely unchanged! o3-mini
is certainly very stubborn about itβs decision for these questions.
π Iteration 7
π Still Ignoring Mark Types
Letβs see what affect our new output format had on the nature of the agentβs responses, if any.
Considering Example 207, the agent still failed to award SC1
for the studentβs answer 1/3, 0.34, 3.5%
, despite the markscheme explicitly stating SC1 for 1/3, 0.34, 3.5%
. The agentβs explicit thoughts
about SC1
were:
π€ No special case credit is applicable here since the order is incorrect and no alternative acceptable method is demonstrated.
This is a pretty fluffy and empty statement. Despite o3-mini being a multi-step reasoning model, perhaps weβre still asking the agent to consider too many things at once.
Enforcing the agent to consider one mark at a time might rectify this lack of attention to detail.
Example 132 is even more difficult, where the agent not only needs to consider each mark, but it also has six different sub-questions to reason about, each with their own set of available marks and mark types.
Letβs see if using a separate LLM call per sub-question improves the performance on Example 132.
π Queries per Subquestion
Firstly, letβs create a new system prompt for our agent, which will reason about one-subquestion at a time.
Given the changes, we can also remove the output_response_explanations
dict, and relace it with a single output_response_explanation
string variable, given that the agent no longer needs to output responses for multiple sub-questions in a single response.
Letβs update call_agent
to map each subquestion to a unique LLM call, and letβs also add a unify.traced
decorator so weβre able to capture each individual LLM call in the overall trace.
Letβs also update evaluate
to pass the updated parameters to call_agent
.
π§ͺ Rreun Tests
π Iteration 8
π Lack of Global Context
These updates have actually regressed the overall performance, with us now having a mean error of 0.5
.
Maybe the purely local reasoning has some shortcomings. Letβs focus on one of the new regressions, to understand why our latest change has distrupted the agent where it was previously very consistently correct.
Example 20 (b)
β Question [1 Mark]:
Explain why he might not want to use Account A.
π Studentβs Answer:
He might not want to use Account A because the bank might keep the money forever and he could lose it.
βοΈ Markscheme:
He may not want to leave it there for 3 years. Accept any valid reason.
β Correct Marks [0/1] Rationale:
The markscheme awards 1 mark for a valid reason such as not wanting to lock away funds for 3 years. Here, the stated reason is irrelevant and does not accurately reflect the restrictions of Account A.
π€ Predicted Marks [1/1] β Rationale:
The mark scheme awards 1 mark for any valid reason why the person might not want to leave the money in Account A for 3 years. The studentβs answer, although phrasing it as the bank might keep the money forever and thus he could lose it, is a valid reason to be cautious, even though it doesnβt mention the explicit time period of 3 years. This interpretation sufficiently aligns with the mark schemeβs requirement of a valid reason. Therefore, the mark is awarded.
Clearly, the agent is now taking some liberties with what constitutes a βvalid reasonβ. Itβs unclear why the agent is only making these mistakes now.
Letβs look at some of the previous justifications for not awarding the mark, from the previous runs.
π€ Predicted Marks [x/1] Rationales:
simple_agent
[0/1] β : The answer should explain that although Account A gives slightly more money, Derrick might not want this account because his money is tied up for three years (i.e. it has poor liquidity). The studentβs explanation about the bank βkeeping the money foreverβ is not an acceptable financial rationale.
add_markscheme
[0/1] β : The explanation provided is not a valid reason and does not address the restriction of no withdrawals for three years. Therefore, no mark can be awarded for this part either.
add_marking_guidelines
[0/1] β : The explanation given (βbecause the bank might keep the money forever and he could lose itβ) does not correctly address why he might not want to use Account A (namely, that his money is locked away for 3 years with no withdrawals allowed).
add_structured_output
[0/1] β : The answer needed to explain why Account A might not be desirable, which is because funds are locked in for three years and withdrawals are not allowed until the end of the period. The studentβs answer states that the bank might keep the money forever and he could lose it, which does not address the actual restriction (lack of liquidity). This is an irrelevant or incorrect reason and does not meet the mark scheme.
align_context
[0/1] β : The mark scheme accepts any valid reason such as not being able to withdraw the money because it is locked in for 3 years. The candidateβs answer β that the bank might βkeep the money foreverβ and he could lose it β is not a valid explanation for the funds being inaccessible. It does not clearly express that his money is tied up for 3 years. Therefore, this explanation does not meet the requirement and no mark should be awarded.
align_guidelines_and_clarify_reasoning
[0/1] β : The answer to part (b) should explain that Account A requires the money to be locked in for three years which might be a disadvantage if immediate access is desired. The studentβs explanation β that the bank might βkeep the money foreverβ β does not address the real issue of the restriction on withdrawals. As the explanation provided is not a valid reason as per the mark scheme, no marks should be awarded for this part.
mark_type_reasoning
[0/1] β : The explanation provided does not address the key issue mentioned in the mark scheme (the restriction of no withdrawals for 3 years) and instead gives a non sequitur reason that the bank might βkeep the money forever.β This does not meet the required rationale, so no mark is awarded.
Perhaps preventing the agent from having access to the full question prevents it from using βcommon senseβ and realizing how βsillyβ the proposed answer is, in light of the overall question and the information provided to the student.
Maybe strict adherence to the markscheme alone without the full context is prohibitive.
Letβs update our per-subquestion system prompts to also fully include the preceeding sub-questions, their markschemes, and their answers. Itβs unlikely that the context of a later question will assists with the marking of an earlier question, and we still want to try and keep the agent as focused as possible on the relevant information.
π Include Preceeding Context
Letβs first update the system prompt, re-introducing the placeholder for the aligned subquestions, markschemes and answers, this time calling it {prior_context}
, which will only be included when sub-questions are present. Letβs also include the full question.
Letβs also add a general explanation for the prior context, in cases where it is included.
Letβs now update call_agent
to pass in the required information.
Finally, letβs update evaluate
accordingly.
π§ͺ Rreun Tests
Great, so weβve fixed the new regressions, but again weβre back at the same three failures, failing for the same reason.
π Iteration 9
π Still Ignoring Mark Types
Given that the agent is still failing to follow the instructions for each mark in the markscheme, perhaps itβs time we tried to perform per-mark reasoning, with a separate LLM call made for each candidate mark to award. This might help the LLM deeply consider each candidate mark mentioned in the markscheme.
Letβs give it a try!
π Queries per Mark
We will still want our per-subquestion LLM to perform the final reasoning about the number of marks to award for the sub-question, but we just want to provide it with the reasoning performed by each of our per-mark LLM queries.
We therefore now have two different LLMs, with two different roles, and therefore we need two different system messages.
Letβs first update the subquestion system message, in anticipation of the incoming mark-by-mark reasoning. Letβs also split the markscheme and the mark type reasoning, rather than naively combining these as was done in update_markscheme
.
The "{mark_types_explanation}"
placeholder can be overriden explicitly, giving us more control. Letβs create a new function extract_mark_type_explanation
, inspired from update_markscheme
above.
Letβs now create the system message for our mark reasoning agent, again with the explicit {mark_types_explanation}
placeholder.
Letβs first define call_subq_agent
, which will include mark-by-mark reasoning with several LLM calls
Letβs now update call_agent
, making use of our call_subq_agent
function, which processes a single sub-question.
We also need to update the evaluate
function, to pass each of the two different system messages to the call_agent
function.
π§ͺ Rreun Tests
Great, this seems to have addressed two of the three failures (on this run at least).
π Iteration 10
π Method Marks Confusion
All of the prior failures now seem to have been resolved, but we have a new regression for Example 215 (b). Letβs take a look.
Example 215 (b)
β Question [2 Marks]:
Complete these rows of boxes using the rule shown above.
π Studentβs Answer:
The missing boxes are 8, 13, and 21, making the sequence 8, 13, 21, 34, 55.
βοΈ Markscheme:
Answer: 8, 13, 21 Marks: 2 Guidance: M1 for one correct subtraction of two boxes
β Correct Marks [2/2] Rationale:
Here, we deduce the missing numbers by working backwards from 34 and 55 or by applying the same rule forward. According to the markscheme (2 marks), we get full credit for the fully correct set of missing numbers, with one method mark (M1) for a correct subtraction and a final accuracy mark (A1) for the correct sequence.
π€ Predicted Marks [0/2] Rationale:
The candidateβs final numerical answer is correct (the missing boxes are 8, 13, and 21), matching the answer in the mark scheme. However, the mark scheme specifically awards M1 for showing a correct subtraction (i.e. working that reverses the addition rule), which is a required part of the method. Since the candidate only provided the final answer without demonstrating any subtraction (or any equivalent working) to show the method, no evidence is provided to award the M1 mark. As A marks (for an accurate answer) depend on obtaining the method mark, no further marks can be awarded. Therefore, despite the correct final answer, the lack of required working means no marks should be awarded.
This is an interesting failure mode. Interestly, the justification for the βcorrectβ (ground truth) marks is wrong. There is no A1 mark for this question (which would depend on a method mark). This is irrelevant in terms of the agent failure (the agent doesnβt know the correct marks or rationale), but itβs still an interesting observsation regarding our βground truthβ data.
Interestingly, the agent has made the same mistake as occurs in the βground truthβ rationale. Our agent presumes the existence of an A mark where none were stated. It seems like the agent doesnβt understand that correct answers should always earn full marks, unless otherwise explicitly stated. M1 marks are not necessary to achieve full marks in such cases, unless an A mark is specifically referenced.
π Clarify Mark + Answer Marks
Letβs try to fully clarify these points for the sub-question agent, and re-run the evals.
π§ͺ Rreun Tests
Great, weβve finally got all 10/10 tests passing perfectly π