Lookg a bit deeper,
it seems we’re getting the same failures as we were before.
These errors seem quite persistent,
we might need to make more radical changes to try and address them.
Let’s revisit each of these failures again,
and explore them in more detail,
to see how we can rectify them.
Given the consistent prevalence of these errors,
it’s useful to consider these failing examples across all experiments,
and see the different kinds of rationale our agent is giving across each run.
Our evaluation runs serve both as tests for comparing parameter configurations
(when the agent output depend strongly on the changed parameters 🔀)
and they also perform a kind of sampling from the noise distribution for each test set example
(when the agent output depends less strongly on the changed parameters across each experiment 📊).
This is one benefit of the flexible spreadsheet design.
Experiments are not indivisible atomic groups,
they’re simply labels,
and the raw evaluation data can be juggled in whatever way makes sense to you as the user.
Let’s first recap the fixed question ❓,
student answer 📝,
markscheme ☑️, and correct marks ✅,
before unpacking the various agent justifications 🤖 across all experiments.
The candidate’s order is 1/3, 0.34, 3.5%. According to the markscheme, this earns SC1 (special case, partially correct ordering) for 1 mark.
The correct order should be 3.5%, 1/3, 0.34. The student’s answer is incorrect. [No reference to SC1 💡]
The student’s ordering is incorrect because smallest first should be 3.5%, 1/3, 0.34, but the student provided 1/3, 0.34, 3.5%. [No reference to SC1 💡]
The required order, from smallest to largest, is 3.5%, 1/3, 0.34 (since 3.5% = 0.035, 1/3 ≃ 0.333, and 0.34 = 0.34). The student’s answer lists the numbers as 1/3, 0.34, 3.5%, which is not in the correct order. [No reference to SC1 💡]
The correct order should list the values from smallest to largest: 3.5% (which is equivalent to 0.035), then 1/3 (approximately 0.333…), and finally 0.34. The student’s answer, ‘1/3, 0.34, 3.5%’, does not follow the required order. Therefore, no marks can be awarded for this answer. [No reference to SC1 💡]
The correct order, based on numerical values (3.5% = 0.035, 1/3 ≈ 0.333, 0.34 = 0.34), should be 3.5%, 1/3, 0.34. The student’s answer (1/3, 0.34, 3.5%) does not follow the ‘smallest first’ requirement. Therefore, no marks should be awarded. [No reference to SC1 💡]
Clearly,
the agent is really failing to pick up on the SC1 mark from the markscheme,
irrespective of the various parameter changes we’ve made across each experiment run.
Peter makes a large amount of pink paint by mixing red and white paint in the ratio 2 : 3.
• Red paint costs £80 per 10 litres.
• White paint costs £5 per 10 litres.
• Peter sells his pink paint in 10-litre tins for £60 per tin.
Calculate how much profit he makes for each tin he sells.
You must show your working.
£ …………………………………………
Peter needs 4 litres of red paint and 6 litres of white paint for every 10-litre batch (because 2 out of 5 parts is red, 3 out of 5 parts is white).
25 with correct working (5 marks)
“Correct working” requires evidence of at least M1M1
M1 for 10 × (2/5) = 4 litres red or for 10 × (3/5) = 6 litres white
M1 for red costs £8 per litre or for white costs £0.50 per litre
M1 for cost of one 10-litre can is their 4 × their 8 + their 6 × their 0.5
M1 for 60 – their 35
Alternative method:
M1 for 2 : 3 = 20 litres red : 30 litres white
M1 for 2 × £80 + 3 × £5 = £175
M1 for (their 175) ÷ 5 = 35
M1 for 60 – their 35
M1 for cost of one 10-litre can is their 4 × their 8 + their 6 × their 0.5
If 0 or 1 scored, instead award SC2 for answer 25 with no working or insufficient working
Awarding 1 mark (M1) for correctly finding that a 10-litre tin contains 4 litres of red and 6 litres of white from the 2:3 ratio. No further steps towards the final profit are shown, so only this 1 mark is earned.
The full solution should involve several steps:
Determining that in a 10‐litre mix, Peter uses 4 litres of red (2/5 of 10) and 6 litres of white (3/5 of 10).
Calculating the cost per litre for each colour:
Red: £80/10 = £8 per litre, so 4 litres costs 4 × £8 = £32.
White: £5/10 = £0.50 per litre, so 6 litres costs 6 × £0.50 = £3.
Adding the costs gives a total cost of £32 + £3 = £35.
Subtracting this from the selling price, the profit per tin is £60 − £35 = £25.
The student’s answer only identifies the paint volumes (4 litres red and 6 litres white) without completing the cost and profit calculations. [No reference to M1 💡]
The student’s response only states that Peter needs 4 litres of red paint and 6 litres of white paint per 10-litre batch. It does not show any further working or provide the calculation of the cost of the paints or the profit per tin. Therefore, the answer does not meet the full requirements for the calculation and is awarded 0 marks. [No reference to M1 💡]
The student only shows that a 10‐litre tin requires 4 litres red paint and 6 litres white paint (i.e. the ratio part), but they do not continue to use these figures to work out the cost or the profit. This is an incomplete answer which does not answer the question. [No reference to M1 💡]
The student’s work shows part of the method by correctly determining that a 10‐litre batch contains 4 litres of red paint and 6 litres of white paint. However, the answer does not extend to calculate the cost of the red and white paint, nor does it compute the final profit per tin by subtracting the total cost from the selling price. According to the mark scheme, a correct final answer of £25 with full working is needed for 5 marks. Since the student has given only part of the working (the mix ratio) and omitted the critical profit calculation, the answer deserves only credit for knowing the ratio. In such cases, with insufficient working, the mark scheme permits awarding SC2 (2 marks) for an answer that states the correct profit value by implication. As the final profit figure is missing, awarding 2 marks is appropriate. [No reference to M1 💡]
The answer only shows that Peter uses 4 litres of red paint and 6 litres of white paint in a 10‑litre batch. However, the question asks for the profit per tin, which requires further calculation including the cost of the paints and then subtracting from the selling price (£60). No profit calculation or final numerical answer for the profit is given. As a result, none of the required steps for computing profit were provided. Therefore, no marks can be awarded. [No reference to M1 💡]
Again,
regardless of our parameter variations,
it seems as though the agent is not correctly understanding the M1 mark in the markscheme,
which clearly states M1 for 10 × (2/5) = 4 litres red or for 10 × (3/5) = 6 litres white.
Even on the two occassions where it got things right,
it feels like a lucky guess,
as this was not justified via the markscheme’s M1 mark.
Example 132 (c) is failing for a slighly different reason,
and so we’ll consider this separately.
In general,
as we get deeper into the evaluation iterations,
it’s often wise to consider multiple failure modes at once.
Larger evals can be expensive to run,
and you generally want to use all of the newly gained knowledge to try and improve your agent in the next evaluation run,
even if this means making several unrelated changes to address several unrelated failure modes.
A newspaper headline reads: ‘High temperatures make more people buy ice cream!’ Does the graph above prove this claim? Give a reason for your decision.
No. The graph alone does not completely prove the claim, because we cannot be certain this trend always holds true in all situations.
Answer: No, because there may be other factors involved (2 marks)
Guidance: B1 for ‘No’, with partial reason.
The mark scheme gives 2 marks if the student states “No, there may be other factors involved.” Here, only a partial reason is given, so 1 mark is awarded for denying that the graph proves the claim but not fully explaining other possible factors.
The response correctly explains that the graph, by itself, does not prove the claim as it only shows a correlation under the conditions measured. [No mention of other factors, which is needed 💡]
[No rationale given at all! 💡]
[No rationale given at all! 💡]
The explanation correctly states that the graph alone does not prove the claim due to other potential factors affecting ice cream sales. This answer meets the criteria for 2 marks. [Not true, the student doesn’t mention “other potential factors affecting ice cream sales” 💡]
The answer indicates that the graph alone does not fully prove the claim since other factors might be involved, which adequately meets the criteria for the mark scheme. [Not true, the student doesn’t mention “other factors might be involved” 💡]
In this case,
it’s not so much an error with understanding the mark types,
it’s more of an issue with falsely hallucinating things the student said,
or perhaps being very liberal in the interpretation.
Perhaps the agent is confusing the markscheme with the student answer.
The student’s answer “we cannot be certain this trend always holds true” is very different to “there may be other factors involved”.
The former (incorrect) assumes the trend does exist,
but just might not continue,
while the (correct) latter is a point about correlation != causation,
indicating the apparent causation may not exist at all.
Firstly,
the recurring problem for Example 207 and 261 seems to be that the agent doesn’t remember and/or understand the different types of marks (B1, SC1, M1 etc.).
Let’s be more explicit,
and parse each sub-question markscheme for the different mark types,
and add the explanations directly as part of the sub-question specific markschemes,
and see if this improves performance.
Let’s first create a dictionary with the mark type explanations,
written in a more direct manner to accompany the subquestion-specific markschemes and to make it easier to parse:
mark_types ={"M":"M{num} ({num_marks}) should be awarded if a correct method is used, and should not be lost for purely numerical errors.","A":"A{num} ({num_marks}) should be awarded for an accurate answer, and this depends on preceding M (method) marks. If preceding M (method marks are not awarded, then A{num} cannot be awarded).","B":"B{num} ({num_marks}) should be awarded for the correct final answer, a partially correct answer, or a correct intermediate stage (depending on how this is expressed and explained below). B{num} is independent of M (method) marks.","SC":"SC{num} ({num_marks}) should be awarded for the special cases explained below, which are worthy of some credit.",}
Let’s then write a simple function to update each sub-question specific markscheme,
prepending the markscheme with the relevant definitions from our mark_types dict,
so that the agent has all the relevant information close at hand:
@unify.traced(name="update_markscheme{subquestion}")defupdate_markscheme(subquestion:str, markscheme:str): m_marks =sorted(list(set(re.findall(r"M\d+", markscheme)))) a_marks =sorted(list(set(re.findall(r"A\d+", markscheme)))) b_marks =sorted(list(set(re.findall(r"B\d+", markscheme)))) sc_marks =sorted(list(set(re.findall(r"SC\d+", markscheme))))ifnotany(m_marks + a_marks + b_marks + sc_marks):return markscheme markscheme =("{mark_types}With this in mind, marks should be awarded as follows:\n"+ markscheme)for marks in(m_marks, a_marks, b_marks, sc_marks):for mark in marks: key ="".join(c for c in mark ifnot c.isdigit()) num_marks =int("".join(c for c in mark if c.isdigit())) explanation = mark_types[key] explanation = explanation.replace("{num}",str(num_marks),).replace("{num_marks}","1 mark"if num_marks ==1elsef"{num_marks} marks",) markscheme = markscheme.replace("{mark_types}", explanation +"\n{mark_types}",) markscheme = markscheme.replace("{mark_types}","",)return markscheme
Let’s now update our call_agent method such that the markscheme changes are dynamically applied before passing to the agent:
@unify.traceddefcall_agent( system_msg, question, sub_questions, markscheme, answer, available_marks_total,): local_agent = agent.copy() with_subqs =len(markscheme)>1 response_format = create_response_format(list(markscheme.keys())if with_subqs elseNone,) local_agent.set_response_format(response_format)if with_subqs: output_response_exp = output_response_explanations["with_subqs"] output_response_exp = output_response_exp.replace("{subquestions}",", ".join(list(markscheme.keys())),)else: output_response_exp = output_response_explanations["without_subqs"] markscheme ={ k: update_markscheme(f"_{k}"if k !="_"else"", v)for k, v in markscheme.items()} local_agent.set_system_message( system_msg.replace("{question}", textwrap.indent(question," "*4),).replace("{markscheme}", pretty_print_dict(markscheme, indent=4),).replace("{answer}", pretty_print_dict(answer, indent=4),).replace("{available_marks_total}",str(available_marks_total),).replace("{questions_markscheme_and_answers}", pretty_print_dict({ k:{"sub-question": sub_questions[k],"markscheme": markscheme[k],"answer": answer[k],}for k in sub_questions.keys()}, indent=4,),).replace("{output_response_explanation}", output_response_exp,),) ret = local_agent.generate()if"```"in ret: ret = ret.split("```")[-2].lstrip("json") ret = response_format.model_validate_json(ret).model_dump()ifnot with_subqs:return{"_": ret}return ret
We’ve just addressed the recurring problem for Example 207 and 261,
but the failure for Example 132 (c) was quite different.
Let’s add another instructions to our general_guidelines variable,
with an imaginary extra piece of guidance,
to try and avoid the leniency we’ve observed in the marking of Example 132 (c).
general_guidelines = ( general_guidelines.rstrip("-") + """15.When students are explaining something in their answer, then their explanation must make *exactly* the same point(s) as are made in the markscheme. The wording can be slightly different, but the underlying observations/reasons must be *identical*, unless otherwise stated *explicitly* in the markscheme.----""")
Now we’ve made both of these changes,
let’s re-run our evals to see if either of these changes were able to address the problems they’re intended to resolve.
with unify.Experiment("align_guidelines_and_clarify_reasoning", overwrite=True,), unify.Params( system_message=system_message, dataset="TestSet10", source=unify.get_source(),): unify.map( evaluate,[dict(**d.entries, _system_message=system_message)for d in test_set_10], name="Evals",)
Our failure mechanisms are exactly the same as before,
clearly the agent is still struggling to correctly reason about the different mark types.
Let’s take a look at the traces,
to ensure that the system message template has been implemented correctly,
and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly,
and the per-LLM system messages look good ✅
Again, let’s explore what’s going wrong in the next iteration 🔁