These updates have actually regressed the overall performance,
with us now having a mean error of 0.5.
Maybe the purely local reasoning has some shortcomings.
Let’s focus on one of the new regressions,
to understand why our latest change has distrupted the agent where it was previously very consistently correct.
He might not want to use Account A because the bank might keep the money forever and he could lose it.
He may not want to leave it there for 3 years. Accept any valid reason.
The markscheme awards 1 mark for a valid reason such as not wanting to lock away funds for 3 years. Here, the stated reason is irrelevant and does not accurately reflect the restrictions of Account A.
The mark scheme states that any valid reason for not using Account A is acceptable, specifically mentioning not leaving the money in the account for 3 years. The candidate’s answer gives a different, yet valid, reason: that he might lose the money because the bank could keep it forever. Although not identical to the mark scheme’s example, the answer is acceptable as a valid reason, therefore the single mark should be awarded.
Clearly,
the agent is now taking some liberties with what constitutes a “valid reason”.
It’s unclear why the agent is only making these mistakes now.
Let’s look at some of the previous justifications for not awarding the mark,
from the previous runs.
The answer should explain that although Account A gives slightly more money, Derrick might not want this account because his money is tied up for three years (i.e. it has poor liquidity). The student’s explanation about the bank “keeping the money forever” is not an acceptable financial rationale.
The explanation provided is not a valid reason and does not address the restriction of no withdrawals for three years. Therefore, no mark can be awarded for this part either.
The explanation given (“because the bank might keep the money forever and he could lose it”) does not correctly address why he might not want to use Account A (namely, that his money is locked away for 3 years with no withdrawals allowed).
The answer needed to explain why Account A might not be desirable, which is because funds are locked in for three years and withdrawals are not allowed until the end of the period. The student’s answer states that the bank might keep the money forever and he could lose it, which does not address the actual restriction (lack of liquidity). This is an irrelevant or incorrect reason and does not meet the mark scheme.
The mark scheme accepts any valid reason such as not being able to withdraw the money because it is locked in for 3 years. The candidate’s answer – that the bank might ‘keep the money forever’ and he could lose it – is not a valid explanation for the funds being inaccessible. It does not clearly express that his money is tied up for 3 years. Therefore, this explanation does not meet the requirement and no mark should be awarded.
The mark scheme accepts any valid reason such as the inconvenience of having to leave the money invested for 3 years. The candidate’s explanation – that the bank might keep the money forever and he could lose it – is not a valid or relevant reason within the context of the question. Consequently, the answer does not meet the requirement for awarding the 1 mark.
The explanation required a valid reason such as the lack of liquidity (money locked for three years). The candidate’s answer instead gives an unrelated and implausible reason, making it unacceptable for the mark scheme. Therefore, no mark is awarded.
Perhaps preventing the agent from having access to the full question prevents it from using “common sense” and realizing how “silly” the proposed answer is,
in light of the overall question and the information provided to the student.
Maybe strict adherence to the markscheme alone without the full context is prohibitive.
Let’s update our per-subquestion system prompts to also fully include the preceeding sub-questions,
their markschemes,
and their answers.
It’s unlikely that the context of a later question will assists with the marking of an earlier question,
and we still want to try and keep the agent as focused as possible on the relevant information.
Let’s first update the system prompt,
re-introducing the placeholder for the aligned subquestions,
markschemes and answers,
this time calling it {prior_context},
which will only be included when sub-questions are present.
Let’s also include the full question.
system_message = """Your task is to award a suitable number of marks for a student's answer to question {subq}, from 0 up to a maximum of {available_marks} marks.The general marking guidelines (relevant for all questions) are as follows:{general_guidelines}The *overall* question is:{question}{prior_context}The specific question you need to mark is:{subquestion}Their answer to this specific question is:{answer}The markscheme for this specific question is:{markscheme}{output_response_explanation}""".replace( "{general_guidelines}", general_guidelines,)
Let’s also add a general explanation for the prior context,
in cases where it is included.
prior_context_exp = """All of the *preceeding* sub-questions, their specific markschemes and the student's answers are as follows:"""
Let’s now update call_agent to pass in the required information.
@unify.traceddefcall_agent( example_id, system_msg, question_num, question, sub_questions, markscheme, answer, available_marks,): agents ={k: agent.copy()for k in markscheme.keys()} with_subqs =len(markscheme)>1 response_formats ={ k: create_marks_and_reasoning_format([ itm[0]for itm in parse_marks_from_markscheme(f"_{k}"if k !="_"else"", v)],)for k, v in markscheme.items()}[ agnt.set_response_format(rf)for agnt, rf inzip( agents.values(), response_formats.values(),)] markscheme ={ k: update_markscheme(f"_{k}"if k !="_"else"", v)for k, v in markscheme.items()}for i, k inenumerate(markscheme.keys()): agents[k].set_system_message( system_msg.replace("{subq}", k.replace("_",str(question_num)),).replace("{question}", textwrap.indent(question," "*4),).replace("{subquestion}", textwrap.indent(sub_questions[k]," "*4),).replace("{markscheme}", textwrap.indent(markscheme[k]," "*4),).replace("{answer}", textwrap.indent(answer[k]," "*4),).replace("{available_marks}",str(available_marks[k.replace("_","total")]),).replace("{output_response_explanation}", output_response_explanation,).replace("{prior_context}",(( prior_context_exp+ pretty_print_dict({ k:{"sub-question": sub_questions[k],"markscheme": markscheme[k],"answer": answer[k],}for k inlist(sub_questions.keys())[0:i]}, indent=4,))if with_subqs and i >0else""),),) rets = unify.map(lambda k, a: a.generate(tags=[k]),list(agents.items()), name=f"Evals[{example_id}]->SubQAgent",) rets =[ ret.split("```")[-2].lstrip("json")if"```"in ret else ret for ret in rets] rets ={ k: response_formats[k].model_validate_json(ret).model_dump()for k, ret inzip(markscheme.keys(), rets)}return rets
with unify.Experiment("with_preceeding_context", overwrite=True,), unify.Params( system_message=system_message, dataset="TestSet10", source=unify.get_source(),): unify.map( evaluate,[dict(**d.entries, _system_message=system_message)for d in test_set_10], name="Evals",)
Great,
so we’ve fixed the new regressions,
but again we’re back at the same three failures,
failing for the same reason.
Let’s take a look at the traces,
to ensure that the system message template has been implemented correctly,
and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly,
and the per-LLM system messages look good ✅
Again,
let’s explore what’s going wrong in the next iteration 🔁