As usual,
let’s take a look and explore why the agent might be failing on the remaining examples 🕵️
Given that the agent is still failing to follow the instructions for each mark in the markscheme,
perhaps it’s time we tried to perform per-mark reasoning,
with a separate LLM call made for each candidate mark to award.
This might help the LLM deeply consider each candidate mark mentioned in the markscheme.
We will still want our per-subquestion LLM to perform the final reasoning about the number of marks to award for the sub-question,
but we just want to provide it with the reasoning performed by each of our per-mark LLM queries.
We therefore now have two different LLMs,
with two different roles,
and therefore we need two different system messages.
Let’s first update the subquestion system message,
in anticipation of the incoming mark-by-mark reasoning.
Let’s also split the markscheme and the mark type reasoning,
rather than naively combining these as was done in update_markscheme.
subq_system_message = """Your task is to award a suitable number of marks for a student's answer to question {subq}, from 0 up to a maximum of {available_marks} marks.The general marking guidelines (relevant for all questions) are as follows:{general_guidelines}The *overall* question is:{question}{prior_context}The specific question you need to mark is:{subquestion}Their answer to this specific question is:{answer}The markscheme for this specific question is:{markscheme}{mark_types_explanation}{mark_observations}{output_response_explanation}""".replace( "{general_guidelines}", general_guidelines,)
The "{mark_types_explanation}" placeholder can be overriden explicitly,
giving us more control.
Let’s create a new function extract_mark_type_explanation,
inspired from update_markscheme above.
@unify.traced(name="extract_mark_type_explanation{subquestion}")defextract_mark_type_explanation( subquestion:str, markscheme:str, marks_to_consider=None,): m_marks =sorted(list(set(re.findall(r"M\d+", markscheme)))) a_marks =sorted(list(set(re.findall(r"A\d+", markscheme)))) b_marks =sorted(list(set(re.findall(r"B\d+", markscheme)))) sc_marks =sorted(list(set(re.findall(r"SC\d+", markscheme))))ifnotany(m_marks + a_marks + b_marks + sc_marks):return"" full_exp ="As a recap, {mark_types_explanation}"for marks in(m_marks, a_marks, b_marks, sc_marks):for mark in marks:if marks_to_consider and mark notin marks_to_consider:continue key ="".join(c for c in mark ifnot c.isdigit()) num_marks =int("".join(c for c in mark if c.isdigit())) exp = mark_types[key] exp = exp.replace("{num}",str(num_marks),).replace("{num_marks}","1 mark"if num_marks ==1elsef"{num_marks} marks",) full_exp = full_exp.replace("{mark_types_explanation}", exp +"\n{mark_types_explanation}",)return full_exp.replace("{mark_types_explanation}","")
Let’s now create the system message for our mark reasoning agent,
again with the explicit {mark_types_explanation} placeholder.
mark_system_message = """Your task is to determine whether mark {mark} should be awarded for the following student's answer to question {subq}, based on the provided markscheme.The general marking guidelines (relevant for all questions) are as follows:{general_guidelines}The *overall* question is:{question}{prior_context}The specific question you need to mark is:{subquestion}Their answer to this specific question is:{answer}The markscheme for this specific question, with the mark in question {mark} expressed in bold and with a prepending `(to consider!)`, is as follows:{markscheme}{mark_types_explanation}You should populate the `thoughts` field with your thoughts on the whether the specific mark {mark} identified within the markscheme should be awarded for the student's answer. This {mark} mark might be irrelevant given the student's approach or answer, in which case just respond `False` for the `should_award` field, and explain this in the `thoughts` field. Please think carefully about your decision for awarding this {mark} mark, considering the general guidelines.""".replace( "{general_guidelines}", general_guidelines,)
Let’s first define call_subq_agent,
which will include mark-by-mark reasoning with several LLM calls
@unify.traced(name="call_subq_agent_{subq}")defcall_subq_agent( example_id, subq, subq_agent, markscheme, parsed_markscheme, mark_sys_msg,): mark_agents =[[k, agent.copy()]for k in[itm[0]for itm in parsed_markscheme]][agnt.set_response_format(ThoughtsAndAwardDecision)for _, agnt in mark_agents]for i,(k, v)inenumerate(parsed_markscheme): mark_agents[i][1].set_system_message( mark_sys_msg.replace("{mark}", k,).replace("{markscheme}", textwrap.indent( markscheme.replace( v, v.replace(k,f"**{k}** (to consider!)"),)," "*4,),).replace("{mark_types_explanation}", extract_mark_type_explanation(f"_{k}({i})"if k !="_"else"", markscheme,[k],),),)if mark_agents: explanation ="An expert marker has already taken a look at the student's answer, and they have made the following observations for each of the candidate marks mentioned in the markscheme. You should pay special attention to these observations." vals = unify.map(lambda i, m, a: json.loads(a.generate(tags=[m +f"({i})"])),[tuple([i]+ item)for i, item inenumerate(mark_agents)], name=f"Evals[{example_id}]->SubQAgent[{subq}]->MarkAgent",) keys =list()for k, _ in mark_agents: keys.append( k +f"({len([ky for ky in keys if k in ky])})",) mark_obs_dict =dict(zip(keys, vals)) mark_observations =( explanation+"\n\n"+ pretty_print_dict( mark_obs_dict, indent=4,))else: mark_observations ="" subq_agent.set_system_message( subq_agent.system_message.replace("{mark_observations}", mark_observations,),) ret = subq_agent.generate(tags=[subq])if"```"in ret: ret = ret.split("```")[-2].lstrip("json") ret = json.loads(ret)ifnot mark_agents:return ret ret["reasoning"]={**mark_obs_dict,"overall_thoughts": ret["reasoning"],}return ret
Let’s now update call_agent,
making use of our call_subq_agent function,
which processes a single sub-question.
@unify.traceddefcall_agent( example_id, subq_system_message, mark_system_message, question_num, question, sub_questions, markscheme, answer, available_marks,): subq_agents ={k: agent.copy()for k in markscheme.keys()} with_subqs =len(markscheme)>1 response_formats ={k: MarksAndReasoning for k, v in markscheme.items()}[ agnt.set_response_format(rf)for agnt, rf inzip( subq_agents.values(), response_formats.values(),)] mark_sys_msgs =list() parsed_markschemes =list()for i, k inenumerate(markscheme.keys()): parsed_markscheme = parse_marks_from_markscheme(f"_{k}"if k !="_"else"", markscheme[k],) parsed_markschemes.append(parsed_markscheme) this_markscheme = markscheme[k]for i,(mark, chunk)inenumerate(parsed_markscheme): this_markscheme = this_markscheme.replace( chunk, chunk.replace( mark,f"{mark}({len([m for m, _ in parsed_markscheme[0:i]if m == mark])})",),) subq_agents[k].set_system_message( subq_system_message.replace("{subq}", k.replace("_",str(question_num)),).replace("{question}", textwrap.indent(question," "*4),).replace("{subquestion}", textwrap.indent(sub_questions[k]," "*4),).replace("{markscheme}", textwrap.indent(this_markscheme," "*4),).replace("{mark_types_explanation}", textwrap.indent( extract_mark_type_explanation(f"_{k}"if k !="_"else"", markscheme[k],)," "*4,),).replace("{answer}", textwrap.indent(answer[k]," "*4),).replace("{available_marks}",str(available_marks[k.replace("_","total")]),).replace("{output_response_explanation}", output_response_explanation,).replace("{prior_context}",(( prior_context_exp+ pretty_print_dict({ k:{"sub-question": sub_questions[k],"markscheme": markscheme[k],"answer": answer[k],}for k inlist(sub_questions.keys())[0:i]}, indent=4,))if with_subqs and i >0else""),),) mark_sys_msgs.append( mark_system_message.replace("{subq}", k.replace("_",str(question_num)),).replace("{question}", textwrap.indent(question," "*4),).replace("{subquestion}", textwrap.indent(sub_questions[k]," "*4),).replace("{answer}", textwrap.indent(answer[k]," "*4),).replace("{prior_context}",(( prior_context_exp+ pretty_print_dict({ k:{"sub-question": sub_questions[k],"markscheme": markscheme[k],"answer": answer[k],}for k inlist(sub_questions.keys())[0:i]}, indent=4,))if with_subqs and i >0else""),),) rets = unify.map(lambda*a: call_subq_agent(example_id,*a),list(sub_questions.keys()),list(subq_agents.values()),list(markscheme.values()), parsed_markschemes, mark_sys_msgs, from_args=True, name=f"Evals[{example_id}]->SubQAgent",)returndict(zip(markscheme.keys(), rets))
We also need to update the evaluate function,
to pass each of the two different system messages to the call_agent function.
with unify.Experiment("queries_per_mark", overwrite=True,), unify.Params( subq_system_message=subq_system_message, mark_system_message=mark_system_message, dataset="TestSet10", source=unify.get_source(),): unify.map( evaluate,[dict(**d.entries, _subq_system_message=subq_system_message, _mark_system_message=mark_system_message,)for d in test_set_10], name="Evals",)
Great,
this seems to have addressed two of the three failures (on this run at least).
Let’s take a look at the traces,
to ensure that the system message template has been implemented correctly,
and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly,
and the per-LLM system messages look good ✅
Again,
let’s explore what’s going wrong in the next iteration 🔁