As usual,
let’s take a look and explore why the agent might be failing on the remaining examples 🕵️
Let’s try to force the agent to reason about each potential mark mentioned in the markscheme,
by further refining our structured output.
Let’s expand upon the reasoning field for each sub-question,
with a field for each mark type referenced in the sub-question markscheme,
going from the following structure:
Prediction: a: reasoning: str marks: int b: reasoning: str marks: int ...
To this version which explicitly enforces reasoning about each potential mark type referenced in the markscheme:
Let’s first define a function to dynamically construct the required pydantic type.
For each parsed mark type,
we want the model to give it’s thoughts and make a decision as to whether or not the mark should be awarded.
Let’s create this pydantic type first:
class ThoughtsAndAwardDecision(BaseModel): thoughts: str should_award: bool
Let’s then create a function to dynamically construct a PerMarkReasoning pydantic type,
with one ThoughtsAndAwardDecision instance for each mark detected in the sub-question markscheme.
Let’s then re-define MarksAndReasoning (previously this was statically defined, see previous iteration)
such that the reasoning field is no longer just a string,
but is intead our newly created PerMarkReasoning (above).
Finally,
let’s then update the top-level function create_response_format
such that we’re making use of our newly defined create_marks_and_reasoning_format for each sub-question.
@unify.traced(name="create_response_format_{mark_types}")def create_response_format(response_keys, mark_types): if response_keys: response_fields = dict( zip( response_keys, [ (create_marks_and_reasoning_format(mark_types[key]), ...) for key in response_keys ], ), ) return create_model("Response", **response_fields) else: return create_marks_and_reasoning_format(mark_types["_"])
We also need to write a function to parse the relevant marks from each sub-question markscheme.
We can take inspiration from update_markscheme defined in the previous iteration,
which parses the markscheme in the same manner but for a different reason.
Let’s have the function extract the marks,
and also the surrounding context.
@unify.traced(name="parse_marks_from_markscheme{subquestion}")def parse_marks_from_markscheme(subquestion: str, markscheme: str): extracted_marks = re.findall(r"(?:SC|M|A|B)\d+", markscheme) if not extracted_marks: return [] marks_n_context = list() for i, mark in enumerate(extracted_marks): index = markscheme.find(mark) chunk = markscheme[0:index] if i > 0: prev_mark = extracted_marks[i - 1] marks_n_context[i - 1][1] += chunk markscheme = markscheme[index:] marks_n_context.append([mark, chunk]) marks_n_context[-1][1] += markscheme return marks_n_context
Finally,
we’ll also need to update call_agent such that we call parse_available_marks_from_markscheme on each sub-question markscheme,
and then pass these into our newly defined create_response_format.
@unify.traceddef call_agent( system_msg, question, sub_questions, markscheme, answer, available_marks_total,): local_agent = agent.copy() with_subqs = len(markscheme) > 1 response_format = create_response_format( list(markscheme.keys()) if with_subqs else None, { k: [ itm[0] for itm in parse_marks_from_markscheme(f"_{k}" if k != "_" else "", v) ] for k, v in markscheme.items() }, ) local_agent.set_response_format(response_format) if with_subqs: output_response_exp = output_response_explanations["with_subqs"] output_response_exp = output_response_exp.replace( "{subquestions}", ", ".join(list(markscheme.keys())), ) else: output_response_exp = output_response_explanations["without_subqs"] markscheme = { k: update_markscheme(f"_{k}" if k != "_" else "", v) for k, v in markscheme.items() } local_agent.set_system_message( system_msg.replace( "{question}", textwrap.indent(question, " " * 4), ) .replace( "{markscheme}", pretty_print_dict(markscheme, indent=4), ) .replace( "{answer}", pretty_print_dict(answer, indent=4), ) .replace( "{available_marks_total}", str(available_marks_total), ) .replace( "{questions_markscheme_and_answers}", pretty_print_dict( { k: { "sub-question": sub_questions[k], "markscheme": markscheme[k], "answer": answer[k], } for k in sub_questions.keys() }, indent=4, ), ) .replace( "{output_response_explanation}", output_response_exp, ), ) ret = local_agent.generate() if "```" in ret: ret = ret.split("```")[-2].lstrip("json") ret = response_format.model_validate_json(ret).model_dump() if not with_subqs: return {"_": ret} return ret
Let’s also update our system message to better explain to the agent how it should reason about this new output structure.
output_response_explanations = dict()output_response_explanations["with_subqs"] = ( "For each sub-question {subquestions} you should populate the `reasoning` field with your general thoughts on each individual mark identified in the markscheme, and also a decision as to whether each of these mark should be awarded. These marks are not necessarily cumulative with regards to the marks to award, and some may be irrelevant given the student's approach or answer, in which case just respond `False` for the `should_award` field. Finally, you should put the total number of marks to award for each sub-question in the corresponding `marks` field.")output_response_explanations["without_subqs"] = ( "You should populate the `reasoning` field with your general thoughts on each individual mark identified in the markscheme, and also a decision as to whether each of these mark should be awarded. These marks are not necessarily cumulative with regards to the marks to award, and some may be irrelevant given the student's approach or answer, in which case just respond `False` for the `should_award` field. Finally, you should put the total number of marks to award in the `marks` field.")
with unify.Experiment( "mark_type_reasoning", overwrite=True,), unify.Params( system_message=system_message, dataset="TestSet10", source=unify.get_source(),): unify.map( evaluate, [dict(**d.entries, _system_message=system_message) for d in test_set_10], name="Evals", )
The failure modes are still entirely unchanged!o3-mini is certainly very stubborn about it’s decision for these questions.
Let’s take a look at the traces,
to ensure that the system message template has been implemented correctly,
and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly,
and the per-LLM system messages look good ✅
Again,
let’s explore what’s going wrong in the next iteration 🔁