The full script for running this iteration can be found here.
As usual, let’s take a look and explore why the agent might be failing on the remaining examples 🕵️
Given that the agent is still failing to follow the instructions for each mark in the markscheme, perhaps it’s time we tried to perform per-mark reasoning, with a separate LLM call made for each candidate mark to award. This might help the LLM deeply consider each candidate mark mentioned in the markscheme.
Let’s give it a try!
We will still want our per-subquestion LLM to perform the final reasoning about the number of marks to award for the sub-question, but we just want to provide it with the reasoning performed by each of our per-mark LLM queries.
We therefore now have two different LLMs, with two different roles, and therefore we need two different system messages.
Let’s first update the subquestion system message,
in anticipation of the incoming mark-by-mark reasoning.
Let’s also split the markscheme and the mark type reasoning,
rather than naively combining these as was done in update_markscheme
.
The "{mark_types_explanation}"
placeholder can be overriden explicitly,
giving us more control.
Let’s create a new function extract_mark_type_explanation
,
inspired from update_markscheme
above.
Let’s now create the system message for our mark reasoning agent,
again with the explicit {mark_types_explanation}
placeholder.
Let’s first define call_subq_agent
,
which will include mark-by-mark reasoning with several LLM calls
Let’s now update call_agent
,
making use of our call_subq_agent
function,
which processes a single sub-question.
We also need to update the evaluate
function,
to pass each of the two different system messages to the call_agent
function.
Great, this seems to have addressed two of the three failures (on this run at least).
Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅
Again, let’s explore what’s going wrong in the next iteration 🔁
The full script for running this iteration can be found here.
As usual, let’s take a look and explore why the agent might be failing on the remaining examples 🕵️
Given that the agent is still failing to follow the instructions for each mark in the markscheme, perhaps it’s time we tried to perform per-mark reasoning, with a separate LLM call made for each candidate mark to award. This might help the LLM deeply consider each candidate mark mentioned in the markscheme.
Let’s give it a try!
We will still want our per-subquestion LLM to perform the final reasoning about the number of marks to award for the sub-question, but we just want to provide it with the reasoning performed by each of our per-mark LLM queries.
We therefore now have two different LLMs, with two different roles, and therefore we need two different system messages.
Let’s first update the subquestion system message,
in anticipation of the incoming mark-by-mark reasoning.
Let’s also split the markscheme and the mark type reasoning,
rather than naively combining these as was done in update_markscheme
.
The "{mark_types_explanation}"
placeholder can be overriden explicitly,
giving us more control.
Let’s create a new function extract_mark_type_explanation
,
inspired from update_markscheme
above.
Let’s now create the system message for our mark reasoning agent,
again with the explicit {mark_types_explanation}
placeholder.
Let’s first define call_subq_agent
,
which will include mark-by-mark reasoning with several LLM calls
Let’s now update call_agent
,
making use of our call_subq_agent
function,
which processes a single sub-question.
We also need to update the evaluate
function,
to pass each of the two different system messages to the call_agent
function.
Great, this seems to have addressed two of the three failures (on this run at least).
Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅
Again, let’s explore what’s going wrong in the next iteration 🔁