In the last section we created test sets of varying sizes, ready to evaluate our agent. So, it’s finally time to start our data flywheel spinning! The general process for optimizing an LLM agent is quite straightforward:

1️⃣ Create simplest possible agent πŸ€–
2️⃣ While True:
3️⃣     Create/expand unit tests (evals) πŸ—‚οΈ
4️⃣     While run(tests) failing: πŸ§ͺ
5️⃣        Analyze failures, understand the root cause πŸ”
6️⃣        Vary system prompt, in-context examples, tools etc. to rectify πŸ”€
7️⃣    [Optional] Beta test with users, find more failures 🚦

Firstly, let’s activate the MarkingAssistant project.

unify.activate("MarkingAssistant")

Let’s also set a new context Evals, where we’ll store all of our evaluation runs.

unify.set_context("Evals")

Let’s now go through this data flywheel step-by-step!

πŸ€– Create Agent

Let’s start with a simple 0-shot LLM to begin with.

agent = unify.Unify("o3-mini@openai", traced=True)

Let’s also download a .cache.json file which was previously generated whilst running this notebook, to avoid making any real LLM calls, and to also make our walkthrough deterministic.

If you’d rather go down your own unique iteration journey, then you should skip the cell below, and either remove cache="read-only" (turn off caching) or replace it with cache=True (create your own local cache) in the agent constructor above. However, this would mean many parts of the remaining walkthrough might not directly apply in your case, as the specific failure modes and the order in which they appear are likely to be different.

if os.path.exists(".cache.json"):
    os.remove(".cache.json")
wget.download(
    "https://raw.githubusercontent.com/"
    "unifyai/demos/refs/heads/main/"
    "marking_assistant/.cache.json"
)

The agent needs to mark student answers to questions, out of a possible maximum number of marks. Let’s give it a sensible system message to begin with:

system_message = """
Your task is to award a suitable number of marks for a student's answer to a question, from 0 up to a maximum of {available_marks_total} marks.

The question is:

{question}


Their answer to this question is:

{answer}


As the very final part of your response, simply provide the number of marks on a *new line*, without any additional formatting. For example:

3
"""

Let’s wrap our system prompt in a simple function so the system message is updated based on the specific data involved:

@unify.traced
def call_agent(system_msg, question, answer, available_marks_total):
    local_agent = agent.copy()
    local_agent.set_system_message(
        system_msg.replace(
            "{question}", question
        ).replace(
            "{answer}", json.dumps(answer, indent=4)
        ).replace(
            "{available_marks_total}", str(available_marks_total)
        )
    )
    return local_agent.generate()

πŸ—‚οΈ Add Tests

Great, we now have our agent implemented. So, what are some good unit tests to begin with? Rather than using all 321 examples for our first iteration. Let’s use the smallest subset of 10 examples, which we created in the previous section.

πŸ§ͺ Run Tests

Let’s add evaluation function, and include all other arguments that we would like to log as part of the evaluation. All input arguments, intermediate variables, and return variables without a prepending "_" in the name (all β€œnon-private” arguments, returns and variables) will automatically be logged when the function is called, with the use of a unify.log decorator.

@unify.log
def evaluate(
    question,
    student_answer,
    available_marks_total,
    correct_marks_total,
    per_question_breakdown,
    _system_message,
):
    pred_marks = call_agent(
        _system_message, question, student_answer,
        available_marks_total
    )
    _pred_marks_split = pred_marks.split("\n")
    pred_marks_total, diff_total, error_total = None, None, None
    for _substr in reversed(_pred_marks_split):
        _extracted = "".join([c for c in _substr if c.isdigit()])
        if _extracted != "":
          pred_marks_total = int(_extracted)
          diff_total = correct_marks_total - pred_marks_total
          error_total = abs(diff_total)
          break
    pred_marks = {"_": {"marks": pred_marks_total, "rationale": pred_marks}}
    return error_total

We can then run our evaluation, with the logging included, like so:

with unify.Experiment("simple_agent"), unify.Params(
    system_message=system_message,
    dataset="dataset_10"
):
    unify.map(
        evaluate,
        [
             dict(**d, _system_message=system_message)
             for d in dataset_10
        ],
        name="Evals"
    )

The unify.Experiment() term creates an "experiment" parameter in the context. The unify.Params() term sets the parameters which are held constant throughout the experiment.

We have some failures, with a mean error of 0.8 across the ten examples. We’ll need to dive in and understand what’s going wrong.

πŸ” Iteration 1

πŸ” Inconsistent Formatting

Let’s take a look at our results in the table

GIF

Example 215 has the largest error. Let’s take a look at what’s going wrong.

❓ Question [8 Marks]:

In this row of boxes, you start with 5 and 7.

β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β” β”‚ 5 β”‚ 7 β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜

You add 5 and 7 to get 12 to go in the third box.  
You add 7 and 12 to get 19 to go in the fourth box.  
You add 12 and 19 to get 31 to go in the fifth box.

β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β” β”‚ 5 β”‚ 7 β”‚ 12 β”‚ 19 β”‚ 31 β”‚ β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜

Complete these rows of boxes using the rule shown above.
>
> (a)

β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β” β”‚ 4 β”‚ 6 β”‚ β”‚ β”‚ [1] β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜

> (b)

β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β” β”‚ β”‚ β”‚ 34 β”‚ 55 β”‚ [2] β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜

> (c) Complete this row of boxes, writing your expressions in their simplest form.

β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β” β”‚ a β”‚ b β”‚ β”‚ β”‚ [2] β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜

> (d) Use your answer to (c) to help you fill in the missing numbers in this row of boxes.

β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β” β”‚ 6 β”‚ β”‚ β”‚ 57 β”‚ [3] β””β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜


πŸ“ Student's Answer:

> (a) The next three boxes are 10, 16, and 26. So the row becomes 4, 6, 10, 16, 26.
>
> (b) The missing boxes are 8, 13, and 21, making the sequence 8, 13, 21, 34, 55.
>
> (c) a + b, a + 2b, a + 3b
>
> (d) By setting 2a + 3b = 57, we substitute a = 6 to get 12 + 3b = 57, so b = 15. Then, adding consecutively gives 6, 15, 21, 37, with the final box stated as 57 in the question.

Based on these two, it seems as though there is a small discrepency in the text-based box formatting in the question given to the agent, with the question presenting *four* boxes per question but the true number being *five*. We can verify this by looking at the original page, which can be found [here](https://raw.githubusercontent.com/unifyai/demos/refs/heads/main/ai_tutor/pdfs/GCSE_(9%E2%80%931)_Mathematics/J560_02_Paper_2_(Foundation_Tier)_Sample_Question_Paper/paper/imgs/page16.png).

β˜‘οΈ Markscheme [Not Given to Agent]:

> (a) Answer: 10, 16, 26. Marks: 1
>
> (b) Answer: 8, 13, 21. Marks: 2. Guidance: M1 for one correct subtraction of two boxes
>
> (c) Answer: a + b, a + 2b, 2a + 3b. Marks: 2. Guidance: M1 for two expressions correct
>
> (d) Answer: 15, 21, 36. Marks: 3. Guidance: M1 for their β€˜2a + 3b’ = 57, M1 for substituting a = 6 into their final expression and solving for b

βœ… Correct Marks [6/8] Rationale [Not Given to Agent]:

> (a) 1/1 Mark. By adding consecutive pairs (4+6=10, 6+10=16, 10+16=26), we match the markscheme’s answer of 10,16,26. This secures the single mark (1) allocated for part (a).
>
> (b) 2/2 Marks. Here, we deduce the missing numbers by working backwards from 34 and 55 or by applying the same rule forward. According to the markscheme (2 marks), we get full credit for the fully correct set of missing numbers, with one method mark (M1) for a correct subtraction and a final accuracy mark (A1) for the correct sequence.
>
> (c) 1/2 Marks. We used the rule to form expressions but made an error in the final expression: the correct set per the markscheme should be (a+b), (a+2b), (2a+3b). We have two correct expressions, so we earn the method mark (M1), but lose the accuracy mark for an incorrect final expression.
>
> (d) 2/3 Marks. Using part (c)’s structure, we form 2a+3b, set it to 57, and solve correctly for b = 15 (which earns the two method marks, M1 and M1). However, we introduced an error when adding 15 + 21 = 37 (should be 36), so we do not achieve the final accuracy mark (A1) for a fully correct final row.

πŸ€– Predicted Marks [1/8] Rationale:

> Let's check each part:
>
> (a) The correct row with starting numbers 4 and 6 should give 4, 6, 4+6 = 10, and then 6+10 = 16. The answer given β€œ4, 6, 10, 16, 26” is one extra term. (0/1 mark)
>
> (b) For the row with boxes β€œ__, __, 34, 55” the correct answer would be 13, 21, 34, 55 (since 13 + 21 = 34 and 21 + 34 = 55). The answer given β€œ8, 13, 21, 34, 55” does not match the required four‐box row. (0/2 marks)
>
> (c) For the row β€œa, b, __, __” the missing boxes should be expressed as a + b and a + 2b in simplest form. The answer β€œa + b, a + 2b, a + 3b” gives an extra term. (1/2 marks)
>
> (d) In the row β€œ6, __, __, 57”, if we interpret the pattern as in (c) with a = 6 so that the boxes are 6, b, 6+b, 6+2b and 6 + 2b = 57, then b = 25.5. The given answer instead sets up a different equation and finds b = 15 leading to an inconsistent sequence. (0/3 marks)
>
> Total marks awarded: 0 + 0 + 1 + 0 = 1
>
> 1

Looking at this failure, it's obvious that the model would benefit from having the **markscheme** included in the context, so that it knows what the correct answers are, and how to award the marks.

Without the markscheme, the agent is unable to notice this discrepency, and therefore presumes that a total of *four* numbers are reqired for each answer, instead of the necessary *five*.

### πŸ”€ Add Markscheme

Let's update the system message to include a placeholder the markscheme.

```python
system_message = """
Your task is to award a suitable number of marks for a student's answer to a question, from 0 up to a maximum of {available_marks_total} marks.

The question is:

{question}


The markscheme for the question is:

{markscheme}


Their answer to this question is:

{answer}


As the very final part of your response, simply provide the number of marks on a *new line*, without any additional formatting. For example:

3
"""

We then make sure to update the system message with the true markscheme during call_agent:

@unify.traced
def call_agent(system_msg, question, markscheme, answer, available_marks_total):
    local_agent = agent.copy()
    local_agent.set_system_message(
        system_msg.replace(
            "{question}", question
        ).replace(
            "{markscheme}", json.dumps(markscheme, indent=4)
        ).replace(
            "{answer}", json.dumps(answer, indent=4)
        ).replace(
            "{available_marks_total}", str(available_marks_total)
        )
    )
    return local_agent.generate()

We also need to update evaluate accordingly:

@unify.log
def evaluate(
    question,
    student_answer,
    available_marks_total,
    markscheme,
    correct_marks_total,
    per_question_breakdown,
    _system_message,
):
    pred_marks = call_agent(
        _system_message, question, markscheme, student_answer,
        available_marks_total
    )
    _pred_marks_split = pred_marks.split("\n")
    pred_marks_total, diff_total, error_total = None, None, None
    for _substr in reversed(_pred_marks_split):
        _extracted = "".join([c for c in _substr if c.isdigit()])
        if _extracted != "":
          pred_marks_total = int(_extracted)
          diff_total = correct_marks_total - pred_marks_total
          error_total = abs(diff_total)
          break
    pred_marks = {"_": {"marks": pred_marks_total, "rationale": pred_marks}}
    return error_total

πŸ§ͺ Rerun Tests

with unify.Experiment("add_markscheme"), unify.Params(
    system_message=system_message,
    dataset="dataset_10"
):
    unify.map(
        evaluate,
        [
             dict(**d.entries, _system_message=system_message)
             for d in test_set_10
        ],
        name="Evals"
    )

Great! Our mean error has gone down from 0.8 to 0.3, we’re definitely making progress πŸ’ͺ

We still need to try to address the remaining errors though. Let’s explore what’s going wrong in another iteration.

πŸ” Iteration 2

πŸ” Misunderstanding Mark Types

Seven out of ten have a perfect error of 0, with three examples having an error of 1. Let’s take a closer look at these examples, and why they’re failing.

GIF

Let’s dive in and explore each of these failures in more detail, to see how we can rectify them.

Example 207

❓ Question [2 Marks]:

Write these in order, smallest first:

0.34, 1/3, 3.5%

…………………… …………………… ……………………

πŸ“ Student’s Answer:

1/3, 0.34, 3.5%

β˜‘οΈ Markscheme:

3.5%, 1/3, 0.34

Part marks and guidance:

  • B1 for 1/3 = 0.33… or 33…%
  • or B1 for 0.34 = 34%
  • or B1 for changing 3.5% to 0.035
  • or SC1 for 1/3, 0.34, 3.5%

Accept correct order with equivalent values.

βœ… Correct Marks [1/2] Rationale:

The candidate’s order is 1/3, 0.34, 3.5%. According to the markscheme, this earns SC1 (special case, partially correct ordering) for 1 mark.

πŸ€– Predicted Marks [0/2] Rationale:

The student’s ordering is incorrect because smallest first should be 3.5%, 1/3, 0.34, but the student provided 1/3, 0.34, 3.5%.

0

Example 261

❓ Question [5 Marks]:

Peter makes a large amount of pink paint by mixing red and white paint in the ratio 2 : 3.

β€’ Red paint costs Β£80 per 10 litres.
β€’ White paint costs Β£5 per 10 litres.
β€’ Peter sells his pink paint in 10-litre tins for Β£60 per tin.

Calculate how much profit he makes for each tin he sells.
You must show your working.

Β£ …………………………………………

πŸ“ Student’s Answer:

Peter needs 4 litres of red paint and 6 litres of white paint for every 10-litre batch (because 2 out of 5 parts is red, 3 out of 5 parts is white).

β˜‘οΈ Markscheme:

25 with correct working (5 marks)

β€œCorrect working” requires evidence of at least M1M1

  • M1 for 10 Γ— (2/5) = 4 litres red or for 10 Γ— (3/5) = 6 litres white
  • M1 for red costs Β£8 per litre or for white costs Β£0.50 per litre
  • M1 for cost of one 10-litre can is their 4 Γ— their 8 + their 6 Γ— their 0.5
  • M1 for 60 – their 35

Alternative method:

  • M1 for 2 : 3 = 20 litres red : 30 litres white
  • M1 for 2 Γ— Β£80 + 3 Γ— Β£5 = Β£175
  • M1 for (their 175) Γ· 5 = 35
  • M1 for 60 – their 35
  • M1 for cost of one 10-litre can is their 4 Γ— their 8 + their 6 Γ— their 0.5

If 0 or 1 scored, instead award SC2 for answer 25 with no working or insufficient working

βœ… Correct Marks [1/5] Rationale:

Awarding 1 mark (M1) for correctly finding that a 10-litre tin contains 4 litres of red and 6 litres of white from the 2:3 ratio. No further steps towards the final profit are shown, so only this 1 mark is earned.

πŸ€– Predicted Marks [0/5] Rationale:

The student’s response only states that Peter needs 4 litres of red paint and 6 litres of white paint per 10-litre batch. It does not show any further working or provide the calculation of the cost of the paints or the profit per tin. Therefore, the answer does not meet the full requirements for the calculation and is awarded 0 marks.

0

Example 132 (c)

❓ Question [2 Marks]:

A newspaper headline reads: β€˜High temperatures make more people buy ice cream!’ Does the graph above prove this claim? Give a reason for your decision.

πŸ“ Student’s Answer:

No. The graph alone does not completely prove the claim, because we cannot be certain this trend always holds true in all situations.

β˜‘οΈ Markscheme:

Answer: No, because there may be other factors involved (2 marks) Guidance: B1 for β€˜No’, with partial reason.

βœ… Correct Marks [1/2] Rationale:

The mark scheme gives 2 marks if the student states β€œNo, there may be other factors involved.” Here, only a partial reason is given, so 1 mark is awarded for denying that the graph proves the claim but not fully explaining other possible factors.

πŸ€– Predicted Marks [2/2] Rationale:

No Rationale Given

Thoughts:

Overall, it’s clear that the agent is unable to properly make sense of the different mark types, such as B1, SC1, M1, A1 etc.

This is not surprising, as we’ve never explained what these terms mean in the system prompt!

πŸ”€ Add Marking Guidelines

Let’s add the general marking guidelines to the system prompt, so the agent knows what all of these mark terms mean, and also fully understands how to interpret the markscheme for each question.

The marking guidelines can be extracted from the beginning of any of the markscheme pdf files, such as this one.

Let’s store this in a seperate variable, which will make it easier for us to parameterize the inclusion of the guidelines in future experiment iterations.

general_guidelines = """----

1.
M marks are for using a correct method and are not lost for purely numerical errors.
A marks are for an accurate answer and depend on preceding M (method) marks. Therefore M0 A1 cannot be awarded.
B marks are independent of M (method) marks and are for a correct final answer, a partially correct answer, or a correct intermediate stage.
SC marks are for special cases that are worthy of some credit.

2.
Unless the answer and marks columns of the mark scheme specify M and A marks etc, or the mark scheme is β€˜banded’, then if the correct answer is clearly given and is not from wrong working full marks should be awarded.

Do not award the marks if the answer was obtained from an incorrect method, i.e. incorrect working is seen and the correct answer clearly follows from it.

3.
Where follow through (FT) is indicated in the mark scheme, marks can be awarded where the candidate’s work follows correctly from a previous answer whether or not it was correct.

Figures or expressions that are being followed through are sometimes encompassed by single quotation marks after the word their for clarity, e.g. FT 180 Γ— (their β€˜37’ + 16), or FT 300 – οƒ–(their β€˜52 + 72’). Answers to part questions which are being followed through are indicated by e.g. FT 3 Γ— their (a).

For questions with FT available you must ensure that you refer back to the relevant previous answer. You may find it easier to mark these questions candidate by candidate rather than question by question.

4.
Where dependent (dep) marks are indicated in the mark scheme, you must check that the candidate has met all the criteria specified for the mark to be awarded.

5.
The following abbreviations are commonly found in GCSE Mathematics mark schemes.
- **figs 237**, for example, means any answer with only these digits. You should ignore leading or trailing zeros and any decimal point e.g. 237000, 2.37, 2.370, 0.00237 would be acceptable but 23070 or 2374 would not.
- **isw** means **ignore subsequent working** after correct answer obtained and applies as a default.
- **nfww** means not from wrong working.
- **oe** means **or equivalent**.
- **rot** means **rounded or truncated**.
- **seen** means that you should award the mark if that number/expression is seen anywhere in the answer space, including the answer line, even if it is not in the method leading to the final answer
- **soi** means seen or implied.

6.
In questions with no final answer line, make no deductions for wrong work after an acceptable answer (ie **isw**) unless the mark scheme says otherwise, indicated by the instruction β€˜mark final answer’.

7.
In questions with a final answer line following working space:

(i)If the correct answer is seen in the body of working and the answer given on the answer line is a clear transcription error allow full marks unless the mark scheme says β€˜mark final answer’. Place the annotation βœ“ next to the correct answer.

(ii)If the correct answer is seen in the body of working but the answer line is blank, allow full marks. Place the annotation βœ“ next to the correct answer.

(iii)If the correct answer is seen in the body of working but a completely different answer is seen on the answer line, then accuracy marks for the answer are lost. Method marks could still be awarded. Use the M0, M1, M2 annotations as appropriate and place the annotation  next to the wrong answer.

8.
In questions with a final answer line:

(i)If one answer is provided on the answer line, mark the method that leads to that answer.

(ii)If more than one answer is provided on the answer line and there is a single method provided, award method marks only.

(iii)If more than one answer is provided on the answer line and there is more than one method provided, award zero marks for the question unless the candidate has clearly indicated which method is to be marked.

9.
In questions with no final answer line:

(i)If a single response is provided, mark as usual.

(ii)If more than one response is provided, award zero marks for the question unless the candidate has clearly indicated which response is to be marked.

10.
When the data of a question is consistently misread in such a way as not to alter the nature or difficulty of the question, please follow the candidate’s work and allow follow through for **A** and **B** marks. Deduct 1 mark from any **A** or **B** marks earned and record this by using the MR annotation. **M** marks are not deducted for misreads.

11.
Unless the question asks for an answer to a specific degree of accuracy, always mark at the greatest number of significant figures even if this is rounded or truncated on the answer line. For example, an answer in the mark scheme is 15 75, which is seen in the working. The candidate then rounds or truncates this to 15.8, 15 or 16 on the answer line. Allow full marks for the 15.75.

12.
Ranges of answers given in the mark scheme are always inclusive.

13.
For methods not provided for in the mark scheme give as far as possible equivalent marks for equivalent work.

14.
Anything in the mark scheme which is in square brackets […] is not required for the mark to be earned, but if present it must be correct.

----"""

Let’s also update the system message to include a placeholder for these general guidelines.

system_message = """
Your task is to award a suitable number of marks for a student's answer to a question, from 0 up to a maximum of {available_marks_total} marks.

The general marking guidelines (relevant for all questions) are as follows:

{general_guidelines}

The question you need to mark is:

{question}


The markscheme for this specific question is:

{markscheme}


The student's answer to this question (which you need to marked) is:

{answer}


As the very final part of your response, simply provide the number of marks on a *new line*, without any additional formatting. For example:

3
"""

Let’s then update the system message to include the guidelines by default.

system_message = system_message.replace(
    "{general_guidelines}", general_guidelines
)

πŸ§ͺ Rreun Tests

with unify.Experiment("add_marking_guidelines"), unify.Params(
    system_message=system_message,
    dataset="dataset_10"
):
    unify.map(
        evaluate,
        [
             dict(**d.entries, _system_message=system_message)
             for d in test_set_10
        ],
        name="Evals"
    )

The mean error has gone down from 0.3 to 0.2. Let’s explore the remaining errors in another iteration.

πŸ” Iteration 3

πŸ” Difficulty Debugging

The change seems to have fixed Example 261, but the other two [Examples 207 and 132 (c)] are still failing.

It’s also becoming quite difficult to track the exact discrepency between the correct marks and those predicted by the agent, as the agent’s response is a single block of text, unlike the ground truth data which is formatted in a dictionary with each sub-question independently marked.

Adding structured output could both help the agent to reason about each part of the question independently, and it will also make the response easier to parse, enabling us to present a diff at the subquestion level, rather than just for the entire question. Let’s give it a try!

πŸ”€ Add Structured Output

Let’s first define the output we want for each sub-question:

class MarksAndReasoning(BaseModel):
    reasoning: str
    marks: int

Let’s now write a simply function to build the desired pydantic output dynamically, based on the subquestions present.

@unify.traced
def create_response_format(response_keys):
    if response_keys:
        response_fields = dict(
            zip(response_keys, [(MarksAndReasoning, ...)] * len(response_keys))
        )
        return create_model('Response', **response_fields)
    else:
        return MarksAndReasoning

Let’s update our system prompt, so the agent knows how to populate the structured response correctly. We also actually only use the nested output structure if subquestions are present. So we’ll want to populate the instructions dynamically, depending on the presence or abscence of sub-questions. Let’s create two alternatives. Firstly, when sub-questions are present:

β€œFor each sub-question , you should populate the reasoning field with your initial reasoning about the correct number of marks to award. Finally, you should put the number of marks to award for this sub-question in the marks field.”

When they are not present:

β€œYou should populate the reasoning field with your initial reasoning about the correct number of marks to award. Finally, you should put the number of marks to award in the marks field.”

Firstly, the general template:

system_message = """
Your task is to award a suitable number of marks for a student's answer to a question, from 0 up to a maximum of {available_marks_total} marks.

The general marking guidelines (relevant for all questions) are as follows:

{general_guidelines}

The question you need to mark is:

{question}


The markscheme for this specific question is:

{markscheme}


The student's answer to this question (which you need to marked) is:

{answer}


{output_response_explanation}
""".replace(
    "{general_guidelines}",
    general_guidelines
)

Then the two excerpts:

output_response_explanations = dict()
output_response_explanations["with_subqs"] = "For each sub-question {subquestions}, you should populate the `reasoning` field with your initial reasoning about the correct number of marks to award. Finally, you should put the number of marks to award for this sub-question in the `marks` field."
output_response_explanations["without_subqs"] = "You should populate the `reasoning` field with your initial reasoning about the correct number of marks to award. Finally, you should put the number of marks to award in the `marks` field."

Let’s update our call_agent method to set the output format dynamically:

@unify.traced
def call_agent(system_msg, question, markscheme, answer, available_marks_total):
    local_agent = agent.copy()
    with_subqs = len(markscheme) > 1
    response_format = create_response_format(
        list(markscheme.keys()) if with_subqs else None
    )
    local_agent.set_response_format(response_format)
    if with_subqs:
        output_response_exp = output_response_explanations["with_subqs"]
        output_response_exp = output_response_exp.replace(
            "{subquestions}", json.dumps(list(markscheme.keys()))
        )
    else:
        output_response_exp = output_response_explanations["without_subqs"]
    local_agent.set_system_message(
        system_msg.replace(
            "{question}", question
        ).replace(
            "{markscheme}", json.dumps(markscheme, indent=4)
        ).replace(
            "{answer}", json.dumps(answer, indent=4)
        ).replace(
            "{available_marks_total}", str(available_marks_total)
        ).replace(
            "{output_response_explanation}",
            output_response_exp
        )
    )
    ret = local_agent.generate()
    if "```" in ret:
        ret = ret.split("```")[-2].lstrip("json")
    ret = response_format.model_validate_json(ret).model_dump()
    if not with_subqs:
        return {"_": ret}
    return ret

Let’s also update our evaluate method to parse the returned json correctly, and also include a subquestion level diff, and update the per-question-breakdown to also include the subquestion level predictions:

@unify.log
def evaluate(
    question,
    student_answer,
    available_marks_total,
    markscheme,
    correct_marks,
    per_question_breakdown,
    _system_message,
):
    pred_marks = call_agent(
        _system_message, question, markscheme, student_answer,
        available_marks_total
    )
    pred_marks_total = sum([v["marks"] for v in pred_marks.values()])
    diff = {
        k: vcor["marks"] - vpred["marks"] for (k, vcor), (_, vpred) in
        zip(correct_marks.items(), pred_marks.items())
    }
    error = {k: abs(v) for k, v in diff.items()}
    diff_total = sum(diff.values())
    error_total = sum(error.values())
    per_question_breakdown = {
        k: {
            **per_question_breakdown[k],
            "predicted_marks": pm,
            "diff": d
        } for (k, pqb), pm, d in zip(
            per_question_breakdown.items(),
            pred_marks.values(),
            diff.values()
        )
    }
    return error

πŸ§ͺ Rreun Tests

with unify.Experiment("add_structured_output"), unify.Params(
    system_message=system_message,
    dataset="dataset_10"
):
    unify.map(
        evaluate,
        [
             dict(**d.entries, _system_message=system_message)
             for d in test_set_10
        ],
        name="Evals"
    )

The mean error has actually gone back up from 0.2 to 0.4. Let’s explore what’s going wrong in another iteration.

πŸ” Iteration 4

πŸ” Context Alignment

In terms of failures, let’s take Example 132 as an example. For this particular question, there are 6 sub-questions (a.i, a.ii, b.i, b.ii, b.iii, c), and we’re asking the LLM to do a lot in a single shot:

  1. understand all 16 points in the general marking guidelines
  2. understand all 6 of the sub-questions
  3. understand all 6 of the student’s answers to the sub-questions
  4. understand all 6 of the markscheme’s reasoning for said sub-questions

More importantly, the system prompt doesn’t align the relevant information together. The agent receives the information like so:

{
  question: [a.i, a.ii, b.i, b.ii, b.iii, c],
  markscheme: [a.i, a.ii, b.i, b.ii, b.iii, c],
  answer: [a.i, a.ii, b.i, b.ii, b.iii, c]
}

Let’s update the system prompt, so the information is aligned better, more like the following:

{
  a.i: [sub-qstn, sub-mrkshm, sub-ans],
  a.ii: [sub-qstn, sub-mrkshm, sub-ans],
  b.i: [sub-qstn, sub-mrkshm, sub-ans],
  b.ii: [sub-qstn, sub-mrkshm, sub-ans],
  b.iii: [sub-qstn, sub-mrkshm, sub-ans],
  c: [sub-qstn, sub-mrkshm, sub-ans],
}

πŸ”€ Better Align Context

First, let’s abstract this into a "{questions_markscheme_and_answers}" placeholder:

system_message = """
Your task is to award a suitable number of marks for a student's answer to a question, from 0 up to a maximum of {available_marks_total} marks.

The general marking guidelines (relevant for all questions) are as follows:

{general_guidelines}

The question you need to mark is:

{question}

The sub-question breakdown, including each sub-question, it's associated markscheme and it's associated answer, are as follows:

{questions_markscheme_and_answers}


{output_response_explanation}
""".replace(
    "{general_guidelines}",
    general_guidelines
)

Let’s then update call_agent

@unify.traced
def call_agent(system_msg, question, sub_questions, markscheme, answer, available_marks_total):
    local_agent = agent.copy()
    with_subqs = len(markscheme) > 1
    response_format = create_response_format(
        list(markscheme.keys()) if with_subqs else None
    )
    local_agent.set_response_format(response_format)
    if with_subqs:
        output_response_exp = output_response_explanations["with_subqs"]
        output_response_exp = output_response_exp.replace(
            "{subquestions}", json.dumps(list(markscheme.keys()))
        )
    else:
        output_response_exp = output_response_explanations["without_subqs"]
    local_agent.set_system_message(
        system_msg.replace(
            "{question}", question
        ).replace(
            "{markscheme}", json.dumps(markscheme, indent=4)
        ).replace(
            "{answer}", json.dumps(answer, indent=4)
        ).replace(
            "{available_marks_total}", str(available_marks_total)
        ).replace(
            "{questions_markscheme_and_answers}", json.dumps(
              {
                  k: {
                      "sub-question": sub_questions[k],
                      "markscheme": markscheme[k],
                      "answer": answer[k]
                  } for k in sub_questions.keys()
              },
              indent=4
            )
        ).replace(
            "{output_response_explanation}",
            output_response_exp
        )
    )
    ret = local_agent.generate()
    if "```" in ret:
        ret = ret.split("```")[-2].lstrip("json")
    ret = response_format.model_validate_json(ret).model_dump()
    if not with_subqs:
        return {"_": ret}
    return ret

Let’s also update our evaluate function, so that we pass the sub_questions into the call_agent function:

@unify.log
def evaluate(
    question,
    sub_questions,
    student_answer,
    available_marks_total,
    markscheme,
    correct_marks,
    per_question_breakdown,
    _system_message,
):
    pred_marks = call_agent(
        _system_message, question, sub_questions, markscheme, student_answer,
        available_marks_total
    )
    pred_marks_total = sum([v["marks"] for v in pred_marks.values()])
    diff = {
        k: vcor["marks"] - vpred["marks"] for (k, vcor), (_, vpred) in
        zip(correct_marks.items(), pred_marks.items())
    }
    error = {k: abs(v) for k, v in diff.items()}
    diff_total = sum(diff.values())
    error_total = sum(error.values())
    per_question_breakdown = {
        k: {
            **per_question_breakdown[k],
            "predicted_marks": pm,
            "diff": d
        } for (k, pqb), pm, d in zip(
            per_question_breakdown.items(),
            pred_marks.values(),
            diff.values()
        )
    }
    return error

πŸ§ͺ Rreun Tests

with unify.Experiment("align_context"), unify.Params(
    system_message=system_message,
    model=agent.model,
    dataset="dataset_10"
):
    unify.map(
        evaluate,
        [
             dict(**d.entries, _system_message=system_message)
             for d in test_set_10
        ],
        name="Evals"
    )

The mean error has now dropped to 0.3. Again, let’s explore what’s going wrong in another iteration.

πŸ” Iteration 5

πŸ” Ignoring Mark Types

Lookg a bit deeper, it seems we’re getting the same failures as we were before. These errors seem quite persistent, we might need to make more radical changes to try and address them.

Let’s revisit each of these failures again, and explore them in more detail, to see how we can rectify them.

Given the consistent prevalence of these errors, it’s useful to consider these failing examples across all experiments, and see the different kinds of rationale our agent is giving across each run.

Our evaluation runs serve both as tests for comparing parameter configurations (when the agent output depend strongly on the changed parameters πŸ”€) and they also perform a kind of sampling from the noise distribution for each test set example (when the agent output depends less strongly on the changed parameters across each experiment πŸ“Š).

This is one benefit of the flexible spreadsheet design. Experiments are not indivisible atomic groups, they’re simply labels, and the raw evaluation data can be juggled in whatever way makes sense to you as the user.

GIF

Let’s first recap, the fixed question ❓, student answer πŸ“, markscheme β˜‘οΈ and correct marks βœ…, before unpacking the various agent justifications πŸ€– across all experiments.

Example 207

❓ Question [2 Marks]:

Write these in order, smallest first:

0.34, 1/3, 3.5%

…………………… …………………… ……………………

πŸ“ Student’s Answer:

1/3, 0.34, 3.5%

β˜‘οΈ Markscheme:

3.5%, 1/3, 0.34

Part marks and guidance:

  • B1 for 1/3 = 0.33… or 33…%
  • or B1 for 0.34 = 34%
  • or B1 for changing 3.5% to 0.035
  • or SC1 for 1/3, 0.34, 3.5%

Accept correct order with equivalent values.

βœ… Correct Marks [1/2] Rationale:

The candidate’s order is 1/3, 0.34, 3.5%. According to the markscheme, this earns SC1 (special case, partially correct ordering) for 1 mark.

πŸ€– Predicted Marks [x/2] Rationales (with added insights πŸ’‘):

simple_agent [0/2] ❌: The correct order should be 3.5%, 1/3, 0.34. The student’s answer is incorrect. [No reference to SC1 πŸ’‘]

add_markscheme [0/2] ❌: The student’s ordering is incorrect because smallest first should be 3.5%, 1/3, 0.34, but the student provided 1/3, 0.34, 3.5%. [No reference to SC1 πŸ’‘]

add_marking_guidelines [0/2] ❌: The required order, from smallest to largest, is 3.5%, 1/3, 0.34 (since 3.5% = 0.035, 1/3 ≃ 0.333, and 0.34 = 0.34). The student’s answer lists the numbers as 1/3, 0.34, 3.5%, which is not in the correct order. [No reference to SC1 πŸ’‘]

add_structured_output [0/2] ❌: The correct order should list the values from smallest to largest: 3.5% (which is equivalent to 0.035), then 1/3 (approximately 0.333…), and finally 0.34. The student’s answer, β€˜1/3, 0.34, 3.5%’, does not follow the required order. Therefore, no marks can be awarded for this answer. [No reference to SC1 πŸ’‘]

align_context [0/2] ❌: The correct order, based on numerical values (3.5% = 0.035, 1/3 β‰ˆ 0.333, 0.34 = 0.34), should be 3.5%, 1/3, 0.34. The student’s answer (1/3, 0.34, 3.5%) does not follow the β€˜smallest first’ requirement. Therefore, no marks should be awarded. [No reference to SC1 πŸ’‘]

Clearly, the agent is really failing to pick up on the SC1 mark from the markscheme, irrespective of the various parameter changes we’ve made across each experiment run.

Example 261

❓ Question [5 Marks]:

Peter makes a large amount of pink paint by mixing red and white paint in the ratio 2 : 3.

β€’ Red paint costs Β£80 per 10 litres.
β€’ White paint costs Β£5 per 10 litres.
β€’ Peter sells his pink paint in 10-litre tins for Β£60 per tin.

Calculate how much profit he makes for each tin he sells.
You must show your working.

Β£ …………………………………………

πŸ“ Student’s Answer:

Peter needs 4 litres of red paint and 6 litres of white paint for every 10-litre batch (because 2 out of 5 parts is red, 3 out of 5 parts is white).

β˜‘οΈ Markscheme:

25 with correct working (5 marks)

β€œCorrect working” requires evidence of at least M1M1

  • M1 for 10 Γ— (2/5) = 4 litres red or for 10 Γ— (3/5) = 6 litres white
  • M1 for red costs Β£8 per litre or for white costs Β£0.50 per litre
  • M1 for cost of one 10-litre can is their 4 Γ— their 8 + their 6 Γ— their 0.5
  • M1 for 60 – their 35

Alternative method:

  • M1 for 2 : 3 = 20 litres red : 30 litres white
  • M1 for 2 Γ— Β£80 + 3 Γ— Β£5 = Β£175
  • M1 for (their 175) Γ· 5 = 35
  • M1 for 60 – their 35
  • M1 for cost of one 10-litre can is their 4 Γ— their 8 + their 6 Γ— their 0.5

If 0 or 1 scored, instead award SC2 for answer 25 with no working or insufficient working

βœ… Correct Marks [1/5] Rationale:

Awarding 1 mark (M1) for correctly finding that a 10-litre tin contains 4 litres of red and 6 litres of white from the 2:3 ratio. No further steps towards the final profit are shown, so only this 1 mark is earned.

πŸ€– Predicted Marks [x/5] Rationale (with added insights πŸ’‘):

simple_agent [1/5] βœ…: The full solution should involve several steps:

  • Determining that in a 10‐litre mix, Peter uses 4 litres of red (2/5 of 10) and 6 litres of white (3/5 of 10).
  • Calculating the cost per litre for each colour:
  • Red: Β£80/10 = Β£8 per litre, so 4 litres costs 4 Γ— Β£8 = Β£32.
  • White: Β£5/10 = Β£0.50 per litre, so 6 litres costs 6 Γ— Β£0.50 = Β£3.
  • Adding the costs gives a total cost of Β£32 + Β£3 = Β£35.
  • Subtracting this from the selling price, the profit per tin is Β£60 βˆ’ Β£35 = Β£25.

The student’s answer only identifies the paint volumes (4 litres red and 6 litres white) without completing the cost and profit calculations. [No reference to M1 πŸ’‘]

add_markscheme [0/5] ❌: The student’s response only states that Peter needs 4 litres of red paint and 6 litres of white paint per 10-litre batch. It does not show any further working or provide the calculation of the cost of the paints or the profit per tin. Therefore, the answer does not meet the full requirements for the calculation and is awarded 0 marks. [No reference to M1 πŸ’‘]

add_marking_guidelines [1/5] βœ…: The student only shows that a 10‐litre tin requires 4 litres red paint and 6 litres white paint (i.e. the ratio part), but they do not continue to use these figures to work out the cost or the profit. This is an incomplete answer which does not answer the question. [No reference to M1 πŸ’‘]

add_structured_output [0/5] ❌: The student’s work shows part of the method by correctly determining that a 10‐litre batch contains 4 litres of red paint and 6 litres of white paint. However, the answer does not extend to calculate the cost of the red and white paint, nor does it compute the final profit per tin by subtracting the total cost from the selling price. According to the mark scheme, a correct final answer of Β£25 with full working is needed for 5 marks. Since the student has given only part of the working (the mix ratio) and omitted the critical profit calculation, the answer deserves only credit for knowing the ratio. In such cases, with insufficient working, the mark scheme permits awarding SC2 (2 marks) for an answer that states the correct profit value by implication. As the final profit figure is missing, awarding 2 marks is appropriate. [No reference to M1 πŸ’‘]

align_context [0/5] ❌: The answer only shows that Peter uses 4 litres of red paint and 6 litres of white paint in a 10‑litre batch. However, the question asks for the profit per tin, which requires further calculation including the cost of the paints and then subtracting from the selling price (Β£60). No profit calculation or final numerical answer for the profit is given. As a result, none of the required steps for computing profit were provided. Therefore, no marks can be awarded. [No reference to M1 πŸ’‘]

Again, regardless of our parameter variations, it seems as though the agent is not correctly understanding the M1 mark in the markscheme, which clearly states M1 for 10 Γ— (2/5) = 4 litres red or for 10 Γ— (3/5) = 6 litres white. Even on the two occassions where it got things right, it feels like a lucky guess, as this was not justified via the markscheme’s M1 mark.

Example 132 (c) is failing for a slighly different reason, and so we’ll consider this separately.

In general, as we get deeper into the evaluation iterations, it’s often wise to consider multiple failure modes at once. Larger evals can be expensive to run, and you generally want to use all of the newly gained knowledge to try and improve your agent in the next evaluation run, even if this means making several unrelated changes to address several unrelated failure modes.

πŸ” Lenient Reasoning

Let’s perform the same deepe analysis for Example 132 (c), and see what’s going wrong in this case.

Example 132 (c)

❓ Question [2 Marks]:

A newspaper headline reads: β€˜High temperatures make more people buy ice cream!’ Does the graph above prove this claim? Give a reason for your decision.

πŸ“ Student’s Answer:

No. The graph alone does not completely prove the claim, because we cannot be certain this trend always holds true in all situations.

β˜‘οΈ Markscheme:

Answer: No, because there may be other factors involved (2 marks) Guidance: B1 for β€˜No’, with partial reason.

βœ… Correct Marks [1/2] Rationale:

The mark scheme gives 2 marks if the student states β€œNo, there may be other factors involved.” Here, only a partial reason is given, so 1 mark is awarded for denying that the graph proves the claim but not fully explaining other possible factors.

πŸ€– Predicted Marks [x/2] Rationale (with added insights πŸ’‘):

simple_agent [2/2] ❌: The response correctly explains that the graph, by itself, does not prove the claim as it only shows a correlation under the conditions measured. [No mention of other factors, which is needed πŸ’‘]

add_markscheme [2/2] ❌: [No rationale given at all! πŸ’‘]

add_marking_guidelines [2/2] ❌: [No rationale given at all! πŸ’‘]

add_structured_output [2/2] ❌: The explanation correctly states that the graph alone does not prove the claim due to other potential factors affecting ice cream sales. This answer meets the criteria for 2 marks. [Not true, the student doesn’t mention β€œother potential factors affecting ice cream sales” πŸ’‘]

align_context [2/2] ❌: The answer indicates that the graph alone does not fully prove the claim since other factors might be involved, which adequately meets the criteria for the mark scheme. [Not true, the student doesn’t mention β€œother factors might be involved” πŸ’‘]

In this case, it’s not so much an error with understanding the mark types, it’s more of an issue with falsely hallucinating things the student said, or perhaps being very liberal in the interpretation. Perhaps the agent is confusing the markscheme with the student answer. The student’s answer β€œwe cannot be certain this trend always holds true” is very different to β€œthere may be other factors involved”. The former (incorrect) assumes the trend does exist, but just might not continue, while the (correct) latter is a point about correlation != causation, indicating the apparent causation may not exist at all.

πŸ”€ Better Align Guidelines

Firstly, the recurring problem for Example 207 and 261 seems to be that the agent doesn’t remember and/or understand the different types of marks (B1, SC1, M1 etc.). Let’s be more explicit, and parse each sub-question markscheme for the different mark types, and add the explanations directly as part of the sub-question specific markschemes, and see if this improves performance.

Let’s first create a dictionary with the mark type explanations, written in a more direct manner to accompany the subquestion-specific markschemes and to make it easier to parse:

mark_types = {
    "M": "M{num} ({num_marks}) should be awarded if a correct method is used, and should not be lost for purely numerical errors.",
    "A": "A{num} ({num_marks}) should be awarded for an accurate answer, and this depends on preceding M (method) marks. If preceding M (method marks are not awarded, then A{num} cannot be awarded).",
    "B": "B{num} ({num_marks}) should be awarded for the correct final answer, a partially correct answer, or a correct intermediate stage (depending on how this is expressed and explained below). B{num} is independent of M (method) marks.",
    "SC": "SC{num} ({num_marks}) should be awarded for the special cases explained below, which are worthy of some credit."
}

Let’s then write a simple function to update each sub-question specific markscheme, prepending the markscheme with the relevant definitions from our mark_types dict, so that the agent has all the relevant information close at hand:

@unify.traced
def update_markscheme(markscheme: str):
    m_marks = sorted(list(set(re.findall(r'M\d+', markscheme))))
    a_marks = sorted(list(set(re.findall(r'A\d+', markscheme))))
    b_marks = sorted(list(set(re.findall(r'B\d+', markscheme))))
    sc_marks = sorted(list(set(re.findall(r'SC\d+', markscheme))))
    if not any(m_marks + a_marks + b_marks + sc_marks):
        return markscheme
    markscheme = """
Firstly, the mark types have the following meanings:

{mark_types}

With these definitions in mind, marks should be awarded as follows:
    """ + markscheme
    for marks in (m_marks, a_marks, b_marks, sc_marks):
        for mark in marks:
            key = "".join(c for c in mark if not c.isdigit())
            num_marks = int("".join(c for c in mark if c.isdigit()))
            explanation = mark_types[key]
            explanation = explanation.replace(
                "{num}", str(num_marks)
            ).replace(
                "{num_marks}", "1 mark" if num_marks == 1 else f"{num_marks} marks"
            )
            markscheme = markscheme.replace(
                "{mark_types}",
                key + ":/n" + explanation + "\n\n{mark_types}"
            )
    markscheme = markscheme.replace(
        "{mark_types}",
        ""
    )
    return markscheme

Let’s now update our call_agent method such that the markscheme changes are dynamically applied before passing to the agent:

@unify.traced
def call_agent(system_msg, question, sub_questions, markscheme, answer, available_marks_total):
    local_agent = agent.copy()
    with_subqs = len(markscheme) > 1
    response_format = create_response_format(
        list(markscheme.keys()) if with_subqs else None
    )
    local_agent.set_response_format(response_format)
    if with_subqs:
        output_response_exp = output_response_explanations["with_subqs"]
        output_response_exp = output_response_exp.replace(
            "{subquestions}", json.dumps(list(markscheme.keys()))
        )
    else:
        output_response_exp = output_response_explanations["without_subqs"]
    markscheme = {k: update_markscheme(v) for k, v in markscheme.items()}
    local_agent.set_system_message(
        system_msg.replace(
            "{question}", question
        ).replace(
            "{markscheme}", json.dumps(markscheme, indent=4)
        ).replace(
            "{answer}", json.dumps(answer, indent=4)
        ).replace(
            "{available_marks_total}", str(available_marks_total)
        ).replace(
            "{questions_markscheme_and_answers}", json.dumps(
              {
                  k: {
                      "sub-question": sub_questions[k],
                      "markscheme": markscheme[k],
                      "answer": answer[k]
                  } for k in sub_questions.keys()
              },
              indent=4
            )
        ).replace(
            "{output_response_explanation}",
            output_response_exp
        )
    )
    ret = local_agent.generate()
    if "```" in ret:
        ret = ret.split("```")[-2].lstrip("json")
    ret = response_format.model_validate_json(ret).model_dump()
    if not with_subqs:
        return {"_": ret}
    return ret

πŸ”€ Clarify Strict Reasoning

We’ve just addressed the recurring problem for Example 207 and 261, but the failure for Example 132 (c) was quite different.

Let’s add another instructions to our general_guidelines variable, with an imaginary extra piece of guidance, to try and avoid the leniency we’ve observed in the marking of Example 132 (c).

general_guidelines = general_guidelines.rstrip("-") + """15.
When students are explaining something in their answer, then their explanation must make *exactly* the same point(s) as are made in the markscheme. The wording can be slightly different, but the underlying observations/reasons must be *identical*, unless otherwise stated *explicitly* in the markscheme.

----
"""
system_message = """
Your task is to award a suitable number of marks for a student's answer to a question, from 0 up to a maximum of {available_marks_total} marks.

The general marking guidelines (relevant for all questions) are as follows:

{general_guidelines}

The question you need to mark is:

{question}

The sub-question breakdown, including each sub-question, it's associated markscheme and it's associated answer, are as follows:

{questions_markscheme_and_answers}


{output_response_explanation}
""".replace(
    "{general_guidelines}",
    general_guidelines
)

Despite us applying no changes to the source code of evaluate, we need to run this in the notebook again such that the @unify.log decorator catches the updated functions called by evaluate, such as call_agent.

@unify.log
def evaluate(
    question,
    sub_questions,
    student_answer,
    available_marks_total,
    markscheme,
    correct_marks,
    per_question_breakdown,
    _system_message,
):
    pred_marks = call_agent(
        _system_message, question, sub_questions, markscheme, student_answer,
        available_marks_total
    )
    pred_marks_total = sum([v["marks"] for v in pred_marks.values()])
    diff = {
        k: vcor["marks"] - vpred["marks"] for (k, vcor), (_, vpred) in
        zip(correct_marks.items(), pred_marks.items())
    }
    error = {k: abs(v) for k, v in diff.items()}
    diff_total = sum(diff.values())
    error_total = sum(error.values())
    per_question_breakdown = {
        k: {
            **per_question_breakdown[k],
            "predicted_marks": pm,
            "diff": d
        } for (k, pqb), pm, d in zip(
            per_question_breakdown.items(),
            pred_marks.values(),
            diff.values()
        )
    }
    return error

πŸ§ͺ Rreun Tests

Now we’ve made both of these changes, let’s re-run our evals to see if either of these changes were able to address the problems they’re intended to resolve.

with unify.Experiment("align_guidelines_and_clarify_reasoning"), unify.Params(
    system_message=system_message,
    model=agent.model,
    dataset="dataset_10"
):
    unify.map(
        evaluate,
        [
             dict(**d.entries, _system_message=system_message)
             for d in test_set_10
        ],
        name="Evals"
    )

Our failure mechanisms are exactly the same as before, clearly the agent is still struggling to correctly reason about the different mark types.

πŸ” Iteration 6

πŸ” Still Ignoring Mark Types

Let’s try to force the agent to reason about each potential mark mentioned in the markscheme, by further refining our structured output. Let’s expand upon the reasoning field for each sub-question, with a field for each mark type referenced in the sub-question markscheme, going from the following structure:

Prediction:
  a:
    reasoning: str
    marks: int
  b:
    reasoning: str
    marks: int
  ...

To this version which explicitly enforces reasoning about each potential mark type referenced in the markscheme:

Prediction:
  a:
    reasoning:
      B1:
        thoughts: str
        should_award: bool
      SC1:
        thoughts: str
        should_award: bool
      ...,
      overall_thoughts: str
    marks: int
  b:
    reasoning:
      M1:
        thoughts: str
        should_award: bool
      A1:
        thoughts: str
        should_award: bool
      ...,
      overall_thoughts: str
    marks: int
  ...

This way, the agent will be forced to reason about SC1 for Example 207, M1 for Example 261, and B1 for Example 132 (c).

πŸ”€ Mark Type Reasoning

Let’s first define a function to dynamically construct the required pydantic type. For each parsed mark type, we want the model to give it’s thoughts and make a decision as to whether or not the mark should be awarded. Let’s create this pydantic type first:

class ThoughtsAndAwardDecision(BaseModel):
    thoughts: str
    should_award: bool

Let’s then create a function to dynamically construct a PerMarkReasoning pydantic type, with one ThoughtsAndAwardDecision instance for each mark detected in the sub-question markscheme.

@unify.traced(name="create_per_mark_reasoning_format_{mark_types}")
def create_per_mark_reasoning_format(mark_types):
    response_fields = dict(
        zip(
            mark_types + ["overall_thoughts"], [(ThoughtsAndAwardDecision, ...)] * len(mark_types) + [(str, ...)]
        )
    )
    return create_model('PerMarkReasoning', **response_fields)

Let’s then re-define MarksAndReasoning (previously this was statically defined, see above) such that the reasoning field is no longer just a string, but is intead our newly created PerMarkReasoning (above).

@unify.traced(name="create_marks_and_reasoning_format_{mark_types}")
def create_marks_and_reasoning_format(mark_types):
    return create_model(
        'MarksAndReasoning',
        reasoning=(create_per_mark_reasoning_format(mark_types), ...),
        marks=(int, ...)
    )

Finally, let’s then update the top-level function create_response_format such that we’re making use of our newly defined create_marks_and_reasoning_format for each sub-question.

@unify.traced(name="create_response_format_{response_keys}_{mark_types}")
def create_response_format(response_keys, mark_types):
    if response_keys:
        response_fields = dict(
            zip(
                response_keys,
                [
                    (create_marks_and_reasoning_format(mark_types[key]), ...)
                    for key in response_keys
                ]
            )
        )
        return create_model('Response', **response_fields)
    else:
        return create_marks_and_reasoning_format(mark_types["_"])

We also need to write a function to parse the relevant marks from each sub-question markscheme. We can take inspiration from here, where we wrote a parser for a slightly different reason. Let’s have the function extract the marks, and also the surrounding context.

@unify.traced
def parse_marks_from_markscheme(markscheme: str):
    extracted_marks = re.findall(r'(?:SC|M|A|B)\d+', markscheme)
    if not extracted_marks:
        return []
    marks_n_context = list()
    for i, mark in enumerate(extracted_marks):
        index = markscheme.find(mark)
        chunk = markscheme[0:index]
        if i > 0:
            prev_mark = extracted_marks[i-1]
            marks_n_context[i-1][1] += chunk
        markscheme = markscheme[index:]
        marks_n_context.append([mark, chunk])
    marks_n_context[-1][1] += markscheme
    return marks_n_context

Finally, we’ll also need to update call_agent such that we call parse_available_marks_from_markscheme on each sub-question markscheme, and then pass these into our newly defined create_response_format.

@unify.traced
def call_agent(system_msg, question, sub_questions, markscheme, answer, available_marks_total):
    local_agent = agent.copy()
    with_subqs = len(markscheme) > 1
    response_format = create_response_format(
        list(markscheme.keys()) if with_subqs else None,
        {k: [itm[0] for itm in parse_marks_from_markscheme(v)]
        for k, v in markscheme.items()}
    )
    local_agent.set_response_format(response_format)
    if with_subqs:
        output_response_exp = output_response_explanations["with_subqs"]
        output_response_exp = output_response_exp.replace(
            "{subquestions}", json.dumps(list(markscheme.keys()))
        )
    else:
        output_response_exp = output_response_explanations["without_subqs"]
    markscheme = {k: update_markscheme(v) for k, v in markscheme.items()}
    local_agent.set_system_message(
        system_msg.replace(
            "{question}", question
        ).replace(
            "{markscheme}", json.dumps(markscheme, indent=4)
        ).replace(
            "{answer}", json.dumps(answer, indent=4)
        ).replace(
            "{available_marks_total}", str(available_marks_total)
        ).replace(
            "{questions_markscheme_and_answers}", json.dumps(
              {
                  k: {
                      "sub-question": sub_questions[k],
                      "markscheme": markscheme[k],
                      "answer": answer[k]
                  } for k in sub_questions.keys()
              },
              indent=4
            )
        ).replace(
            "{output_response_explanation}",
            output_response_exp
        )
    )
    ret = local_agent.generate()
    if "```" in ret:
        ret = ret.split("```")[-2].lstrip("json")
    ret = response_format.model_validate_json(ret).model_dump()
    if not with_subqs:
        return {"_": ret}
    return ret

Let’s also update our system message to explain to better explain to the agent how it should reason about this new output structure.

output_response_explanations = dict()
output_response_explanations["with_subqs"] = "For each sub-question {subquestions}, you should populate the `reasoning` field with your general thoughts on each individual mark identified in the markscheme, and also a decision as to whether each of these mark should be awarded. These marks are not necessarily cumulative with regards to the marks to award, and some may be irrelevant given the student's approach or answer, in which case just respond `False` for the `should_award` field. Finally, you should put the total number of marks to award for each sub-question in the corresponding `marks` field."
output_response_explanations["without_subqs"] = "You should populate the `reasoning` field with your general thoughts on each individual mark identified in the markscheme, and also a decision as to whether each of these mark should be awarded. These marks are not necessarily cumulative with regards to the marks to award, and some may be irrelevant given the student's approach or answer, in which case just respond `False` for the `should_award` field. Finally, you should put the total number of marks to award in the `marks` field."

We also need to update evaluate, given some of the other changes.

@unify.log
def evaluate(
    question,
    sub_questions,
    student_answer,
    available_marks_total,
    markscheme,
    correct_marks,
    per_question_breakdown,
    _system_message,
):
    pred_marks = call_agent(
        _system_message, question, sub_questions, markscheme, student_answer,
        available_marks_total
    )
    pred_marks_total = sum([v["marks"] for v in pred_marks.values()])
    diff = {
        k: vcor["marks"] - vpred["marks"] for (k, vcor), (_, vpred) in
        zip(correct_marks.items(), pred_marks.items())
    }
    error = {k: abs(v) for k, v in diff.items()}
    diff_total = sum(diff.values())
    error_total = sum(error.values())
    per_question_breakdown = {
        k: {
            **per_question_breakdown[k],
            "predicted_marks": pm,
            "diff": d
        } for (k, pqb), pm, d in zip(
            per_question_breakdown.items(),
            pred_marks.values(),
            diff.values()
        )
    }
    return error

πŸ§ͺ Rreun Tests

with unify.Experiment("mark_type_reasoning"), unify.Params(
    system_message=system_message,
    model=agent.model,
    dataset="dataset_10"
):
    unify.map(
        evaluate,
        [
             dict(**d.entries, _system_message=system_message)
             for d in test_set_10
        ],
        name="Evals"
    )

The failure modes are still entirely unchanged! o3-mini is certainly very stubborn about it’s decision for these questions.

πŸ” Iteration 7

πŸ” Still Ignoring Mark Types

Let’s see what affect our new output format had on the nature of the agent’s responses, if any.

Considering Example 207, the agent still failed to award SC1 for the student’s answer 1/3, 0.34, 3.5%, despite the markscheme explicitly stating SC1 for 1/3, 0.34, 3.5%. The agent’s explicit thoughts about SC1 were:

πŸ€– No special case credit is applicable here since the order is incorrect and no alternative acceptable method is demonstrated.

This is a pretty fluffy and empty statement. Despite o3-mini being a multi-step reasoning model, perhaps we’re still asking the agent to consider too many things at once.

Enforcing the agent to consider one mark at a time might rectify this lack of attention to detail.

Example 132 is even more difficult, where the agent not only needs to consider each mark, but it also has six different sub-questions to reason about, each with their own set of available marks and mark types.

Let’s see if using a separate LLM call per sub-question improves the performance on Example 132.

πŸ”€ Queries per Subquestion

Firstly, let’s create a new system prompt for our agent, which will reason about one-subquestion at a time.

system_message = """
Your task is to award a suitable number of marks for a student's answer to a question, from 0 up to a maximum of {available_marks} marks.

The general marking guidelines (relevant for all questions) are as follows:

{general_guidelines}


The question you need to mark is:

{question}


Their answer is:

{answer}


The markscheme is:

{markscheme}


{output_response_explanation}
""".replace(
    "{general_guidelines}",
    general_guidelines
)

Given the changes, we can also remove the output_response_explanations dict, and relace it with a single output_response_explanation string variable, given that the agent no longer needs to output responses for multiple sub-questions in a single response.

output_response_explanation = "You should populate the `reasoning` field with your general thoughts on each individual mark identified in the markscheme, and also a decision as to whether each of these mark should be awarded. These marks are not necessarily cumulative with regards to the marks to award, and some may be irrelevant given the student's approach or answer, in which case just respond `False` for the `should_award` field. Finally, you should put the total number of marks to award in the `marks` field."

Let’s update call_agent to map each subquestion to a unique LLM call, and let’s also add a unify.traced decorator so we’re able to capture each individual LLM call in the overall trace.

@unify.traced
def call_agent(example_id, system_msg, sub_questions, markscheme, answer, available_marks):
    agents = {k: agent.copy() for k in markscheme.keys()}
    with_subqs = len(markscheme) > 1
    response_formats = {
        k: create_marks_and_reasoning_format(
            [itm[0] for itm in parse_marks_from_markscheme(v)]
        ) for k, v in markscheme.items()
    }
    [
        agnt.set_response_format(rf)
        for agnt, rf in zip(
            agents.values(), response_formats.values()
        )
    ]
    markscheme = {
        k: update_markscheme(v) for k, v in markscheme.items()
    }
    for k in markscheme.keys():
        agents[k].set_system_message(
            system_msg.replace(
                "{question}", sub_questions[k]
            ).replace(
                "{markscheme}", markscheme[k]
            ).replace(
                "{answer}", answer[k]
            ).replace(
                "{available_marks}",
                str(available_marks[k.replace("_", "total")])
            ).replace(
                "{output_response_explanation}",
                output_response_explanation
            )
        )
    rets = unify.map(
        lambda a: a.generate(),
        list(agents.values()),
        name=f"Evals[{example_id}]->SubQAgent"
    )
    rets = [
        ret.split("```")[-2].lstrip("json")
        if "```" in ret else ret
        for ret in rets
    ]
    rets = {
        k: response_formats[k].model_validate_json(ret).model_dump()
        for k, ret in zip(markscheme.keys(), rets)
    }
    return rets

Let’s also update evaluate to pass the updated parameters to call_agent.

@unify.log
def evaluate(
    example_id,
    question,
    sub_questions,
    student_answer,
    available_marks,
    markscheme,
    correct_marks,
    per_question_breakdown,
    _system_message,
):
    pred_marks = call_agent(
        example_id, _system_message, sub_questions, markscheme, student_answer,
        available_marks
    )
    pred_marks_total = sum([v["marks"] for v in pred_marks.values()])
    diff = {
        k: vcor["marks"] - vpred["marks"] for (k, vcor), (_, vpred) in
        zip(correct_marks.items(), pred_marks.items())
    }
    error = {k: abs(v) for k, v in diff.items()}
    diff_total = sum(diff.values())
    error_total = sum(error.values())
    per_question_breakdown = {
        k: {
            **per_question_breakdown[k],
            "predicted_marks": pm,
            "diff": d
        } for (k, pqb), pm, d in zip(
            per_question_breakdown.items(),
            pred_marks.values(),
            diff.values()
        )
    }
    return error

πŸ§ͺ Rreun Tests

with unify.Experiment("queries_per_subquestion"), unify.Params(
    system_message=system_message,
    model=agent.model,
    dataset="dataset_10"
):
    unify.map(
        evaluate,
        [
             dict(**d.entries, _system_message=system_message)
             for d in test_set_10
        ],
        name="Evals"
    )

πŸ” Iteration 8

πŸ” Lack of Global Context

These updates have actually regressed the overall performance, with us now having a mean error of 0.5.

Maybe the purely local reasoning has some shortcomings. Let’s focus on one of the new regressions, to understand why our latest change has distrupted the agent where it was previously very consistently correct.

Example 20 (b)

❓ Question [1 Mark]:

Explain why he might not want to use Account A.

πŸ“ Student’s Answer:

He might not want to use Account A because the bank might keep the money forever and he could lose it.

β˜‘οΈ Markscheme:

He may not want to leave it there for 3 years. Accept any valid reason.

βœ… Correct Marks [0/1] Rationale:

The markscheme awards 1 mark for a valid reason such as not wanting to lock away funds for 3 years. Here, the stated reason is irrelevant and does not accurately reflect the restrictions of Account A.

πŸ€– Predicted Marks [1/1] ❌ Rationale:

The mark scheme awards 1 mark for any valid reason why the person might not want to leave the money in Account A for 3 years. The student’s answer, although phrasing it as the bank might keep the money forever and thus he could lose it, is a valid reason to be cautious, even though it doesn’t mention the explicit time period of 3 years. This interpretation sufficiently aligns with the mark scheme’s requirement of a valid reason. Therefore, the mark is awarded.

Clearly, the agent is now taking some liberties with what constitutes a β€œvalid reason”. It’s unclear why the agent is only making these mistakes now.

Let’s look at some of the previous justifications for not awarding the mark, from the previous runs.

πŸ€– Predicted Marks [x/1] Rationales:

simple_agent [0/1] βœ…: The answer should explain that although Account A gives slightly more money, Derrick might not want this account because his money is tied up for three years (i.e. it has poor liquidity). The student’s explanation about the bank β€œkeeping the money forever” is not an acceptable financial rationale.

add_markscheme [0/1] βœ…: The explanation provided is not a valid reason and does not address the restriction of no withdrawals for three years. Therefore, no mark can be awarded for this part either.

add_marking_guidelines [0/1] βœ…: The explanation given (β€œbecause the bank might keep the money forever and he could lose it”) does not correctly address why he might not want to use Account A (namely, that his money is locked away for 3 years with no withdrawals allowed).

add_structured_output [0/1] βœ…: The answer needed to explain why Account A might not be desirable, which is because funds are locked in for three years and withdrawals are not allowed until the end of the period. The student’s answer states that the bank might keep the money forever and he could lose it, which does not address the actual restriction (lack of liquidity). This is an irrelevant or incorrect reason and does not meet the mark scheme.

align_context [0/1] βœ…: The mark scheme accepts any valid reason such as not being able to withdraw the money because it is locked in for 3 years. The candidate’s answer – that the bank might β€˜keep the money forever’ and he could lose it – is not a valid explanation for the funds being inaccessible. It does not clearly express that his money is tied up for 3 years. Therefore, this explanation does not meet the requirement and no mark should be awarded.

align_guidelines_and_clarify_reasoning [0/1] βœ…: The answer to part (b) should explain that Account A requires the money to be locked in for three years which might be a disadvantage if immediate access is desired. The student’s explanation – that the bank might β€˜keep the money forever’ – does not address the real issue of the restriction on withdrawals. As the explanation provided is not a valid reason as per the mark scheme, no marks should be awarded for this part.

mark_type_reasoning [0/1] βœ…: The explanation provided does not address the key issue mentioned in the mark scheme (the restriction of no withdrawals for 3 years) and instead gives a non sequitur reason that the bank might β€˜keep the money forever.’ This does not meet the required rationale, so no mark is awarded.

Perhaps preventing the agent from having access to the full question prevents it from using β€œcommon sense” and realizing how β€œsilly” the proposed answer is, in light of the overall question and the information provided to the student.

Maybe strict adherence to the markscheme alone without the full context is prohibitive.

Let’s update our per-subquestion system prompts to also fully include the preceeding sub-questions, their markschemes, and their answers. It’s unlikely that the context of a later question will assists with the marking of an earlier question, and we still want to try and keep the agent as focused as possible on the relevant information.

πŸ”€ Include Preceeding Context

Let’s first update the system prompt, re-introducing the placeholder for the aligned subquestions, markschemes and answers, this time calling it {prior_context}, which will only be included when sub-questions are present. Let’s also include the full question.

system_message = """
Your task is to award a suitable number of marks for a student's answer to question {subq}, from 0 up to a maximum of {available_marks} marks.

The general marking guidelines (relevant for all questions) are as follows:

{general_guidelines}


The *overall* question is:

{question}

{prior_context}

The specific question you need to mark is:

{subquestion}


Their answer to this specific question is:

{answer}


The markscheme for this specific question is:

{markscheme}


{output_response_explanation}
""".replace(
    "{general_guidelines}",
    general_guidelines
)

Let’s also add a general explanation for the prior context, in cases where it is included.

prior_context_exp = """
All of the *preceeding* sub-questions, their specific markschemes and the student's answers are as follows:
"""

Let’s now update call_agent to pass in the required information.

@unify.traced
def call_agent(example_id, system_msg, question_num, question, sub_questions, markscheme, answer, available_marks):
    agents = {k: agent.copy() for k in markscheme.keys()}
    with_subqs = len(markscheme) > 1
    response_formats = {
        k: create_marks_and_reasoning_format(
            [itm[0] for itm in parse_marks_from_markscheme(v)]
        ) for k, v in markscheme.items()
    }
    [
        agnt.set_response_format(rf)
        for agnt, rf in zip(
            agents.values(), response_formats.values()
        )
    ]
    markscheme = {
        k: update_markscheme(v) for k, v in markscheme.items()
    }
    for i, k in enumerate(markscheme.keys()):
        agents[k].set_system_message(
            system_msg.replace(
                "{subq}", k.replace("_", str(question_num))
            ).replace(
                "{question}", question,
            ).replace(
                "{subquestion}", sub_questions[k]
            ).replace(
                "{markscheme}", markscheme[k]
            ).replace(
                "{answer}", answer[k]
            ).replace(
                "{available_marks}",
                str(available_marks[k.replace("_", "total")])
            ).replace(
                "{output_response_explanation}",
                output_response_explanation
            ).replace(
            "{prior_context}", (prior_context_exp + json.dumps(
              {
                  k: {
                      "sub-question": sub_questions[k],
                      "markscheme": markscheme[k],
                      "answer": answer[k]
                  } for k in list(sub_questions.keys())[0:i]
              },
              indent=4
            )) if with_subqs and i > 0 else ""
          )
        )
    rets = unify.map(
        lambda a: a.generate(),
        list(agents.values()),
        name=f"Evals[{example_id}]->SubQAgent"
    )
    rets = [
        ret.split("```")[-2].lstrip("json")
        if "```" in ret else ret
        for ret in rets
    ]
    rets = {
        k: response_formats[k].model_validate_json(ret).model_dump()
        for k, ret in zip(markscheme.keys(), rets)
    }
    return rets

Finally, let’s update evaluate accordingly.

@unify.log
def evaluate(
    example_id,
    question_num,
    question,
    sub_questions,
    student_answer,
    available_marks,
    available_marks_total,
    markscheme,
    correct_marks,
    correct_marks_total,
    per_question_breakdown,
    _system_message,
):
    pred_marks = call_agent(
        example_id, _system_message, question_num, question, sub_questions, markscheme, student_answer, available_marks
    )
    pred_marks_total = sum([v["marks"] for v in pred_marks.values()])
    diff = {
        k: vcor["marks"] - vpred["marks"] for (k, vcor), (_, vpred) in
        zip(correct_marks.items(), pred_marks.items())
    }
    error = {k: abs(v) for k, v in diff.items()}
    diff_total = sum(diff.values())
    error_total = sum(error.values())
    per_question_breakdown = {
        k: {
            **per_question_breakdown[k],
            "predicted_marks": pm,
            "diff": d
        } for (k, pqb), pm, d in zip(
            per_question_breakdown.items(),
            pred_marks.values(),
            diff.values()
        )
    }
    return error

πŸ§ͺ Rreun Tests

with unify.Experiment("with_preceeding_context"), unify.Params(
    system_message=system_message,
    model=agent.model,
    dataset="dataset_10"
):
    unify.map(
        evaluate,
        [
             dict(**d.entries, _system_message=system_message)
             for d in test_set_10
        ],
        name="Evals"
    )

Great, so we’ve fixed the new regressions, but again we’re back at the same three failures, failing for the same reason.

πŸ” Iteration 9

πŸ” Still Ignoring Mark Types

Given that the agent is still failing to follow the instructions for each mark in the markscheme, perhaps it’s time we tried to perform per-mark reasoning, with a separate LLM call made for each candidate mark to award. This might help the LLM deeply consider each candidate mark mentioned in the markscheme.

Let’s give it a try!

πŸ”€ Queries per Mark

We will still want our per-subquestion LLM to perform the final reasoning about the number of marks to award for the sub-question, but we just want to provide it with the reasoning performed by each of our per-mark LLM queries.

We therefore now have two different LLMs, with two different roles, and therefore we need two different system messages.

Let’s first update the subquestion system message, in anticipation of the incoming mark-by-mark reasoning. Let’s also split the markscheme and the mark type reasoning, rather than naively combining these as was done in update_markscheme.

subq_system_message = """
Your task is to award a suitable number of marks for a student's answer to question {subq}, from 0 up to a maximum of {available_marks} marks.

The general marking guidelines (relevant for all questions) are as follows:

{general_guidelines}


The *overall* question is:

{question}

{prior_context}

The specific question you need to mark is:

{subquestion}


Their answer to this specific question is:

{answer}


The markscheme for this specific question is:

{markscheme}

{mark_types_explanation}

{mark_observations}

{output_response_explanation}
""".replace(
    "{general_guidelines}",
    general_guidelines
)

The "{mark_types_explanation}" placeholder can be overriden explicitly, giving us more control. Let’s create a new function extract_mark_type_explanation, inspired from update_markscheme above.

@unify.traced(name="extract_mark_type_explanation_{marks_to_consider}")
def extract_mark_type_explanation(markscheme: str, marks_to_consider=None):
    m_marks = sorted(list(set(re.findall(r'M\d+', markscheme))))
    a_marks = sorted(list(set(re.findall(r'A\d+', markscheme))))
    b_marks = sorted(list(set(re.findall(r'B\d+', markscheme))))
    sc_marks = sorted(list(set(re.findall(r'SC\d+', markscheme))))
    if not any(m_marks + a_marks + b_marks + sc_marks):
        return ""
    full_exp = """As a recap, the general guidelines for each of these mark types are as follows:

{mark_types_explanation}"""
    for marks in (m_marks, a_marks, b_marks, sc_marks):
        for mark in marks:
            if marks_to_consider and mark not in marks_to_consider:
                continue
            key = "".join(c for c in mark if not c.isdigit())
            num_marks = int("".join(c for c in mark if c.isdigit()))
            exp = mark_types[key]
            exp = exp.replace(
                "{num}", str(num_marks)
            ).replace(
                "{num_marks}", "1 mark" if num_marks == 1 else f"{num_marks} marks"
            )
            full_exp = full_exp.replace(
                "{mark_types_explanation}",
                key + ":/n" + exp + "\n\n{mark_types_explanation}"
            )
    return full_exp.replace("{mark_types_explanation}", "")

Let’s now create the system message for our mark reasoning agent, again with the explicit {mark_types_explanation} placeholder.

mark_system_message = """
Your task is to determine whether mark {mark} should be awarded for the following student's answer to question {subq}, based on the provided markscheme.

The general marking guidelines (relevant for all questions) are as follows:

{general_guidelines}


The *overall* question is:

{question}

{prior_context}

The specific question you need to mark is:

{subquestion}


Their answer to this specific question is:

{answer}


The markscheme for this specific question, with the mark in question {mark} expressed in bold and with a prepending `(to consider!)`, is as follows:

{markscheme}

{mark_types_explanation}

You should populate the `thoughts` field with your thoughts on the whether the specific mark identified within the markscheme should be awarded for the student's answer. The mark might be irrelevant given the student's approach or answer, in which case just respond `False` for the `should_award` field, and explain this in the `thoughts` field. Please think carefully about your decision for the mark, considering the general guidelines.
""".replace(
    "{general_guidelines}",
    general_guidelines
)

Let’s first define call_subq_agent, which will include mark-by-mark reasoning with several LLM calls

@unify.traced(name="call_subq_agent_{subq}")
def call_subq_agent(example_id, subq, subq_agent, markscheme, mark_sys_msg):
    parsed_markscheme = parse_marks_from_markscheme(markscheme)
    mark_agents = [
        [k, agent.copy()] for k in
        [itm[0] for itm in parsed_markscheme]
    ]
    [
        agnt.set_response_format(ThoughtsAndAwardDecision)
        for _, agnt in mark_agents
    ]
    for i, (k, v) in enumerate(parsed_markscheme):
        mark_agents[i][1].set_system_message(
            mark_sys_msg.replace(
                "{mark}", k
            ).replace(
                "{markscheme}", markscheme
            ).replace(
                v, v.replace(k, f"**{k}** (to consider!)")
            ).replace(
                "{mark_types_explanation}",
                extract_mark_type_explanation(markscheme, [k])
            )
        )
    if mark_agents:
        explanation = "An expert marker has already taken a look at the student's answer, and they have made the following observations for each of the candidate marks mentioned in the markscheme. You should pay special attention to these observations."
        vals = unify.map(
            lambda a: json.loads(a.generate()),
            [agnt for _, agnt in mark_agents],
            name=f"Evals[{example_id}]->SubQAgent[{subq}]->MarkAgent"
        )
        keys = list()
        for k, _ in mark_agents:
            if k not in keys:
                keys.append(k)
                continue
            keys.append(
                k + f"({len([ky for ky in keys if k in ky])})"
            )
        mark_obs_dict = dict(zip(keys, vals))
        mark_observations = explanation + "\n\n" + json.dumps(
            mark_obs_dict, indent=4
        )
    else:
        mark_observations = ""
    subq_agent.set_system_message(
        subq_agent.system_message.replace(
            "{mark_observations}",
            mark_observations
        )
    )
    ret = subq_agent.generate()
    if "```" in ret:
        ret = ret.split("```")[-2].lstrip("json")
    ret = json.loads(ret)
    if not mark_agents:
        return ret
    ret["reasoning"] = {
        **mark_obs_dict,
        "overall_thoughts": ret["reasoning"]
    }
    return ret

Let’s now update call_agent, making use of our call_subq_agent function, which processes a single sub-question.

@unify.traced
def call_agent(
    example_id,
    subq_system_message,
    mark_system_message,
    question_num,
    question,
    sub_questions,
    markscheme,
    answer,
    available_marks
):
    subq_agents = {k: agent.copy() for k in markscheme.keys()}
    with_subqs = len(markscheme) > 1
    response_formats = {
        k: MarksAndReasoning for k, v in markscheme.items()
    }
    [
        agnt.set_response_format(rf)
        for agnt, rf in zip(
            subq_agents.values(), response_formats.values()
        )
    ]
    mark_sys_msgs = list()
    for i, k in enumerate(markscheme.keys()):
        subq_agents[k].set_system_message(
            subq_system_message.replace(
                "{subq}", k.replace("_", str(question_num))
            ).replace(
                "{question}", question,
            ).replace(
                "{subquestion}", sub_questions[k]
            ).replace(
                "{markscheme}", markscheme[k]
            ).replace(
                "{mark_types_explanation}",
                extract_mark_type_explanation(markscheme[k])
            ).replace(
                "{answer}", answer[k]
            ).replace(
                "{available_marks}",
                str(available_marks[k.replace("_", "total")])
            ).replace(
                "{output_response_explanation}",
                output_response_explanation
            ).replace(
            "{prior_context}", (prior_context_exp + json.dumps(
              {
                  k: {
                      "sub-question": sub_questions[k],
                      "markscheme": markscheme[k],
                      "answer": answer[k]
                  } for k in list(sub_questions.keys())[0:i]
              },
              indent=4
            )) if with_subqs and i > 0 else ""
          )
        )
        mark_sys_msgs.append(
            mark_system_message.replace(
                "{subq}", k.replace("_", str(question_num))
            ).replace(
                "{question}", question,
            ).replace(
                "{subquestion}", sub_questions[k]
            ).replace(
                "{answer}", answer[k]
            ).replace(
            "{prior_context}", (prior_context_exp + json.dumps(
              {
                  k: {
                      "sub-question": sub_questions[k],
                      "markscheme": markscheme[k],
                      "answer": answer[k]
                  } for k in list(sub_questions.keys())[0:i]
              },
              indent=4
            )) if with_subqs and i > 0 else ""
          )
        )
    rets = unify.map(
        lambda *a: call_subq_agent(example_id, *a),
        list(sub_questions.keys()),
        list(subq_agents.values()),
        list(markscheme.values()),
        mark_sys_msgs,
        from_args=True,
        name=f"Evals[{example_id}]->SubQAgent"
    )
    return dict(zip(markscheme.keys(), rets))

We also need to update the evaluate function, to pass each of the two different system messages to the call_agent function.

@unify.log
def evaluate(
    example_id,
    question_num,
    question,
    sub_questions,
    student_answer,
    available_marks,
    available_marks_total,
    markscheme,
    correct_marks,
    correct_marks_total,
    per_question_breakdown,
    _subq_system_message,
    _mark_system_message
):
    pred_marks = call_agent(
        example_id,
        _subq_system_message,
        _mark_system_message,
        question_num,
        question,
        sub_questions,
        markscheme,
        student_answer,
        available_marks
    )
    pred_marks_total = sum([v["marks"] for v in pred_marks.values()])
    diff = {
        k: vcor["marks"] - vpred["marks"] for (k, vcor), (_, vpred) in
        zip(correct_marks.items(), pred_marks.items())
    }
    error = {k: abs(v) for k, v in diff.items()}
    diff_total = sum(diff.values())
    error_total = sum(error.values())
    per_question_breakdown = {
        k: {
            **per_question_breakdown[k],
            "predicted_marks": pm,
            "diff": d
        } for (k, pqb), pm, d in zip(
            per_question_breakdown.items(),
            pred_marks.values(),
            diff.values()
        )
    }
    return error

πŸ§ͺ Rreun Tests

agent.set_cache("read-only")
with unify.Experiment("queries_per_mark"), unify.Params(
    subq_system_message=subq_system_message,
    mark_system_message=mark_system_message,
    model=agent.model,
    dataset="dataset_10"
):
    unify.map(
        evaluate,
        [
             dict(
                 **d.entries,
                 _subq_system_message=subq_system_message,
                 _mark_system_message=mark_system_message
            )
             for d in test_set_10[0:1]
        ],
        name="Eval",
    )

Great, this seems to have addressed two of the three failures (on this run at least).

πŸ” Iteration 10

πŸ” Method Marks Confusion

All of the prior failures now seem to have been resolved, but we have a new regression for Example 215 (b). Let’s take a look.

Example 215 (b)

❓ Question [2 Marks]:

Complete these rows of boxes using the rule shown above.

πŸ“ Student’s Answer:

The missing boxes are 8, 13, and 21, making the sequence 8, 13, 21, 34, 55.

β˜‘οΈ Markscheme:

Answer: 8, 13, 21 Marks: 2 Guidance: M1 for one correct subtraction of two boxes

βœ… Correct Marks [2/2] Rationale:

Here, we deduce the missing numbers by working backwards from 34 and 55 or by applying the same rule forward. According to the markscheme (2 marks), we get full credit for the fully correct set of missing numbers, with one method mark (M1) for a correct subtraction and a final accuracy mark (A1) for the correct sequence.

πŸ€– Predicted Marks [0/2] Rationale:

The candidate’s final numerical answer is correct (the missing boxes are 8, 13, and 21), matching the answer in the mark scheme. However, the mark scheme specifically awards M1 for showing a correct subtraction (i.e. working that reverses the addition rule), which is a required part of the method. Since the candidate only provided the final answer without demonstrating any subtraction (or any equivalent working) to show the method, no evidence is provided to award the M1 mark. As A marks (for an accurate answer) depend on obtaining the method mark, no further marks can be awarded. Therefore, despite the correct final answer, the lack of required working means no marks should be awarded.

This is an interesting failure mode. Interestly, the justification for the β€œcorrect” (ground truth) marks is wrong. There is no A1 mark for this question (which would depend on a method mark). This is irrelevant in terms of the agent failure (the agent doesn’t know the correct marks or rationale), but it’s still an interesting observsation regarding our β€œground truth” data.

Interestingly, the agent has made the same mistake as occurs in the β€œground truth” rationale. Our agent presumes the existence of an A mark where none were stated. It seems like the agent doesn’t understand that correct answers should always earn full marks, unless otherwise explicitly stated. M1 marks are not necessary to achieve full marks in such cases, unless an A mark is specifically referenced.

πŸ”€ Clarify Mark + Answer Marks

Let’s try to fully clarify these points for the sub-question agent, and re-run the evals.

output_response_explanation = """
You should populate the `reasoning` field with your general thoughts on each individual mark identified in the markscheme, and also a decision as to whether each of these mark should be awarded.

If you deem that a mark *should* be awarded (such SC1, B1, A1 etc.), then it is worth as many marks as appear in the mark type itself (SC1, B1, and A1 are therefore worth 1 mark each, A2 is worth 2 marks etc.). However, these marks are not *necessarily* cumulative with regards to the total marks to award for this sub-question, and some may be irrelevant given the student's approach or answer.

More importantly, full marks should *always* be given for a fully correct answer, unless otherwise *explicitly* stated. For example, a correct answer without any method shown should still get *full marks*, despite the M1 criteria not being met. The only exception to this is explicitly referenced A marks, which do depend on the preceding M marks being awarded.

Finally, after you've given it a lot of thought, you should put the total number of marks to award for this sub-question in the `marks` field.
"""

πŸ§ͺ Rreun Tests

with unify.Experiment("clarify_method_marks"), unify.Params(
    subq_system_message=subq_system_message,
    mark_system_message=mark_system_message,
    model=agent.model,
    dataset="dataset_10"
):
    unify.map(
        evaluate,
        [
             dict(
                 **d.entries,
                 _subq_system_message=subq_system_message,
                 _mark_system_message=mark_system_message
            )
             for d in test_set_10
        ],
        name="Eval"
    )

Great, we’ve finally got all 10/10 tests passing perfectly πŸŽ‰

For Downloading cache

from google.colab import files
files.download('.cache.json')