2: Add Marking Guidelines - Unify Documentation

The full script for running this iteration can be found here.

🔍 Misunderstanding Mark Types

Seven out of ten have a perfect error of 0, with three examples having an error of 1. Let’s take a closer look at these examples, and why they’re failing.

Let’s dive in and explore each of these failures in more detail, to see how we can rectify them.

Example 207

📄 PDFs

❓ Parsed Question [2 Marks]

📝 Student's Answer

☑️ Parsed Markscheme

✅ Correct Marks [1/2] Rationale

🤖 Predicted Marks [0/2] Rationale

Example 261

📄 PDFs

❓ Parsed Question [5 Marks]

📝 Student's Answer

☑️ Parsed Markscheme

✅ Correct Marks [1/5] Rationale

🤖 Predicted Marks [0/5] Rationale

Example 132 (c)

📄 PDFs

❓ Parsed Question [2 Marks]

📝 Student's Answer

☑️ Parsed Markscheme

✅ Correct Marks [1/2] Rationale

🤖 Predicted Marks [2/2] Rationale

Thoughts

Overall, it’s clear that the agent is unable to properly make sense of the different mark types, such as B1, SC1, M1, A1 etc. This is not surprising, as we’ve never explained what these terms mean in the system prompt!

🔀 Add Marking Guidelines

Let’s add the general marking guidelines to the system prompt, so the agent knows what all of these mark terms mean, and also fully understands how to interpret the markscheme for each question.

The marking guidelines can be extracted from the beginning of any of the markscheme pdf files, such as this one. Let’s store this in a seperate variable, which will make it easier for us to parameterize the inclusion of the guidelines in future experiment iterations.

general_guidelines = """----

1.
M marks are for using a correct method and are not lost for purely numerical errors.
A marks are for an accurate answer and depend on preceding M (method) marks. Therefore M0 A1 cannot be awarded.
B marks are independent of M (method) marks and are for a correct final answer, a partially correct answer, or a correct intermediate stage.
SC marks are for special cases that are worthy of some credit.

2.
Unless the answer and marks columns of the mark scheme specify M and A marks etc, or the mark scheme is ‘banded’, then if the correct answer is clearly given and is not from wrong working full marks should be awarded.

Do not award the marks if the answer was obtained from an incorrect method, i.e. incorrect working is seen and the correct answer clearly follows from it.

3.
Where follow through (FT) is indicated in the mark scheme, marks can be awarded where the candidate’s work follows correctly from a previous answer whether or not it was correct.

Figures or expressions that are being followed through are sometimes encompassed by single quotation marks after the word their for clarity, e.g. FT 180 × (their ‘37’ + 16), or FT 300 – (their ‘52 + 72’). Answers to part questions which are being followed through are indicated by e.g. FT 3 × their (a).

For questions with FT available you must ensure that you refer back to the relevant previous answer. You may find it easier to mark these questions candidate by candidate rather than question by question.

4.
Where dependent (dep) marks are indicated in the mark scheme, you must check that the candidate has met all the criteria specified for the mark to be awarded.

5.
The following abbreviations are commonly found in GCSE Mathematics mark schemes.
- **figs 237**, for example, means any answer with only these digits. You should ignore leading or trailing zeros and any decimal point e.g. 237000, 2.37, 2.370, 0.00237 would be acceptable but 23070 or 2374 would not.
- **isw** means **ignore subsequent working** after correct answer obtained and applies as a default.
- **nfww** means not from wrong working.
- **oe** means **or equivalent**.
- **rot** means **rounded or truncated**.
- **seen** means that you should award the mark if that number/expression is seen anywhere in the answer space, including the answer line, even if it is not in the method leading to the final answer
- **soi** means seen or implied.

6.
In questions with no final answer line, make no deductions for wrong work after an acceptable answer (ie **isw**) unless the mark scheme says otherwise, indicated by the instruction ‘mark final answer’.

7.
In questions with a final answer line following working space:

(i)If the correct answer is seen in the body of working and the answer given on the answer line is a clear transcription error allow full marks unless the mark scheme says ‘mark final answer’. Place the annotation ✓ next to the correct answer.

(ii)If the correct answer is seen in the body of working but the answer line is blank, allow full marks. Place the annotation ✓ next to the correct answer.

(iii)If the correct answer is seen in the body of working but a completely different answer is seen on the answer line, then accuracy marks for the answer are lost. Method marks could still be awarded. Use the M0, M1, M2 annotations as appropriate and place the annotation  next to the wrong answer.

8.
In questions with a final answer line:

(i)If one answer is provided on the answer line, mark the method that leads to that answer.

(ii)If more than one answer is provided on the answer line and there is a single method provided, award method marks only.

(iii)If more than one answer is provided on the answer line and there is more than one method provided, award zero marks for the question unless the candidate has clearly indicated which method is to be marked.

9.
In questions with no final answer line:

(i)If a single response is provided, mark as usual.

(ii)If more than one response is provided, award zero marks for the question unless the candidate has clearly indicated which response is to be marked.

10.
When the data of a question is consistently misread in such a way as not to alter the nature or difficulty of the question, please follow the candidate’s work and allow follow through for **A** and **B** marks. Deduct 1 mark from any **A** or **B** marks earned and record this by using the MR annotation. **M** marks are not deducted for misreads.

11.
Unless the question asks for an answer to a specific degree of accuracy, always mark at the greatest number of significant figures even if this is rounded or truncated on the answer line. For example, an answer in the mark scheme is 15 75, which is seen in the working. The candidate then rounds or truncates this to 15.8, 15 or 16 on the answer line. Allow full marks for the 15.75.

12.
Ranges of answers given in the mark scheme are always inclusive.

13.
For methods not provided for in the mark scheme give as far as possible equivalent marks for equivalent work.

14.
Anything in the mark scheme which is in square brackets […] is not required for the mark to be earned, but if present it must be correct.

----"""

Let’s also update the system message to include a placeholder for these general guidelines, and let’s also reword other parts of the system message to make it clear which parts are for the question, and which parts are general guidelines.

system_message = """
Your task is to award a suitable number of marks for a student's answer to a question, from 0 up to a maximum of {available_marks_total} marks.

The general marking guidelines (relevant for all questions) are as follows:

{general_guidelines}

The question you need to mark is:

{question}


The markscheme for this specific question is:

{markscheme}


The student's answer to this question (which you need to marked) is:

{answer}


As the very final part of your response, simply provide the number of marks on a *new line*, without any additional formatting. For example:

3
"""

Let’s then update the system message to include the guidelines by default.

system_message = system_message.replace(
    "{general_guidelines}",
    general_guidelines,
)

🧪 Rerun Tests

with unify.Experiment("add_marking_guidelines", overwrite=True), unify.Params(
    system_message=system_message,
    dataset="TestSet10",
    source=unify.get_source(),
):
    unify.map(
        evaluate,
        [dict(**d.entries, _system_message=system_message) for d in test_set_10],
        name="Evals",
    )

The mean error has gone down from 0.3 to 0.2. Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.

It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅ Let’s explore the remaining errors in the next iteration 🔁

​🔍 Misunderstanding Mark Types

​Example 207

​Example 261

​Example 132 (c)

​Thoughts

​🔀 Add Marking Guidelines

​🧪 Rerun Tests

🔍 Misunderstanding Mark Types

Example 207

Example 261

Example 132 (c)

Thoughts

🔀 Add Marking Guidelines

🧪 Rerun Tests