📄 PDFs
❓ Paper 1 -> Question 2
☑️ Paper 1 -> Question 2 Markscheme
❓ Parsed Question [2 Marks]
📝 Student's Answer
☑️ Parsed Markscheme
✅ Correct Marks [1/2] Rationale
🤖 Predicted Marks [x/2] Rationales (with added insights💡)
simple_agent [0/2] ❌
add_markscheme [0/2] ❌
add_marking_guidelines [0/2] ❌
add_structured_output [0/2] ❌
align_context [0/2] ❌
SC1
mark from the markscheme,
irrespective of the various parameter changes we’ve made across each experiment run.
📄 PDFs
❓ Paper 1 -> Question 19
☑️ Paper 1 -> Question 19 Markscheme
❓ Parsed Question [5 Marks]
📝 Student's Answer
☑️ Parsed Markscheme
✅ Correct Marks [1/5] Rationale
🤖 Predicted Marks [x/5] Rationale (with added insights💡)
simple_agent [1/5] ✅
add_markscheme [0/5] ❌
add_marking_guidelines [1/5] ✅
add_structured_output [0/5] ❌
align_context [0/5] ❌
M1 for 10 × (2/5) = 4 litres red or for 10 × (3/5) = 6 litres white
.
Even on the two occassions where it got things right,
it feels like a lucky guess,
as this was not justified via the markscheme’s M1 mark.
Example 132 (c) is failing for a slighly different reason,
and so we’ll consider this separately.
In general,
as we get deeper into the evaluation iterations,
it’s often wise to consider multiple failure modes at once.
Larger evals can be expensive to run,
and you generally want to use all of the newly gained knowledge to try and improve your agent in the next evaluation run,
even if this means making several unrelated changes to address several unrelated failure modes.
📄 PDFs
❓ Paper 2 -> Question 10 (c)
☑️ Paper 2 -> Question 10 (c) Markscheme
❓ Parsed Question [2 Marks]
📝 Student's Answer
☑️ Parsed Markscheme
✅ Correct Marks [1/2] Rationale
🤖 Predicted Marks [x/2] Rationale (with added insights💡)
simple_agent [2/2] ❌
add_markscheme [2/2] ❌
add_marking_guidelines [2/2] ❌
add_structured_output [2/2] ❌
align_context [2/2] ❌
mark_types
dict,
so that the agent has all the relevant information close at hand:
call_agent
method such that the markscheme changes are dynamically applied before passing to the agent:
general_guidelines
variable,
with an imaginary extra piece of guidance,
to try and avoid the leniency we’ve observed in the marking of Example 132 (c).