0: Setup
The first step of the flywheel is to create our agent. Let’s add something simple to begin with, and we can add more complexity on future iterations 🔁
The full script for running this iteration can be found here.
🤖 Create Agent
We can start with a simple 0-shot LLM.
Let’s also download a .cache.json
file which was previously generated whilst running through this demo.
This avoids the need to make any real LLM calls,
saving time and money,
and it also ensures the walkthrough is fully deterministic.
If you’d rather go down your own unique iteration journey,
then you should skip the cell below,
and either remove cache="read-only"
(turn off caching)
or replace it with cache=True
(create your own local cache) in the agent constructor above.
However,
this would mean many parts of the remaining walkthrough might not directly apply in your case,
as the specific failure modes and the order in which they appear are likely to be different.
The agent needs to mark student answers to questions, out of a possible maximum number of marks. Let’s give it a sensible system message template to begin with:
In our test set, the student answer is stored as a dictionary with keys for each sub-question. When no sub-questions are present, then a dummy ”_” key is used instead (to preserve the dict type for all answers). Rather than unpacking the dictionary directly into the system message, let’s instead add a simple “pretty print” function to format this more elegantly.
Let’s now wrap our system message template in a simple function to populate the template variables with the specific data involved,
before querying the agent for a response.
We also make use of the pretty_print_dict
for the answer as explained above.
🗂️ Add Tests
Great, we now have our agent implemented. So, what are some good unit tests to begin with? Rather than using all 321 examples for our first iteration. Let’s use the smallest subset of 10 examples, which we created in the previous section.
🧪 Run Tests
Let’s add an evaluation function,
and include all other arguments that we would like to log as part of the evaluation.
All input arguments, intermediate variables,
and return variables without a prepending "_"
in the name
(all “non-private” arguments, returns and intermediate variables)
will automatically be logged when the function is called,
due to the inclusion of the unify.log
decorator.
We can then run our evaluation, with the logging included, like so:
The unify.Experiment()
term creates an "experiment"
parameter in the context,
with value "simple_agent"
in this case.
The overwrite=True
argument will remove all prior logs with experiment parameter equal to "simple_agent"
.
This is useful if you would like to re-run an experiment (clearing any previous runs).
If you would like to accumulate data for a specific experiment,
then the overwrite
flag should not be set,
and new logs will simply be added to the experiment without deleting any prior logs.
The unify.Params()
term sets the parameters which are held constant throughout the experiment.
By looking in our interface, we can see that we have some failures,
with a mean error of 0.8
across the ten examples.
Let’s take a look at the traces, to ensure that the system message template has been implemented correctly, and each LLM call has the template variables in the system message populated correctly.
It seems as though everything was implemented correctly, and the per-LLM system messages look good ✅
So, for the next iteration, we’ll need to dive in and understand why the agent is failing to make the correct prediction in some cases 🔁