Upload Datasets
In the previous section we built a usage dashboard, showing the traffic coming from 100 active users on the platform 📈
Let’s now create some datasets so we can keep tabs on our users and also have some ground truth data in order to optimize our LLM agent in the next section.
Users
So, let’s first create the “Users” dataset to store all of the user data in one place 🗂️
Upload
We have exported these details to users.json. We can easily upload this to the platform via the Python client.
Let’s add imports,
activate the project,
download the data,
and read the data.
We can then create a dataset like so:
It’s good practice to use .sync()
when uploading,
as this performs a bi-directional sync,
and uploads/downloads data to achieve the superset both locally and upstream.
In this case the dataset did not exist upstream,
and so .sync()
was equivalent to calling .upload()
.
The full script for uploading this dataset (and the ones mentioned below) can be found here.
Analyze
Let’s now create a new Dataset
Tab in our interface,
and set the context of the entire tab to Datasets
such that all tables will only have access to this context.
The only datasets, Users
, is then loaded into the table automatically.
Sort Surnames
We can sort the data alphabetically if we want the data to be more structured.
Search for Students
We can also search for any student in the search bar.
Test Set
Before we deploy an agent to mark the user answers in production, we need to be able to evaluate the agent performance. This is what a test set is for, matching agent inputs with desired agent outputs.
Upload
Let’s assume we have a dataset of questions, answers and the correct number of marks to award for each answer, that an expert marker has provided, alongside their rationale for awarding this number of marks.
In our case, this was synthetically generated by OpenAI’s o1 via this script, but for the sake of example, we can assume it was generated by expert human markers.
The data was then organized into a test set using this script, and the resultant data is saved in test_set.json.
Full Dataset
As before, lets download the data,
read the data,
create a dataset,
and upload it into the platform.
Sub-Datasets
Whilst we’re improving our LLM agent (next section), we won’t necessarily want to test against all 321 examples every time. This would be both very costly and very time consuming, and needlessly wasteful early on, when a handful of examples will suffice to point us in the right direction.
Therefore, let’s create some subsets of the full test dataset. Let’s start with 10 examples, and then double up to 160 (almost half the full size). The test set has already been shuffled, so we can simply take increasing slices starting from the beginning.
As mentioned above, the full script for uploading all of these datasets into the platform can be found here.
Analyze
Let’s now analyze the test datasets, to verify everything is as it should be 🔍
Check Sub-Datasets
Let’s take a look at our different dataset slices.
Each sub-dataset contains the same logs as the main dataset, and as every other overlapping sub-dataset. For example, any in-place updates to the logs will be reflected in all datasets.
Verify All Data is Present
Let’s verify that all of the data is present, as expected.
Number of Papers
We should have a total of three papers, as per this original PDF, and as per the parsed representation we extracted here. We can confirm this by grouping by the papers.
Number of Questions
Each paper should have a set number of questions. Checking the original PDF, we can see that:
- Paper1 -> 21 Questions
- Paper 2 -> 19 Questions
- Paper 3 -> 19 Questions
However, questions are also omitted from the test set if:
- they involve an image as part of the question, (they are not text-only)
- the question was not
correctly_parsed
(example full parsed paper) - the question markscheme was not
correctly_parsed
(example full parsed markscheme)
Let’s check how many questions were used from each of the three papers. We can simply also group by the question number, and then unfold these nested groups.
As expected,
we can see that not all of the questions are replicated in the test set.
Upon further inspection,
we can see that the following questions are all ommitted due to
the inclusion of necessary images as part of the question ("text-only": false
):
Number of Target Answers
Finally, each question should have N+1
candidate answers which are expected to receive [0, 1, ..., N]
marks,
where N
is the total number of available marks.
Let’s verify this is correct,
by adding the available_marks_total
column,
and verifying that the group sizes are all equal to available_marks_total + 1
.
It seems as though the data is all formatted as we expected it to be, and the 321 ground truth examples do indeed exhaustively fill all possible marks for all the valid (text-only) questions across all papers ✅
That’s it, we’re now ready to implement our LLM agent, and start iterating to improve the performance! 🔁