Upload Datasets
Users
In the previous section we built a usage dashboard, showing the traffic coming from 100 active users on the platform 📈
Let’s now create a “Users” dataset to store all of the user data in one place 🗂️
We have exported these details to users.json. We can easily upload this to the platform via the Python client.
Let’s add imports,
activate the project,
download the data,
and read the data.
We can then create a dataset like so:
It’s good practice to use .sync()
when uploading,
as this performs a bi-directional sync,
and uploads/downloads data to achieve the superset both locally and upstream.
In this case the dataset did not exist upstream,
and so .sync()
was equivalent to calling .upload()
.
Let’s now create a new Dataset
Tab in our interface,
and set the context of the entire tab to Datasets
such that all tables will only have access to this context.
The only datasets, Users
, is then loaded into the table automatically.
We can sort the data alphabetically if we want the data to be more structured.
We can also search for any student in the search bar.
Test Set
Before we deploy an agent to mark the user answers in production, we need to be able to evaluate the agent performance. This is what a test set is for, matching agent inputs with desired agent outputs.
Let’s assume we have a dataset of questions, answers and the correct number of marks to award for each answer, that an expert marker has provided, alongside their rationale for awarding this number of marks.
In our case, this was synthetically generated by OpenAI’s o1 via this script, but for the sake of example, we can assume it was generated by expert human markers.
The data was then organized into a test set using this script, and the resultant data is saved in test_set.json.
As before, lets download the data,
read the data,
create a dataset,
and upload it into the platform.
Let’s see if the test set has been created correctly.
GIFs
Whilst we’re improving out LLM agent (next section), we won’t necessarily want to test against all 321 examples every time. This would be both very costly and very time consuming, and needlessly wasteful early on, when a handful of examples will suffice to point us in the right direction.
Therefore, let’s create some subsets of the full test dataset. Let’s start with 10 examples, and then double up to 160 (almost half the full size). The test set has already been shuffled, so we can simply take increasing slices starting from the beginning.
Let’s take a look at our different dataset slices.
Each sub-dataset contains the same logs as the main dataset, and as every other overlapping sub-dataset. For example, any in-place updates to the logs will be reflected in all datasets.
That’s it, we’re now ready to implement our LLM agent, and start iterating to improve the performance! 🔁