Datasets - Unify Documentation

The unify.Dataset class is the best way to manage and create datasets. Datasets support indexing as well as value-based addition and removal.

import unify

my_dataset = unify.Dataset([0, 1, 2], name="my_dataset")
my_dataset[0] = -1
my_dataset += [3, 4, 5]
my_dataset -= [1, 2]
print(my_dataset)

unify.Dataset([-1, 3, 4, 5], name="my_dataset")

Under the hood, datasets make use of the same logging primitives as unify.log. The only differences is that datasets each have their own context, which is automatically managed by the unify.Dataset class.

Uploading

You can upload any dataset to your interface like so.

my_dataset.upload()

The unify.Dataset class automatically sets the context as f"Datasets/{name}", so your dataset can be viewed at Datasets/my_dataset.

Expand

click image to maximize

Setting overwrite=True will overwrite any existing dataset with the same name (even if the upstream dataset contained more data which is not included in the uploaded data).

my_dataset.upload(overwrite=True)

Downloading

If your dataset already exists upstream, you can download it like so.

my_dataset = unify.Dataset.download("my_dataset")

Setting overwrite=True will overwrite your local dataset (even if your local dataset contained more data than is present in the download).

my_dataset.download(overwrite=True)

Syncing

Finally, if you want to sync your local dataset with the upstream version, to achieve the superset of the upstream and local data, you can do so like so.

my_dataset.sync()

This first uploads your local dataset with overwrite=False to create the superset upstream, and then downloads with overwrite=True to get this superset dataset downloaded locally, with every entry now including the unique log ID.

Duplicates

By default, the unify.Dataset class will not allow duplicate values.

unify.Dataset([0, 1, 2, 2]) # fails

If you would like to allow duplicates, you can override the default behavior by setting the allow_duplicates flag when creating the dataset.

unify.Dataset([0, 1, 2, 2], allow_duplicates=True)

When allow_duplicates is set to False, then all upstream logs with identical values to local (id-less) logs will be assumed to represent the same log, and the unset log ids of these local logs will be updated to match the upstream ids with the matching values. If allow_duplicates is set to True, then any upstream logs with identical values to local logs will assume to represent different logs unless the log ids match exactly. If duplicates are not explicitly required for a dataset, then it’s best to use the default behaviour, and leave allow_duplicates set to False. Even if duplicates are needed, adding an extra example_id column with allow_duplicates kept as False can be worthwhile to avoid accidental duplication, especially if you’re regularly syncing datasets between local and upstream sources.

import unify
import random

my_dataset = unify.Dataset(
    list(range(100)), name="my_dataset"
).sync()
my_sub_dataset = unify.Dataset(
    random.sample(my_dataset, 10), name="my_sub_dataset"
).sync()

Expand

click image to maximize

Editing logs (rows) in one dataset will be reflected in the other.

for log in my_dataset:
    val = log.entries["data"]
    log.update_entries(data=val*100)

Expand

click image to maximize

Logs can also be removed from one dataset table, without removing it from the other. However, if the log is removed from both datasets, then the log itself will be deleted entirely.

Expand

click image to maximize

​Uploading

​Downloading

​Syncing

​Duplicates

Uploading

Downloading

Syncing

Duplicates