Splits and configurations
Machine learning datasets are commonly organized in splits and they may also have configurations. These internal structures provide the scaffolding for building out a dataset, and determines how a dataset should be split and organized. Understanding a dataset’s structure can help you create your own dataset, and know which subset of data you should use when during model training and evaluation.
Splits
Every processed and cleaned dataset contains splits, specific subsets of data reserved for specific needs. The most common splits are:
train
: data used to train a model; this data is exposed to the modelvalidation
: data reserved for evaluation and improving model hyperparameters; this data is hidden from the modeltest
: data reserved for evaluation only; this data is completely hidden from the model and ourselves
The validation
and test
sets are especially important to ensure a model is actually learning instead of overfitting, or just memorizing the data.
Configurations
A configuration is a higher-level internal structure than a split, and a configuration contains splits. You can think of a configuration as a sub-dataset contained within a larger dataset. It is a useful structure for adding additional layers of organization to a dataset. For example, if you take a look at the Multilingual LibriSpeech (MLS) dataset, you’ll notice there are eight different languages. While you can create a dataset containing all eight languages, it’s probably neater to create a dataset with each language as a configuration. This way, users can instantly load a dataset with their language of interest instead of preprocessing the dataset to filter for a specific language.
Configurations are flexible, and can be used to organize a dataset along whatever objective you’d like. For example, the SceneParse150 dataset uses configurations to organize the dataset by task. One configuration is dedicated to segmenting the whole image, while the other configuration is for instance segmentation.