Overview

Mirdata is a library which aims to standardize how audio datasets are accessed in Python, removing the need for writing custom loaders in every project, and improving reproducibility. Working with datasets usually requires an often cumbersome step of downloading data and writing load functions that load related files (for example, audio and annotations) into a standard format to be used for experimenting or evaluating. Mirdata does all of this for you:

import mirdata

print(mirdata.list_datasets())

tinysol = mirdata.initialize('tinysol')
tinysol.download()

# get annotations and audio for a random track
example_track = tinysol.choice_track()
instrument = example_track.instrument_full
pitch = example_track.pitch
y, sr = example_track.audio

Mirdata loaders contain methods to:

download(): download (or give instructions to download) a dataset
load_*(): load a dataset’s files (audio, metadata, annotations, etc.) into standard formats, so you don’t have to write them yourself which are compatible with mir_eval.
validate(): validate that a dataset is complete and correct
cite(): quickly print a dataset’s relevant citation
access track and multitrack objects for grouping multiple annotations for a particular track/multitrack
and more

See the Tutorial for a detailed explanation of how to get started using this library.

mirdata design principles

Ease of use and contribution

We designed Mirdata to be easy to use and easy to contribute to. Mirdata simplifies the research pipeline considerably, facilitating research in a wider diversity of tasks and musical datasets. We provide detailed examples on how to interact with the library in the Tutorial, as well as detail explanation on how to contribute in Contributing to Mirdata. Additionally, we have a repository of Jupyter notebooks with usage examples of the different datasets.

Reproducibility

We aim for Mirdata to aid in increasing research reproducibility by providing a common framework for MIR researchers to compare and validate their data. If mistakes are found in annotations or audio versions change, using Mirdata, the community can fix mistakes while still being able to compare methods moving forward.

canonical versions

The dataset loaders in Mirdata are written for what we call the canonical version of a dataset. Whenever possible, this should be the official release of the dataset as published by the dataset creator/s. When this is not possible, (e.g. for data that is no longer available), the procedure we follow is to find as many copies of the data as possible from different researchers (at least 4), and use the most common one. To make this process transparent, when there are doubts about the data consistency we open an issue and leave it to the community to discuss what to use.

Standardization

Different datasets have different annotations, metadata, etc. We try to respect the idiosyncrasies of each dataset as much as we can. For this reason, tracks in each Dataset in Mirdata have different attributes, e.g. some may have artist information and some may not. However there are some elements that are common in most datasets, and in these cases we standardize them to increase the usability of the library. Some examples of this are the annotations in Mirdata, e.g. BeatData.

indexes

Indexes in mirdata are manifests of the files in a dataset and their corresponding md5 checksums. Specifically, an index is a json file with the mandatory top-level key version and at least one of the optional top-level keys metadata, tracks, multitracks or records. An index might look like:

Example Index

{   "version": "1.0.0",
    "metadata": {
        "metadata_file_1": [
                // the relative path for metadata_file_1
                "path_to_metadata/metadata_file_1.csv",
                // metadata_file_1 md5 checksum
                "bb8b0ca866fc2423edde01325d6e34f7"
            ],
        "metadata_file_2": [
                // the relative path for metadata_file_2
                "path_to_metadata/metadata_file_2.csv",
                // metadata_file_2 md5 checksum
                "6cce186ce77a06541cdb9f0a671afb46"
            ]
        }
    "tracks": {
        "track1": {
            'audio': ["audio_files/track1.wav", "6c77777ce77a06541cdb9f0a671afb46"],
            'beats': ["annotations/track1.beats.csv", "ab8b0ca866fc2423edde01325d6e34f7"],
            'sections': ["annotations/track1.sections.txt", "05abeca866fc2423edde01325d6e34f7"],
        }
        "track2": {
            'audio': ["audio_files/track2.wav", "6c77777ce77a06542cdb9f0a672afb46"],
            'beats': ["annotations/track2.beats.csv", "ab8b0ca866fc2423edde02325d6e34f7"],
            'sections': ["annotations/track2.sections.txt", "05abeca866fc2423edde02325d6e34f7"],
        }
        ...
        }
}

The optional top-level keys (tracks, multitracks and records) relate to different organizations of music datasets. tracks are used when a dataset is organized as a collection of individual tracks, namely mono or multi-channel audio, spectrograms only, and their respective annotations. multitracks are used in when a dataset comprises of multitracks - different groups of tracks which are directly related to each other. Finally, records are used when a dataset consists of groups of tables (e.g. relational databases), as many recommendation datasets do.

See the contributing docs 1. Create an index for more information about mirdata indexes.

annotations

mirdata provides Annotation classes of various kinds which provide a standard interface to different annotation formats. These classes are compatible with the mir_eval library’s expected format. The format can be easily extended to other formats, if requested.

metadata

When available, we provide extensive and easy-to-access metadata to facilitate track metadata-specific analysis. metadata is available as attributes at the track level, e.g. track.artist.