Tutorial
This tutorial will cover how to use Mirdata to access and work with music datasets. mirdata is a Python library designed to make it easy to load and work with common music information retrieval (MIR) datasets.
In this tutorial, we will cover:
Downloading Mirdata
Initializing a dataset
Downloading a dataset
Validating a dataset
Loading tracks
Accessing annotations and metadata
Advanced options for download and tracks
Usage examples of Mirdata in your pipeline, with TensorFlow, and with PyTorch, and in Google Colab.
Quickstart
First, install mirdata. We recommend to do this inside a conda or virtual environment for reproducibility.
pip install mirdata
Then, get yor data by simply doing:
1# Basic Usage Example
2import mirdata
3
4# 1. List all available datasets
5print(mirdata.list_datasets())
6
7# 2. Initialize a dataset loader
8dataset = mirdata.initialize("orchset", data_home='/choose/where/data/live')
9
10# 3. Download the dataset
11dataset.download()
12
13# 4. validate the dataset
14dataset.validate()
15
16# 5. Load tracks
17random_track = dataset.choice_track()
18
19# 6. Access metadata and annotations
20print(random_track)
Below, we elaborate on each step a bit more:
Initializing a dataset
To use a loader, (for example, orchset) you need to initialize it by calling:
dataset = mirdata.initialize('orchset', data_home='/choose/where/data/live')
This will create a dataset loader object that you can use to access the dataset’s tracks, metadata, and annotations.
You can specify the directory where the Mirdata data is stored by passing a path to data_home.
Dataset versions
Mirdata supports working with multiple dataset versions.
To see all available versions of a specific dataset, run mirdata.list_dataset_versions('orchset').
Use version parameter if you wish to use a version other than the default one. To check an example, see below.
# To see all available versions of a specific dataset:
mirdata.list_dataset_versions('orchset')
#Use 'version' parameter if you wish to use a version other than the default one.
dataset = mirdata.initialize('orchset', data_home='/choose/where/data/live', version="1.0")
Downloading a dataset
To download the dataset, you can use the download() method of the dataset loader object:
dataset.download() # Dataset is downloaded to ~/mir_datasets/orchset
By default, the dataset will be downloaded to the mir_datasets folder in your home directory.
Note
For downloading in a custom folder, partial downloads, and other advanced options, see the Advanced download options section below.
Validating a dataset
To ensure that the dataset files are correctly downloaded and not corrupted, you can use the validate() method of the dataset loader object:
dataset.validate()
This method checks the integrity of the dataset files and raises an error if any files are missing or corrupted.
Loading a random track
We can choose a random track from a dataset with the choice_track() method:
random_track = dataset.choice_track()
This returns a random track from the dataset, which can be useful for testing or exploration purposes.
Note
For loading all tracks, load a single track, or load tracks with specific IDs, see the Advanced track options section below.
Annotations and metadata
After choosing a track, we can access its metadata and annotations. To print the metadata and annotations associated with the track, you can simply print the track object:
# For this example, we will use the random_track from above.
print(random_track)
This will print the metadata and annotations associated with the track, such as composer, work, excerpt, and paths to audio files.
# Example output
>>> Track(
alternating_melody=True,
audio_path_mono="user/mir_datasets/orchset/audio/mono/Beethoven-S3-I-ex1.wav",
audio_path_stereo="user/mir_datasets/orchset/audio/stereo/Beethoven-S3-I-ex1.wav",
composer="Beethoven",
contains_brass=False,
contains_strings=True,
contains_winds=True,
excerpt="1",
melody_path="user/mir_datasets/orchset/GT/Beethoven-S3-I-ex1.mel",
only_brass=False,
only_strings=False,
only_winds=False,
predominant_melodic_instruments=['strings', 'winds'],
track_id="Beethoven-S3-I-ex1",
work="S3-I",
audio_mono: (np.ndarray, float),
audio_stereo: (np.ndarray, float),
melody: F0Data,
)
Annotation classes
Mirdata defines annotation-specific data classes. These data classes are meant to standardize the format for all loaders, and are compatible with mir_eval. The list and descriptions of available annotation classes can be found in Annotations.
Note: These classes may be extended in the case that a loader requires it.
Advanced download options
This section provides comprehensive coverage of advanced dataset download configurations and options available in Mirdata:
Downloading the dataset to a custom folder
Partially downloading a dataset
Downloading the dataset index only
Accessing data on non-local filesystems
Downloading dataset in custom folder
dataset = mirdata.initialize('orchset', data_home='/Users/leslieknope/Desktop/orchset123')
dataset.download() # Dataset is downloaded to the folder "orchset123" on Leslie Knope's desktop
Now data_home is specified and so orchset will be read from / written to this custom location.
Partially downloading a dataset
The download() function allows partial downloads of a dataset. In other words, if applicable, the user can
select which elements of the dataset they want to download. Each dataset has a REMOTES dictionary where all
the available elements are listed.
# Elements should be specified as a list of keys in the REMOTES dictionary.
dataset.download(partial_download=['element_A', 'element_B', 'element_C'])
Partial downloads example
cante100 has different elements as seen in the REMOTES dictionary. Thus, we can specify which of these elements are
downloaded, by passing to the download() function the list of keys in REMOTES that we are interested in. This
list is passed to the download() function through the partial_download variable.
REMOTES = {
"spectrogram": download_utils.RemoteFileMetadata(
filename="cante100_spectrum.zip",
url="https://zenodo.org/record/1322542/files/cante100_spectrum.zip?download=1",
checksum="0b81fe0fd7ab2c1adc1ad789edb12981", # the md5 checksum
destination_dir="cante100_spectrum", # relative path for where to unzip the data, or None
),
"melody": download_utils.RemoteFileMetadata(
filename="cante100midi_f0.zip",
url="https://zenodo.org/record/1322542/files/cante100midi_f0.zip?download=1",
checksum="cce543b5125eda5a984347b55fdcd5e8", # the md5 checksum
destination_dir="cante100midi_f0", # relative path for where to unzip the data, or None
),
"notes": download_utils.RemoteFileMetadata(
filename="cante100_automaticTranscription.zip",
url="https://zenodo.org/record/1322542/files/cante100_automaticTranscription.zip?download=1",
checksum="47fea64c744f9fe678ae5642a8f0ee8e", # the md5 checksum
destination_dir="cante100_automaticTranscription", # relative path for where to unzip the data, or None
),
"metadata": download_utils.RemoteFileMetadata(
filename="cante100Meta.xml",
url="https://zenodo.org/record/1322542/files/cante100Meta.xml?download=1",
checksum="6cce186ce77a06541cdb9f0a671afb46", # the md5 checksum
),
"README": download_utils.RemoteFileMetadata(
filename="cante100_README.txt",
url="https://zenodo.org/record/1322542/files/cante100_README.txt?download=1",
checksum="184209b7e7d816fa603f0c7f481c0aae", # the md5 checksum
),
}
A partial download example for cante100 dataset could be:
dataset = mirdata.initialize('cante100', data_home='/choose/where/data/live')
dataset.download(partial_download=['spectrogram', 'melody', 'metadata'])
Note
Not all datasets support partial downloads. To check if a dataset supports partial downloads, check if the REMOTES
dictionary is not empty.
Downloading dataset index only
All dataset loaders in Mirdata have a download() function that downloads:
The canonical version of the dataset (when available)
The dataset index, which indicates the list of clips and paths to audio and annotation files
The index is downloaded by running download(["index"]) and is stored in Mirdata’s indexes folder (mirdata/datasets/indexes).
# Download the dataset index
dataset.download(["index"])
# Check the path to the downloaded index
print(dataset.index_path)
Accessing data on non-local filesystems
mirdata uses the smart_open library, which supports non-local filesystems such as GCS and AWS.
If your data lives, e.g. on Google Cloud Storage (GCS), simply set the data_home variable accordingly
when initializing a dataset. For example:
dataset = mirdata.initialize("orchset", data_home="gs://my-bucket/my-subfolder/orchset")
# everything should work the same as if the data were local
dataset.validate()
Note that the data on the remote file system must have identical folder structure to what is specified by dataset.download(),
and we do not support downloading (i.e. writing) to remote filesystems, only reading from them. To prepare a new dataset to use with mirdata,
we recommend running dataset.download() on a local filesystem, and then manually transfering the folder contents to the remote
filesystem.
mp3 data
For a variety of reasons, mirdata doesn’t support remote reading of mp3 files, so some datasets with mp3 audio may have tracks with unavailable attributes.
Advanced track options
This section covers advanced options for working with tracks in datasets. These methods provide flexible ways to access and manipulate track data based on your specific research needs:
Loading all tracks and example
Loading tracks with track ID
Loading tracks
1# Initialize the dataset
2dataset = mirdata.initialize("orchset")
3
4# Load all tracks in the dataset as a dictionary with the track_ids as keys and track objects as values.
5tracks = dataset.load_tracks()
6
7# Iterating over datasets
8for key, track in tracks.items():
9 print(key, track.audio_path)
To load tracks from a dataset, you can use the load_tracks() method. This method returns a dictionary where the keys are track IDs and the values are track objects.
tracks = dataset.load_tracks()
This will load all tracks in the dataset, allowing you to access their audio and annotations.
Next, you can iterate over the tracks dictionary to access each track’s audio path and other attributes:
for key, track in tracks.items():
print(key, track.audio_path)
Loading tracks with track ID
1# Initialize the dataset
2dataset = mirdata.initialize("orchset")
3
4# Get the list of track IDs
5track_ids = dataset.track_ids
6
7# Loop over the track_ids list to directly access each track in the dataset
8for track_id in dataset.track_ids:
9
10 print(track_id, dataset.track(track_id).audio_path)
To load tracks with track ids, first:
track_ids = dataset.track_ids
Get the list of the track_ids.
Next, loop over the track_ids list to directly access each track in the dataset:
for track_id in dataset.track_ids:
print(track_id, dataset.track(track_id).audio_path)
Advanced Usage
Using mirdata in your pipeline
This section shows how to use Mirdata in your machine learning pipeline.
1import mir_eval
2import mirdata
3import numpy as np
4import sox
5
6def very_bad_melody_extractor(audio_path):
7
8 duration = sox.file_info.duration(audio_path)
9 time_stamps = np.arange(0, duration, 0.01)
10 melody_f0 = np.random.uniform(low=80.0, high=800.0, size=time_stamps.shape)
11
12 return time_stamps, melody_f0
13
14# Evaluate on the full dataset
15orchset = mirdata.initialize("orchset")
16
17orchset_scores = {}
18
19orchset_data = orchset.load_tracks()
20
21for track_id, track_data in orchset_data.items():
22 est_times, est_freqs = very_bad_melody_extractor(track_data.audio_path_mono)
23
24 ref_melody_data = track_data.melody
25 ref_times = ref_melody_data.times
26 ref_freqs = ref_melody_data.frequencies
27
28 score = mir_eval.melody.evaluate(ref_times, ref_freqs, est_times, est_freqs)
29 orchset_scores[track_id] = score
30
31# Split the results by composer and by instrumentation
32composer_scores = {}
33
34strings_no_strings_scores = {True: {}, False: {}}
35
36for track_id, track_data in orchset_data.items():
37 if track_data.composer not in composer_scores.keys():
38 composer_scores[track_data.composer] = {}
39
40 composer_scores[track_data.composer][track_id] = orchset_scores[track_id]
41 strings_no_strings_scores[track_data.contains_strings][track_id] = \
42 orchset_scores[track_id]
Using mirdata with tensorflow
This example shows how to use Mirdata with TensorFlow’s tf.data.Dataset API to create a dataset generator for the ORCHSET dataset.
1import mirdata
2import numpy as np
3import tensorflow as tf
4
5def orchset_generator():
6
7 # using the default data_home
8 orchset = mirdata.initialize("orchset")
9 track_ids = orchset.track_ids
10
11 for track_id in track_ids:
12 track = orchset.track(track_id)
13 audio_signal, sample_rate = track.audio_mono
14
15 yield {
16 "audio": audio_signal.astype(np.float32),
17 "sample_rate": sample_rate,
18 "annotation": {
19 "times": track.melody.times.astype(np.float32),
20 "freqs": track.melody.frequencies.astype(np.float32),
21 },
22 "metadata": {"track_id": track.track_id}
23 }
24
25dataset = tf.data.Dataset.from_generator(
26 orchset_generator,
27 {
28 "audio": tf.float32,
29 "sample_rate": tf.float32,
30 "annotation": {"times": tf.float32, "freqs": tf.float32},
31 "metadata": {'track_id': tf.string}
32 }
33)
Using mirdata with pytorch
This example shows how to use Mirdata with PyTorch’s torch.utils.data.Dataset and DataLoader to create a dataset generator.
1import torch
2import numpy as np
3import mirdata
4from torch.utils.data import Dataset, DataLoader
5
6
7class MIRDataset(Dataset):
8
9 def __init__(self, dataset_name: str):
10
11 # Initialize the loader, download if required, and validate
12 self.loader = mirdata.initialize(dataset_name)
13 self.loader.download()
14 self.loader.validate()
15
16 # Get the length of the longest tracks + annotations in the dataset
17 # Torch dataloader requires all tensors to have the same dims
18 # So we'll use this to pad items that are too short
19 self.longest_track = max(
20 [len(self.loader.track(tid).audio_mono[0]) for tid in self.loader.track_ids]
21 )
22 self.longest_annotation = max(
23 [len(self.loader.track(tid).melody.times) for tid in self.loader.track_ids]
24 )
25
26 @staticmethod
27 def pad(to_pad: np.ndarray, pad_size: int) -> np.ndarray:
28 """Right-pads a 1D array to `pad_size`"""
29 return np.pad(
30 to_pad, (0, pad_size - len(to_pad)), mode="constant", constant_values=0.0
31 )
32
33 def __len__(self) -> int:
34 return len(self.loader.track_ids)
35
36 def __getitem__(self, item: int) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
37
38 # Unpack the current track
39 track_id = self.loader.track_ids[item]
40 track = self.loader.track(track_id)
41
42 # Get the audio and annotations
43 audio_signal, sample_rate = track.audio_mono
44 times = track.melody.times
45 frequencies = track.melody.frequencies
46
47 # Right pad everything to satisfy torch's requirement for equal dims
48 audio_signal_padded = self.pad(audio_signal, self.longest_track)
49 times_padded = self.pad(times, self.longest_annotation)
50 frequencies_padded = self.pad(frequencies, self.longest_annotation)
51
52 return (
53 audio_signal_padded.astype(np.float32),
54 times_padded.astype(np.float32),
55 frequencies_padded.astype(np.float32),
56 )
57
58
59md = DataLoader(MIRDataset("orchset"), batch_size=2, shuffle=True, drop_last=False)
60for audio, times, freqs in md:
61 pass # train your model on this data
Using mirdata in Google Colab
Google Colab provides a browser-based Python environment with free GPU support, which is useful for exploring datasets quickly.
You will have two options that you can use the dataset from mirdata in Colab - Download Dataset directly in Google Colab, or Access the Dataset Downloaded out of Google Colab
Colab Example Notebook
File -> Save a copy in Drive.