Contributing to Mirdata
We encourage contributions to mirdata, especially new dataset loaders. To contribute a new loader, follow the steps indicated below and create a Pull Request (PR) to the github repository. For any doubt or comment about your contribution, you can always submit an issue or open a discussion in the repository.
To reduce friction, we may make commits on top of contributor’s PRs. If you do not want us
to, please tag your PR with please-do-not-edit.
Before you start
Quick link to contributing templates
If you’re familiar with Mirdata’s API already, you can find the template files for contributing here, and the loader checklist for submitting your PR here.
Installing mirdata for development purposes
To install Mirdata for development purposes:
First, fork the Mirdata repository on GitHub and clone your fork locally.
Then, after opening source data library you have to install all the dependencies:
# Install Core dependencies
pip install .
#Install Testing dependencies
pip install ."[tests]"
#Install Docs dependencies
pip install ."[docs]"
#Install dataset-specific dependencies
pip install ."[dataset]" # where dataset can be dali | haydn_op20 | cipi ...
Note
We recommend to install pyenv to manage your Python versions
and install all Mirdata requirements. You will want to install the latest supported Python versions (see README.md).
Once pyenv and the Python versions are configured, install pytest. Make sure you installed all the necessary pytest
plugins to automatically test your code successfully (e.g. pytest-cov).
Before running the tests, make sure to have formatted mirdata/ and tests/ with black.
black mirdata/ tests/
Also, make sure that they pass flake8 and mypy tests specified in lint-python.yml github action workflow.
flake8 mirdata --count --select=E9,F63,F7,F82 --show-source --statistics
python -m mypy mirdata --ignore-missing-imports --allow-subclassing-any
Finally, run:
pytest -vv --cov-report term-missing --cov-report=xml --cov=mirdata tests/ --local
All tests should pass!
Writing a new dataset loader
The steps to add a new dataset loader to Mirdata are:
Before starting
Before starting, check if your dataset falls into one of these non-standard cases:
Is the dataset download restricted or not fully-downloadable? If so, see this section
Does the dataset require dependencies not currently in mirdata? If so, see this section
Does the dataset have multiple versions? If so, see this section
1. Create an index
Mirdata’s structure relies on indexes. Indexes are dictionaries contain information about the structure of the dataset which is necessary for the loading and validating functionalities of Mirdata. In particular, indexes contain information about the files included in the dataset, their location and checksums. Indexes do not contain tags, annotations, or descriptors. The necessary steps are:
To create an index, first create a script in
scripts/, asmake_dataset_index.py, which generates an index file.Then run the script on the dataset and save the index in
mirdata/datasets/indexes/asdataset_index_<version>.json. where <version> indicates which version of the dataset was used (e.g. 1.0).When the dataloader is completed and the PR is accepted, upload the index in our Zenodo community. See more details here.
The function make_<datasetname>_index.py should automate the generation of an index by computing the MD5 checksums for given files in a dataset located at data_path.
Users can adapt this function to create an index for their dataset by adding their file paths and using the md5 function to generate checksums for their files.
Example Make Index Script
import argparse
import glob
import json
import os
from mirdata.validate import md5
DATASET_INDEX_PATH = "../mirdata/datasets/indexes/dataset_index.json"
def make_dataset_index(dataset_data_path):
annotation_dir = os.path.join(dataset_data_path, "annotation")
annotation_files = glob.glob(os.path.join(annotation_dir, "*.lab"))
track_ids = sorted([os.path.basename(f).split(".")[0] for f in annotation_files])
# top-key level metadata
metadata_checksum = md5(os.path.join(dataset_data_path, "id_mapping.txt"))
index_metadata = {"metadata": {"id_mapping": ("id_mapping.txt", metadata_checksum)}}
# top-key level tracks
index_tracks = {}
for track_id in track_ids:
audio_checksum = md5(
os.path.join(dataset_data_path, "Wavfile/{}.wav".format(track_id))
)
annotation_checksum = md5(
os.path.join(dataset_data_path, "annotation/{}.lab".format(track_id))
)
index_tracks[track_id] = {
"audio": ("Wavfile/{}.wav".format(track_id), audio_checksum),
"annotation": ("annotation/{}.lab".format(track_id), annotation_checksum),
}
# top-key level version
dataset_index = {"version": None}
# combine all in dataset index
dataset_index.update(index_metadata)
dataset_index.update({"tracks": index_tracks})
with open(DATASET_INDEX_PATH, "w") as fhandle:
json.dump(dataset_index, fhandle, indent=2)
def main(args):
make_dataset_index(args.dataset_data_path)
if __name__ == "__main__":
PARSER = argparse.ArgumentParser(description="Make dataset index file.")
PARSER.add_argument(
"dataset_data_path", type=str, help="Path to dataset data folder."
)
main(PARSER.parse_args())
More examples of scripts used to create dataset indexes can be found in the scripts folder.
Note
Users should be able to create the dataset indexes without the need for additional dependencies that are not included in Mirdata by default. Should you need an additional dependency for a specific reason, please open an issue to discuss with the Mirdata maintainers the need for it.
Tracks
Most MIR datasets are organized as a collection of tracks and annotations. In such case, the index should make use of the tracks
top-level key. A dictionary should be stored under the tracks top-level key where the keys are the unique track ids of the dataset.
The values are a dictionary of files associated with a track id, along with their checksums. These files can be for instance audio files
or annotations related to the track id. File paths are relative to the top level directory of a dataset.
Index Examples - Tracks
If the version 1.0 of a given dataset has the structure:
> Example_Dataset/ > audio/ track1.wav track2.wav track3.wav > annotations/ track1.csv Track2.csv track3.csv > metadata/ metadata_file.csv
The top level directory is Example_Dataset and the relative path for track1.wav
would be audio/track1.wav. Any unavailable fields are indicated with null. A possible index file for this example would be:
{ "version": "1.0",
"tracks":
"track1": {
"audio": [
"audio/track1.wav", // the relative path for track1's audio file
"912ec803b2ce49e4a541068d495ab570" // track1.wav's md5 checksum
],
"annotation": [
"annotations/track1.csv", // the relative path for track1's annotation
"2cf33591c3b28b382668952e236cccd5" // track1.csv's md5 checksum
]
},
"track2": {
"audio": [
"audio/track2.wav",
"65d671ec9787b32cfb7e33188be32ff7"
],
"annotation": [
"annotations/Track2.csv",
"e1964798cfe86e914af895f8d0291812"
]
},
"track3": {
"audio": [
"audio/track3.wav",
"60edeb51dc4041c47c031c4bfb456b76"
],
"annotation": [
"annotations/track3.csv",
"06cb006cc7b61de6be6361ff904654b3"
]
},
}
"metadata": {
"metadata_file": [
"metadata/metadata_file.csv",
"7a41b280c7b74e2ddac5184708f9525b"
]
}
}
Note
In this example there is a (purposeful) mismatch between the name of the audio file track2.wav and its corresponding annotation file, Track2.csv, compared with the other pairs. This mismatch should be included in the index. This type of slight difference in filenames happens often in publicly available datasets, making pairing audio and annotation files more difficult. We use a fixed, version-controlled index to account for this kind of mismatch, rather than relying on string parsing on load.
Multitracks
Index Examples - Multitracks
If the version 1.0 of a given multitrack dataset has the structure:
> Example_Dataset/
> audio/
multitrack1-voice1.wav
multitrack1-voice2.wav
multitrack1-accompaniment.wav
multitrack1-mix.wav
multitrack2-voice1.wav
multitrack2-voice2.wav
multitrack2-accompaniment.wav
multitrack2-mix.wav
> annotations/
multitrack1-voice-f0.csv
multitrack2-voice-f0.csv
multitrack1-f0.csv
multitrack2-f0.csv
> metadata/
metadata_file.csv
The top level directory is Example_Dataset and the relative path for multitrack1-voice1
would be audio/multitrack1-voice1.wav. Any unavailable fields are indicated with null. A possible index file for this example would be:
{
"version": 1,
"tracks": {
"multitrack1-voice": {
"audio_voice1": ('audio/multitrack1-voice1.wav', checksum),
"audio_voice2": ('audio/multitrack1-voice1.wav', checksum),
"voice-f0": ('annotations/multitrack1-voice-f0.csv', checksum)
}
"multitrack1-accompaniment": {
"audio_accompaniment": ('audio/multitrack1-accompaniment.wav', checksum)
}
"multitrack2-voice" : {...}
...
},
"multitracks": {
"multitrack1": {
"tracks": ['multitrack1-voice', 'multitrack1-accompaniment'],
"audio": ('audio/multitrack1-mix.wav', checksum)
"f0": ('annotations/multitrack1-f0.csv', checksum)
}
"multitrack2": ...
},
"metadata": {
"metadata_file": [
"metadata/metadata_file.csv",
"7a41b280c7b74e2ddac5184708f9525b"
]
}
}
Note
In this examples, we group audio_voice1 and audio_voice2 in a single Track because the annotation
voice-f0 annotation corresponds to their mixture. In contrast, the annotation voice-f0 is extracted from
the multitrack mix and it is stored in the multitracks group. The multitrack multitrack1 has an
additional track multitrack1-mix.wav which may be the master track, the final mix,
the recording of multitrack1 with another microphone.
2. Create a module
Once the index is created you can create the loader. For that, we suggest you use the following template and adjust it for your dataset. To quickstart a new module:
Copy the example below and save it to
mirdata/datasets/<your_dataset_name>.pyFind & Replace
Examplewith the <your_dataset_name>.Remove any lines beginning with # – which are there as guidelines.
Example Module
Copy and save it to mirdata/datasets/<your_dataset_name>.py.
1"""Example Dataset Loader
2
3.. admonition:: Dataset Info
4 :class: dropdown
5
6 Please include the following information at the top level docstring for the dataset's module `dataset.py`:
7
8 1. Describe annotations included in the dataset
9 2. Indicate the size of the datasets (e.g. number files and duration, hours)
10 3. Mention the origin of the dataset (e.g. creator, institution)
11 4. Describe the type of music included in the dataset
12 5. Indicate any relevant papers related to the dataset
13 6. Include a description about how the data can be accessed and the license it uses (if applicable)
14
15"""
16import csv
17import json
18import os
19from typing import BinaryIO, Optional, TextIO, Tuple
20
21# -- import whatever you need here and remove
22# -- example imports you won't use
23import librosa
24import numpy as np
25from smart_open import open # if you use the open function, make sure you include this line!
26
27from mirdata import download_utils, core, annotations
28
29# -- Add any relevant citations here
30BIBTEX = """
31@article{article-minimal,
32 author = "L[eslie] B. Lamport",
33 title = "The Gnats and Gnus Document Preparation System",
34 journal = "G-Animal's Journal",
35 year = "1986"
36},
37@article{article-minimal2,
38 author = "L[eslie] B. Lamport",
39 title = "The Gnats and Gnus Document Preparation System 2",
40 journal = "G-Animal's Journal",
41 year = "1987"
42}
43"""
44
45# -- INDEXES specifies different versions of a dataset
46# -- "default" and "test" specify which key should be used by default and when running tests
47# -- Each index is defined by {"version": core.Index instance}
48# -- | filename: index name
49# -- | url: Zenodo direct download link of the index (will be available afer the index upload is
50# -- accepted to Audio Data Loaders Zenodo community).
51# -- | checksum: Checksum of the index hosted at Zenodo.
52# -- Direct url for download and checksum can be found in the Zenodo entry of the dataset.
53# -- Sample index is a mini-version that makes it easier to test a large datasets.
54# -- There must be a local sample index for testing for each remote index.
55INDEXES = {
56 "default": "1.2",
57 "test": "sample",
58 "1.2": core.Index(
59 filename="beatles_index_1.2.json",
60 url="https://zenodo.org/records/14007830/files/beatles_index_1.2.json?download=1",
61 checksum="6e1276bdab6de05446ddbbc75e6f6cbe",
62 ),
63 "sample": core.Index(filename="beatles_index_1.2_sample.json"),
64}
65
66# -- REMOTES is a dictionary containing all files that need to be downloaded.
67# -- The keys should be descriptive (e.g. 'annotations', 'audio').
68# -- When having data that can be partially downloaded, remember to set up
69# -- correctly destination_dir to download the files following the correct structure.
70REMOTES = {
71 'remote_data': download_utils.RemoteFileMetadata(
72 filename='a_zip_file.zip',
73 url='http://website/hosting/the/zipfile.zip',
74 checksum='00000000000000000000000000000000', # -- the md5 checksum
75 destination_dir='path/to/unzip' # -- relative path for where to unzip the data, or None
76 ),
77}
78
79# -- Include any information that should be printed when downloading
80# -- remove this variable if you don't need to print anything during download
81DOWNLOAD_INFO = """
82Include any information you want to be printed when dataset.download() is called.
83These can be instructions for how to download the dataset (e.g. request access on zenodo),
84caveats about the download, etc
85"""
86
87# -- Include the dataset's license information
88LICENSE_INFO = """
89The dataset's license information goes here.
90"""
91
92
93class Track(core.Track):
94 """Example track class
95 # -- YOU CAN AUTOMATICALLY GENERATE THIS DOCSTRING BY CALLING THE SCRIPT:
96 # -- `scripts/print_track_docstring.py my_dataset`
97 # -- note that you'll first need to have a test track (see "Adding tests to your dataset" below)
98
99 Args:
100 track_id (str): track id of the track
101
102 Attributes:
103 audio_path (str): path to audio file
104 annotation_path (str): path to annotation file
105 # -- Add any of the dataset specific attributes here
106
107 Cached Properties:
108 annotation (EventData): a description of this annotation
109
110 """
111 def __init__(self, track_id, data_home, dataset_name, index, metadata):
112
113 # -- this sets the following attributes:
114 # -- * track_id
115 # -- * _dataset_name
116 # -- * _data_home
117 # -- * _track_paths
118 # -- * _track_metadata
119 super().__init__(
120 track_id,
121 data_home,
122 dataset_name=dataset_name,
123 index=index,
124 metadata=metadata,
125 )
126
127 # -- add any dataset specific attributes here
128 self.audio_path = self.get_path("audio")
129 self.annotation_path = self.get_path("annotation")
130
131 # -- if the dataset has an *official* e.g. train/test split, use this
132 # -- reserved attribute (can be a property if needed)
133 self.split = ...
134
135 # -- If the dataset has metadata that needs to be accessed by Tracks,
136 # -- such as a table mapping track ids to composers for the full dataset,
137 # -- add them as properties like instead of in the __init__.
138 @property
139 def composer(self) -> Optional[str]:
140 return self._track_metadata.get("composer")
141
142 # -- `annotation` will behave like an attribute, but it will only be loaded
143 # -- and saved when someone accesses it. Useful when loading slightly
144 # -- bigger files or for bigger datasets. By default, we make any time
145 # -- series data loaded from a file a cached property
146 @core.cached_property
147 def annotation(self) -> Optional[annotations.EventData]:
148 return load_annotation(self.annotation_path)
149
150 # -- `audio` will behave like an attribute, but it will only be loaded
151 # -- when someone accesses it and it won't be stored. By default, we make
152 # -- any memory heavy information (like audio) properties
153 @property
154 def audio(self) -> Optional[Tuple[np.ndarray, float]]:
155 """The track's audio
156
157 Returns:
158 * np.ndarray - audio signal
159 * float - sample rate
160
161 """
162 return load_audio(self.audio_path)
163
164# -- if the dataset contains multitracks, you can define a MultiTrack similar to a Track
165# -- you can delete the block of code below if the dataset has no multitracks
166class MultiTrack(core.MultiTrack):
167 """Example multitrack class
168
169 Args:
170 mtrack_id (str): multitrack id
171 data_home (str): Local path where the dataset is stored.
172 If `None`, looks for the data in the default directory, `~/mir_datasets/Example`
173
174 Attributes:
175 mtrack_id (str): track id
176 tracks (dict): {track_id: Track}
177 track_audio_property (str): the name of the attribute of Track which
178 returns the audio to be mixed
179 # -- Add any of the dataset specific attributes here
180
181 Cached Properties:
182 annotation (EventData): a description of this annotation
183
184 """
185 def __init__(
186 self, mtrack_id, data_home, dataset_name, index, track_class, metadata
187 ):
188 # -- this sets the following attributes:
189 # -- * mtrack_id
190 # -- * _dataset_name
191 # -- * _data_home
192 # -- * _multitrack_paths
193 # -- * _metadata
194 # -- * _track_class
195 # -- * _index
196 # -- * track_ids
197 super().__init__(
198 mtrack_id=mtrack_id,
199 data_home=data_home,
200 dataset_name=dataset_name,
201 index=index,
202 track_class=track_class,
203 metadata=metadata,
204 )
205
206 # -- optionally add any multitrack specific attributes here
207 self.mix_path = ... # this can be called whatever makes sense for the datasets
208 self.annotation_path = ...
209
210 # -- if the dataset has an *official* e.g. train/test split, use this
211 # -- reserved attribute (can be a property if needed)
212 self.split = ...
213
214 # If you want to support multitrack mixing in this dataset, set this property
215 @property
216 def track_audio_property(self):
217 return "audio" # the attribute of Track, e.g. Track.audio, which returns the audio to mix
218
219 # -- multitracks can optionally have mix-level cached properties and properties
220 @core.cached_property
221 def annotation(self) -> Optional[annotations.EventData]:
222 """output type: description of output"""
223 return load_annotation(self.annotation_path)
224
225 @property
226 def audio(self) -> Optional[Tuple[np.ndarray, float]]:
227 """The track's audio
228
229 Returns:
230 * np.ndarray - audio signal
231 * float - sample rate
232
233 """
234 return load_audio(self.audio_path)
235
236# -- this decorator allows this function to take a string or an open bytes file as input
237# -- and in either case converts it to an open file handle.
238# -- It also checks if the file exists
239# -- and, if None is passed, None will be returned
240@io.coerce_to_bytes_io
241def load_audio(fhandle: BinaryIO) -> Tuple[np.ndarray, float]:
242 """Load a Example audio file.
243
244 Args:
245 fhandle (str or file-like): path or file-like object pointing to an audio file
246
247 Returns:
248 * np.ndarray - the audio signal
249 * float - The sample rate of the audio file
250
251 """
252 # -- for example, the code below. This should be dataset specific!
253 # -- By default we load to mono
254 # -- change this if it doesn't make sense for your dataset.
255 return librosa.load(audio_path, sr=None, mono=True)
256
257
258# -- Write any necessary loader functions for loading the dataset's data
259
260# -- this decorator allows this function to take a string or an open file as input
261# -- and in either case converts it to an open file handle.
262# -- It also checks if the file exists
263# -- and, if None is passed, None will be returned
264@io.coerce_to_string_io
265def load_annotation(fhandle: TextIO) -> Optional[annotations.EventData]:
266
267 # -- because of the decorator, the file is already open
268 reader = csv.reader(fhandle, delimiter=' ')
269 intervals = []
270 annotation = []
271 for line in reader:
272 intervals.append([float(line[0]), float(line[1])])
273 annotation.append(line[2])
274
275 # there are several annotation types in annotations.py
276 # They should be initialized with data, followed by their units
277 # see annotations.py for a complete list of types and units.
278 annotation_data = annotations.EventData(
279 np.array(intervals), "s", np.array(annotation), "open"
280 )
281 return annotation_data
282
283# -- use this decorator so the docs are complete
284@core.docstring_inherit(core.Dataset)
285class Dataset(core.Dataset):
286 """The Example dataset
287 """
288
289 def __init__(self, data_home=None, version="default"):
290 super().__init__(
291 data_home,
292 version,
293 name=NAME,
294 track_class=Track,
295 bibtex=BIBTEX,
296 indexes=INDEXES,
297 remotes=REMOTES,
298 download_info=DOWNLOAD_INFO,
299 license_info=LICENSE_INFO,
300 )
301
302 # -- if your dataset has a top-level metadata file, write a loader for it here
303 # -- you do not have to include this function if there is no metadata
304 @core.cached_property
305 def _metadata(self):
306 metadata_path = os.path.join(self.data_home, 'example_metadata.csv')
307
308 # load metadata however makes sense for your dataset
309 metadata_path = os.path.join(data_home, 'example_metadata.json')
310 with open(metadata_path, 'r') as fhandle:
311 metadata = json.load(fhandle)
312
313 return metadata
314
315 # -- if your dataset needs to overwrite the default download logic, do it here.
316 # -- this function is usually not necessary unless you need very custom download logic
317 def download(
318 self, partial_download=None, force_overwrite=False, cleanup=False
319 ):
320 """Download the dataset
321
322 Args:
323 partial_download (list or None):
324 A list of keys of remotes to partially download.
325 If None, all data is downloaded
326 force_overwrite (bool):
327 If True, existing files are overwritten by the downloaded files.
328 cleanup (bool):
329 Whether to delete any zip/tar files after extracting.
330
331 Raises:
332 ValueError: if invalid keys are passed to partial_download
333 IOError: if a downloaded file's checksum is different from expected
334
335 """
336 # see download_utils.downloader for basic usage - if you only need to call downloader
337 # once, you do not need this function at all.
338 # only write a custom function if you need it!
339
You may find these examples useful as references:
For many more examples, see the datasets folder.
Declare constant variables
Declare constant variables
Please, include the variables BIBTEX, INDEXES, REMOTES, and LICENSE_INFO at the beginning of your module.
While BIBTEX (including the bibtex-formatted citation of the dataset), INDEXES (indexes urls, checksums and versions),
and LICENSE_INFO (including the license that protects the dataset in the dataloader) are mandatory, REMOTES is only defined if the dataset is openly downloadable.
INDEXES
As seen in the example, we have two ways to define an index: providing a URL to download the index file, or by providing the filename of the index file, assuming it is available locally (like sample indexes).
The full indexes for each version of the dataset should be retrieved from our Zenodo community. See more details here.
The sample indexes should be locally stored in the
tests/indexes/folder, and directly accessed through filename. See more details here.
Note
We do recommend to set the highest version of the dataset as the default version in the INDEXES variable.
However, if there is a reason for having a different version as the default, please do so.
When defining a remote index in INDEXES, simply also pass the arguments url and checksum to the Index class:
"1.0": core.Index(
filename="example_index_1.0.json", # the name of the index file
url=<url>, # the download link
checksum=<checksum>, # the md5 checksum
)
Remote indexes get downloaded along with the data when calling .download(), and are stored in <data_home>/mirdata/datasets/indexes.
REMOTES
REMOTES
Should be a list of RemoteFileMetadata objects, which are used to download the dataset files. See an example below:
REMOTES = {
"annotations": download_utils.RemoteFileMetadata(
filename="The Beatles Annotations.tar.gz",
url="http://isophonics.net/files/annotations/The%20Beatles%20Annotations.tar.gz",
checksum="62425c552d37c6bb655a78e4603828cc",
destination_dir="annotations",
),
}
Add more RemoteFileMetadata objects to the REMOTES dictionary if the dataset is split into multiple files.
Please use download_utils.RemoteFileMetadata to parse the dataset from an online repository, which takes cares of the download process and the checksum validation, and addresses corner carses.
Please do NOT use specific functions like download_zip_file or download_and_extract individually in your loader.
Note
Direct url for download and checksum can be found in the Zenodo entries of the dataset and index. Bear in mind that the url and checksum for the index will be available once a maintainer of the Audio Data Loaders Zenodo community has accepted the index upload.
For other repositories, you may need to generate the checksum yourself.
You may use the function provided in mirdata.validate.py.
Make sure to include, in the docstring of the dataloader, information about the following list of relevant aspects about the dataset you are integrating:
The dataset name.
A general purpose description, the task it is used for.
Details about the coverage: how many clips, how many hours of audio, how many classes, the annotations available, etc.
The license of the dataset (even if you have included the
LICENSE_INFOvariable already).The authors of the dataset, the organization in which it was created, and the year of creation (even if you have included the
BIBTEXvariable already).Please reference also any relevant link or website that users can check for more information.
Important
In addition to the module docstring, you should write docstrings for every new class and function you write. See the documentation tutorial for practical information on best documentation practices. This docstring is important for users to understand the dataset and its purpose. Having proper documentation also enhances transparency, and helps users to understand the dataset better. Please do not include complicated tables, big pieces of text, or unformatted copy-pasted text pieces. It is important that the docstring is clean, and the information is very clear to users. This will also engage users to use the dataloader! For many more examples, see the datasets folder.
Note
If the dataset you are trying to integrate stores every clip in a separated compressed file, it cannot be currently supported by Mirdata. Feel free to open and issue to discuss a solution (hopefully for the near future!)
3. Add tests
To finish your contribution, include tests that check the integrity of your loader. For this, follow these steps:
Make a toy version of the dataset in the tests folder
tests/resources/mir_datasets/my_dataset/, so you can test against little data. For example:Include all audio and annotation files for one track of the dataset
For each audio/annotation file, reduce the audio length to 1-2 seconds and remove all but a few of the annotations.
If the dataset has a metadata file, reduce the length to a few lines.
Test all of the dataset specific code, e.g. the public attributes of the Track class, the load functions and any other custom functions you wrote. See the tests folder for reference. If your loader has a custom download function, add tests similar to this loader.
Locally run
pytest -s tests/test_full_dataset.py --local --dataset my_datasetbefore submitting your loader to make sure everything is working. If your dataset has multiple versions, test each (non-default) version by runningpytest -s tests/test_full_dataset.py --local --dataset my_dataset --dataset-version my_version.
Note
We have written automated tests for all loader’s cite, download, validate, load, track_ids functions,
as well as some basic edge cases of the Track class, so you don’t need to write tests for these!
Example Test File
"""Tests for example dataset
"""
import numpy as np
import pytest
from mirdata import annotations
from mirdata.datasets import example
from tests.test_utils import run_track_tests
def test_track():
default_trackid = "some_id"
data_home = "tests/resources/mir_datasets/dataset"
dataset = example.Dataset(data_home, version="test")
track = dataset.track(default_trackid)
expected_attributes = {
"track_id": "some_id",
"audio_path": "tests/resources/mir_datasets/example/" + "Wavfile/some_id.wav",
"song_id": "some_id",
"annotation_path": "tests/resources/mir_datasets/example/annotation/some_id.pv",
}
expected_property_types = {"annotation": annotations.XData}
assert track._track_paths == {
"audio": ["Wavfile/some_id.wav", "278ae003cb0d323e99b9a643c0f2eeda"],
"annotation": ["Annotation/some_id.pv", "0d93a011a9e668fd80673049089bbb14"],
}
run_track_tests(track, expected_attributes, expected_property_types)
# test audio loading functions
audio, sr = track.audio
assert sr == 44100
assert audio.shape == (44100 * 2,)
def test_load_annotation():
# load a file which exists
annotation_path = "tests/resources/mir_datasets/dataset/Annotation/some_id.pv"
annotation_data = example.load_annotation(annotation_path)
# check types
assert type(annotation_data) == annotations.XData
assert type(annotation_data.times) is np.ndarray
# ... etc
# check values
assert np.array_equal(annotation_data.times, np.array([0.016, 0.048]))
# ... etc
def test_metadata():
data_home = "tests/resources/mir_datasets/dataset"
dataset = example.Dataset(data_home, version="test")
metadata = dataset._metadata
assert metadata["some_id"] == "something"
Running your tests locally
Before creating a PR, you should run all the tests. But before that, make sure to have formatted mirdata/ and tests/ with black.
black mirdata/ tests/
Also, make sure that they pass flake8 and mypy tests specified in lint-python.yml github action workflow.
flake8 mirdata --count --select=E9,F63,F7,F82 --show-source --statistics
python -m mypy mirdata --ignore-missing-imports --allow-subclassing-any
Finally, run all the tests locally like this:
pytest -vv --cov-report term-missing --cov-report=xml --cov=mirdata --black tests/ --local
The –local flag skips tests that are built to run only on the remote testing environment.
To run one specific test file:
pytest tests/datasets/test_ikala.py
Finally, there is one local test you should run, which we can’t easily run in our testing environment.
pytest -s tests/test_full_dataset.py --local --dataset dataset
Where dataset is the name of the module of the dataset you added. The -s tells pytest not to skip print
statements, which is useful here for seeing the download progress bar when testing the download function.
This tests that your dataset downloads, validates, and loads properly for every track. This test takes a long time for some datasets, but it’s important to ensure the integrity of the library.
The --skip-download flag can be added to pytest command to run the tests skipping the download.
This will skip the downloading step. Note that this is just for convenience during debugging - the tests should eventually all pass without this flag.
Reducing the testing space usage
Important
We are trying to keep the test resources folder size as small as possible, because it can get really heavy as new loaders are added. We kindly ask the contributors to reduce the size of the testing data if possible (e.g. trimming the audio tracks, keeping just two rows for csv files).
4. Update Mirdata documentation
Before you submit your loader make sure to:
Add your module to
docs/source/mirdata.rstfollowing an alphabetical orderAdd your module to
docs/source/table.rstfollowing an alphabetical order as follows:
* - Dataset
- Downloadable?
- Annotation Types
- Tracks
- License
An example of this for the Beatport EDM key dataset:
* - Beatport EDM key
- - audio: ✅
- annotations: ✅
- - global :ref:`key`
- 1486
- .. image:: https://licensebuttons.net/l/by-sa/3.0/88x31.png
:target: https://creativecommons.org/licenses/by-sa/4.0
(you can check that this was done correctly by clicking on the readthedocs check when you open a PR). You can find license badges images and links here.
5. Uploading the index to Zenodo
We store all dataset indexes in an online repository on Zenodo.
To use a dataloader, users may retrieve the index running the dataset.download() function that is also used to download the dataset.
To download only the index, you may run .download(["index"]). The index will be automatically downloaded and stored in the expected folder in Mirdata.
From a contributor point of view, you may create the index, store it locally, and develop the dataloader.
All JSON files in mirdata/indexes/ are included in the .gitignore file,
therefore there is no need to remove it when pushing to the remote branch during development, since it will be ignored by git.
Important
When creating the PR, please submit your index to our Zenodo community:
First, click on
New upload.Add your index in the
Upload filessection.Let Zenodo create a DOI for your index, so click No.
Resource type is Other.
Title should be mirdata-<dataset-id>_index_<version>, e.g. mirdata-beatles_index_1.2.
Add yourself as the Creator of this entry.
The license of the index should be the same as Mirdata.
Visibility should be set as Public.
Note
<dataset-id> is the identifier we use to initialize the dataset using mirdata.initialize(). It’s also the filename of your dataset module.
6. Create a Pull Request
Create a Pull Request
Please, create a Pull Request with all your development. When starting your PR please use the new_loader.md template,
it will simplify the reviewing process and also help you make a complete PR. You can do that by adding
&template=new_loader.md at the end of the url when you are creating the PR :
...mir-dataset-loaders/mirdata/compare?expand=1 will become
...mir-dataset-loaders/mirdata/compare?expand=1&template=new_loader.md.
Docs
Staged docs for every new PR are built, and you can look at them by clicking on the “readthedocs” test in a PR.
To quickly troubleshoot any issues, you can build the docs locally by navigating to the docs folder, and running
make html (note, you must have sphinx installed). Then open the generated _build/source/index.html
file in your web browser to view.
Troubleshooting
If github shows a red X next to your latest commit, it means one of our checks is not passing. This could mean:
running
blackhas failed – this means that your code is not formatted according toblack’s code-style. To fix this, simply run the following from inside the top level folder of the repository:
black mirdata/ tests/
Your code does not pass
flake8test.
flake8 mirdata --count --select=E9,F63,F7,F82 --show-source --statistics
Your code does not pass
mypytest.
python -m mypy mirdata --ignore-missing-imports --allow-subclassing-any
the test coverage is too low – this means that there are too many new lines of code introduced that are not tested.
the docs build has failed – this means that one of the changes you made to the documentation has caused the build to fail. Check the formatting in your changes and make sure they are consistent.
the tests have failed – this means at least one of the tests is failing. Run the tests locally to make sure they are passing. If they are passing locally but failing in the check, open an issue and we can help debug.
Common non-standard cases
Not fully-downloadable datasets
Sometimes, parts of music datasets are not publicly available due to e.g. copyright restrictions. In these cases, we aim to make sure that the version used in mirdata is the original one, and not a variant.
Before starting a PR, if a dataset is not fully downloadable:
Contact the mirdata team by opening an issue or PR so we can discuss how to proceed with the closed dataset.
Show that the version used to create the checksum is the “canonical” one, either by getting the version from the dataset creator, or by verifying equivalence with several other copies of the dataset.
Datasets needing extra dependencies
If a new dataset requires a library that is not included setup.py, please open an issue. In general, if the new library will be useful for many future datasets, we will add it as a dependency. If it is specific to one dataset, we will add it as an optional dependency.
To add an optional dependency, add the dataset name as a key in extras_require in setup.py, and list any additional dependencies. Additionally, mock the dependencies in docs/conf.py by adding it to the autodoc_mock_imports list.
When importing these optional dependencies in the dataset module, use a try/except clause and log instructions if the user hasn’t installed the extra requirements.
For example, if a module called example_dataset requires a module called asdf, it should be imported as follows:
try:
import asdf
except ImportError:
logging.error(
"In order to use example_dataset you must have asdf installed. "
"Please reinstall mirdata using `pip install 'mirdata[example_dataset]'"
)
raise ImportError
Datasets with multiple versions
There are some datasets where the loading code is the same, but there are multiple
versions of the data (e.g. updated annotations, or an additional set of tracks which
follow the same paradigm). In this case, only one loader should be written, and
multiple versions can be defined by creating additional indexes. Indexes follow the
naming convention <datasetname>_index_<version>.json, thus a dataset with two
versions simply has two index files. Different versions are tracked using the
INDEXES variable:
INDEXES = {
"default": "1.0",
"test": "sample",
"1.0": core.Index(filename="example_index_1.0.json"),
"2.0": core.Index(filename="example_index_2.0.json"),
"sample": core.Index(filename="example_index_sample.json")
}
By default, mirdata loads the version specified as default in INDEXES
when running mirdata.initialize('example'), but a specific version can
be loaded by running mirdata.initialize('example', version='2.0').
Different indexes can refer to different subsets of the same larger dataset,
or can reference completely different data. All data needed for all versions
should be specified via keys in REMOTES, and by default, mirdata will
download everything. If one version only needs a subset
of the data in REMOTES, it can be specified using the partial_download
argument of core.Index. For example, if REMOTES has the keys
['audio', 'v1-annotations', 'v2-annotations'], the INDEXES dictionary
could look like:
INDEXES = {
"default": "1.0",
"test": "1.0",
"1.0": core.Index(filename="example_index_1.0.json", partial_download=['audio', 'v1-annotations']),
"2.0": core.Index(filename="example_index_2.0.json", partial_download=['audio', 'v2-annotations']),
}
Documentation
This documentation is in rst format. It is built using Sphinx and hosted on readthedocs. The API documentation is built using autodoc, which autogenerates documentation from the code’s docstrings. We use the napoleon plugin for building docs in Google docstring style. See the next section for docstring conventions.
mirdata uses Google’s Docstring formatting style. Here are some common examples.
Note
The small formatting details in these examples are important. Differences in new lines, indentation, and spacing make
a difference in how the documentation is rendered. For example writing Returns: will render correctly, but Returns
or Returns : will not.
Functions:
def add_to_list(list_of_numbers, scalar):
"""Add a scalar to every element of a list.
You can write a continuation of the function description here on the next line.
You can optionally write more about the function here. If you want to add an example
of how this function can be used, you can do it like below.
Example:
.. code-block:: python
foo = add_to_list([1, 2, 3], 2)
Args:
list_of_numbers (list): A short description that fits on one line.
scalar (float):
Description of the second parameter. If there is a lot to say you can
overflow to a second line.
Returns:
list: Description of the return. The type here is not in parentheses
"""
return [x + scalar for x in list_of_numbers]
Functions with more than one return value:
def multiple_returns():
"""This function has no arguments, but more than one return value. Autodoc with napoleon doesn't handle this well,
and we use this formatting as a workaround.
Returns:
* int - the first return value
* bool - the second return value
"""
return 42, True
One-line docstrings
def some_function():
"""
One line docstrings must be on their own separate line, or autodoc does not build them properly
"""
...
Objects
"""Description of the class
overflowing to a second line if it's long
Some more details here
Args:
foo (str): First argument to the __init__ method
bar (int): Second argument to the __init__ method
Attributes:
foobar (str): First track attribute
barfoo (bool): Second track attribute
Cached Properties:
foofoo (list): Cached properties are special mirdata attributes
barbar (None): They are lazy loaded properties.
barf (bool): Document them with this special header.
"""
Conventions
Opening files
Mirdata uses the smart_open library under the hood in order to support reading data from
remote filesystems. If your loader needs to either call the python open command, or if
it needs to use os.path.exists, you’ll need to include the line
from smart_open import open
at the top of your dataset module and use open as you normally would.
Sometimes dependency libraries accept file paths as input to certain functions and open the files
internally - whenever possible mirdata avoids this, and passes in file-objects directly.
If you just need os.path.exists, you’ll need to replace
it with a try/except:
# original code that uses os.path.exists
file_path = "flululu.txt"
if not os.path.exists(file_path):
raise FileNotFoundError(f"{file_path} not found, did you run .download?")
with open(file_path, "r") as fhandle:
...
# replacement code that is compatible with remote filesystems
try:
with open(file_path, "r") as fhandle:
...
except FileNotFoundError:
raise FileNotFoundError(f"{file_path} not found, did you run .download?")
Loading from files
We use the following libraries for loading data from files:
Format |
library |
|---|---|
audio (wav, mp3, …) |
librosa |
midi |
pretty_midi |
json |
json |
csv |
csv |
yaml |
pyyaml |
hdf5 / h5 |
h5py |
If a file format needed for a dataset is not included in this list, please see this section
Track Attributes
If the dataset has an official e.g. train/test split, use the reserved attribute Track.split, or MultiTrack.split which will enable some dataset-level helper functions like dataset.get_track_splits. If there is no official split, do not use this attribute.
Custom track attributes should be global, track-level data.
For some datasets, there is a separate, dataset-level metadata file
with track-level metadata, e.g. as a csv. When a single file is needed
for more than one track, we recommend using writing a _metadata cached property (which
returns a dictionary, either keyed by track_id or freeform)
in the Dataset class (see the dataset module example code above). When this is specified,
it will populate a track’s hidden _track_metadata field, which can be accessed from
the Track class.
For example, if _metadata returns a dictionary of the form:
{
'track1': {
'artist': 'A',
'genre': 'Z'
},
'track2': {
'artist': 'B',
'genre': 'Y'
}
}
the _track metadata for track_id=track2 will be:
{
'artist': 'B',
'genre': 'Y'
}
Missing Data
If a Track has a property, for example a type of annotation, that is present for some tracks and not others,
the property should be set to None when it isn’t available.
The index should only contain key-values for files that exist.
Custom Decorators
cached_property
This is used primarily for Track classes.
This decorator causes an Object’s function to behave like
an attribute (aka, like the @property decorator), but caches
the value in memory after it is first accessed. This is used
for data which is relatively large and loaded from files.
docstring_inherit
This decorator is used for children of the Dataset class, and copies the Attributes from the parent class to the docstring of the child. This gives us clear and complete docs without a lot of copy-paste.
coerce_to_bytes_io/coerce_to_string_io
These are two decorators used to simplify the loading of various Track members
in addition to giving users the ability to use file streams instead of paths in
case the data is in a remote location e.g. GCS. The decorators modify the function
to:
Return
NoneifNoneif passed in.Open a file if a string path is passed in either
'w'mode forstring_ioorwbforbytes_ioand pass the file handle to the decorated function.Pass the file handle to the decorated function if a file-like object is passed.
This cannot be used if the function to be decorated takes multiple arguments.
coerce_to_bytes_io should not be used if trying to load an mp3 with librosa as libsndfile does not support
mp3 yet and audioread expects a path.