.. _tutorial: ======== Tutorial ======== This tutorial will cover how to use Mirdata to access and work with music datasets. mirdata is a Python library designed to make it easy to load and work with common music information retrieval (MIR) datasets. In this tutorial, we will cover: * Downloading Mirdata * Initializing a dataset * Downloading a dataset * Validating a dataset * Loading tracks * Accessing annotations and metadata * Advanced options for download and tracks * Usage examples of Mirdata in your pipeline, with TensorFlow, and with PyTorch, and in Google Colab. ------ ---------- Quickstart ---------- First, install mirdata. We recommend to do this inside a conda or virtual environment for reproducibility. .. code-block:: bash pip install mirdata Then, get yor data by simply doing: .. code-block:: python :linenos: # Basic Usage Example import mirdata # 1. List all available datasets print(mirdata.list_datasets()) # 2. Initialize a dataset loader dataset = mirdata.initialize("orchset", data_home='/choose/where/data/live') # 3. Download the dataset dataset.download() # 4. validate the dataset dataset.validate() # 5. Load tracks random_track = dataset.choice_track() # 6. Access metadata and annotations print(random_track) Below, we elaborate on each step a bit more: Initializing a dataset ---------------------- To use a loader, (for example, ``orchset``) you need to initialize it by calling: .. code-block:: python dataset = mirdata.initialize('orchset', data_home='/choose/where/data/live') This will create a dataset loader object that you can use to access the dataset's tracks, metadata, and annotations. You can specify the directory where the Mirdata data is stored by passing a path to ``data_home``. .. admonition:: Dataset versions :class: attention Mirdata supports working with multiple dataset versions. To see all available versions of a specific dataset, run ``mirdata.list_dataset_versions('orchset')``. Use ``version`` parameter if you wish to use a version other than the default one. To check an example, see below. .. toggle:: .. code-block:: python # To see all available versions of a specific dataset: mirdata.list_dataset_versions('orchset') #Use 'version' parameter if you wish to use a version other than the default one. dataset = mirdata.initialize('orchset', data_home='/choose/where/data/live', version="1.0") Downloading a dataset ---------------------- To download the dataset, you can use the ``download()`` method of the dataset loader object: .. code-block:: python dataset.download() # Dataset is downloaded to ~/mir_datasets/orchset By default, the dataset will be downloaded to the ``mir_datasets`` folder in your home directory. .. admonition:: Note :class: attention For downloading in a custom folder, partial downloads, and other advanced options, see the `Advanced download options`_ section below. Validating a dataset -------------------- To ensure that the dataset files are correctly downloaded and not corrupted, you can use the ``validate()`` method of the dataset loader object: .. code-block:: python dataset.validate() This method checks the integrity of the dataset files and raises an error if any files are missing or corrupted. Loading a random track ---------------------- We can choose a random track from a dataset with the ``choice_track()`` method: .. code-block:: python random_track = dataset.choice_track() This returns a random track from the dataset, which can be useful for testing or exploration purposes. .. admonition:: Note :class: attention For loading all tracks, load a single track, or load tracks with specific IDs, see the `Advanced track options`_ section below. Annotations and metadata ------------------------ After choosing a track, we can access its metadata and annotations. To print the metadata and annotations associated with the track, you can simply print the track object: .. code-block:: python # For this example, we will use the random_track from above. print(random_track) This will print the metadata and annotations associated with the track, such as composer, work, excerpt, and paths to audio files. .. code-block:: python # Example output >>> Track( alternating_melody=True, audio_path_mono="user/mir_datasets/orchset/audio/mono/Beethoven-S3-I-ex1.wav", audio_path_stereo="user/mir_datasets/orchset/audio/stereo/Beethoven-S3-I-ex1.wav", composer="Beethoven", contains_brass=False, contains_strings=True, contains_winds=True, excerpt="1", melody_path="user/mir_datasets/orchset/GT/Beethoven-S3-I-ex1.mel", only_brass=False, only_strings=False, only_winds=False, predominant_melodic_instruments=['strings', 'winds'], track_id="Beethoven-S3-I-ex1", work="S3-I", audio_mono: (np.ndarray, float), audio_stereo: (np.ndarray, float), melody: F0Data, ) .. admonition:: Annotation classes :class: attention Mirdata defines annotation-specific data classes. These data classes are meant to standardize the format for all loaders, and are compatible with `mir_eval `_. The list and descriptions of available annotation classes can be found in :ref:`annotations`. **Note: These classes may be extended in the case that a loader requires it.** ----- ------------------------- Advanced download options ------------------------- This section provides comprehensive coverage of advanced dataset download configurations and options available in Mirdata: * Downloading the dataset to a custom folder * Partially downloading a dataset * Downloading the dataset index only * Accessing data on non-local filesystems Downloading dataset in custom folder ------------------------------------ .. code-block:: python dataset = mirdata.initialize('orchset', data_home='/Users/leslieknope/Desktop/orchset123') dataset.download() # Dataset is downloaded to the folder "orchset123" on Leslie Knope's desktop Now ``data_home`` is specified and so orchset will be read from / written to this custom location. Partially downloading a dataset ------------------------------------ The ``download()`` function allows partial downloads of a dataset. In other words, if applicable, the user can select which elements of the dataset they want to download. Each dataset has a ``REMOTES`` dictionary where all the available elements are listed. .. code-block:: python # Elements should be specified as a list of keys in the REMOTES dictionary. dataset.download(partial_download=['element_A', 'element_B', 'element_C']) .. admonition:: Partial downloads example .. toggle:: ``cante100`` has different elements as seen in the ``REMOTES`` dictionary. Thus, we can specify which of these elements are downloaded, by passing to the ``download()`` function the list of keys in ``REMOTES`` that we are interested in. This list is passed to the ``download()`` function through the ``partial_download`` variable. .. code-block:: python REMOTES = { "spectrogram": download_utils.RemoteFileMetadata( filename="cante100_spectrum.zip", url="https://zenodo.org/record/1322542/files/cante100_spectrum.zip?download=1", checksum="0b81fe0fd7ab2c1adc1ad789edb12981", # the md5 checksum destination_dir="cante100_spectrum", # relative path for where to unzip the data, or None ), "melody": download_utils.RemoteFileMetadata( filename="cante100midi_f0.zip", url="https://zenodo.org/record/1322542/files/cante100midi_f0.zip?download=1", checksum="cce543b5125eda5a984347b55fdcd5e8", # the md5 checksum destination_dir="cante100midi_f0", # relative path for where to unzip the data, or None ), "notes": download_utils.RemoteFileMetadata( filename="cante100_automaticTranscription.zip", url="https://zenodo.org/record/1322542/files/cante100_automaticTranscription.zip?download=1", checksum="47fea64c744f9fe678ae5642a8f0ee8e", # the md5 checksum destination_dir="cante100_automaticTranscription", # relative path for where to unzip the data, or None ), "metadata": download_utils.RemoteFileMetadata( filename="cante100Meta.xml", url="https://zenodo.org/record/1322542/files/cante100Meta.xml?download=1", checksum="6cce186ce77a06541cdb9f0a671afb46", # the md5 checksum ), "README": download_utils.RemoteFileMetadata( filename="cante100_README.txt", url="https://zenodo.org/record/1322542/files/cante100_README.txt?download=1", checksum="184209b7e7d816fa603f0c7f481c0aae", # the md5 checksum ), } A partial download example for ``cante100`` dataset could be: .. code-block:: python dataset = mirdata.initialize('cante100', data_home='/choose/where/data/live') dataset.download(partial_download=['spectrogram', 'melody', 'metadata']) .. admonition:: Note :class: warning Not all datasets support partial downloads. To check if a dataset supports partial downloads, check if the ``REMOTES`` dictionary is not empty. Downloading dataset index only ------------------------------ All dataset loaders in Mirdata have a ``download()`` function that downloads: * The :ref:`canonical ` version of the dataset (when available) * The dataset index, which indicates the list of clips and paths to audio and annotation files The index is downloaded by running ``download(["index"])`` and is stored in Mirdata's indexes folder (``mirdata/datasets/indexes``). .. code-block:: python # Download the dataset index dataset.download(["index"]) # Check the path to the downloaded index print(dataset.index_path) Accessing data on non-local filesystems --------------------------------------- mirdata uses the smart_open_ library, which supports non-local filesystems such as GCS and AWS. If your data lives, e.g. on Google Cloud Storage (GCS), simply set the ``data_home`` variable accordingly when initializing a dataset. For example: .. _smart_open: https://pypi.org/project/smart-open/ .. code-block:: python dataset = mirdata.initialize("orchset", data_home="gs://my-bucket/my-subfolder/orchset") # everything should work the same as if the data were local dataset.validate() Note that the data on the remote file system **must have identical folder structure** to what is specified by ``dataset.download()``, and we do not support downloading (i.e. writing) to remote filesystems, only reading from them. To prepare a new dataset to use with mirdata, we recommend running ``dataset.download()`` on a local filesystem, and then manually transfering the folder contents to the remote filesystem. .. admonition:: mp3 data :class: warning For a variety of reasons, mirdata doesn't support remote reading of mp3 files, so some datasets with mp3 audio may have tracks with unavailable attributes. ----- --------------------- Advanced track options --------------------- This section covers advanced options for working with tracks in datasets. These methods provide flexible ways to access and manipulate track data based on your specific research needs: * Loading all tracks and example * Loading tracks with track ID Loading tracks -------------- .. code-block:: python :linenos: # Initialize the dataset dataset = mirdata.initialize("orchset") # Load all tracks in the dataset as a dictionary with the track_ids as keys and track objects as values. tracks = dataset.load_tracks() # Iterating over datasets for key, track in tracks.items(): print(key, track.audio_path) To load tracks from a dataset, you can use the load_tracks() method. This method returns a dictionary where the keys are track IDs and the values are track objects. .. code-block:: python tracks = dataset.load_tracks() This will load all tracks in the dataset, allowing you to access their audio and annotations. Next, you can iterate over the tracks dictionary to access each track's audio path and other attributes: .. code-block:: python for key, track in tracks.items(): print(key, track.audio_path) Loading tracks with track ID -------------------------- .. code-block:: python :linenos: # Initialize the dataset dataset = mirdata.initialize("orchset") # Get the list of track IDs track_ids = dataset.track_ids # Loop over the track_ids list to directly access each track in the dataset for track_id in dataset.track_ids: print(track_id, dataset.track(track_id).audio_path) To load tracks with track ids, first: .. code-block:: python track_ids = dataset.track_ids Get the list of the track_ids. Next, loop over the ``track_ids`` list to directly access each track in the dataset: .. code-block:: python for track_id in dataset.track_ids: print(track_id, dataset.track(track_id).audio_path) --------- -------------- Advanced Usage -------------- Using mirdata in your pipeline ------------------------------ This section shows how to use Mirdata in your machine learning pipeline. .. code-block:: python :linenos: import mir_eval import mirdata import numpy as np import sox def very_bad_melody_extractor(audio_path): duration = sox.file_info.duration(audio_path) time_stamps = np.arange(0, duration, 0.01) melody_f0 = np.random.uniform(low=80.0, high=800.0, size=time_stamps.shape) return time_stamps, melody_f0 # Evaluate on the full dataset orchset = mirdata.initialize("orchset") orchset_scores = {} orchset_data = orchset.load_tracks() for track_id, track_data in orchset_data.items(): est_times, est_freqs = very_bad_melody_extractor(track_data.audio_path_mono) ref_melody_data = track_data.melody ref_times = ref_melody_data.times ref_freqs = ref_melody_data.frequencies score = mir_eval.melody.evaluate(ref_times, ref_freqs, est_times, est_freqs) orchset_scores[track_id] = score # Split the results by composer and by instrumentation composer_scores = {} strings_no_strings_scores = {True: {}, False: {}} for track_id, track_data in orchset_data.items(): if track_data.composer not in composer_scores.keys(): composer_scores[track_data.composer] = {} composer_scores[track_data.composer][track_id] = orchset_scores[track_id] strings_no_strings_scores[track_data.contains_strings][track_id] = \ orchset_scores[track_id] Using mirdata with tensorflow ----------------------------- This example shows how to use Mirdata with TensorFlow's ``tf.data.Dataset`` API to create a dataset generator for the ORCHSET dataset. .. code-block:: python :linenos: import mirdata import numpy as np import tensorflow as tf def orchset_generator(): # using the default data_home orchset = mirdata.initialize("orchset") track_ids = orchset.track_ids for track_id in track_ids: track = orchset.track(track_id) audio_signal, sample_rate = track.audio_mono yield { "audio": audio_signal.astype(np.float32), "sample_rate": sample_rate, "annotation": { "times": track.melody.times.astype(np.float32), "freqs": track.melody.frequencies.astype(np.float32), }, "metadata": {"track_id": track.track_id} } dataset = tf.data.Dataset.from_generator( orchset_generator, { "audio": tf.float32, "sample_rate": tf.float32, "annotation": {"times": tf.float32, "freqs": tf.float32}, "metadata": {'track_id': tf.string} } ) Using mirdata with pytorch -------------------------- This example shows how to use Mirdata with PyTorch's ``torch.utils.data.Dataset`` and ``DataLoader`` to create a dataset generator. .. code-block:: python :linenos: import torch import numpy as np import mirdata from torch.utils.data import Dataset, DataLoader class MIRDataset(Dataset): def __init__(self, dataset_name: str): # Initialize the loader, download if required, and validate self.loader = mirdata.initialize(dataset_name) self.loader.download() self.loader.validate() # Get the length of the longest tracks + annotations in the dataset # Torch dataloader requires all tensors to have the same dims # So we'll use this to pad items that are too short self.longest_track = max( [len(self.loader.track(tid).audio_mono[0]) for tid in self.loader.track_ids] ) self.longest_annotation = max( [len(self.loader.track(tid).melody.times) for tid in self.loader.track_ids] ) @staticmethod def pad(to_pad: np.ndarray, pad_size: int) -> np.ndarray: """Right-pads a 1D array to `pad_size`""" return np.pad( to_pad, (0, pad_size - len(to_pad)), mode="constant", constant_values=0.0 ) def __len__(self) -> int: return len(self.loader.track_ids) def __getitem__(self, item: int) -> tuple[np.ndarray, np.ndarray, np.ndarray]: # Unpack the current track track_id = self.loader.track_ids[item] track = self.loader.track(track_id) # Get the audio and annotations audio_signal, sample_rate = track.audio_mono times = track.melody.times frequencies = track.melody.frequencies # Right pad everything to satisfy torch's requirement for equal dims audio_signal_padded = self.pad(audio_signal, self.longest_track) times_padded = self.pad(times, self.longest_annotation) frequencies_padded = self.pad(frequencies, self.longest_annotation) return ( audio_signal_padded.astype(np.float32), times_padded.astype(np.float32), frequencies_padded.astype(np.float32), ) md = DataLoader(MIRDataset("orchset"), batch_size=2, shuffle=True, drop_last=False) for audio, times, freqs in md: pass # train your model on this data Using mirdata in Google Colab ----------------------------- `Google Colab` provides a browser-based Python environment with free GPU support, which is useful for exploring datasets quickly. You will have two options that you can use the dataset from ``mirdata`` in Colab - ``Download Dataset directly in Google Colab``, or ``Access the Dataset Downloaded out of Google Colab`` .. admonition:: Colab Example Notebook | For Google Colab Example Notebook, check the link here: `Google Colab Example Notebook `_. | If you are willing to use the notebook, you can make a copy of it to your Google Drive by clicking on ``File -> Save a copy in Drive``.