Skip to content

Latest commit

 

History

History

dataloaders

Overview

Basic datasets including MNIST, CIFAR, and Speech Commands will auto-download. Source code for these datamodules are in basic.py.

By default, data is downloaded to ./data/ by default, where . is the top level directory of this repository (e.g. 'state-spaces').

Advanced Usage

After downloading and preparing data, the paths can be configured in several ways.

  1. Suppose that it is desired to download all data to a different folder, for example a different disk. The data path can be configured by setting the environment variable DATA_PATH, which defaults to ./data.

  2. For fine-grained control over the path of a particular dataset, set dataset.data_dir in the config. For example, if the LRA ListOps files are located in /home/lra/listops-1000/ instead of the default ./data/listops/, pass in +dataset.data_dir=/home/lra/listops-1000 on the command line or modify the config file directly.

  3. As a simple workaround, softlinks can be set, e.g. ln -s /home/lra/listops-1000 ./data/listops

Data Preparation

Datasets that must be manually downloaded include LRA, WikiText-103, BIDMC, and other audio datasets used in SaShiMi.

By default, these should go under $DATA_PATH/, which defaults to ./data. For the remainder of this README, these are used interchangeably.

Long Range Arena (LRA)

LRA can be downloaded from the GitHub page. These datasets should be organized as follows:

$DATA_PATH/
  pathfinder/
    pathfinder32/
    pathfinder64/
    pathfinder128/
    pathfinder256/
  aan/
  listops/

The other two datasets in the suite ("Image" i.e. grayscale sequential CIFAR-10; "Text" i.e. char-level IMDB sentiment classification) are both auto-downloaded.

The following sequence of commands prepares the LRA datasets in the default data path:

cd data
wget https://storage.googleapis.com/long-range-arena/lra_release.gz
tar xvf lra_release.gz
mv lra_release/lra_release/listops-1000 listops
mv lra_release/lra_release/tsv_data aan
mkdir pathfinder
mv lra_release/lra_release/pathfinder* pathfinder/
rm -r lra_release

Speech Commands (SC)

The full SC dataset is auto-downloaded into ./data/SpeechCommands/. Specific subsets such as the SC10 subset can be toggled in the config or command line.

For the SC09 audio generation dataset, copy the digit subclasses of the ./data/SpeechCommands folder into data/sc09/{zero,one,two,three,four,five,six,seven,eight,nine}. Also copy the ./data/SpeechCommands/{validation_list,test_list}.txt files.

WikiText-103

The WikiText-103 language modeling dataset can be downloaded by the getdata.sh script from the Transformer-XL codebase. By default, the datamodule looks for it under $DATA_PATH/wt103.

cd {repo}/data
wget https://raw.githubusercontent.com/kimiyoung/transformer-xl/master/getdata.sh
bash getdata.sh
mv data/wikitext-103 wt103

A trained model checkpoint can be found here. (Note that this uses a vanilla isotropic S4 model and is only located in the SaShiMi release for convenience.)

BIDMC

See prepare/bidmc/README.md

Informer Forecasting Datasets

The ETTH, ETTM, Weather, and ECL experiments originally from the Informer paper can be downloaded as informer.zip and extracted inside ./data.

Other Audio

Instructions for other audio datasets used by the SaShiMi paper, including Beethoven and YoutubeMix, can be found in the SaShiMi README.

Adding a Dataset [WIP]

Datasets generally consist of two components.

  1. The first is the torch.utils.data.Dataset class which defines the raw data, or (data, target) pairs.

  2. The second is a SequenceDataset class, which defines how to set up the dataset as well as the dataloaders. This class is very similar to PyTorch Lightning's LightningDataModule and satisfies an interface described below.

Datasets are sometimes defined in the datasets/ subfolder, while Datamodules are all defined in the top-level files in this folder and imported by __init__.py.

Basic examples of datamodules are provided here.

Some help for adding a custom audio dataset was provided in Issue #23

SequenceDataset [WIP]

TODO:

  • Add documentation for adding a new dataset
  • Restructure folder so that each dataset is in its own file
  • Use Hydra to instantiate datamodules