Skip to main content
Version: v1.4.0

Dataset format: Speech recognition

Dataset format

The data for a speech recognition experiment needs to be in a zip file (1) containing a CSV file (2) and an audio folder (3).

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───audio_folder_name (3)
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ ...
Note

You can have multiple CSV files in the zip file that you can use as train, validation, and test dataframes:

  • A train CSV file needs to follow the format described above
  • A validation CSV file needs to follow the same format as a train CSV file
  • A test CSV file needs to follow the same format as a train CSV file, but does not require a label column(s)
  1. The available dataset connectors require the data for a speech recognition experiment to be in a zip file.
    Note

    To learn how to upload your zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A CSV file containing the following columns:
    • An audio column containing the names of the audios for the experiment, where each audio has an audio extension specified
      Note
      • To learn about supported audio extensions for a speech recognition experiment, see Supported audio extensions for speech recognition.
      • The names of the audio files do not specify the data directory (location of the audio in the zip file). You can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
      tip

      For most supported speech architectures, utilize speech audios of up to 30 seconds. Attempting to train with longer speech samples may lead to:

      • Out-of-memory (OOM) issues even on high VRAM GPUs
      • Poor training performance
    • One label column containing the text transcript of the audio
    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An audio folder that contains all the audio files specified in the audio column; H2O Hydrogen Torch uses the audios in this folder to run the experiment.
    Note

    All audio file names need to specify an audio extension. To learn about supported audio extensions, see Supported audio extensions for speech recognition.

Example

The minds14_US_speech_recognition.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a speech recognition problem. The structure of the zip file is:

minds14-US_speech_recognition.zip
│ └───annotations.csv
│ └───audio
│ └───0.wav
│ └───1.wav
│ └───2.wav
│ ...

The first three rows of the CSV file are:

filetranscriptduration
0.wavI WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER [...]11
1.wavI'M WONDERING HOW TO SET UP A JOINT ACCOUNT WITH MY WIFE [...]7
2.wavHI I'D LIKE TO SET UP A JOINT ACCOUNT WIH MY PARTNER I'M NOT SEEING [...]24
Note
  • The duration column is not a required column when formating your dataset for a speech recognition experiment
  • In this example, the data directory in the file column is not specified. That being the case, it needs to be specified when uploading the dataset, and the audio folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
  • To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Demo (preprocessed) datasets.

Feedback