Skip to main content
Version: v1.4.0

Predictions download formats: Speech recognition

Overview

When you download predictions in H2O Hydrogen Torch, which comes in a zip file, the format and content of the file first depends on the problem type of the predictions, and then it depends on how you generate them. On the point of "how you generate them," there are two scenarios.

  • Scenario 1: Predictions from a completed experiment

    Predictions downloaded from a completed experiment on the View experiments card are packaged in a zip file. This zip file contains the following files:

    1. validation_predictions.csv: This is a structured dataframe in CSV format, presenting the final predictions for the provided validation dataframe.

    2. validation_raw_predictions.pkl: This is a Pickle file, which is essentially a pickled Python dictionary. It contains raw predictions for the provided validation dataframe.

      If the experiment included a test dataframe, H2O Hydrogen Torch also includes two additional files in the same zip file:

    3. test_predictions.csv: This is another structured dataframe in CSV format, displaying the final predictions for the provided test dataframe.

    4. test_raw_predictions.pkl: Similar to the validation set, this is a Pickle file with raw predictions for the provided test dataframe.

  • Scenario 2: Predictions generated by scoring on new data

    Predictions generated by scoring on new data through the H2O Hydrogen Torch UI (on the Predict data card) are downloaded in a zip file. This zip file includes the following files:

    1. test_predictions.csv: This is a structured dataframe in CSV format, showing the final predictions for the provided test dataframe.
    2. test_raw_predictions.pkl: This is a Pickle file, a pickled Python dictionary containing raw predictions for the provided test dataframe.

Formats

The Pickle file, contains the following keys:

  • predictions
    • A 1-dimensional NumPy array that contains predicted transcriptions
  • [label_column]
    • A 1-dimensional NumPy array that contains label transcriptions. The name of the key is [label_column] where label_column refers to the name of the label column in the train dataframe
      Note

      You can define the [label_column] under the Dataset settings section when building a speech recognition experiment.

  • [audio_column]
    • A 1-dimensional NumPy array that contains input audio names. The name of the key is [audio_column] where audio_column refers to the name of the audio column in the train dataframe
      Note

      You can define the [audio_column] under the Dataset settings section when building a speech recognition experiment.

Note

Open CSV and Pickle files with Python

Using Python, a csv or Pickle file containing predictions can be open as follows:

import pickle
import pandas as pd

df = pd.read_csv('text_classification/validation_predictions.csv')

with open('text_classification/validation_raw_predictions.pkl', 'rb') as f:
out = pickle.load(f)
print(out.keys())
dict_keys(['predictions', 'comment_text', 'labels'])
print(df.head(1))
idcomment_textlabel_toxiclabel_severe_toxiclabel_obscenelabel_threatlabel_insultlabel_identity_hatefoldpred_label_toxicpred_label_severe_toxicpred_label_obscenepred_label_threatpred_label_insultpred_label_identity_hate
000103f0d9cfb60fD'aww! He matches this background colour I'm s...00000000.000410.0001680.0003280.0001420.0002470.000155

Feedback