Version: v1.3.0

Predictions: File formats

Overview

Downloaded predictions follow one of the following two formats depending on how you generated predictions:

Predictions downloaded from a completed experiment on the View experiments card are in a .zip file containing the following files:
- validation_predictions.csv
  - The .csv file is a structured dataframe with final predictions for the provided validation dataframe.
- validation_raw_predictions.pkl
  - The .pkl file is a pickled Python dictionary with raw predictions for the provided validation dataframe.
- If the experiment contained a test dataframe, H2O Hydrogen Torch also includes the following two files in the .zip file:
  - test_predictions.csv
    - The .csv file is a structured dataframe with final predictions for the provided test dataframe.
  - test_raw_predictions.pkl
    - The .pkl file is a pickled Python dictionary with raw predictions for the provided test dataframe.
Predictions generated by scoring on new data (through the H2O Hydrogen Torch UI) are downloaded in a .zip file containing the following files:
- test_predictions.csv
  - The .csv file is a structured dataframe with final predictions for the provided test dataframe.
- test_raw_predictions.pkl
  - The .pkl file is a pickled Python dictionary with raw predictions for the provided test dataframe.
note
See Download a prediction to learn how to download generated predictions through the H2O Hydrogen Torch UI.

Image

Image regression

For image regression, the validation and test .csv and .pkl prediction files have the same format:

`.pkl` file keys
`.csv` file columns

predictions
- A 2-dimensional NumPy array that contains label predictions. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of labels.
labels
- A 1-dimensional NumPy array that contains label names.
{image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is {image_column} where image_column refers to the name of the image column in the train dataframe.

Note

You can define the {image_column} under the Dataset settings section when building an image regression experiment.

All the N columns in the train dataframe.
A column name pred_{label_column_name} that contains probabilities for the label column, label_column_name refers to the label column name found in the train dataframe.

Note

For multi-label image regression experiments, more than one pred_{label_column_name} column is in the .csv referring to the predicted probability for each of the label columns from the train dataframe.
You can define the label_column_name(s) under the Dataset settings section when building an image regression experiment.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

3D image regression

For 3D image regression, the validation and test .csv and .pkl prediction files have the same format:

`.pkl` file keys
`.csv` file columns

predictions
- A 2-dimensional NumPy array that contains label predictions. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of labels.
labels
- A 1-dimensional NumPy array that contains label names.
{image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is {image_column} where image_column refers to the name of the image column in the train dataframe.

Note

You can define the {image_column} under the Dataset settings section when building a 3D image regression experiment.

All the N columns in the train dataframe.
A column name pred_{label_column_name} that contains probabilities for the label column, label_column_name refers to the label column name found in the train dataframe.

Note

For multi-label 3D image regression experiments, more than one pred_{label_column_name} column is in the .csv referring to the predicted probability for each of the label columns from the train dataframe.
You can define the label_column_name(s) under the Dataset settings section when building an 3D image regression experiment.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Image classification

For image classification, the validation and test .csv and .pkl prediction files have the same format:

`.pkl` file keys
`.csv` file columns

predictions
- A 2-dimensional NumPy array that contains class probabilities. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of classes.
labels
- A 1-dimensional NumPy array that contains label names.
{image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is {image_column} where image_column refers to the name of the image column in the train dataframe.

Note

You can define the {image_column} under the Dataset settings section when building an image classification experiment.

All the N columns in the train dataframe.
A column name pred_{label_column_name} that contains probabilities for the label column, label_column_name refers to the label column name found in the train dataframe.

Note

For multi-label and multi-class image classification experiments, more than one pred_{label_column_name} column is be in the .csv referring to the predicted probability for each of the label columns from the train dataframe.
You can define the label_column_name(s) under the Dataset settings section when building an image classification experiment.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

3D image classification

For 3D image classification, the validation and test .csv and .pkl prediction files have the same format:

`.pkl` file keys
`.csv` file columns

predictions
- A 2-dimensional NumPy array that contains class probabilities. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of classes.
labels
- A 1-dimensional NumPy array that contains label names.
{image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is {image_column} where image_column refers to the name of the image column in the train dataframe.

Note

You can define the {image_column} under the Dataset settings section when building a 3D image classification experiment.

All the N columns in the train dataframe.
A column name pred_{label_column_name} that contains probabilities for the label column, label_column_name refers to the label column name found in the train dataframe.

Note

For multi-label and multi-class 3D image classification experiments, more than one pred_{label_column_name} column is be in the .csv referring to the predicted probability for each of the label columns from the train dataframe.
You can define the label_column_name(s) under the Dataset settings section when building a 3D image classification experiment.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Image metric learning

For image metric learning, the validation and test .csv and .pkl prediction files have the same format:

`.pkl` file keys
`.csv` file columns

embeddings
- A 2-dimensional NumPy array that contains image embeddings. The shape of the array is (n, m), where n represents the number of observations, while m represents the {embedding_size} where {embedding_size} refers to the selected embedding size value used during the experiment. Images with nearby embedding vectors are predicted to have similar content.
  Note
  You can define the {embedding_size} under the Architecture settings section when building an image metric learning experiment.
cosine_similarities
- A 2-dimensional NumPy array that contains cosine similarities between validation/test images. The shape of the array is as follows: number_of_observations x {top_k_similar} where {top_k_similar} refers to the selected Top K Similar value used during the experiment.
similar_images
- A 2-dimensional NumPy array that contains indices of similar validation/test images. The shape of the array is as follows: number_of_observations x {top_k_similar} where {top_k_similar} refers to the selected Top K Similar value used during the experiment.
{image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is {image_column} where image_column refers to the name of the image column in the train dataframe.

Note

You can define the {image_column} under the Dataset settings section when building an image metric learning experiment.

All the N columns found in the train dataframe.
Three columns name top_{k}_similar_image, where k can represent a number between 1 to 3, which contains the top k name of an image similar to the input image.
Three columns name top_{k}_cosine_similarity, where k can represent a number between 1 to 3, which contains the cosine similarity value between the input image and the top_{k}_similar_image.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Image object detection

For image object detection, the validation and test .csv and .pkl prediction files have the same format:

`.pkl` file keys
`.csv` file columns

boxes
- A 3-dimensional NumPy array that contains predicted bounding boxes. The shape of the array is as follows: The number_of_observations x number_of_bounding_boxes x 4.
  Note
  The number_of_bounding_boxes is limited to 100 most confident boxes, all in the format of: (x_min, y_min, x_max, y_max).
confidences
- A 2-dimensional NumPy array that contains bounding boxes confidences (from 0 to 1). The shape of the array is (n, m), where n represents the number of observations, while m represents the number of bounding boxes.
classes
- A 2-dimensional NumPy array that contains class names of bounding boxes. The shape of the array is (n, m), where n represents the number of observations while m represents the number of bounding boxes.
{image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is {image_column} where image_column refers to the name of the image column in the train dataframe.
  Note
  You can define the {image_column} under the Dataset settings section when building an image object detection experiment.

A column named {image_column_name} where image_column_name refers to the image column name in the train dataframe.
Note
You can define the image_column_name under the Dataset settings section when building an image object detection experiment.
A column named x_min containing the minimum x coordinates for the bounding boxes.
A column named y_min containing the minimum y coordinates for the bounding boxes.
A column named x_max containing the maximum x coordinates for the bounding boxes.
A column named y_max containing the maximum y coordinates for the bounding boxes.
A column named confidence containing the confidence scores of all the corresponding bounding boxes, only bounding boxes with a confidence score larger than the probability_threshold are considered.
Note
You can define the {probability_threshold} under the Validation settings section when building an image object detection experiment.
A column named {class_name_column} containing the class names of all the corresponding bounding boxes, where class_name_column refers to the name of a column in the train dataframe referring to the class names.

Note

To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Image semantic segmentation

For image semantic segmentation, the validation and test .csv and .pkl prediction files, for the most part, have similar formats; differences are noted below:

`.pkl` file keys
`.csv` file columns

masks
- A 4-dimensional NumPy array that contains pixel-wise probabilities. The shape of the array is as follows: number_of_observations x number_of_classes x {image_height} x {image_width}.
  Note
  You can define the {image_height} and {image_width} under the Image settings section when building an image semantic segmentation experiment.
original_image_shapes
- A 2-dimensional NumPy array that contains shapes of the original input images. The shape of the array is as follows: number_of_observations x 2, where the 2nd dimension contains original_image_height and original_image_width of the corresponding input image.
rle_predictions
- A 2-dimensional NumPy array that contains RLE-encoded predictions for each class. The shape of the array is as follows: number_of_observations x number_of_classes. You can use RLE predictions with corresponding original_image_shapes to decode RLE-encoded strings to binary masks.
class_names
- The class_names refers to a list containing all the class names. The class names follow the order of the class names in the 4-dimensional NumPy masks array.
{image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is {image_column} where image_column refers to the name of the image column in the train dataframe.

Note

You can define the {image_column} under the Dataset settings section when building an image semantic segmentation experiment.

All the N columns in the train dataframe.
Note
The .csv file repeats X times each original row in the train dataframe while having each row contain a different run-length-encoded mask prediction for a given class, where X refers to the {number_of_classes}.

In the case that the train dataframe contains a {class_name_column} and {rle_mask_column}:

A column named {class_name_column} containing input class names, where class_name_column refers to the name of the column in the train dataframe that refers to the class names.
A column named {rle_mask_column} containing all the true Run-length encodings (RLEs) in the train dataframe.

Note

You can define the {class_name_column} and {rle_mask_column} under the Dataset settings section when building an image semantic segmentation experiment.

In the case that the test dataframe does not contain a {class_name_column} or {rle_mask_column} or both:

The first column in the .csv file has the name class_id, and no column with true Run-length encodings (RLEs).
A column with a prefix pred_ follow by a suffix {rle_mask_column} that contains the predicted Run-length encodings (RLEs) of all the predictions, where rle_mask_column refers to the name of the Run-length encodings mask column in the train dataframe.

Note

If there's not a {rle_mask_column} in the train dataframe, this column is name pred_mask.
If no mask is predicted, then the column value is an empty string.
You can define the {rle_mask_column} under the Dataset settings section when building an image semantic segmentation experiment.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

3D image semantic segmentation

For 3D image semantic segmentation, the validation and test .csv and .pkl prediction files, for the most part, have similar formats; differences are noted below:

`.pkl` file keys
`.csv` file columns

masks
- A 5-dimensional NumPy array that contains pixel-wise probabilities. The shape of the array is as follows: number_of_observations x number_of_classes x {image_height} x {image_width} x {image_depth}.
  Note
  You can define the {image_height}, {image_width} and {image_depth} under the Image settings section when building a 3D image semantic segmentation experiment.
original_image_shapes
- A 2-dimensional NumPy array that contains shapes of the original input images. The shape of the array is as follows: number_of_observations x 3, where the 2nd dimension contains original_image_height, original_image_width and original_image_depth of the corresponding input image.
rle_predictions
- A 2-dimensional NumPy array that contains RLE-encoded predictions for each class. The shape of the array is as follows: number_of_observations x number_of_classes. You can use RLE predictions with corresponding original_image_shapes to decode RLE-encoded strings to binary masks.
class_names
- The class_names refers to a list containing all the class names. The class names follow the order of the class names in the 5-dimensional NumPy masks array.
{image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is {image_column} where image_column refers to the name of the image column in the train dataframe.

Note

You can define the {image_column} under the Dataset settings section when building a 3D image semantic segmentation experiment.

All the N columns in the train dataframe.
Note
The .csv file repeats X times each original row in the train dataframe while having each row contain a different run-length-encoded mask prediction for a given class, where X refers to the {number_of_classes}.

In the case that the train dataframe contains a {class_name_column} and {rle_mask_column}:

A column name {class_name_column} that contains input class names, where class_name_column refers to the name of the column in the train dataframe that refers to the class names.
A column name {rle_mask_column} that contains all the true Run-length encodings (RLEs) in the train dataframe.

Note

You can define the {class_name_column} and {rle_mask_column} under the Dataset settings section when building a 3D image semantic segmentation experiment.

In the case that the test dataframe does not contain a {class_name_column} or {rle_mask_column} or both:

The first column in the .csv file has the name class_id, and no column with true Run-length encodings (RLEs).
A column with a prefix pred_ follow by a suffix {rle_mask_column} that contains the predicted Run-length encodings (RLEs) of all the predictions, where rle_mask_column refers to the name of the Run-length encodings mask column in the train dataframe.

Note

If there's not a {rle_mask_column} in the train dataframe, this column is name pred_mask.
If no mask is predicted, then the column value is an empty string.
You can define the {rle_mask_column} under the Dataset settings section when building a 3D image semantic segmentation experiment.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Image instance segmentation

For image instance segmentation, the validation and test .csv and .pkl prediction files, for the most part, have similar formats; differences are noted below:

`.pkl` file keys
`.csv` file columns

raw_probabilities
- A 4-dimensional NumPy array that contains pixel-wise probabilities. The shape of the array is as follows: number_of_observations x number_of_classes + 2 x {image_height} x {image_width}. Two additional channels (+ 2) are added to the number_of_classes corresponding to individual instance borders and borders between instances.
  Note
  You can define the {image_height} and {image_width} under the Image settings section when building an image instance segmentation experiment.
instance_predictions
- A list of 3-dimensional NumPy arrays containing instance predictions, where each instance is represented as a separate integer starting from 1 for each class. The length of the list is number_of_observations and the shape of each array is as follows: original_image_height x original_image_width x number_of_classes, where original_image_height and original_image_width are height and width of the corresponding input image.
confidences
- A list of dictionaries containing prediction confidences for each instance; the length of the list is N (number_of_observations). Each element of the list is a dictionary with keys representing the class names and values representing the confidences for each instance ID (starting from 1).
class_names
- The class_names refer to a list containing all the class names. The class names follow the order of the class names in the 4-dimensional Numpy raw_probabilities array and the 4-dimensional NumPy instance_predictions array.
{image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is {image_column} where image_column refers to the name of the image column in the train dataframe.

Note

You can define the {image_column} under the Dataset settings section when building an image semantic segmentation experiment.

A column named {image_column_name} where image_column_name refers to the image column name in the train dataframe.
Note
You can define the image_column_name under the Dataset settings section when building an image instance segmentation experiment.
A column named {class_name_column} containing the class names for each instance predicted, where class_name_column refers to the name of the column in the train dataframe that refers to the class names.
A column named instance_rle that contains Run-length encoded (RLEs) mask for each instance.
A column named confidence containing the confidence scores for each instance.

Note

To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Text

Text regression

For text regression, the validation and test .csv and .pkl prediction files have the same format:

`.pkl` file keys
`.csv` file columns

predictions
- A 2-dimensional NumPy array that contains label predictions. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of labels.
labels
- A 1-dimensional NumPy array that contains label names.
{text_column}
- A 1-dimensional NumPy array that contains input texts. The name of the key is {text_column} where text_column refers to the name of the text column in the train dataframe.

Note

You can define the {text_column} under the Dataset settings section when building an text regression experiment.

All the N columns in the train dataframe.
A column named pred_{label_column_name} containing probabilities for the label column, label_column_name refers to the label column name found in the train dataframe.
Note
- For multi-label text regression experiments, more than one pred_{label_column_name} column is in the .csv, referring to the predicted probability for each of the label columns from the train dataframe.
- You can define the label_column_name(s) under the Dataset settings section when building an text regression experiment.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Text classification

For text classification, the validation and test .csv and .pkl prediction files have the same format:

`.pkl` file keys
`.csv` file columns

predictions
- A 2-dimensional NumPy array that contains class probabilities. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of classes.
labels
- A 1-dimensional NumPy array that contains label names.
{text_column}
- A 1-dimensional NumPy array that contains input texts. The name of the key is {text_column} where text_column refers to the name of the text column in the train dataframe.

Note

You can define the {text_column} under the Dataset settings section when building a text classification experiment.

All the N columns in the train dataframe.
A column named pred_{label_column_name} containing probabilities for the label column, label_column_name refers to the label column name found in the train dataframe.

Note

For multi-label and multi-class text classification experiments, more than one pred_{label_column_name} column is in the .csv referring to the predicted probability for each of the label columns from the train dataframe.
You can define the label_column_name(s) under the Dataset settings section when building a text classification experiment.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Text sequence to sequence

For text sequence to sequence, the validation and test .csv and .pkl prediction files have the same format:

`.pkl` file keys
`.csv` file columns

{text_column}
- A 1-dimensional NumPy array that contains the input text observed. The name of the key is {text_column} where text_column refers to the name of the text column in the train dataframe.
  Note
  You can define the {text_column} under the Dataset settings section when building a text sequence to sequence experiment.
predicted_text
- A 1-dimensional NumPy array that contains predictions in a string format for the input text column in the train dataframe.

All the N columns found in the train dataframe.
A column with a prefix pred_ followed by {name_of_the_output_text} that contains predictions for the output text ({label_columns}).
Note
You can define the {label_columns} under the Dataset settings section when building a text sequence to sequence experiment.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Text span predictions

For text span predictions, the validation and test .csv and .pkl prediction files have the same format:

`.pkl` file keys
`.csv` file columns

{question_column}
- A 1-dimensional NumPy array that contains the input question text. The name of the key is {question_column} where question_column refers to the name of the question text column in the train dataframe.
  Note
  You can define the {question_column} under the Dataset settings section when building a Text Span Predictions experiment.
{context_column}
- A 1-dimensional NumPy array that contains the input context text. The name of the key is {context_column} where context_column refers to the name of the context text column in the train dataframe.
  Note
  You can define the {context_column} under the Dataset settings section when building a text span predictions experiment.
predictions
- A 1-dimensional NumPy array that contains predictions in a string format for every input question. The predicted string is a substring of the corresponding context text.
predicted_{answer_column_name}_top_k
- A 1-dimensional NumPy array that contains top-K predictions (in a string form) for the answer column, where k represents the number of predictions the model generated for the answer column.
predicted_answers_score
- A 2-dimensional NumPy array that contains (unnormalized) scores for each predicted answer. Higher scores indicate higher confidence in the prediction.
predicted_answers_null_score
- A 2-dimensional NumPy array that contains the (unnormalized) score that the model assigns to the question having no answer. The difference between the answer and the null scores can be seen as a measure of confidence in the answer.
  Note
  The null score for each answer to a given question may differ. This difference can happen if the model splits the context into multiple spans and predicts each span individually. The null score corresponds to the span where the predicted answer is found.

All the N columns found in the train dataframe.
A column named pred_{answer_column_name} containing predictions (in a string form) for the answer column, where the answer_column_name refers to the name of the answer column found in the train dataframe.
A set of N columns, with the following name convention: pred_{answer_column_name}_top_{k}.
- N refers to the number of answers the model generated for the answer column
  Note
  The number of answers the model generates is determined by the number specified in the following dataset setting: Number of predicted answers.
- answer_column_name refers to the name of the answer column found in the train dataframe
- k refers to the rank of the prediction for the answer column, where k can represent a number between 1 to N, where N refers to the specified number of predictions to generate. Generated predictions are ranked from highest to lowest, where 1 refers to the highest prediction.
  Note
  You can specify the number of predictions for the answer column using the following dataset setting: Number of predicted answers.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Text token classification

For text token classification, the validation and test .csv and .pkl prediction files have the same format:

`.pkl` file keys
`.csv` file columns

probabilities
- A list of 2-dimensional NumPy arrays that contains word-level probabilities for each token, where the length of the list is N (number_of_observations). The shape of each array in the list is as follows: text_length x number_of_classes, where text_length is the number of words in the input text.
predictions
- A 1-dimensional NumPy array that contains predictions in the form of a list of predicted classes for each input word found in the input text.
labels
- A 1-dimensional NumPy array that contains label names.
{text_column}
- A 1-dimensional NumPy array that contains input texts. The name of the key is {text_column} where text_column refers to the name of the text column in the train dataframe.

All the N columns found in the train dataframe.
A column named pred_{label_column_name} containing predictions for the {label_column_name} column in a form of a string space-separated by predicted classes for each input word.
Note
You can define the label_column_name under the Dataset settings section when building a text token classification experiment.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Text metric learning

For text metric learning, the validation and test .csv and .pkl prediction files have the same format:

`.pkl` file keys
`.csv` file columns

embeddings
- A 2-dimensional NumPy array that contains text embeddings. The shape of the array is (n, m), where n represents the number of observations, while m represents the embedding size. Texts with similar embeddings are predicted to have a similar semantic meaning.
  Note
  You can define the {embedding_size} under the Architecture settings section when building a text metric learning experiment.
cosine_similarities
- A 2-dimensional NumPy array that contains cosine similarities between validation (test) texts. The shape of the array is as follows: number_of_observations x {top_k_similar} where {top_k_similar} refers to the selected Top K Similar value used during the experiment.
similar_texts
- A 2-dimensional NumPy array that contains indices of similar validation (test) texts. The shape of the array is as follows: number_of_observations x {top_k_similar} where {top_k_similar} refers to the selected Top K Similar value used during the experiment.
{text_column}
- A 1-dimensional NumPy array that contains texts from the original text column in the train dataframe. The name of the key is {text_column} where text_column refers to the name of the text column in the train dataframe.
  Note
  You can define the {text_column} under the Dataset settings section when building a text metric learning experiment.

All the N columns found in the train dataframe.
Three columns name top_{k}_similar_text, where k can represent a number between 1 to 3, which contains the top k text similar to the input text.
Three columns name top_{k}_cosine_similarity, where k can represent a number between 1 to 3, which contains the cosine similarity value between the input text and the top_{k}_similar_text.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Audio

Audio regression

For audio regression, the validation and test .csv and .pkl prediction files have the same format:

`.pkl` file keys
`.csv` file columns

predictions
- A 2-dimensional NumPy array that contains label predictions. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of labels.
labels
- A 1-dimensional NumPy array that contains label names.
{audio_column}
- A 1-dimensional NumPy array that contains input audio names. The name of the key is {audio_column} where audio_column refers to the name of the audio column in the train dataframe.
  Note
  You can define the {audio_column} under the Dataset settings section when building an audio regression experiment.

All the N columns in the train dataframe.
A column named pred_{label_column_name} containing probabilities for the label column, label_column_name refers to the label column name found in the train dataframe.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Audio classification

For audio classification, the validation and test .csv and .pkl prediction files have the same format:

`.pkl` file keys
`.csv` file columns

predictions
- A 2-dimensional NumPy array that contains class probabilities. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of classes.
labels
- A 1-dimensional NumPy array that contains label names.
{audio_column}
- A 1-dimensional NumPy array that contains input audio names. The name of the key is {audio_column} where audio_column refers to the name of the audio column in the train dataframe.
  Note
  You can define the {audio_column} under the Dataset settings section when building an audio classification experiment.

All the N columns in the train dataframe.
A column named pred_{label_column_name} containing probabilities for the label column, label_column_name refers to the label column name found in the train dataframe.
note
- For multi-label and multi-class audio classification experiments, more than one pred_{label_column_name} column is in the .csv referring to the predicted probability for each of the label columns from the train dataframe.
- You can define the label_column_name(s) under the Dataset settings section when building an audio classification experiment.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Speech

Speech recognition

For speech recognition, the validation and test .csv and .pkl files have the same format:

`.pkl` file keys
`.csv` file columns

predictions
- A 1-dimensional NumPy array that contains predicted transcriptions.
{label_column}
- A 1-dimensional NumPy array that contains label transcriptions. The name of the key is {label_column} where label_column refers to the name of the label column in the train dataframe.
  Note
  You can define the {label_column} under the Dataset settings section when building a speech recognition experiment.
{audio_column}
- A 1-dimensional NumPy array that contains input audio names. The name of the key is {audio_column} where audio_column refers to the name of the audio column in the train dataframe.
  Note
  You can define the {audio_column} under the Dataset settings section when building a speech recognition experiment.

Note

The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

All the N columns in the train dataframe
A column named predicted containing predicted transcripts

Open `.csv` and `.pkl` files with Python

Using Python, a .csv or .pkl file containing predictions can be open as follows:

import pickle
import pandas as pd

df = pd.read_csv('text_classification/validation_predictions.csv')

with open('text_classification/validation_raw_predictions.pkl', 'rb') as f:
    out = pickle.load(f)

print(out.keys())

dict_keys(['predictions', 'comment_text', 'labels'])

print(df.head(1))

id	comment_text	label_toxic	label_severe_toxic	label_obscene	label_threat	label_insult	label_identity_hate	fold	pred_label_toxic	pred_label_severe_toxic	pred_label_obscene	pred_label_threat	pred_label_insult	pred_label_identity_hate
000103f0d9cfb60f	D'aww! He matches this background colour I'm s...	0	0	0	0	0	0	0	0.00041	0.000168	0.000328	0.000142	0.000247	0.000155

Feedback

Submit and view feedback for this page
Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai

Predictions: File formats

Overview​

Image​

Image regression​

3D image regression​

Image classification​

3D image classification​

Image metric learning​

Image object detection​

Image semantic segmentation​

3D image semantic segmentation​

Image instance segmentation​

Text​

Text regression​

Text classification​

Text sequence to sequence​

Text span predictions​

Text token classification​

Text metric learning​

Audio​

Audio regression​

Audio classification​

Speech​

Speech recognition​

Open .csv and .pkl files with Python​

Overview

Image

Image regression

3D image regression

Image classification

3D image classification

Image metric learning

Image object detection

Image semantic segmentation

3D image semantic segmentation

Image instance segmentation

Text

Text regression

Text classification

Text sequence to sequence

Text span predictions

Text token classification

Text metric learning

Audio

Audio regression

Audio classification

Speech

Speech recognition

Open `.csv` and `.pkl` files with Python