Skip to main content
Version: v1.2.0

Predictions: File formats

Downloaded predictions follow one of the following two formats depending on how you generated predictions:

  1. Predictions downloaded from a completed experiment on the View experiments card are in a .zip file containing the following files:
    • validation_predictions.csv
      • The .csv file is a structured dataframe with final predictions for the provided validation dataframe.
    • validation_raw_predictions.pkl
      • The .pkl file is a pickled Python dictionary with raw predictions for the provided validation dataframe.
    • If the experiment contained a test dataframe, H2O Hydrogen Torch also includes the following two files in the .zip file:
      • test_predictions.csv
        • The .csv file is a structured dataframe with final predictions for the provided test dataframe.
      • test_raw_predictions.pkl
        • The .pkl file is a pickled Python dictionary with raw predictions for the provided test dataframe.
  2. Predictions generated by scoring on new data are in a .zip file containing the following files:
    • test_predictions.csv
      • The .csv file is a structured dataframe with final predictions for the provided test dataframe.
    • test_raw_predictions.pkl
      • The .pkl file is a pickled Python dictionary with raw predictions for the provided test dataframe.

Image regression

For image regression, the validation and test .csv and .pkl files have the same format:

  • predictions
    • A 2-dimensional NumPy array that contains label predictions. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of labels.
  • labels
    • A 1-dimensional NumPy array that contains label names.
  • {image_column}
    • A 1-dimensional NumPy array that contains input image names. The name of the key is {image_column} where image_column refers to the name of the image column in the train dataframe.
Note

You can define the {image_column} under the Dataset settings section when building an image regression experiment.

Note

Image classification

For image classification, the validation and test .csv and .pkl files have the same format:

  • predictions
    • A 2-dimensional NumPy array that contains class probabilities. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of classes.
  • labels
    • A 1-dimensional NumPy array that contains label names.
  • {image_column}
    • A 1-dimensional NumPy array that contains input image names. The name of the key is {image_column} where image_column refers to the name of the image column in the train dataframe.
Note

You can define the {image_column} under the Dataset settings section when building an image classification experiment.

Note

Image metric learning

For image metric learning, the validation and test .csv and .pkl files have the same format:

  • embeddings
    • A 2-dimensional NumPy array that contains image embeddings. The shape of the array is (n, m), where n represents the number of observations, while m represents the {embedding_size} where {embedding_size} refers to the selected embedding size value used during the experiment. Images with nearby embedding vectors are predicted to have similar content.
      Note

      You can define the {embedding_size} under the Architecture settings section when building an image metric learning experiment.

  • cosine_similarities
    • A 2-dimensional NumPy array that contains cosine similarities between validation/test images. The shape of the array is as follows: number_of_observations x {top_k_similar} where {top_k_similar} refers to the selected Top K Similar value used during the experiment.
  • similar_images
    • A 2-dimensional NumPy array that contains indices of similar validation/test images. The shape of the array is as follows: number_of_observations x {top_k_similar} where {top_k_similar} refers to the selected Top K Similar value used during the experiment.
  • {image_column}
    • A 1-dimensional NumPy array that contains input image names. The name of the key is {image_column} where image_column refers to the name of the image column in the train dataframe.
Note

You can define the {image_column} under the Dataset settings section when building an image metric learning experiment.

Note

Image object detection

For image object detection, the validation and test .csv and .pkl files have the same format:

  • boxes
    • A 3-dimensional NumPy array that contains predicted bounding boxes. The shape of the array is as follows: The number_of_observations x number_of_bounding_boxes x 4.
      Note

      The number_of_bounding_boxes is limited to 100 most confident boxes, all in the format of: (x_min, y_min, x_max, y_max).

  • confidences
    • A 2-dimensional NumPy array that contains bounding boxes confidences (from 0 to 1). The shape of the array is (n, m), where n represents the number of observations, while m represents the number of bounding boxes.
  • classes
    • A 2-dimensional NumPy array that contains class names of bounding boxes. The shape of the array is (n, m), where n represents the number of observations while m represents the number of bounding boxes.
  • {image_column}
    • A 1-dimensional NumPy array that contains input image names. The name of the key is {image_column} where image_column refers to the name of the image column in the train dataframe.
      Note

      You can define the {image_column} under the Dataset settings section when building an image object detection experiment.

Note

To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Image semantic segmentation

For image semantic segmentation, the validation and test .csv and .pkl files, for the most part, have similar formats; differences are noted below:

  • masks
    • A 4-dimensional NumPy array that contains pixel-wise probabilities. The shape of the array is as follows: number_of_observations x number_of_classes x {image_height} x {image_width}.
      Note

      You can define the {image_height} and {image_width} under the Image settings section when building an image semantic segmentation experiment.

  • original_image_shapes
    • A 2-dimensional NumPy array that contains shapes of the original input images. The shape of the array is as follows: number_of_observations x 2, where the 2nd dimension contains original_image_height and original_image_width of the corresponding input image.
  • rle_predictions
    • A 2-dimensional NumPy array that contains RLE-encoded predictions for each class. The shape of the array is as follows: number_of_observations x number_of_classes. You can use RLE predictions with corresponding original_image_shapes to decode RLE-encoded strings to binary masks.
  • class_names
    • The class_names refers to a list containing all the class names. The class names follow the order of the class names in the 4-dimensional NumPy masks array.
  • {image_column}
    • A 1-dimensional NumPy array that contains input image names. The name of the key is {image_column} where image_column refers to the name of the image column in the train dataframe.
Note

You can define the {image_column} under the Dataset settings section when building an image semantic segmentation experiment.

Note

Image instance segmentation

For image instance segmentation, the validation and test .csv and .pkl files, for the most part, have similar formats; differences are noted below:

  • raw_probabilities
    • A 4-dimensional NumPy array that contains pixel-wise probabilities. The shape of the array is as follows: number_of_observations x number_of_classes + 2 x {image_height} x {image_width}. Two additional channels (+ 2) are added to the number_of_classes corresponding to individual instance borders and borders between instances.
      Note

      You can define the {image_height} and {image_width} under the Image settings section when building an image instance segmentation experiment.

  • instance_predictions
    • A list of 3-dimensional NumPy arrays containing instance predictions, where each instance is represented as a separate integer starting from 1 for each class. The length of the list is number_of_observations and the shape of each array is as follows: original_image_height x original_image_width x number_of_classes, where original_image_height and original_image_width are height and width of the corresponding input image.
  • confidences
    • A list of dictionaries containing prediction confidences for each instance; the length of the list is N (number_of_observations). Each element of the list is a dictionary with keys representing the class names and values representing the confidences for each instance ID (starting from 1).
  • class_names
    • The class_names refer to a list containing all the class names. The class names follow the order of the class names in the 4-dimensional Numpy raw_probabilities array and the 4-dimensional NumPy instance_predictions array.
  • {image_column}
    • A 1-dimensional NumPy array that contains input image names. The name of the key is {image_column} where image_column refers to the name of the image column in the train dataframe.
Note

You can define the {image_column} under the Dataset settings section when building an image semantic segmentation experiment.

Note

To learn how to open the .csv and .pkl files, see Open .csv and .pkl files with Python.

Text regression

For text regression, the validation and test .csv and .pkl files have the same format:

  • predictions
    • A 2-dimensional NumPy array that contains label predictions. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of labels.
  • labels
    • A 1-dimensional NumPy array that contains label names.
  • {text_column}
    • A 1-dimensional NumPy array that contains input texts. The name of the key is {text_column} where text_column refers to the name of the text column in the train dataframe.
Note

You can define the {text_column} under the Dataset settings section when building an text regression experiment.

Note

Text classification

For text classification, the validation and test .csv and .pkl files have the same format:

  • predictions
    • A 2-dimensional NumPy array that contains class probabilities. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of classes.
  • labels
    • A 1-dimensional NumPy array that contains label names.
  • {text_column}
    • A 1-dimensional NumPy array that contains input texts. The name of the key is {text_column} where text_column refers to the name of the text column in the train dataframe.
Note

You can define the {text_column} under the Dataset settings section when building a text classification experiment.

Note

Text sequence to sequence

For text sequence to sequence, the validation and test .csv and .pkl files have the same format:

  • {text_column}
    • A 1-dimensional NumPy array that contains the input text observed. The name of the key is {text_column} where text_column refers to the name of the text column in the train dataframe.
      Note

      You can define the {text_column} under the Dataset settings section when building a text sequence to sequence experiment.

  • predicted_text
    • A 1-dimensional NumPy array that contains predictions in a string format for the input text column in the train dataframe.
Note

Text span predictions

For text span predictions, the validation and test .csv and .pkl files have the same format:

  • {question_column}
    • A 1-dimensional NumPy array that contains the input question text. The name of the key is {question_column} where question_column refers to the name of the question text column in the train dataframe.
      Note

      You can define the {question_column} under the Dataset settings section when building a Text Span Predictions experiment.

  • {context_column}
    • A 1-dimensional NumPy array that contains the input context text. The name of the key is {context_column} where context_column refers to the name of the context text column in the train dataframe.
      Note

      You can define the {context_column} under the Dataset settings section when building a text span predictions experiment.

  • predictions
    • A 1-dimensional NumPy array that contains predictions in a string format for every input question. The predicted string is a substring of the corresponding context text.
  • predicted_{answer_column_name}_top_k
    • A 1-dimensional NumPy array that contains top-K predictions (in a string form) for the answer column, where k represents the number of predictions the model generated for the answer column.
Note

Text token classification

For text token classification, the validation and test .csv and .pkl files have the same format:

  • probabilities
    • A list of 2-dimensional NumPy arrays that contains word-level probabilities for each token, where the length of the list is N (number_of_observations). The shape of each array in the list is as follows: text_length x number_of_classes, where text_length is the number of words in the input text.
  • predictions
    • A 1-dimensional NumPy array that contains predictions in the form of a list of predicted classes for each input word found in the input text.
  • labels
    • A 1-dimensional NumPy array that contains label names.
  • {text_column}
    • A 1-dimensional NumPy array that contains input texts. The name of the key is {text_column} where text_column refers to the name of the text column in the train dataframe.
Note

Text metric learning

For text metric learning, the validation and test .csv and .pkl files have the same format:

  • embeddings
    • A 2-dimensional NumPy array that contains text embeddings. The shape of the array is (n, m), where n represents the number of observations, while m represents the embedding size. Texts with similar embeddings are predicted to have a similar semantic meaning.
      Note

      You can define the {embedding_size} under the Architecture settings section when building a text metric learning experiment.

  • cosine_similarities
    • A 2-dimensional NumPy array that contains cosine similarities between validation (test) texts. The shape of the array is as follows: number_of_observations x {top_k_similar} where {top_k_similar} refers to the selected Top K Similar value used during the experiment.
  • similar_texts
    • A 2-dimensional NumPy array that contains indices of similar validation (test) texts. The shape of the array is as follows: number_of_observations x {top_k_similar} where {top_k_similar} refers to the selected Top K Similar value used during the experiment.
  • {text_column}
    • A 1-dimensional NumPy array that contains texts from the original text column in the train dataframe. The name of the key is {text_column} where text_column refers to the name of the text column in the train dataframe.
      Note

      You can define the {text_column} under the Dataset settings section when building a text metric learning experiment.

Note

Audio regression

For audio regression, the validation and test .csv and .pkl files have the same format:

  • predictions
    • A 2-dimensional NumPy array that contains label predictions. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of labels.
  • labels
    • A 1-dimensional NumPy array that contains label names.
  • {audio_column}
    • A 1-dimensional NumPy array that contains input audio names. The name of the key is {audio_column} where audio_column refers to the name of the audio column in the train dataframe.
      Note

      You can define the {audio_column} under the Dataset settings section when building an audio regression experiment.

Note

Audio classification

For audio classification, the validation and test .csv and .pkl files have the same format:

  • predictions
    • A 2-dimensional NumPy array that contains class probabilities. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of classes.
  • labels
    • A 1-dimensional NumPy array that contains label names.
  • {audio_column}
    • A 1-dimensional NumPy array that contains input audio names. The name of the key is {audio_column} where audio_column refers to the name of the audio column in the train dataframe.
      Note

      You can define the {audio_column} under the Dataset settings section when building an audio classification experiment.

Note

Open .csv and .pkl files with Python

Using Python, a .csv or .pkl file containing predictions can be open as follows:

import pickle
import pandas as pd

df = pd.read_csv('text_classification/validation_predictions.csv')

with open('text_classification/validation_raw_predictions.pkl', 'rb') as f:
out = pickle.load(f)
print(out.keys())
dict_keys(['predictions', 'comment_text', 'labels'])
print(df.head(1))
idcomment_textlabel_toxiclabel_severe_toxiclabel_obscenelabel_threatlabel_insultlabel_identity_hatefoldpred_label_toxicpred_label_severe_toxicpred_label_obscenepred_label_threatpred_label_insultpred_label_identity_hate
000103f0d9cfb60fD'aww! He matches this background colour I'm s...00000000.000410.0001680.0003280.0001420.0002470.000155

Feedback