Predictions: File formats
Overview
Downloaded predictions follow one of the following two formats depending on how you generated predictions:
Predictions downloaded from a completed experiment on the View experiments card are in a
.zip
file containing the following files:validation_predictions.csv
- The
.csv
file is a structured dataframe with final predictions for the provided validation dataframe.
- The
validation_raw_predictions.pkl
- The
.pkl
file is a pickled Python dictionary with raw predictions for the provided validation dataframe.
- The
- If the experiment contained a test dataframe, H2O Hydrogen Torch also includes the following two files in the
.zip
file:test_predictions.csv
- The
.csv
file is a structured dataframe with final predictions for the provided test dataframe.
- The
test_raw_predictions.pkl
- The
.pkl
file is a pickled Python dictionary with raw predictions for the provided test dataframe.
- The
Predictions generated by scoring on new data (through the H2O Hydrogen Torch UI) are downloaded in a
.zip
file containing the following files:test_predictions.csv
- The
.csv
file is a structured dataframe with final predictions for the provided test dataframe.
- The
test_raw_predictions.pkl
- The
.pkl
file is a pickled Python dictionary with raw predictions for the provided test dataframe.
- The
noteSee Download a prediction to learn how to download generated predictions through the H2O Hydrogen Torch UI.
Image
Image regression
For image regression, the validation and test .csv
and .pkl
prediction files have the same format:
- `.pkl` file keys
- `.csv` file columns
- predictions
- A 2-dimensional NumPy array that contains label predictions. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of labels.
- labels
- A 1-dimensional NumPy array that contains label names.
- {image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is
{image_column}
whereimage_column
refers to the name of the image column in the train dataframe.
- A 1-dimensional NumPy array that contains input image names. The name of the key is
You can define the {image_column}
under the Dataset settings section when building an image regression experiment.
- All the N columns in the train dataframe.
- A column name
pred_{label_column_name}
that contains probabilities for the label column,label_column_name
refers to the label column name found in the train dataframe.
- For multi-label image regression experiments, more than one
pred_{label_column_name}
column is in the.csv
referring to the predicted probability for each of the label columns from the train dataframe. - You can define the
label_column_name
(s) under the Dataset settings section when building an image regression experiment.
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
3D image regression
For 3D image regression, the validation and test .csv
and .pkl
prediction files have the same format:
- `.pkl` file keys
- `.csv` file columns
- predictions
- A 2-dimensional NumPy array that contains label predictions. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of labels.
- labels
- A 1-dimensional NumPy array that contains label names.
- {image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is
{image_column}
whereimage_column
refers to the name of the image column in the train dataframe.
- A 1-dimensional NumPy array that contains input image names. The name of the key is
You can define the {image_column}
under the Dataset settings section when building a 3D image regression experiment.
- All the N columns in the train dataframe.
- A column name
pred_{label_column_name}
that contains probabilities for the label column,label_column_name
refers to the label column name found in the train dataframe.
- For multi-label 3D image regression experiments, more than one
pred_{label_column_name}
column is in the.csv
referring to the predicted probability for each of the label columns from the train dataframe. - You can define the
label_column_name
(s) under the Dataset settings section when building an 3D image regression experiment.
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
Image classification
For image classification, the validation and test .csv
and .pkl
prediction files have the same format:
- `.pkl` file keys
- `.csv` file columns
- predictions
- A 2-dimensional NumPy array that contains class probabilities. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of classes.
- labels
- A 1-dimensional NumPy array that contains label names.
- {image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is
{image_column}
whereimage_column
refers to the name of the image column in the train dataframe.
- A 1-dimensional NumPy array that contains input image names. The name of the key is
You can define the {image_column}
under the Dataset settings section when building an image classification experiment.
- All the N columns in the train dataframe.
- A column name
pred_{label_column_name}
that contains probabilities for the label column,label_column_name
refers to the label column name found in the train dataframe.
- For multi-label and multi-class image classification experiments, more than one
pred_{label_column_name}
column is be in the.csv
referring to the predicted probability for each of the label columns from the train dataframe. - You can define the
label_column_name
(s) under the Dataset settings section when building an image classification experiment.
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
3D image classification
For 3D image classification, the validation and test .csv
and .pkl
prediction files have the same format:
- `.pkl` file keys
- `.csv` file columns
- predictions
- A 2-dimensional NumPy array that contains class probabilities. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of classes.
- labels
- A 1-dimensional NumPy array that contains label names.
- {image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is
{image_column}
whereimage_column
refers to the name of the image column in the train dataframe.
- A 1-dimensional NumPy array that contains input image names. The name of the key is
You can define the {image_column}
under the Dataset settings section when building a 3D image classification experiment.
- All the N columns in the train dataframe.
- A column name
pred_{label_column_name}
that contains probabilities for the label column,label_column_name
refers to the label column name found in the train dataframe.
- For multi-label and multi-class 3D image classification experiments, more than one
pred_{label_column_name}
column is be in the.csv
referring to the predicted probability for each of the label columns from the train dataframe. - You can define the
label_column_name
(s) under the Dataset settings section when building a 3D image classification experiment.
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
Image metric learning
For image metric learning, the validation and test .csv
and .pkl
prediction files have the same format:
- `.pkl` file keys
- `.csv` file columns
- embeddings
- A 2-dimensional NumPy array that contains image embeddings. The shape of the array is (n, m), where n represents the number of observations, while m represents the
{embedding_size}
where{embedding_size}
refers to the selected embedding size value used during the experiment. Images with nearby embedding vectors are predicted to have similar content.NoteYou can define the
{embedding_size}
under the Architecture settings section when building an image metric learning experiment.
- A 2-dimensional NumPy array that contains image embeddings. The shape of the array is (n, m), where n represents the number of observations, while m represents the
- cosine_similarities
- A 2-dimensional NumPy array that contains cosine similarities between validation/test images. The shape of the array is as follows:
number_of_observations
x{top_k_similar}
where{top_k_similar}
refers to the selectedTop K Similar
value used during the experiment.
- A 2-dimensional NumPy array that contains cosine similarities between validation/test images. The shape of the array is as follows:
- similar_images
- A 2-dimensional NumPy array that contains indices of similar validation/test images. The shape of the array is as follows:
number_of_observations
x{top_k_similar}
where{top_k_similar}
refers to the selectedTop K Similar
value used during the experiment.
- A 2-dimensional NumPy array that contains indices of similar validation/test images. The shape of the array is as follows:
- {image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is
{image_column}
whereimage_column
refers to the name of the image column in the train dataframe.
- A 1-dimensional NumPy array that contains input image names. The name of the key is
You can define the {image_column}
under the Dataset settings section when building an image metric learning experiment.
- All the N columns found in the train dataframe.
- Three columns name
top_{k}_similar_image
, where k can represent a number between 1 to 3, which contains the top k name of an image similar to the input image. - Three columns name
top_{k}_cosine_similarity
, where k can represent a number between 1 to 3, which contains the cosine similarity value between the input image and thetop_{k}_similar_image.
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
Image object detection
For image object detection, the validation and test .csv
and .pkl
prediction files have the same format:
- `.pkl` file keys
- `.csv` file columns
- boxes
- A 3-dimensional NumPy array that contains predicted bounding boxes. The shape of the array is as follows: The
number_of_observations
xnumber_of_bounding_boxes
x 4.NoteThe
number_of_bounding_boxes
is limited to 100 most confident boxes, all in the format of: (x_min
,y_min
,x_max
,y_max
).
- A 3-dimensional NumPy array that contains predicted bounding boxes. The shape of the array is as follows: The
- confidences
- A 2-dimensional NumPy array that contains bounding boxes confidences (from 0 to 1). The shape of the array is (n, m), where n represents the number of observations, while m represents the number of bounding boxes.
- classes
- A 2-dimensional NumPy array that contains class names of bounding boxes. The shape of the array is (n, m), where n represents the number of observations while m represents the number of bounding boxes.
- {image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is
{image_column}
whereimage_column
refers to the name of the image column in the train dataframe.NoteYou can define the
{image_column}
under the Dataset settings section when building an image object detection experiment.
- A 1-dimensional NumPy array that contains input image names. The name of the key is
- A column named
{image_column_name}
whereimage_column_name
refers to the image column name in the train dataframe.NoteYou can define the
image_column_name
under the Dataset settings section when building an image object detection experiment. - A column named
x_min
containing the minimum x coordinates for the bounding boxes. - A column named
y_min
containing the minimum y coordinates for the bounding boxes. - A column named
x_max
containing the maximum x coordinates for the bounding boxes. - A column named
y_max
containing the maximum y coordinates for the bounding boxes. - A column named
confidence
containing the confidence scores of all the corresponding bounding boxes, only bounding boxes with a confidence score larger than theprobability_threshold
are considered.NoteYou can define the
{probability_threshold}
under the Validation settings section when building an image object detection experiment. - A column named
{class_name_column}
containing the class names of all the corresponding bounding boxes, whereclass_name_column
refers to the name of a column in the train dataframe referring to the class names.
To learn how to open the .csv
and .pkl
files, see Open .csv
and .pkl
files with Python.
Image semantic segmentation
For image semantic segmentation, the validation and test .csv
and .pkl
prediction files, for the most part, have similar formats; differences are noted below:
- `.pkl` file keys
- `.csv` file columns
- masks
- A 4-dimensional NumPy array that contains pixel-wise probabilities. The shape of the array is as follows:
number_of_observations
xnumber_of_classes
x{image_height}
x{image_width}
.NoteYou can define the
{image_height}
and{image_width}
under the Image settings section when building an image semantic segmentation experiment.
- A 4-dimensional NumPy array that contains pixel-wise probabilities. The shape of the array is as follows:
- original_image_shapes
- A 2-dimensional NumPy array that contains shapes of the original input images. The shape of the array is as follows:
number_of_observations
x2
, where the 2nd dimension containsoriginal_image_height
andoriginal_image_width
of the corresponding input image.
- A 2-dimensional NumPy array that contains shapes of the original input images. The shape of the array is as follows:
- rle_predictions
- A 2-dimensional NumPy array that contains RLE-encoded predictions for each class. The shape of the array is as follows:
number_of_observations
xnumber_of_classes
. You can use RLE predictions with corresponding original_image_shapes to decode RLE-encoded strings to binary masks.
- A 2-dimensional NumPy array that contains RLE-encoded predictions for each class. The shape of the array is as follows:
- class_names
- The
class_names
refers to a list containing all the class names. The class names follow the order of the class names in the 4-dimensional NumPy masks array.
- The
- {image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is
{image_column}
whereimage_column
refers to the name of the image column in the train dataframe.
- A 1-dimensional NumPy array that contains input image names. The name of the key is
You can define the {image_column}
under the Dataset settings section when building an image semantic segmentation experiment.
- All the N columns in the train dataframe. Note
The
.csv
file repeats X times each original row in the train dataframe while having each row contain a different run-length-encoded mask prediction for a given class, where X refers to the{number_of_classes}.
In the case that the train dataframe contains a {class_name_column}
and {rle_mask_column}
:
- A column named
{class_name_column}
containing input class names, whereclass_name_column
refers to the name of the column in the train dataframe that refers to the class names. - A column named
{rle_mask_column}
containing all the true Run-length encodings (RLEs) in the train dataframe.
You can define the {class_name_column}
and {rle_mask_column}
under the Dataset settings section when building an image semantic segmentation experiment.
In the case that the test dataframe does not contain a {class_name_column}
or {rle_mask_column}
or both:
- The first column in the
.csv
file has the nameclass_id
, and no column with true Run-length encodings (RLEs). - A column with a prefix
pred_
follow by a suffix{rle_mask_column}
that contains the predicted Run-length encodings (RLEs) of all the predictions, whererle_mask_column
refers to the name of the Run-length encodings mask column in the train dataframe.
- If there's not a
{rle_mask_column}
in the train dataframe, this column is namepred_mask
. - If no mask is predicted, then the column value is an empty string.
- You can define the
{rle_mask_column}
under the Dataset settings section when building an image semantic segmentation experiment.
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
3D image semantic segmentation
For 3D image semantic segmentation, the validation and test .csv
and .pkl
prediction files, for the most part, have similar formats; differences are noted below:
- `.pkl` file keys
- `.csv` file columns
- masks
- A 5-dimensional NumPy array that contains pixel-wise probabilities. The shape of the array is as follows:
number_of_observations
xnumber_of_classes
x{image_height}
x{image_width}
x{image_depth}
.NoteYou can define the
{image_height}
,{image_width}
and{image_depth}
under the Image settings section when building a 3D image semantic segmentation experiment.
- A 5-dimensional NumPy array that contains pixel-wise probabilities. The shape of the array is as follows:
- original_image_shapes
- A 2-dimensional NumPy array that contains shapes of the original input images. The shape of the array is as follows:
number_of_observations
x3
, where the 2nd dimension containsoriginal_image_height
,original_image_width
andoriginal_image_depth
of the corresponding input image.
- A 2-dimensional NumPy array that contains shapes of the original input images. The shape of the array is as follows:
- rle_predictions
- A 2-dimensional NumPy array that contains RLE-encoded predictions for each class. The shape of the array is as follows:
number_of_observations
xnumber_of_classes
. You can use RLE predictions with corresponding original_image_shapes to decode RLE-encoded strings to binary masks.
- A 2-dimensional NumPy array that contains RLE-encoded predictions for each class. The shape of the array is as follows:
- class_names
- The
class_names
refers to a list containing all the class names. The class names follow the order of the class names in the 5-dimensional NumPy masks array.
- The
- {image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is
{image_column}
whereimage_column
refers to the name of the image column in the train dataframe.
- A 1-dimensional NumPy array that contains input image names. The name of the key is
You can define the {image_column}
under the Dataset settings section when building a 3D image semantic segmentation experiment.
- All the N columns in the train dataframe. Note
The
.csv
file repeats X times each original row in the train dataframe while having each row contain a different run-length-encoded mask prediction for a given class, where X refers to the{number_of_classes}.
In the case that the train dataframe contains a {class_name_column}
and {rle_mask_column}
:
- A column name
{class_name_column}
that contains input class names, whereclass_name_column
refers to the name of the column in the train dataframe that refers to the class names. - A column name
{rle_mask_column}
that contains all the true Run-length encodings (RLEs) in the train dataframe.
You can define the {class_name_column}
and {rle_mask_column}
under the Dataset settings section when building a 3D image semantic segmentation experiment.
In the case that the test dataframe does not contain a {class_name_column}
or {rle_mask_column}
or both:
- The first column in the
.csv
file has the nameclass_id
, and no column with true Run-length encodings (RLEs). - A column with a prefix
pred_
follow by a suffix{rle_mask_column}
that contains the predicted Run-length encodings (RLEs) of all the predictions, whererle_mask_column
refers to the name of the Run-length encodings mask column in the train dataframe.
- If there's not a
{rle_mask_column}
in the train dataframe, this column is namepred_mask
. - If no mask is predicted, then the column value is an empty string.
- You can define the
{rle_mask_column}
under the Dataset settings section when building a 3D image semantic segmentation experiment.
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
Image instance segmentation
For image instance segmentation, the validation and test .csv
and .pkl
prediction files, for the most part, have similar formats; differences are noted below:
- `.pkl` file keys
- `.csv` file columns
- raw_probabilities
- A 4-dimensional NumPy array that contains pixel-wise probabilities. The shape of the array is as follows:
number_of_observations
xnumber_of_classes + 2
x{image_height}
x{image_width}
. Two additional channels (+ 2
) are added to thenumber_of_classes
corresponding to individual instance borders and borders between instances.NoteYou can define the
{image_height}
and{image_width}
under the Image settings section when building an image instance segmentation experiment.
- A 4-dimensional NumPy array that contains pixel-wise probabilities. The shape of the array is as follows:
- instance_predictions
- A list of 3-dimensional NumPy arrays containing instance predictions, where each instance is represented as a separate integer starting from 1 for each class. The length of the list is
number_of_observations
and the shape of each array is as follows:original_image_height
xoriginal_image_width
xnumber_of_classes
, whereoriginal_image_height
andoriginal_image_width
are height and width of the corresponding input image.
- A list of 3-dimensional NumPy arrays containing instance predictions, where each instance is represented as a separate integer starting from 1 for each class. The length of the list is
- confidences
- A list of dictionaries containing prediction confidences for each instance; the length of the list is N (
number_of_observations
). Each element of the list is a dictionary with keys representing the class names and values representing the confidences for each instance ID (starting from 1).
- A list of dictionaries containing prediction confidences for each instance; the length of the list is N (
- class_names
- The
class_names
refer to a list containing all the class names. The class names follow the order of the class names in the 4-dimensional Numpy raw_probabilities array and the 4-dimensional NumPy instance_predictions array.
- The
- {image_column}
- A 1-dimensional NumPy array that contains input image names. The name of the key is
{image_column}
whereimage_column
refers to the name of the image column in the train dataframe.
- A 1-dimensional NumPy array that contains input image names. The name of the key is
You can define the {image_column}
under the Dataset settings section when building an image semantic segmentation experiment.
- A column named
{image_column_name}
whereimage_column_name
refers to the image column name in the train dataframe.NoteYou can define the
image_column_name
under the Dataset settings section when building an image instance segmentation experiment. - A column named
{class_name_column}
containing the class names for each instance predicted, whereclass_name_column
refers to the name of the column in the train dataframe that refers to the class names. - A column named
instance_rle
that contains Run-length encoded (RLEs) mask for each instance. - A column named
confidence
containing the confidence scores for each instance.
To learn how to open the .csv
and .pkl
files, see Open .csv
and .pkl
files with Python.
Text
Text regression
For text regression, the validation and test .csv
and .pkl
prediction files have the same format:
- `.pkl` file keys
- `.csv` file columns
- predictions
- A 2-dimensional NumPy array that contains label predictions. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of labels.
- labels
- A 1-dimensional NumPy array that contains label names.
- {text_column}
- A 1-dimensional NumPy array that contains input texts. The name of the key is
{text_column}
wheretext_column
refers to the name of the text column in the train dataframe.
- A 1-dimensional NumPy array that contains input texts. The name of the key is
You can define the {text_column}
under the Dataset settings section when building an text regression experiment.
- All the N columns in the train dataframe.
- A column named
pred_{label_column_name}
containing probabilities for the label column,label_column_name
refers to the label column name found in the train dataframe.Note- For multi-label text regression experiments, more than one
pred_{label_column_name}
column is in the.csv
, referring to the predicted probability for each of the label columns from the train dataframe. - You can define the
label_column_name
(s) under the Dataset settings section when building an text regression experiment.
- For multi-label text regression experiments, more than one
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
Text classification
For text classification, the validation and test .csv
and .pkl
prediction files have the same format:
- `.pkl` file keys
- `.csv` file columns
- predictions
- A 2-dimensional NumPy array that contains class probabilities. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of classes.
- labels
- A 1-dimensional NumPy array that contains label names.
- {text_column}
- A 1-dimensional NumPy array that contains input texts. The name of the key is
{text_column}
wheretext_column
refers to the name of the text column in the train dataframe.
- A 1-dimensional NumPy array that contains input texts. The name of the key is
You can define the {text_column}
under the Dataset settings section when building a text classification experiment.
- All the N columns in the train dataframe.
- A column named
pred_{label_column_name}
containing probabilities for the label column,label_column_name
refers to the label column name found in the train dataframe.
- For multi-label and multi-class text classification experiments, more than one
pred_{label_column_name}
column is in the.csv
referring to the predicted probability for each of the label columns from the train dataframe. - You can define the
label_column_name
(s) under the Dataset settings section when building a text classification experiment.
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
Text sequence to sequence
For text sequence to sequence, the validation and test .csv
and .pkl
prediction files have the same format:
- `.pkl` file keys
- `.csv` file columns
- {text_column}
- A 1-dimensional NumPy array that contains the input text observed. The name of the key is
{text_column}
wheretext_column
refers to the name of the text column in the train dataframe.NoteYou can define the
{text_column}
under the Dataset settings section when building a text sequence to sequence experiment.
- A 1-dimensional NumPy array that contains the input text observed. The name of the key is
- predicted_text
- A 1-dimensional NumPy array that contains predictions in a string format for the input text column in the train dataframe.
- All the N columns found in the train dataframe.
- A column with a prefix
pred_
followed by{name_of_the_output_text}
that contains predictions for the output text ({label_columns}
).NoteYou can define the
{label_columns}
under the Dataset settings section when building a text sequence to sequence experiment.
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
Text span predictions
For text span predictions, the validation and test .csv
and .pkl
prediction files have the same format:
- `.pkl` file keys
- `.csv` file columns
- {question_column}
- A 1-dimensional NumPy array that contains the input question text. The name of the key is
{question_column}
wherequestion_column
refers to the name of the question text column in the train dataframe.NoteYou can define the
{question_column}
under the Dataset settings section when building a Text Span Predictions experiment.
- A 1-dimensional NumPy array that contains the input question text. The name of the key is
- {context_column}
- A 1-dimensional NumPy array that contains the input context text. The name of the key is
{context_column}
wherecontext_column
refers to the name of the context text column in the train dataframe.NoteYou can define the
{context_column}
under the Dataset settings section when building a text span predictions experiment.
- A 1-dimensional NumPy array that contains the input context text. The name of the key is
- predictions
- A 1-dimensional NumPy array that contains predictions in a string format for every input question. The predicted string is a substring of the corresponding context text.
- predicted_{answer_column_name}_top_k
- A 1-dimensional NumPy array that contains top-K predictions (in a string form) for the answer column, where k represents the number of predictions the model generated for the answer column.
- predicted_answers_score
- A 2-dimensional NumPy array that contains (unnormalized) scores for each predicted answer. Higher scores indicate higher confidence in the prediction.
- predicted_answers_null_score
- A 2-dimensional NumPy array that contains the (unnormalized) score that the model assigns to the question having no answer. The difference between the answer and the null scores can be seen as a measure of confidence in the answer.Note
The null score for each answer to a given question may differ. This difference can happen if the model splits the context into multiple spans and predicts each span individually. The null score corresponds to the span where the predicted answer is found.
- A 2-dimensional NumPy array that contains the (unnormalized) score that the model assigns to the question having no answer. The difference between the answer and the null scores can be seen as a measure of confidence in the answer.
- All the N columns found in the train dataframe.
- A column named
pred_{answer_column_name}
containing predictions (in a string form) for the answer column, where theanswer_column_name
refers to the name of the answer column found in the train dataframe. - A set of N columns, with the following name convention:
pred_{answer_column_name}_top_{k}
.- N refers to the number of answers the model generated for the answer column Note
The number of answers the model generates is determined by the number specified in the following dataset setting: Number of predicted answers.
answer_column_name
refers to the name of the answer column found in the train dataframek
refers to the rank of the prediction for the answer column, where k can represent a number between 1 to N, where N refers to the specified number of predictions to generate. Generated predictions are ranked from highest to lowest, where 1 refers to the highest prediction.NoteYou can specify the number of predictions for the answer column using the following dataset setting: Number of predicted answers.
- N refers to the number of answers the model generated for the answer column
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
Text token classification
For text token classification, the validation and test .csv
and .pkl
prediction files have the same format:
- `.pkl` file keys
- `.csv` file columns
- probabilities
- A list of 2-dimensional NumPy arrays that contains word-level probabilities for each token, where the length of the list is N (
number_of_observations
). The shape of each array in the list is as follows:text_length
xnumber_of_classes
, wheretext_length
is the number of words in the input text.
- A list of 2-dimensional NumPy arrays that contains word-level probabilities for each token, where the length of the list is N (
- predictions
- A 1-dimensional NumPy array that contains predictions in the form of a list of predicted classes for each input word found in the input text.
- labels
- A 1-dimensional NumPy array that contains label names.
- {text_column}
- A 1-dimensional NumPy array that contains input texts. The name of the key is
{text_column}
wheretext_column
refers to the name of the text column in the train dataframe.
- A 1-dimensional NumPy array that contains input texts. The name of the key is
- All the N columns found in the train dataframe.
- A column named
pred_{label_column_name}
containing predictions for the{label_column_name}
column in a form of a string space-separated by predicted classes for each input word.NoteYou can define the
label_column_name
under the Dataset settings section when building a text token classification experiment.
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
Text metric learning
For text metric learning, the validation and test .csv
and .pkl
prediction files have the same format:
- `.pkl` file keys
- `.csv` file columns
- embeddings
- A 2-dimensional NumPy array that contains text embeddings. The shape of the array is (n, m), where n represents the number of observations, while m represents the embedding size. Texts with similar embeddings are predicted to have a similar semantic meaning.Note
You can define the
{embedding_size}
under the Architecture settings section when building a text metric learning experiment.
- A 2-dimensional NumPy array that contains text embeddings. The shape of the array is (n, m), where n represents the number of observations, while m represents the embedding size. Texts with similar embeddings are predicted to have a similar semantic meaning.
- cosine_similarities
- A 2-dimensional NumPy array that contains cosine similarities between validation (test) texts. The shape of the array is as follows:
number_of_observations
x{top_k_similar}
where{top_k_similar}
refers to the selectedTop K Similar
value used during the experiment.
- A 2-dimensional NumPy array that contains cosine similarities between validation (test) texts. The shape of the array is as follows:
- similar_texts
- A 2-dimensional NumPy array that contains indices of similar validation (test) texts. The shape of the array is as follows:
number_of_observations
x{top_k_similar}
where{top_k_similar}
refers to the selectedTop K Similar
value used during the experiment.
- A 2-dimensional NumPy array that contains indices of similar validation (test) texts. The shape of the array is as follows:
- {text_column}
- A 1-dimensional NumPy array that contains texts from the original text column in the train dataframe. The name of the key is
{text_column}
wheretext_column
refers to the name of the text column in the train dataframe.NoteYou can define the
{text_column}
under the Dataset settings section when building a text metric learning experiment.
- A 1-dimensional NumPy array that contains texts from the original text column in the train dataframe. The name of the key is
- All the N columns found in the train dataframe.
- Three columns name
top_{k}_similar_text
, where k can represent a number between 1 to 3, which contains the top k text similar to the input text. - Three columns name
top_{k}_cosine_similarity
, where k can represent a number between 1 to 3, which contains the cosine similarity value between the input text and thetop_{k}_similar_text
.
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
Audio
Audio regression
For audio regression, the validation and test .csv
and .pkl
prediction files have the same format:
- `.pkl` file keys
- `.csv` file columns
- predictions
- A 2-dimensional NumPy array that contains label predictions. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of labels.
- labels
- A 1-dimensional NumPy array that contains label names.
- {audio_column}
- A 1-dimensional NumPy array that contains input audio names. The name of the key is
{audio_column}
whereaudio_column
refers to the name of the audio column in the train dataframe.NoteYou can define the
{audio_column}
under the Dataset settings section when building an audio regression experiment.
- A 1-dimensional NumPy array that contains input audio names. The name of the key is
- All the N columns in the train dataframe.
- A column named
pred_{label_column_name}
containing probabilities for the label column,label_column_name
refers to the label column name found in the train dataframe.
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
Audio classification
For audio classification, the validation and test .csv
and .pkl
prediction files have the same format:
- `.pkl` file keys
- `.csv` file columns
- predictions
- A 2-dimensional NumPy array that contains class probabilities. The shape of the array is (n, m), where n represents the number of observations, while m represents the number of classes.
- labels
- A 1-dimensional NumPy array that contains label names.
- {audio_column}
- A 1-dimensional NumPy array that contains input audio names. The name of the key is
{audio_column}
whereaudio_column
refers to the name of the audio column in the train dataframe.NoteYou can define the
{audio_column}
under the Dataset settings section when building an audio classification experiment.
- A 1-dimensional NumPy array that contains input audio names. The name of the key is
- All the N columns in the train dataframe.
- A column named
pred_{label_column_name}
containing probabilities for the label column,label_column_name
refers to the label column name found in the train dataframe.note- For multi-label and multi-class audio classification experiments, more than one
pred_{label_column_name}
column is in the.csv
referring to the predicted probability for each of the label columns from the train dataframe. - You can define the
label_column_name
(s) under the Dataset settings section when building an audio classification experiment.
- For multi-label and multi-class audio classification experiments, more than one
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
Speech
Speech recognition
For speech recognition, the validation and test .csv
and .pkl
files have the same format:
- `.pkl` file keys
- `.csv` file columns
- predictions
- A 1-dimensional NumPy array that contains predicted transcriptions.
- {label_column}
- A 1-dimensional NumPy array that contains label transcriptions. The name of the key is
{label_column}
wherelabel_column
refers to the name of the label column in the train dataframe.NoteYou can define the
{label_column}
under the Dataset settings section when building a speech recognition experiment.
- A 1-dimensional NumPy array that contains label transcriptions. The name of the key is
- {audio_column}
- A 1-dimensional NumPy array that contains input audio names. The name of the key is
{audio_column}
whereaudio_column
refers to the name of the audio column in the train dataframe.NoteYou can define the
{audio_column}
under the Dataset settings section when building a speech recognition experiment.
- A 1-dimensional NumPy array that contains input audio names. The name of the key is
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the
.csv
and.pkl
files, see Open.csv
and.pkl
files with Python.
- All the N columns in the train dataframe
- A column named
predicted
containing predicted transcripts
Open .csv
and .pkl
files with Python
Using Python, a .csv
or .pkl
file containing predictions can be open as follows:
import pickle
import pandas as pd
df = pd.read_csv('text_classification/validation_predictions.csv')
with open('text_classification/validation_raw_predictions.pkl', 'rb') as f:
out = pickle.load(f)
print(out.keys())
dict_keys(['predictions', 'comment_text', 'labels'])
print(df.head(1))
id | comment_text | label_toxic | label_severe_toxic | label_obscene | label_threat | label_insult | label_identity_hate | fold | pred_label_toxic | pred_label_severe_toxic | pred_label_obscene | pred_label_threat | pred_label_insult | pred_label_identity_hate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
000103f0d9cfb60f | D'aww! He matches this background colour I'm s... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00041 | 0.000168 | 0.000328 | 0.000142 | 0.000247 | 0.000155 |
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai