Predictions download formats: Text span prediction
Overview
When you download predictions in H2O Hydrogen Torch, which comes in a zip file, the format and content of the file first depends on the problem type of the predictions, and then it depends on how you generate them. On the point of "how you generate them," there are two scenarios.
Scenario 1: Predictions from a completed experiment
Predictions downloaded from a completed experiment on the View experiments card are packaged in a zip file. This zip file contains the following files:
validation_predictions.csv
: This is a structured dataframe in CSV format, presenting the final predictions for the provided validation dataframe.validation_raw_predictions.pkl
: This is a Pickle file, which is essentially a pickled Python dictionary. It contains raw predictions for the provided validation dataframe.If the experiment included a test dataframe, H2O Hydrogen Torch also includes two additional files in the same zip file:
test_predictions.csv
: This is another structured dataframe in CSV format, displaying the final predictions for the provided test dataframe.test_raw_predictions.pkl
: Similar to the validation set, this is a Pickle file with raw predictions for the provided test dataframe.
Scenario 2: Predictions generated by scoring on new data
Predictions generated by scoring on new data through the H2O Hydrogen Torch UI (on the Predict data card) are downloaded in a zip file. This zip file includes the following files:
test_predictions.csv
: This is a structured dataframe in CSV format, showing the final predictions for the provided test dataframe.test_raw_predictions.pkl
: This is a Pickle file, a pickled Python dictionary containing raw predictions for the provided test dataframe.
Formats
- `.pkl` file keys
- `.csv` file columns
The Pickle file, contains the following keys:
- [question_column]
- A 1-dimensional NumPy array that contains the input question text. The name of the key is
[question_column]
wherequestion_column
refers to the name of the question text column in the train dataframe.NoteYou can define the
[question_column]
under the Dataset settings section when building a Text Span Predictions experiment.
- A 1-dimensional NumPy array that contains the input question text. The name of the key is
- [context_column]
- A 1-dimensional NumPy array that contains the input context text. The name of the key is
[context_column]
wherecontext_column
refers to the name of the context text column in the train dataframe.NoteYou can define the
[context_column]
under the Dataset settings section when building a text span predictions experiment.
- A 1-dimensional NumPy array that contains the input context text. The name of the key is
- predictions
- A 1-dimensional NumPy array that contains predictions in a string format for every input question. The predicted string is a substring of the corresponding context text.
- predicted_[answer_column_name]_top_k
- A 1-dimensional NumPy array that contains top-K predictions (in a string form) for the answer column, where k represents the number of predictions the model generated for the answer column.
- predicted_answers_score
- A 2-dimensional NumPy array that contains (unnormalized) scores for each predicted answer. Higher scores indicate higher confidence in the prediction.
- predicted_answers_null_score
- A 2-dimensional NumPy array that contains the (unnormalized) score that the model assigns to the question having no answer. The difference between the answer and the null scores can be seen as a measure of confidence in the answer.Note
The null score for each answer to a given question may differ. This difference can happen if the model splits the context into multiple spans and predicts each span individually. The null score corresponds to the span where the predicted answer is found.
- A 2-dimensional NumPy array that contains the (unnormalized) score that the model assigns to the question having no answer. The difference between the answer and the null scores can be seen as a measure of confidence in the answer.
The csv file contains the following columns:
- All the N columns found in the train dataframe
- A column named
pred_{answer_column_name}
containing predictions (in a string form) for the answer column, where theanswer_column_name
refers to the name of the answer column found in the train dataframe - A set of N columns, with the following name convention:
pred_{answer_column_name}_top_{k}
- N refers to the number of answers the model generated for the answer column Note
The number of answers the model generates is determined by the number specified in the following dataset setting: Number of predicted answers.
answer_column_name
refers to the name of the answer column found in the train dataframek
refers to the rank of the prediction for the answer column, where k can represent a number between 1 to N, where N refers to the specified number of predictions to generate. Generated predictions are ranked from highest to lowest, where 1 refers to the highest prediction.NoteYou can specify the number of predictions for the answer column using the following dataset setting: Number of predicted answers.
- N refers to the number of answers the model generated for the answer column
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the csv and Pickle files, see Open CSV and Pickle files with Python.
Open CSV and Pickle files with Python
Using Python, a csv or Pickle file containing predictions can be open as follows:
import pickle
import pandas as pd
df = pd.read_csv('text_classification/validation_predictions.csv')
with open('text_classification/validation_raw_predictions.pkl', 'rb') as f:
out = pickle.load(f)
print(out.keys())
dict_keys(['predictions', 'comment_text', 'labels'])
print(df.head(1))
id | comment_text | label_toxic | label_severe_toxic | label_obscene | label_threat | label_insult | label_identity_hate | fold | pred_label_toxic | pred_label_severe_toxic | pred_label_obscene | pred_label_threat | pred_label_insult | pred_label_identity_hate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
000103f0d9cfb60f | D'aww! He matches this background colour I'm s... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00041 | 0.000168 | 0.000328 | 0.000142 | 0.000247 | 0.000155 |
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai