Predictions download formats: Text metric learning
Overview
When you download predictions in H2O Hydrogen Torch, which comes in a zip file, the format and content of the file first depends on the problem type of the predictions, and then it depends on how you generate them. On the point of "how you generate them," there are two scenarios.
Scenario 1: Predictions from a completed experiment
Predictions downloaded from a completed experiment on the View experiments card are packaged in a zip file. This zip file contains the following files:
validation_predictions.csv
: This is a structured dataframe in CSV format, presenting the final predictions for the provided validation dataframe.validation_raw_predictions.pkl
: This is a Pickle file, which is essentially a pickled Python dictionary. It contains raw predictions for the provided validation dataframe.If the experiment included a test dataframe, H2O Hydrogen Torch also includes two additional files in the same zip file:
test_predictions.csv
: This is another structured dataframe in CSV format, displaying the final predictions for the provided test dataframe.test_raw_predictions.pkl
: Similar to the validation set, this is a Pickle file with raw predictions for the provided test dataframe.
Scenario 2: Predictions generated by scoring on new data
Predictions generated by scoring on new data through the H2O Hydrogen Torch UI (on the Predict data card) are downloaded in a zip file. This zip file includes the following files:
test_predictions.csv
: This is a structured dataframe in CSV format, showing the final predictions for the provided test dataframe.test_raw_predictions.pkl
: This is a Pickle file, a pickled Python dictionary containing raw predictions for the provided test dataframe.
Formats
- `.pkl` file keys
- `.csv` file columns
The Pickle file, contains the following keys:
- embeddings
- A 2-dimensional NumPy array that contains text embeddings. The shape of the array is (n, m), where n represents the number of observations, while m represents the embedding size. Texts with similar embeddings are predicted to have a similar semantic meaningNote
You can define the
{embedding_size}
under the Architecture settings section when building a text metric learning experiment.
- A 2-dimensional NumPy array that contains text embeddings. The shape of the array is (n, m), where n represents the number of observations, while m represents the embedding size. Texts with similar embeddings are predicted to have a similar semantic meaning
- cosine_similarities
- A 2-dimensional NumPy array that contains cosine similarities between validation (test) texts. The shape of the array is as follows:
number_of_observations
x{top_k_similar}
where{top_k_similar}
refers to the selectedTop K Similar
value used during the experiment
- A 2-dimensional NumPy array that contains cosine similarities between validation (test) texts. The shape of the array is as follows:
- similar_texts
- A 2-dimensional NumPy array that contains indices of similar validation (test) texts. The shape of the array is as follows:
number_of_observations
x{top_k_similar}
where{top_k_similar}
refers to the selectedTop K Similar
value used during the experiment
- A 2-dimensional NumPy array that contains indices of similar validation (test) texts. The shape of the array is as follows:
- [text_column]
- A 1-dimensional NumPy array that contains texts from the original text column in the train dataframe. The name of the key is
[text_column]
wheretext_column
refers to the name of the text column in the train dataframeNoteYou can define the
[text_column]
under the Dataset settings section when building a text metric learning experiment.
- A 1-dimensional NumPy array that contains texts from the original text column in the train dataframe. The name of the key is
The csv file contains the following columns:
- All the N columns found in the train dataframe
- Three columns name
top_{k}_similar_text
, where k can represent a number between 1 to 3, which contains the top k text similar to the input text - Three columns name
top_{k}_cosine_similarity
, where k can represent a number between 1 to 3, which contains the cosine similarity value between the input text and thetop_{k}_similar_text
- The i-th sample of each output's dictionary item matches the i-th row of the dataframe.
- To learn how to open the csv and Pickle files, see Open CSV and Pickle files with Python.
Open CSV and Pickle files with Python
Using Python, a csv or Pickle file containing predictions can be open as follows:
import pickle
import pandas as pd
df = pd.read_csv('text_classification/validation_predictions.csv')
with open('text_classification/validation_raw_predictions.pkl', 'rb') as f:
out = pickle.load(f)
print(out.keys())
dict_keys(['predictions', 'comment_text', 'labels'])
print(df.head(1))
id | comment_text | label_toxic | label_severe_toxic | label_obscene | label_threat | label_insult | label_identity_hate | fold | pred_label_toxic | pred_label_severe_toxic | pred_label_obscene | pred_label_threat | pred_label_insult | pred_label_identity_hate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
000103f0d9cfb60f | D'aww! He matches this background colour I'm s... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00041 | 0.000168 | 0.000328 | 0.000142 | 0.000247 | 0.000155 |
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai