Prediction settings: Text sequence to sequence
Overview
To score (predict) new data through the H2O Hydrogen Torch UI (with a built model), you need to specify certain settings refer as prediction settings (which are comprised of certain dataset, prediction, and environment settings similar to those utilized when creating an experiment). Below observe the prediction settings for a text sequence to sequence model.
General settings
Experiment
Defines the model (experiment) H2O Hydrogen Torch uses to score new data.
Prediction name
It defines the name of the prediction.
Dataset settings
Dataset
Specifies the dataset to score.
Test dataframe
Defines the file containing the test dataframe that H2O Hydrogen Torch scores.
- Image regression | 3D image regression | Image classification | 3D image classification | Image metric learning | Text regression | Text classification | Text sequence to sequence | Text span prediction | Text token classification | Text metric learning | Audio regression | Audio classification
- Defines a .csvor.pqfile containing the test dataframe that H2O Hydrogen Torch utilizes for scoring.
 noteThe test dataframe should have the same format as the train dataframe but does not require label columns. 
- Defines a 
- Image object detection | Image semantic segmentation | 3D image semantic segmentation | Image instance segmentation
- Defines a .pqfile containing the test dataframe that H2O Hydrogen Torch utilizes for scoring.
 :::
 
- Defines a 
Text column
Defines the column name with the input text that H2O Hydrogen Torch uses during scoring.
Prediction settings
Metric
Defines the evaluation metric in which H2O Hydrogen Torch evaluates the model's accuracy on generated predictions.
Details
Options
Details
Image regression | 3D image regression | Text regression | Audio regression
- MAE: Mean absolute error
- The Mean Absolute Error (MAE) is an average of the absolute errors. The MAE units are the same as the predicted target, which is useful for understanding whether the size of the error is of concern or not. The smaller the MAE the better the model’s performance.
 
- MSE: Mean squared error
- The MSE metric measures the average of the squares of the errors or deviations. MSE takes the distances from the points to the regression line (these distances are the “errors”) and squaring them to remove any negative signs. MSE incorporates both the variance and the bias of the predictor.
- MSE also gives more weight to larger differences. The bigger the error, the more it is penalized. For example, if your correct answers are 2,3,4 and the algorithm guesses 1,4,3, then the absolute error on each one is exactly 1, so squared error is also 1, and the MSE is 1. But if the algorithm guesses 2,3,6, then the errors are 0,0,2, the squared errors are 0,0,4, and the MSE is a higher 1.333. The smaller the MSE, the better the model’s performance.
 
- RMSE: Root mean squared error
- The Root Mean Sqaured Error (RMSE) metric evaluates how well a model can predict a continuous value. The RMSE units are the same as the predicted target, which is useful for understanding if the size of the error is of concern or not. The smaller the RMSE, the better the model’s performance.
- RMSE penalizes outliers more, as compared to MAE, so it is useful if we want to avoid having large errors.
 
- MAPE: Mean absolute percentage error
- Mean Absolute Percentage Error (MAPE) measures the size of the error in percentage terms. It is calculated as the average of the unsigned percentage error.
- MAPE is useful when target values are across different scales.
 
- SMAPE Symmetric mean absolute percentage error
- Unlike the MAPE, which divides the absolute errors by the absolute actual values, the SMAPE divides by the mean of the absolute actual and the absolute predicted values. This is important when the actual values can be 0 or near 0. Actual values near 0 cause the MAPE value to become infinitely high. Because SMAPE includes both the actual and the predicted values, the SMAPE value can never be greater than 200%.
 
- R2: R squared
- The R2 value represents the degree that the predicted value and the actual value move in unison. The R2 value varies between 0 and 1 where 0 represents no correlation between the predicted and actual value and 1 represents complete correlation.
 
Details
Image classification | 3D image classification | Text classification | Audio classification
- LogLoss: Logarithmic loss
- The logarithmic loss metric can be used to evaluate the performance of a binomial or multinomial classifier. Unlike AUC which looks at how well a model can classify a binary target, logloss evaluates how close a model’s predicted values (uncalibrated probability estimates) are to the actual target value. For example, does a model tend to assign a high predicted value like .80 for the positive class, or does it show a poor ability to recognize the positive class and assign a lower predicted value like .50? Logloss can be any value greater than or equal to 0, with 0 meaning that the model correctly assigns a probability of 0% or 100%.
 
- ROC_AUC: Area under the receiver operating characteristic curve
- This model metric is used to evaluate how well a binary classification model is able to distinguish between true positives and false positives. For multi-class problems, this score is computed by micro-averaging the ROC curves for each class.
- An Area Under the Curve (AUC) of 1 indicates a perfect classifier, while an AUC of .5 indicates a poor classifier whose performance is no better than random guessing.
 
- F1
- The F1 score is calculated from the harmonic mean of the precision and recall. An F1 score of 1 means both precision and recall are perfect, and the model correctly identified all the positive cases and didn’t mark a negative case as a positive case. If either precision or recall is very low, it is reflected with an F1 score closer to 0.
- Formula: F1 = 2 (Precision * Recall / Precision + Recall)
- Precision is the positive observations (true positives) the model correctly identified from all the observations it labeled as positive (the true positives + the false positives).
- Recall is the positive observations (true positives) the model correctly identified from all the actual positive cases (the true positives + the false negatives).
 
- Micro-averaging: H2O Hydrogen Torch micro-averages the F1 metric (score).
- Multi-class: For multi-class classification experiments utilizing an F1 metric, the derived micro-average F1 metric might look suspicious; in that case, the micro-average F1 metric is numerically equivalent to the accuracy score.
- Binary: For binary classification experiments utilizing an F1 metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the F1 metric.
 
 
- F2
- The F2 score is the weighted harmonic mean of the precision and recall (given a threshold value). Unlike the F1 score, which gives equal weight to precision and recall, the F2 score gives more weight to recall than to precision. More weight should be given to recall for cases where False Negatives are considered worse than False Positives. For example, if your use case is to predict which customers will churn, you may consider False Negatives worse than False Positives. In this case, you want your predictions to capture all of the customers that will churn. Some of these customers may not be at risk for churning, but the extra attention they receive is not harmful. More importantly, no customers actually at risk of churning have been missed.
- Formula: F2 = 5 (Precision * Recall / (4 * Precision) + Recall)
- Precision is the positive observations (true positives) the model correctly identified from all the observations it labeled as positive (the true positives + the false positives).
- Recall is the positive observations (true positives) the model correctly identified from all the actual positive cases (the true positives + the false negatives).
 
- Micro-averaging: H2O Hydrogen Torch micro-averages the F2 metric (score).
- Multi-class: For multi-class classification experiments utilizing an F2 metric, the derived micro-average F2 metric might look suspicious; in that case, the micro-average F2 metric is numerically equivalent to the accuracy score.
- Binary: For binary classification experiments utilizing an F2 metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the F2 metric.
 
 
- F05
- The F05 score is the weighted harmonic mean of the precision and recall (given a threshold value). Unlike the F1 score, which gives equal weight to precision and recall, the F05 score gives more weight to precision than to recall. More weight should be given to precision for cases where False Positives are considered worse than False Negatives. For example, if your use case is to predict which products you will run out of, you may consider False Positives worse than False Negatives. In this case, you want your predictions to be very precise and only capture the products that will definitely run out. If you predict a product will need to be restocked when it actually doesn’t, you incur cost by having purchased more inventory than you actually need.
- Formula: F05 = 1.25 (Precision * Recall / (0.25 * Precision) + Recall)
- Precision is the positive observations (true positives) the model correctly identified from all the observations it labeled as positive (the true positives + the false positives).
- Recall is the positive observations (true positives) the model correctly identified from all the actual positive cases (the true positives + the false negatives).
 
- Micro-averaging: H2O Hydrogen Torch micro-averages the F05 metric (score).
- Multi-class: For multi-class classification experiments utilizing an F05 metric, the derived micro-average F05 metric might look suspicious; in that case, the micro-average F05 metric is numerically equivalent to the accuracy score.
- Binary: For binary classification experiments utilizing an F05 metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the F05 metric.
 
 
- Precision
- The precision metric measures the ratio of correct true positives among all predicted positives.
- Formula: Precision = True Positive / (True Positive + False Positive)
- Micro-averaging: H2O Hydrogen Torch micro-averages the precision metric (score).
- Multi-class: For multi-class classification experiments utilizing a precision metric, the derived micro-average precision metric might look suspicious; in that case, the micro-average precision metric is numerically equivalent to the accuracy score.
- Binary: For binary classification experiments utilizing a precision metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision metric.
 
 
- Recall
- The recall metric measures the ratio of true positives predicted correctly.
- Formula: Recall = True Positive / (True Positive + False Negative)
- Micro-averaging: H2O Hydrogen Torch micro-averages the recall metric (score).
- Multi-class: For multi-class classification experiments utilizing a recall metric, the derived micro-average recall metric might look suspicious; in that case, the micro-average recall metric is numerically equivalent to the accuracy score.
- Binary: For binary classification experiments utilizing a recall metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the recall metric.
 
 
- Accuracy
- In binary classification, Accuracy is the number of correct predictions made as a ratio of all predictions made. In multiclass classification, the set of labels predicted for a sample must exactly match the corresponding set of labels in target values.
 
- MCC: Matthews correlation coefficient
- The goal of the Matthews Correlation Coefficient (MCC) metric is to represent the confusion matrix of a model as a single number. The MCC metric combines the true positives, false positives, true negatives, and false negatives using the following MCC equation: 𝑀𝐶𝐶= 𝑇𝑃𝑥𝑇𝑁−𝐹𝑃𝑥𝐹𝑁/√(𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)(𝑇𝑁+𝐹𝑃)(𝑇𝑁+𝐹𝑁).
- Unlike metrics like Accuracy, MCC is a good scorer to use when the target variable is imbalanced. In the case of imbalanced data, high Accuracy can be found by predicting the majority class. Metrics like Accuracy and F1 can be misleading, especially in the case of imbalanced data, because they do not consider the relative size of the four confusion matrix categories. MCC, on the other hand, takes the proportion of each class into account. The MCC value ranges from -1 to 1 where -1 indicates a classifier that predicts the opposite class from the actual value, 0 means the classifier does no better than random guessing, and 1 indicates a perfect classifier.
 
Details
Image object detection
- mAP: Mean average precision
Details
Image semantic segmentation | 3D image semantic segmentation
- IoU: Intersection over union
- Dice
Details
Image instance segmentation
- COCO_mAP: COCO (Common Objects in Context) mean average precision
Details
Image metric learning | Text metric learning
- mAP: Mean average precision
Details
Text token classification
- CONLL_MICRO_F1_SCORE
- Macro F1 score calculated in CoNLL style
 
- CONLL_MACRO_F1_SCORE
- Micro F1 score calculated in CoNLL style
 
- MICRO_F1_SCORE: Micro F1 score
- MACRO_F1_SCORE: Macro F1 score
Details
Text span prediction
- Jaccard
- F1
- Accuracy
- Top_2_Accuracy
- Top_3_Accuracy
- Top_4_Accuracy
- Top_5_Accuracy
Details
Text sequence to sequence
- BLEU
- Computes the BLEU metric given hypotheses and references
 
- CHRF
- Computes the chrF(++) metric given hypotheses and references
 
- TER
- Computes the translation edit rate metric given hypotheses and references
 
Details
Speech recognition
- WER: Word error rate
- CER: Character error rate
Max length inference
Defines the maximum length value H2O Hydrogen Torch uses for the generated text.
- Similar to the Max length setting in the Tokenizer Settings section (when defining the settings of the experiment), this setting specifies the maximum number of tokens to predict for a given prediction sample.
- This setting impacts predictions and the evaluation metrics and should depend on the dataset and average output sequence length that is expected to be predicted.
Do sample
Determines whether to sample from the next token distribution instead of choosing the token with the highest probability. If turned On, the next token in a predicted sequence is sampled based on the probabilities. If turned Off, the highest probability is always chosen.
Num beams
Defines the number of beams to use for beam search. Num Beams default value is 1 (a single beam); no beam search.
The selection of various beams increases prediction runtime while potentially improving accuracy.
Temperature
Defines the temperature to use for sampling from the next token distribution during validation and inference. In other words, the defined temperature controls the randomness of predictions by scaling the logits before applying softmax. A higher temperature makes the distribution more random.
- Modify the temperature value if you have the Do Sample setting enabled (On).
- To learn more about this setting, refer to the following article: How to generate text: using different decoding methods for language generation with Transformers.
Environment settings
GPUs
Specifies the list of GPUs H2O Hydrogen Torch can use for scoring. GPUs are listed by name, referring to their system ID (starting from 1). If no GPUs are selected, H2O Hydrogen Torch utilizes CPUs for model scoring.
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai