Task 3: Experiment scoring and analysis concepts
Overview
As noted in the Automatic Machine Learning Introduction with Driverless AI tutorial, evaluating a model's performance is essential upon completion, especially before deploying it to production. These metrics help us assess the built model's quality and determine the appropriate model score threshold for making predictions.
Several metrics exist for assessing binary classification machine learning models, including Receiver Operating Characteristic (ROC) curves, Precision and Recall (often combined as Prec-Recall), Lift, Gain, and K-S Charts. Each metric evaluates different aspects of the model's performance.
The following concepts provide a high-level overview of metrics used in H2O Driverless AI to assess performance for classification models it generates. In-depth explanations for each metric can be found in the additional resources provided at the end of this task.
Binary
A binary classification model predicts in what two categories (classes) the elements of a given set belong to. In the case of our example, the two categories (classes) are defaulting on your home loan and not defaulting.
Refer here for more information on binary classification models.
ROC (Receiver Operating Characteristic) curve
The ROC curve visually shows how well a binary classifier distinguishes between classes (e.g., loan default or not default). It plots true positive rate (sensitivity) vs. false positive rate (1-specificity) for various classification thresholds, helping us choose the optimal threshold for our specific needs.
Refer ROC for more information on ROC curves.
Precision-Recall curve
The Precision-Recall curve (P-R curve) helps assess classification models, especially with imbalanced datasets. It plots precision (exactness of positive results) vs. recall (completeness of positive results) for different thresholds. In simpler terms, it shows a trade-off between how relevant and how complete the model's positive predictions are.
Refer Precision-Recall for more information on Precision-Recall curves.
Evaluation metrics
- Accuracy (ACC): The percentage of correctly classified instances (both positive and negative).
- F1 Score (F1): Harmonic mean of precision and recall, penalizes models that are imbalanced towards one class.
- F0.5, F2: Variations of F1 score that put more weight on precision (F0.5) or recall (F2).
- MCC (Matthews Correlation Coefficient): A balanced measure that considers true positives, negatives, false positives, and false negatives.
- Log Loss: A measure of how well the model's predicted probabilities match the actual outcomes, lower is better.
- GINI: A measure of a model's ability to separate classes (used in decision trees).
Refer here for more information on evaluation metrics.
Gain and lift charts
Gain and Lift charts measure a classification model's effectiveness by looking at the ratio between the results obtained with a trained model versus a random model (or no model). The Gain and Lift charts help us evaluate the classifier's performance and answer questions such as what percentage of the dataset captured has a positive response as a function of the selected percentage of a sample. Additionally, you can explore how much better you can expect to do with a model than a random model (or no model).
Refer here for more information on Gain and Lift charts.
Kolmogorov-Smirnov (KS) chart
Kolmogorov-Smirnov or K-S measures the classification models' performance by measuring the degree of separation between positives and negatives for validation or test data.
Refer Kolmogorov Smirnov chart for more information on K-S charts.
- Submit and view feedback for this page
- Send feedback about H2O Driverless AI | Tutorials to cloud-feedback@h2o.ai