library(h2o)
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 1 minutes 57 seconds
## H2O cluster timezone: Europe/Prague
## H2O data parsing timezone: UTC
## H2O cluster version: 3.37.0.99999
## H2O cluster version age: 4 minutes
## H2O cluster name: tomasfryda
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.74 GB
## H2O cluster total cores: 16
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 4.1.3 (2022-03-10)
h2o.no_progress()
df <- h2o.importFile("https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv")
response <- "quality"
predictors <- c(
"fixed acidity", "volatile acidity", "citric acid", "residual sugar", "chlorides", "free sulfur dioxide",
"total sulfur dioxide", "density", "pH", "sulphates", "alcohol", "type"
)
df_splits <- h2o.splitFrame(df, seed = 1)
train <- df_splits[[1]]
test <- df_splits[[2]]
aml <- h2o.automl(predictors, response, train, max_runtime_secs = 120)
h2o.explain(aml, test)
Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the newdata. At most 20 models are shown by default.
model_id | rmse | mse | mae | rmsle | mean_residual_deviance | training_time_ms | predict_time_per_row_ms | algo | |
---|---|---|---|---|---|---|---|---|---|
1 | StackedEnsemble_AllModels_3_AutoML_1_20220408_154334 | 0.617685776617159 | 0.381535718635142 | 0.438378587069498 | 0.0936590959349768 | 0.381535718635142 | 478 | 0.039631 | StackedEnsemble |
2 | StackedEnsemble_AllModels_2_AutoML_1_20220408_154334 | 0.618013024931484 | 0.381940098984963 | 0.439959229672551 | 0.0937501017812696 | 0.381940098984963 | 356 | 0.019012 | StackedEnsemble |
3 | StackedEnsemble_AllModels_1_AutoML_1_20220408_154334 | 0.618214658361304 | 0.382189363812783 | 0.440410150437152 | 0.0937798961585461 | 0.382189363812783 | 344 | 0.012239 | StackedEnsemble |
4 | StackedEnsemble_BestOfFamily_3_AutoML_1_20220408_154334 | 0.618930632968796 | 0.383075128427154 | 0.441219995671507 | 0.0938838970587147 | 0.383075128427154 | 469 | 0.016097 | StackedEnsemble |
5 | StackedEnsemble_BestOfFamily_2_AutoML_1_20220408_154334 | 0.619252731243278 | 0.38347394515226 | 0.442115525875666 | 0.0939366793118372 | 0.38347394515226 | 368 | 0.013446 | StackedEnsemble |
6 | StackedEnsemble_BestOfFamily_4_AutoML_1_20220408_154334 | 0.620728928088902 | 0.385304402166397 | 0.443204693679439 | 0.094117617659717 | 0.385304402166397 | 242 | 0.02157 | StackedEnsemble |
7 | DRF_1_AutoML_1_20220408_154334 | 0.623399620739062 | 0.388627087137606 | 0.451090313693956 | 0.0946911453486115 | 0.388627087137606 | 3195 | 0.005967 | DRF |
8 | XRT_1_AutoML_1_20220408_154334 | 0.62606109981841 | 0.391952500705837 | 0.453111049733941 | 0.0951338672138981 | 0.391952500705837 | 2223 | 0.006284 | DRF |
9 | GBM_grid_1_AutoML_1_20220408_154334_model_7 | 0.637915265136001 | 0.406935885493535 | 0.476742598832093 | 0.0964849953561878 | 0.406935885493535 | 649 | 0.006073 | GBM |
10 | GBM_grid_1_AutoML_1_20220408_154334_model_6 | 0.641354902962878 | 0.411336111554523 | 0.476741007078573 | 0.0969221694646524 | 0.411336111554523 | 626 | 0.005926 | GBM |
11 | GBM_grid_1_AutoML_1_20220408_154334_model_9 | 0.642958284406168 | 0.413395355486523 | 0.476448028841323 | 0.0975142896460704 | 0.413395355486523 | 601 | 0.002098 | GBM |
12 | XGBoost_grid_1_AutoML_1_20220408_154334_model_10 | 0.648563761818005 | 0.420634953143521 | 0.47752383277203 | 0.0978531204988241 | 0.420634953143521 | 1142 | 0.00208 | XGBoost |
13 | GBM_grid_1_AutoML_1_20220408_154334_model_5 | 0.648599756724984 | 0.420681644423708 | 0.483004603881046 | 0.0980326905339133 | 0.420681644423708 | 825 | 0.004597 | GBM |
14 | GBM_4_AutoML_1_20220408_154334 | 0.649691920305335 | 0.422099591310033 | 0.487717334731695 | 0.0982314888081577 | 0.422099591310033 | 878 | 0.004574 | GBM |
15 | StackedEnsemble_BestOfFamily_1_AutoML_1_20220408_154334 | 0.650043181849592 | 0.422556138269142 | 0.486853519201754 | 0.0981231711888431 | 0.422556138269142 | 576 | 0.007308 | StackedEnsemble |
16 | GBM_grid_1_AutoML_1_20220408_154334_model_3 | 0.650982115890274 | 0.423777715208978 | 0.4882610338403 | 0.0986011433390065 | 0.423777715208978 | 566 | 0.005961 | GBM |
17 | GBM_grid_1_AutoML_1_20220408_154334_model_4 | 0.653741189108367 | 0.427377542336821 | 0.496609192879076 | 0.0986346994882898 | 0.427377542336821 | 606 | 0.005631 | GBM |
18 | XGBoost_grid_1_AutoML_1_20220408_154334_model_2 | 0.654850868750006 | 0.428829660302637 | 0.465980194210394 | 0.0992954680663143 | 0.428829660302637 | 2734 | 0.001862 | XGBoost |
19 | GBM_3_AutoML_1_20220408_154334 | 0.655021303157682 | 0.429052907590388 | 0.498592026724169 | 0.0988840102934742 | 0.429052907590388 | 792 | 0.004217 | GBM |
20 | XGBoost_grid_1_AutoML_1_20220408_154334_model_5 | 0.658951956027864 | 0.434217680352948 | 0.45547348612504 | 0.099599420783599 | 0.434217680352948 | 1615 | 0.001332 | XGBoost |
Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see “striped” lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
The variable importance plot shows the relative importance of the most important variables in the model.
Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.