Cross Validation ^^^^^^^^^^^^^^^^ Which parameters are used for or with cross validation? ####################################################### - ``nfolds`` - ``keep_cross_validation_models`` - ``keep_cross_validation_predictions`` - ``keep_cross_validation_fold_assignment`` - ``fold_assignment`` - ``fold_column`` If a user activates cross validation in one of the algorithms (``h2o.randomForest()``, ``h2o.gbm()``, etc), will H2O output estimates of model performance on only the holdout sets? ######################################################################################################################################################################################## No, H2O will build nfolds+1 models in total, the 'main' model on 100% of training data and nfolds 'cross-validation’ models that use disjoint holdout 'validation' sets (obtained from the training data) to estimate the generalization of the main model. The main model contains a cross-validation metrics object that is computed from the combined holdout predictions (obtain by setting xval to true in h2o.performance), as well as a table containing the statistics of various metrics across all nfolds cross-validation models (e.g., the mean and stddev of the logloss, rmse, etc.). You can also get the performance of the main model on the ``training_frame`` dataset if you specify ``train = TRUE`` (R) or ``train = True`` (python) when you ask for a model performance metric. If you provide a ``validation_frame`` during cross-validation, then you can get the performance of the main model on that by specifying ``valid = TRUE`` (R) or ``valid = True`` (python) when you ask for a model performance metric. Can H2O automatically feed back the implications of the cross-validation results to improve the algorithm during training, as well as tune some of the model's hyperparamters? ############################################################################################################################################################################### Yes, H2O can use cross-validation for parameter tuning if early stopping is enabled (stopping_rounds>0). In that case, cross-validation is used to automatically tune the optimal number of epochs for Deep Learning or the number of trees for DRF/GBM. The main model will use the mean number of epochs across all cross-validation models. If a ``validation_frame`` isn't specified, does supplying the ``nfolds`` parameter activate cross-validation scoring on the ``training_frame`` dataset's holdouts? ################################################################################################################################################################### True (if ``nfolds > 1`` ) Does the model only train on the training data? ############################################### The model only ever trains on training data, but can use validation data (if provided) to tune parameters related to early stopping (epochs, number of trees). If no validation data is provided, we will tune based off training data. Does supplying the ``validation_frame`` parameter activate scoring on the ``validation_frame`` dataset instead of the ``training_frame`` dataset? ######################################################################################################################################################## No, the models always score on the training frame (unless explicitly turned off - only available in Deep Learning), but if a validation frame is provided, then the model will score on that as well (and can use it for parameter tuning such as early stopping). It’s always a good idea to provide a validation set. If you don’t want to 'sacrifice' data, use cross-validation instead. Then, you can still provide a validation frame, but you don’t have to (and it isn’t used for parameter tuning either, just for metrics reporting). If the ``nfolds`` parameter is not specified, while ``validation_frame`` and ``training_frame`` are , then would cross validation be activate, and some default value for ``nfolds`` parameter will be applied? ####################################################################################################################################################################################################################### No, when a training frame and validation frame are supplied without the ``nfolds`` parameter, then training is done on the ``training_frame`` and validation is done on the ``validation_frame`` (CV will only ever activate unless ``nfolds > 1`` ) Is early stopping (``stopping_rounds > 0``) based on the ``validation_frame`` dataset, if provided, and otherwise based on ``the training_frame`` dataset? ############################################################################################################################################################ Yes.