.. _parameters_H2OAutoML: Parameters of H2OAutoML ----------------------- Affected Classes ################ - ``ai.h2o.sparkling.ml.algos.H2OAutoML`` - ``ai.h2o.sparkling.ml.algos.classification.H2OAutoMLClassifier`` - ``ai.h2o.sparkling.ml.algos.regression.H2OAutoMLRegressor`` Parameters ########## - *Each parameter has also a corresponding getter and setter method.* *(E.g.:* ``label`` *->* ``getLabel()`` *,* ``setLabel(...)`` *)* blendingDataFrame This parameter is used for computing the predictions that serve as the training frame for the meta-learner. If provided, this triggers blending mode on the stacked ensemble training stage. Blending mode is faster than cross-validating the base learners (though these ensembles may not perform as well as the Super Learner ensemble). The parameter is not serializable! *Scala default value:* ``null`` *; Python default value:* ``None`` ignoredCols Names of columns to ignore for training. *Scala default value:* ``null`` *; Python default value:* ``None`` leaderboardDataFrame This parameter allows the user to specify a particular data frame to use to score and rank models on the leaderboard. This data frame will not be used for anything besides leaderboard scoring. *Scala default value:* ``null`` *; Python default value:* ``None`` monotoneConstraints A key must correspond to a feature name and value could be 1 or -1 *Scala default value:* ``Map()`` *; Python default value:* ``{}`` balanceClasses Balance training data class counts via over/under-sampling (for imbalanced data). *Scala default value:* ``false`` *; Python default value:* ``False`` classSamplingFactors Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes. *Scala default value:* ``null`` *; Python default value:* ``None`` columnsToCategorical List of columns to convert to categorical before modelling *Scala default value:* ``Array()`` *; Python default value:* ``[]`` convertInvalidNumbersToNa If set to 'true', the model converts invalid numbers to NA during making predictions. *Scala default value:* ``false`` *; Python default value:* ``False`` convertUnknownCategoricalLevelsToNa If set to 'true', the model converts unknown categorical levels to NA during making predictions. *Scala default value:* ``false`` *; Python default value:* ``False`` customDistributionFunc Reference to custom distribution, format: `language:keyName=funcName`. *Scala default value:* ``null`` *; Python default value:* ``None`` customMetricFunc Reference to custom evaluation function, format: `language:keyName=funcName`. *Scala default value:* ``null`` *; Python default value:* ``None`` dataFrameSerializer A full name of a serializer used for serialization and deserialization of Spark DataFrames to a JSON value within NullableDataFrameParam. *Default value:* ``"ai.h2o.sparkling.utils.JSONDataFrameSerializer"`` detailedPredictionCol Column containing additional prediction details, its content depends on the model type. *Default value:* ``"detailed_prediction"`` distribution Distribution function used by algorithms that support it; other algorithms use their defaults. Possible values are ``"AUTO"``, ``"bernoulli"``, ``"quasibinomial"``, ``"modified_huber"``, ``"multinomial"``, ``"ordinal"``, ``"gaussian"``, ``"poisson"``, ``"gamma"``, ``"tweedie"``, ``"huber"``, ``"laplace"``, ``"quantile"``, ``"fractionalbinomial"``, ``"negativebinomial"``, ``"custom"``. *Default value:* ``"AUTO"`` excludeAlgos A list of algorithms to skip during the model-building phase. Possible values are ``"GLM"``, ``"DRF"``, ``"GBM"``, ``"DeepLearning"``, ``"StackedEnsemble"``, ``"XGBoost"``. *Scala default value:* ``null`` *; Python default value:* ``None`` exploitationRatio The budget ratio (between 0 and 1) dedicated to the exploitation (vs exploration) phase. *Default value:* ``-1.0`` exportCheckpointsDir Path to a directory where every generated model will be stored. *Scala default value:* ``null`` *; Python default value:* ``None`` featuresCols Name of feature columns *Scala default value:* ``Array()`` *; Python default value:* ``[]`` foldCol Fold column (contains fold IDs) in the training frame. These assignments are used to create the folds for cross-validation of the models. *Scala default value:* ``null`` *; Python default value:* ``None`` huberAlpha Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1). *Default value:* ``0.9`` includeAlgos A list of algorithms to restrict to during the model-building phase. Possible values are ``"GLM"``, ``"DRF"``, ``"GBM"``, ``"DeepLearning"``, ``"StackedEnsemble"``, ``"XGBoost"``. *Scala default value:* ``Array("GLM", "DRF", "GBM", "DeepLearning", "StackedEnsemble", "XGBoost")`` *; Python default value:* ``["GLM", "DRF", "GBM", "DeepLearning", "StackedEnsemble", "XGBoost"]`` keepBinaryModels If set to true, all binary models created during execution of the ``fit`` method will be kept in DKV of H2O-3 cluster. *Scala default value:* ``false`` *; Python default value:* ``False`` keepCrossValidationFoldAssignment Whether to keep cross-validation assignments. *Scala default value:* ``false`` *; Python default value:* ``False`` keepCrossValidationModels Whether to keep the cross-validated models. Keeping cross-validation models may consume significantly more memory in the H2O cluster. *Scala default value:* ``false`` *; Python default value:* ``False`` keepCrossValidationPredictions Whether to keep the predictions of the cross-validation predictions. This needs to be set to TRUE if running the same AutoML object for repeated runs because CV predictions are required to build additional Stacked Ensemble models in AutoML. *Scala default value:* ``false`` *; Python default value:* ``False`` labelCol Response column. *Default value:* ``"label"`` maxAfterBalanceSize Maximum relative size of the training data after balancing class counts (defaults to 5.0 and can be less than 1.0). Requires balance_classes. *Scala default value:* ``5.0f`` *; Python default value:* ``5.0`` maxModels Maximum number of models to build (optional). Always set this parameter to ensure AutoML reproducibility: all models are then trained until convergence and none is constrained by a time budget. *Default value:* ``0`` maxRuntimeSecs This argument specifies the maximum time that the AutoML process will run for. If both max_runtime_secs and max_models are specified, then the AutoML run will stop as soon as it hits either of these limits. If neither max_runtime_secs nor max_models are specified, then max_runtime_secs defaults to 3600 seconds (1 hour). *Default value:* ``0.0`` maxRuntimeSecsPerModel Maximum time to spend on each individual model (optional). Note that models constrained by a time budget are not guaranteed reproducible. *Default value:* ``0.0`` nfolds Number of folds for k-fold cross-validation (defaults to -1 (AUTO), otherwise it must be >=2 or use 0 to disable). Disabling prevents Stacked Ensembles from being built. *Default value:* ``-1`` predictionCol Prediction column name *Default value:* ``"prediction"`` projectName Optional project name used to group models from multiple AutoML runs into a single Leaderboard; derived from the training data name if not specified. *Scala default value:* ``null`` *; Python default value:* ``None`` quantileAlpha Desired quantile for Quantile regression, must be between 0 and 1. *Default value:* ``0.5`` seed Seed for random number generator; set to a value other than -1 for reproducibility. *Scala default value:* ``-1L`` *; Python default value:* ``-1`` sortMetric Metric used to sort leaderboard. Possible values are ``"AUTO"``, ``"deviance"``, ``"logloss"``, ``"MSE"``, ``"RMSE"``, ``"MAE"``, ``"RMSLE"``, ``"AUC"``, ``"mean_per_class_error"``. *Default value:* ``"AUTO"`` splitRatio Accepts values in range [0, 1.0] which determine how large part of dataset is used for training and for validation. For example, 0.8 -> 80% training 20% validation. This parameter is ignored when validationDataFrame is set. *Default value:* ``1.0`` stoppingMetric Metric to use for early stopping (AUTO: logloss for classification, deviance for regression). Possible values are ``"AUTO"``, ``"deviance"``, ``"logloss"``, ``"MSE"``, ``"RMSE"``, ``"MAE"``, ``"RMSLE"``, ``"AUC"``, ``"AUCPR"``, ``"lift_top_group"``, ``"misclassification"``, ``"mean_per_class_error"``, ``"anomaly_score"``, ``"AUUC"``, ``"ATE"``, ``"ATT"``, ``"ATC"``, ``"qini"``, ``"custom"``, ``"custom_increasing"``. *Default value:* ``"AUTO"`` stoppingRounds Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable). *Default value:* ``3`` stoppingTolerance Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much). *Default value:* ``-1.0`` tweediePower Tweedie power for Tweedie regression, must be between 1 and 2. *Default value:* ``1.5`` validationDataFrame A data frame dedicated for a validation of the trained model. If the parameters is not set,a validation frame created via the 'splitRatio' parameter. The parameter is not serializable! *Scala default value:* ``null`` *; Python default value:* ``None`` weightCol Weights column in the training frame, which specifies the row weights used in model training. *Scala default value:* ``null`` *; Python default value:* ``None`` withContributions Enables or disables generating a sub-column of detailedPredictionCol containing Shapley values of original features. *Scala default value:* ``false`` *; Python default value:* ``False`` withLeafNodeAssignments Enables or disables computation of leaf node assignments. *Scala default value:* ``false`` *; Python default value:* ``False`` withStageResults Enables or disables computation of stage results. *Scala default value:* ``false`` *; Python default value:* ``False``