Create a synthetic H2O Frame with random data. You can specify the number of rows/columns, as well as column types: integer, real, boolean, time, string, categorical. The frame may also have a dedicated “response” column, and some of the entries in the dataset may be created as missing.
POST /3/ModelMetrics/models/{model}/frames/{frame}
Return the scoring metrics for the specified Frame with the specified Model. If the Frame has already been scored with the Model then cached results will be returned; otherwise predictions for all rows in the Frame will be generated and the metrics will be returned.
POST /3/ModelMetrics/predictions_frame/{predictions_frame}/actuals_frame/{actuals_frame}
Create a ModelMetrics object from the predicted and actual values, and a domain for classification problems or a distribution family for regression problems.
Return the model in the MOJO format. This format can then be interpreted by gen_model.jar in order to perform prediction / scoring. Currently works for GBM and DRF algos only.
Create frame with random (uniformly distributed) data. You can specify how many columns of each type to make; and what the desired range for each column type.
ANOVA table frame key containing Type III SS calculation, degree of freedom, F-statistics and p-values. This frame content is repeated in the model summary.
Seed for pseudo random number generator (if applicable)
In
standardize boolean
Standardize numeric columns to have zero mean and unit variance
In
family enum
Family. Use binomial for classification with logistic regression, others are for regression problems.
In
tweedie_variance_power double
Tweedie variance power
In
tweedie_link_power double
Tweedie link power
In
theta double
Theta
In
alpha double[]
Distribution of regularization between the L1 (Lasso) and L2 (Ridge) penalties. A value of 1 for alpha represents Lasso regression, a value of 0 produces Ridge regression, and anything in between specifies the amount of mixing between the two. Default value of alpha is 0 when SOLVER = ‘L-BFGS’; 0.5 otherwise.
In
lambda double[]
Regularization strength
In
lambda_search boolean
Use lambda search starting at lambda max, given lambda is then interpreted as lambda min
In
solver enum
AUTO will set the solver based on given data and the other parameters. IRLSM is fast on on problems with small number of predictors and for lambda-search with L1 penalty, L_BFGS scales better for datasets with many columns.
Plug Values (a single row frame containing values that will be used to impute missing values of the training/validation frame, use with conjunction missing_values_handling = PlugValues)
In
non_negative boolean
Restrict coefficients (not intercept) to be non-negative
In
compute_p_values boolean
Request p-values computation, p-values work only with IRLSM solver and no regularization
In
max_iterations int
Maximum number of iterations
In
link enum
Link function.
In
prior double
Prior probability for y==1. To be used only for logistic regression iff the data has been sampled and the mean of response does not reflect reality.
In
highest_interaction_term int
Limit the number of interaction terms, if 2 means interaction between 2 columns only, 3 for three columns and so on… Default to 2.
In
type int
Refer to the SS type 1, 2, 3, or 4. We are currently only supporting 3
In
early_stopping boolean
Stop early when there is no more relative improvement on train or validation (if provided).
In
save_transformed_framekeys boolean
true to save the keys of transformed predictors and interaction column.
In
nparallelism int
Number of models to build in parallel. Default to 4. Adjust according to your system.
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
missing_values_handling enum
Handling of missing values. Either MeanImputation, Skip or PlugValues.
In/Out
balance_classes boolean
Balance training data class counts via over/under-sampling (for imbalanced data).
In/Out
class_sampling_factors float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
In/Out
max_after_balance_size float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Choose a weak learner type. Defaults to AUTO, which means DRF.
In
learn_rate double
Learning rate (from 0.0 to 1.0)
In
weak_learner_params string
Customized parameters for the weak_learner algorithm.
In
seed long
Seed for pseudo random number generator (if applicable)
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Method for computing PCA (Caution: GLRM is currently experimental and unstable)
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
k int
Rank of matrix approximation
In/Out
max_iterations int
Maximum number of iterations for PCA
In/Out
target_num_exemplars int
Targeted number of exemplars
In/Out
rel_tol_num_exemplars double
Relative tolerance for number of exemplars (e.g, 0.5 is +/- 50 percents)
In/Out
seed long
RNG seed for initialization
In/Out
use_all_factor_levels boolean
Whether first factor level is included in each categorical expansion
In/Out
save_mapping_frame boolean
Whether to export the mapping of the aggregated frame
In/Out
num_iteration_without_new_exemplar int
The number of iterations to run before aggregator exits if the number of exemplars collected didn’t change
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Model performance based stopping criteria for the AutoML run.
In
nfolds int
Number of folds for k-fold cross-validation (defaults to -1 (AUTO), otherwise it must be >=2 or use 0 to disable). Disabling prevents Stacked Ensembles from being built.
In
balance_classes boolean
Balance training data class counts via over/under-sampling (for imbalanced data).
In
class_sampling_factors float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
In
max_after_balance_size float
Maximum relative size of the training data after balancing class counts (defaults to 5.0 and can be less than 1.0). Requires balance_classes.
In
keep_cross_validation_predictions boolean
Whether to keep the predictions of the cross-validation predictions. This needs to be set to TRUE if running the same AutoML object for repeated runs because CV predictions are required to build additional Stacked Ensemble models in AutoML.
In
keep_cross_validation_models boolean
Whether to keep the cross-validated models. Keeping cross-validation models may consume significantly more memory in the H2O cluster.
In
keep_cross_validation_fold_assignment boolean
Whether to keep cross-validation assignments.
In
export_checkpoints_dir string
Path to a directory where every generated model will be stored.
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
project_name string
Optional project name used to group models from multiple AutoML runs into a single Leaderboard; derived from the training data name if not specified.
In/Out
distribution enum
Distribution function used by algorithms that support it; other algorithms use their defaults.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
AutoMLBuildModelsV99
exclude_algos enum[]
A list of algorithms to skip during the model-building phase.
In
include_algos enum[]
A list of algorithms to restrict to during the model-building phase.
In
exploitation_ratio double
The budget ratio (between 0 and 1) dedicated to the exploitation (vs exploration) phase.
ID of the H2OFrame used to train the the metalearning algorithm in Stacked Ensembles (instead of relying on cross-validated predicted values). When provided, it is also recommended to disable cross validation by setting nfolds=0 and to provide a leaderboard frame for scoring purposes.
Weights column in the training frame, which specifies the row weights used in model training.
In
ignored_columns string[]
Names of columns to ignore in the training frame when building models.
In
sort_metric enum
Metric used to sort leaderboard
In
AutoMLKeyV3
name string
Name (string representation) for this Key.
In/Out
type string
Name (string representation) for the type of Keyed this Key points to.
In/Out
URL string
URL for the resource that this Key points to, if one exists.
In/Out
AutoMLStoppingCriteriaV99
seed long
Seed for random number generator; set to a value other than -1 for reproducibility.
In
max_models int
Maximum number of models to build (optional). Always set this parameter to ensure AutoML reproducibility: all models are then trained until convergence and none is constrained by a time budget.
In
max_runtime_secs double
This argument specifies the maximum time that the AutoML process will run for. If both max_runtime_secs and max_models are specified, then the AutoML run will stop as soon as it hits either of these limits. If neither max_runtime_secs nor max_models are specified, then max_runtime_secs defaults to 3600 seconds (1 hour).
In
max_runtime_secs_per_model double
Maximum time to spend on each individual model (optional). Note that models constrained by a time budget are not guaranteed reproducible.
In
stopping_rounds int
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression)
In
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
k int
The max. number of clusters. If estimate_k is disabled, the model will find k centroids, otherwise it will find up to k centroids.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
In/Out
auc_type enum
Set default multinomial AUC type.
In/Out
ColSpecifierV3
column_name string
Name of the column
In/Out
is_member_of_frames string[]
List of fields which specify columns that must contain this column
In/Out
ColV3
label string
label
Out
missing_count long
missing
Out
zero_count long
zeros
Out
positive_infinity_count long
positive infinities
Out
negative_infinity_count long
negative infinities
Out
mins double[]
mins
Out
maxs double[]
maxs
Out
mean double
mean
Out
sigma double
sigma
Out
type string
datatype: {enum, string, int, real, time, uuid}
Out
domain string[]
domain; not-null for categorical columns only
Out
domain_cardinality int
cardinality of this column’s domain; not-null for categorical columns only
Out
data double[]
data
Out
string_data string[]
string data
Out
precision byte
decimal precision, -1 for all digits
Out
histogram_bins long[]
Histogram bins; null if not computed
Out
histogram_base double
Start of histogram bin zero
Out
histogram_stride double
Stride per bin
Out
percentiles double[]
Percentile values, matching the default percentiles
Out
ColumnSpecsBase
name string
Column Name
Out
type string
Column Type
Out
format string
Column Format (printf)
Out
description string
Column Description
Out
ColumnsMappingV3
from string[]
Input column(s) from the same encoding group.
In
to string[]
Output column(s) generated by the application of target encoding to the from group.
A list of pairwise (first order) column interactions.
In
use_all_factor_levels boolean
(Internal. For development only!) Indicates whether to use all factor levels.
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Number of data columns (in addition to the first response column)
In
seed long
Random number seed that determines the random values
In
randomize boolean
Whether frame should be randomized
In
value long
Constant value (for randomize=false)
In
real_range double
Range for real variables (-range … range)
In
categorical_fraction double
Fraction of categorical columns (for randomize=true)
In
factors int
Factor levels for categorical variables
In
integer_fraction double
Fraction of integer columns (for randomize=true)
In
integer_range int
Range for integer variables (-range … range)
In
binary_fraction double
Fraction of binary columns (for randomize=true)
In
binary_ones_fraction double
Fraction of 1’s in binary columns
In
time_fraction double
Fraction of date/time columns (for randomize=true)
In
string_fraction double
Fraction of string columns (for randomize=true)
In
missing_fraction double
Fraction of missing values
In
has_response boolean
Whether an additional response column should be generated
In
response_factors int
Number of factor levels of the first column (1=real, 2=binomial, N=multinomial)
In
positive_response boolean
For real-valued response variable: Whether the response should be positive only.
In
_fields string
Filter on the set of output fields: if you set _fields=”foo,bar,baz”, then only those fields will be included in the output; or you can specify _fields=”-goo,gee” to include all fields except goo and gee. If the result contains nested data structures, then you can refer to the fields within those structures as well. For example if you specify _fields=”foo(oof),bar(-rab)”, then only fields foo and bar will be included, and within foo there will be only field oof, whereas within bar all fields except rab will be reported.
Random number seed that determines the random values.
In
nrows int
Number of rows.
In
ncols_real int
Number of real-valued columns. Values in these columns will be uniformly distributed between real_lb and real_ub.
In
ncols_int int
Number of integer columns.
In
ncols_enum int
Number of enum (categorical) columns.
In
ncols_bool int
Number of boolean (binary) columns.
In
ncols_str int
Number of string columns.
In
ncols_time int
Number of time columns.
In
real_lb double
Lower bound for the range of the real-valued columns.
In
real_ub double
Upper bound for the range of the real-valued columns.
In
int_lb int
Lower bound for the range of integer columns.
In
int_ub int
Upper bound for the range of integer columns.
In
enum_nlevels int
Number of levels (categories) for the enum columns.
In
bool_p double
Fraction of ones in each boolean (binary) column.
In
time_lb long
Lower bound for the range of time columns (in ms since the epoch).
In
time_ub long
Upper bound for the range of time columns (in ms since the epoch).
In
str_length int
Length of generated strings in string columns.
In
missing_fraction double
Fraction of missing values.
In
response_type enum
Type of the response column to add.
In
response_lb double
Lower bound for the response variable (real/int/time types).
In
response_ub double
Upper bound for the response variable (real/int/time types).
In
response_p double
Frequency of 1s for the bool (binary) response column.
In
response_nlevels int
Number of categorical levels for the enum response column.
In
_fields string
Filter on the set of output fields: if you set _fields=”foo,bar,baz”, then only those fields will be included in the output; or you can specify _fields=”-goo,gee” to include all fields except goo and gee. If the result contains nested data structures, then you can refer to the fields within those structures as well. For example if you specify _fields=”foo(oof),bar(-rab)”, then only fields foo and bar will be included, and within foo there will be only field oof, whereas within bar all fields except rab will be reported.
In
CreateFrameV3
_exclude_fields string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”
Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrt{p} for classification and p/3 for regression (where p is the # of predictors
In
binomial_double_trees boolean
For binary classification: Build 2x as many trees (one per class) - can lead to higher accuracy.
In
sample_rate double
Row sample rate per tree (from 0.0 to 1.0)
In
ntrees int
Number of trees.
In
max_depth int
Maximum tree depth (0 for unlimited).
In
min_rows double
Fewest allowed (weighted) observations in a leaf.
In
nbins int
For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point
In
nbins_top_level int
For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level
In
nbins_cats int
For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.
In
r2_stopping double
r2_stopping is no longer supported and will be ignored if set - please use stopping_rounds, stopping_metric and stopping_tolerance instead. Previous version of H2O would stop making trees when the R^2 metric equals or exceeds this
In
seed long
Seed for pseudo random number generator (if applicable)
In
build_tree_one_node boolean
Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.
In
sample_rate_per_class double[]
A list of row sample rates per class (relative fraction for each class, from 0.0 to 1.0), for each tree
In
col_sample_rate_per_tree double
Column sample rate per tree (from 0.0 to 1.0)
In
col_sample_rate_change_per_level double
Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0)
In
score_tree_interval int
Score the model after every so many trees. Disabled if set to 0.
In
min_split_improvement double
Minimum relative improvement in squared error reduction for a split to happen
In
histogram_type enum
What type of histogram to use for finding optimal split points
In
calibrate_model boolean
Use Platt Scaling (default) or Isotonic Regression to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities.
In
in_training_checkpoints_dir string
Create checkpoints into defined directory while training process is still running. In case of cluster shutdown, this checkpoint can be used to restart training.
In
in_training_checkpoints_tree_interval int
Checkpoint the model after every so many trees. Parameter is used only when in_training_checkpoints_dir is defined
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
balance_classes boolean
Balance training data class counts via over/under-sampling (for imbalanced data).
In/Out
class_sampling_factors float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
In/Out
max_after_balance_size float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
In/Out
max_confusion_matrix_size int
[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs
Check if response column is constant. If enabled, then an exception is thrown if the response column is a constant value.If disabled, then model will train regardless of the response column being a constant value or not.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
balance_classes boolean
Balance training data class counts via over/under-sampling (for imbalanced data).
In/Out
class_sampling_factors float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
In/Out
max_after_balance_size float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
In/Out
max_confusion_matrix_size int
[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs.
In/Out
activation enum
Activation function.
In/Out
hidden int[]
Hidden layer sizes (e.g. [100, 100]).
In/Out
epochs double
How many times the dataset should be iterated (streamed), can be fractional.
In/Out
train_samples_per_iteration long
Number of training samples (globally) per MapReduce iteration. Special values are 0: one epoch, -1: all available data (e.g., replicated training data), -2: automatic.
In/Out
target_ratio_comm_to_comp double
Target ratio of communication overhead to computation. Only for multi-node operation and train_samples_per_iteration = -2 (auto-tuning).
In/Out
seed long
Seed for random numbers (affects sampling) - Note: only reproducible when running single threaded.
In/Out
adaptive_rate boolean
Adaptive learning rate.
In/Out
rho double
Adaptive learning rate time decay factor (similarity to prior updates).
In/Out
epsilon double
Adaptive learning rate smoothing factor (to avoid divisions by zero and allow progress).
In/Out
rate double
Learning rate (higher => less stable, lower => slower convergence).
A list of H2OFrame ids to initialize the bias vectors of this model with.
In/Out
loss enum
Loss function.
In/Out
score_interval double
Shortest time interval (in seconds) between model scoring.
In/Out
score_training_samples long
Number of training set samples for scoring (0 for all).
In/Out
score_validation_samples long
Number of validation set samples for scoring (0 for all).
In/Out
score_duty_cycle double
Maximum duty cycle fraction for scoring (lower: more training, higher: more scoring).
In/Out
classification_stop double
Stopping criterion for classification error fraction on training data (-1 to disable).
In/Out
regression_stop double
Stopping criterion for regression error (MSE) on training data (-1 to disable).
In/Out
quiet_mode boolean
Enable quiet mode for less output to standard output.
In/Out
score_validation_sampling enum
Method used to sample validation dataset for scoring.
In/Out
overwrite_with_best_model boolean
If enabled, override the final model with the best model found during training.
In/Out
autoencoder boolean
Auto-Encoder.
In/Out
use_all_factor_levels boolean
Use all factor levels of categorical variables. Otherwise, the first factor level is omitted (without loss of accuracy). Useful for variable importances and auto-enabled for autoencoder.
In/Out
standardize boolean
If enabled, automatically standardize the data. If disabled, the user must provide properly scaled input data.
In/Out
diagnostics boolean
Enable diagnostics for hidden layers.
In/Out
variable_importances boolean
Compute variable importances for input features (Gedeon method) - can be slow for large networks.
In/Out
fast_mode boolean
Enable fast mode (minor approximation in back-propagation).
In/Out
force_load_balance boolean
Force extra load balancing to increase training speed for small datasets (to keep all cores busy).
In/Out
replicate_training_data boolean
Replicate the entire training dataset onto every node for faster training on small datasets.
In/Out
single_node_mode boolean
Run on a single node for fine-tuning of model parameters.
In/Out
shuffle_training_data boolean
Enable shuffling of training data (recommended if training data is replicated and train_samples_per_iteration is close to #nodes x #rows, of if using balance_classes).
In/Out
missing_values_handling enum
Handling of missing values. Either MeanImputation or Skip.
In/Out
sparse boolean
Sparse data handling (more efficient for data with lots of 0 values).
In/Out
col_major boolean
#DEPRECATED Use a column major weight matrix for input layer. Can speed up forward propagation, but might slow down backpropagation.
In/Out
average_activation double
Average activation for sparse auto-encoder. #Experimental
In/Out
sparsity_beta double
Sparsity regularization. #Experimental
In/Out
max_categorical_features int
Max. number of categorical features, enforced via hashing. #Experimental
In/Out
reproducible boolean
Force reproducibility on small data (will be slow - only uses 1 thread).
In/Out
export_weights_and_biases boolean
Whether to export Neural Network weights and biases to H2O Frames.
In/Out
mini_batch_size int
Mini-batch size (smaller leads to better fit, larger can speed up and generalize better).
In/Out
elastic_averaging boolean
Elastic averaging between compute nodes can improve distributed model convergence. #Experimental
In/Out
elastic_averaging_moving_rate double
Elastic averaging moving rate (only if elastic averaging is enabled).
In/Out
elastic_averaging_regularization double
Elastic averaging regularization strength (only if elastic averaging is enabled).
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Number of randomly sampled observations used to train each Extended Isolation Forest tree.
In
extension_level int
Maximum is N - 1 (N = numCols). Minimum is 0. Extended Isolation Forest with extension_Level = 0 behaves like Isolation Forest.
In
seed long
Seed for pseudo random number generator (if applicable)
In
score_tree_interval int
Score the model after every so many trees. Disabled if set to 0.
In
disable_training_metrics boolean
Disable calculating training metrics (expensive on large datasets)
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Schema name for this field, if it is_schema, or the name of the enum, if it’s an enum.
In
name string
Field name in the Schema
Out
type string
Type for this field
Out
is_schema boolean
Type for this field is itself a Schema.
Out
value Polymorphic
Value for this field
Out
help string
A short help description to appear alongside the field in a UI
Out
label string
The label that should be displayed for the field if the name is insufficient
Out
required boolean
Is this field required, or is the default value generally sufficient?
Out
level enum
How important is this field? The web UI uses the level to do a slow reveal of the parameters
Out
direction enum
Is this field an input, output or inout?
Out
is_inherited boolean
Is the field inherited from the parent schema?
Out
inherited_from string
If this field is inherited from a class higher in the hierarchy which one?
Out
is_gridable boolean
Is the field gridable (i.e., it can be used in grid call)
Out
values string[]
For enum-type fields the allowed values are specified using the values annotation; this is used in UIs to tell the user the allowed values, and for validation
Out
json boolean
Should this field be rendered in the JSON representation?
Out
is_member_of_frames string[]
For Vec-type fields this is the set of other Vec-type fields which must contain mutually exclusive values; for example, for a SupervisedModel the response_column must be mutually exclusive with the weights_column
Out
is_mutually_exclusive_with string[]
For Vec-type fields this is the set of Frame-type fields which must contain the named column; for example, for a SupervisedModel the response_column must be in both the training_frame and (if it’s set) the validation_frame
Seed for pseudo random number generator (if applicable)
In
family enum
Family. Use binomial for classification with logistic regression, others are for regression problems.
In
tweedie_variance_power double
Tweedie variance power
In
tweedie_link_power double
Tweedie link power
In
theta double
Theta
In
solver enum
AUTO will set the solver based on given data and the other parameters. IRLSM is fast on on problems with small number of predictors and for lambda-search with L1 penalty, L_BFGS scales better for datasets with many columns.
In
alpha double[]
Distribution of regularization between the L1 (Lasso) and L2 (Ridge) penalties. A value of 1 for alpha represents Lasso regression, a value of 0 produces Ridge regression, and anything in between specifies the amount of mixing between the two. Default value of alpha is 0 when SOLVER = ‘L-BFGS’; 0.5 otherwise.
In
lambda double[]
Regularization strength
In
startval double[]
double array to initialize coefficients for GAM.
In
lambda_search boolean
Use lambda search starting at lambda max, given lambda is then interpreted as lambda min
In
early_stopping boolean
Stop early when there is no more relative improvement on train or validation (if provided)
In
nlambdas int
Number of lambdas to be used in a search. Default indicates: If alpha is zero, with lambda search set to True, the value of nlamdas is set to 30 (fewer lambdas are needed for ridge regression) otherwise it is set to 100.
In
standardize boolean
Standardize numeric columns to have zero mean and unit variance
Plug Values (a single row frame containing values that will be used to impute missing values of the training/validation frame, use with conjunction missing_values_handling = PlugValues)
In
non_negative boolean
Restrict coefficients (not intercept) to be non-negative
In
max_iterations int
Maximum number of iterations
In
beta_epsilon double
Converge if beta changes less (using L-infinity norm) than beta esilon, ONLY applies to IRLSM solver
In
objective_epsilon double
Converge if objective value changes less than this. Default indicates: If lambda_search is set to True the value of objective_epsilon is set to .0001. If the lambda_search is set to False and lambda is equal to zero, the value of objective_epsilon is set to .000001, for any other value of lambda the default value of objective_epsilon is set to .0001.
In
gradient_epsilon double
Converge if objective changes less (using L-infinity norm) than this, ONLY applies to L-BFGS solver. Default indicates: If lambda_search is set to False and lambda is equal to zero, the default value of gradient_epsilon is equal to .000001, otherwise the default value is .0001. If lambda_search is set to True, the conditional values above are 1E-8 and 1E-6 respectively.
In
obj_reg double
Likelihood divider in objective value computation, default is 1/nobs
In
link enum
Link function.
In
intercept boolean
Include constant term in the model
In
prior double
Prior probability for y==1. To be used only for logistic regression iff the data has been sampled and the mean of response does not reflect reality.
In
cold_start boolean
Only applicable to multiple alpha/lambda values when calling GLM from GAM. If false, build the next model for next set of alpha/lambda values starting from the values provided by current model. If true will start GLM model from scratch.
In
lambda_min_ratio double
Minimum lambda used in lambda search, specified as a ratio of lambda_max (the smallest lambda that drives all coefficients to zero). Default indicates: if the number of observations is greater than the number of variables, then lambda_min_ratio is set to 0.0001; if the number of observations is less than the number of variables, then lambda_min_ratio is set to 0.01.
Maximum number of active predictors during computation. Use as a stopping criterion to prevent expensive model building with many predictors. Default indicates: If the IRLSM solver is used, the value of max_active_predictors is set to 5000 otherwise it is set to 100000000.
In
interactions string[]
A list of predictor column indices to interact. All pairwise combinations will be computed for the list.
A list of pairwise (first order) column interactions.
In
compute_p_values boolean
Request p-values computation, p-values work only with IRLSM solver and no regularization
In
remove_collinear_columns boolean
In case of linearly dependent columns, remove some of the dependent columns
In
store_knot_locations boolean
If set to true, will return knot locations as double[][] array for gam column names found knots_for_gam. Default to false.
In
num_knots int[]
Number of knots for gam predictors. If specified, must specify one for each gam predictor. For monotone I-splines, mininum = 2, for cs spline, minimum = 3. For thin plate, minimum is size of polynomial basis + 2.
In
spline_orders int[]
Order of I-splines or NBSplineTypeI M-splines used for gam predictors. If specified, must be the same size as gam_columns. For I-splines, the spline_orders will be the same as the polynomials used to generate the splines. For M-splines, the polynomials used to generate the splines will be spline_order-1. Values for bs=0 or 1 will be ignored.
In
splines_non_negative boolean[]
Valid for I-spline (bs=2) only. True if the I-splines are monotonically increasing (and monotonically non-decreasing) and False if the I-splines are monotonically decreasing (and monotonically non-increasing). If specified, must be the same size as gam_columns. Values for other spline types will be ignored. Default to true.
In
gam_columns string[][]
Arrays of predictor column names for gam for smoothers using single or multiple predictors like {{‘c1’},{‘c2’,’c3’},{‘c4’},…}
In
scale double[]
Smoothing parameter for gam predictors. If specified, must be of the same length as gam_columns
In
bs int[]
Basis function type for each gam predictors, 0 for cr, 1 for thin plate regression with knots, 2 for monotone I-splines, 3 for NBSplineTypeI M-splines (refer to doc here: https://github.com/h2oai/h2o-3/issues/6926). If specified, must be the same size as gam_columns
In
keep_gam_cols boolean
Save keys of model matrix
In
standardize_tp_gam_cols boolean
standardize tp (thin plate) predictor columns
In
scale_tp_penalty_mat boolean
Scale penalty matrix for tp (thin plate) smoothers as in R
In
knot_ids string[]
Array storing frame keys of knots. One for each gam column set specified in gam_columns
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
missing_values_handling enum
Handling of missing values. Either MeanImputation, Skip or PlugValues.
In/Out
balance_classes boolean
Balance training data class counts via over/under-sampling (for imbalanced data).
In/Out
class_sampling_factors float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
In/Out
max_after_balance_size float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
In/Out
max_confusion_matrix_size int
[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
A mapping representing monotonic constraints. Use +1 to enforce an increasing constraint and -1 to specify a decreasing constraint.
In
max_abs_leafnode_pred double
Maximum absolute value of a leaf node prediction
In
pred_noise_bandwidth double
Bandwidth (sigma) of Gaussian multiplicative noise ~N(1,sigma) for tree node predictions
In
interaction_constraints string[][]
A set of allowed column interactions.
In
auto_rebalance boolean
Allow automatic rebalancing of training and validation datasets
In
ntrees int
Number of trees.
In
max_depth int
Maximum tree depth (0 for unlimited).
In
min_rows double
Fewest allowed (weighted) observations in a leaf.
In
nbins int
For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point
In
nbins_top_level int
For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level
In
nbins_cats int
For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.
In
r2_stopping double
r2_stopping is no longer supported and will be ignored if set - please use stopping_rounds, stopping_metric and stopping_tolerance instead. Previous version of H2O would stop making trees when the R^2 metric equals or exceeds this
In
seed long
Seed for pseudo random number generator (if applicable)
In
build_tree_one_node boolean
Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.
In
sample_rate_per_class double[]
A list of row sample rates per class (relative fraction for each class, from 0.0 to 1.0), for each tree
In
col_sample_rate_per_tree double
Column sample rate per tree (from 0.0 to 1.0)
In
col_sample_rate_change_per_level double
Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0)
In
score_tree_interval int
Score the model after every so many trees. Disabled if set to 0.
In
min_split_improvement double
Minimum relative improvement in squared error reduction for a split to happen
In
histogram_type enum
What type of histogram to use for finding optimal split points
In
calibrate_model boolean
Use Platt Scaling (default) or Isotonic Regression to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities.
In
in_training_checkpoints_dir string
Create checkpoints into defined directory while training process is still running. In case of cluster shutdown, this checkpoint can be used to restart training.
In
in_training_checkpoints_tree_interval int
Checkpoint the model after every so many trees. Parameter is used only when in_training_checkpoints_dir is defined
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
balance_classes boolean
Balance training data class counts via over/under-sampling (for imbalanced data).
In/Out
class_sampling_factors float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
In/Out
max_after_balance_size float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
In/Out
max_confusion_matrix_size int
[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs
Check if response column is constant. If enabled, then an exception is thrown if the response column is a constant value.If disabled, then model will train regardless of the response column being a constant value or not.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Seed for pseudo random number generator (if applicable).
In
family enum
Family. Use binomial for classification with logistic regression, others are for regression problems.
In
tweedie_variance_power double
Tweedie variance power
In
dispersion_learning_rate double
Dispersion learning rate is only valid for tweedie family dispersion parameter estimation using ml. It must be > 0. This controls how much the dispersion parameter estimate is to be changed when the calculated loglikelihood actually decreases with the new dispersion. In this case, instead of setting new dispersion = dispersion + change, we set new dispersion = dispersion + dispersion_learning_rate * change. Defaults to 0.5.
In
tweedie_link_power double
Tweedie link power.
In
theta double
Theta
In
solver enum
AUTO will set the solver based on given data and the other parameters. IRLSM is fast on on problems with small number of predictors and for lambda-search with L1 penalty, L_BFGS scales better for datasets with many columns.
In
alpha double[]
Distribution of regularization between the L1 (Lasso) and L2 (Ridge) penalties. A value of 1 for alpha represents Lasso regression, a value of 0 produces Ridge regression, and anything in between specifies the amount of mixing between the two. Default value of alpha is 0 when SOLVER = ‘L-BFGS’; 0.5 otherwise.
In
lambda double[]
Regularization strength
In
lambda_search boolean
Use lambda search starting at lambda max, given lambda is then interpreted as lambda min.
In
early_stopping boolean
Stop early when there is no more relative improvement on train or validation (if provided).
In
nlambdas int
Number of lambdas to be used in a search. Default indicates: If alpha is zero, with lambda search set to True, the value of nlamdas is set to 30 (fewer lambdas are needed for ridge regression) otherwise it is set to 100.
In
score_iteration_interval int
Perform scoring for every score_iteration_interval iterations.
In
standardize boolean
Standardize numeric columns to have zero mean and unit variance.
In
cold_start boolean
Only applicable to multiple alpha/lambda values. If false, build the next model for next set of alpha/lambda values starting from the values provided by current model. If true will start GLM model from scratch.
In
influence enum
If set to dfbetas will calculate the difference in beta when a datarow is included and excluded in the dataset.
Plug Values (a single row frame containing values that will be used to impute missing values of the training/validation frame, use with conjunction missing_values_handling = PlugValues).
In
non_negative boolean
Restrict coefficients (not intercept) to be non-negative.
In
max_iterations int
Maximum number of iterations. Value should >=1. A value of 0 is only set when only the model coefficient names and model coefficient dimensions are needed.
In
beta_epsilon double
Converge if beta changes less (using L-infinity norm) than beta esilon. ONLY applies to IRLSM solver.
In
objective_epsilon double
Converge if objective value changes less than this. Default (of -1.0) indicates: If lambda_search is set to True the value of objective_epsilon is set to .0001. If the lambda_search is set to False and lambda is equal to zero, the value of objective_epsilon is set to .000001, for any other value of lambda the default value of objective_epsilon is set to .0001.
In
gradient_epsilon double
Converge if objective changes less (using L-infinity norm) than this, ONLY applies to L-BFGS solver. Default (of -1.0) indicates: If lambda_search is set to False and lambda is equal to zero, the default value of gradient_epsilon is equal to .000001, otherwise the default value is .0001. If lambda_search is set to True, the conditional values above are 1E-8 and 1E-6 respectively.
In
obj_reg double
Likelihood divider in objective value computation, default (of -1.0) will set it to 1/nobs.
In
link enum
Link function.
In
dispersion_parameter_method enum
Method used to estimate the dispersion parameter for Tweedie, Gamma and Negative Binomial only.
In
startval double[]
double array to initialize coefficients for GLM. If standardize is true, the standardized coefficients should be used. Otherwise, use the regular coefficients.
In
calc_like boolean
if true, will return likelihood function value.
In
generate_variable_inflation_factors boolean
if true, will generate variable inflation factors for numerical predictors. Default to false.
In
intercept boolean
Include constant term in the model
In
build_null_model boolean
If set, will build a model with only the intercept. Default to false.
In
fix_dispersion_parameter boolean
Only used for Tweedie, Gamma and Negative Binomial GLM. If set, will use the dispsersion parameter in init_dispersion_parameter as the standard error and use it to calculate the p-values. Default to false.
In
init_dispersion_parameter double
Only used for Tweedie, Gamma and Negative Binomial GLM. Store the initial value of dispersion parameter. If fix_dispersion_parameter is set, this value will be used in the calculation of p-values.
In
prior double
Prior probability for y==1. To be used only for logistic regression iff the data has been sampled and the mean of response does not reflect reality.
In
lambda_min_ratio double
Minimum lambda used in lambda search, specified as a ratio of lambda_max (the smallest lambda that drives all coefficients to zero). Default indicates: if the number of observations is greater than the number of variables, then lambda_min_ratio is set to 0.0001; if the number of observations is less than the number of variables, then lambda_min_ratio is set to 0.01.
Linear constraints: used to specify linear constraints involving more than one coefficients in standard form. It is only supported for solver IRLSM. It contains four columns: names (strings for coefficient names or constant), values, types ( strings of ‘Equal’ or ‘LessThanEqual’), constraint_numbers (0 for first linear constraint, 1 for second linear constraint, …).
In
max_active_predictors int
Maximum number of active predictors during computation. Use as a stopping criterion to prevent expensive model building with many predictors. Default indicates: If the IRLSM solver is used, the value of max_active_predictors is set to 5000 otherwise it is set to 100000000.
In
interactions string[]
A list of predictor column indices to interact. All pairwise combinations will be computed for the list.
A list of pairwise (first order) column interactions.
In
compute_p_values boolean
Request p-values computation, p-values work only with IRLSM solver.
In
fix_tweedie_variance_power boolean
If true, will fix tweedie variance power value to the value set in tweedie_variance_power.
In
remove_collinear_columns boolean
In case of linearly dependent columns, remove the dependent columns.
In
generate_scoring_history boolean
If set to true, will generate scoring history for GLM. This may significantly slow down the algo.
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
missing_values_handling enum
Handling of missing values. Either MeanImputation, Skip or PlugValues.
In/Out
balance_classes boolean
Balance training data class counts via over/under-sampling (for imbalanced data).
In/Out
class_sampling_factors float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
In/Out
max_after_balance_size float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
In/Out
max_confusion_matrix_size int
[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs.
In/Out
dispersion_epsilon double
If changes in dispersion parameter estimation or loglikelihood value is smaller than dispersion_epsilon, will break out of the dispersion parameter estimation loop using maximum likelihood.
In/Out
tweedie_epsilon double
In estimating tweedie dispersion parameter using maximum likelihood, this is used to choose the lower and upper indices in the approximating of the infinite series summation.
In/Out
max_iterations_dispersion int
Control the maximum number of iterations in the dispersion parameter estimation loop using maximum likelihood.
In/Out
init_optimal_glm boolean
If true, will initialize coefficients with values derived from GLM runs without linear constraints. Only available for linear constraints.
In/Out
separate_linear_beta boolean
If true, will keep the beta constraints and linear constraints separate. After new coefficients are found, first beta constraints will be applied followed by the application of linear constraints. Note that the beta constraints in this case will not be part of the objective function. If false, will combine the beta and linear constraints.
In/Out
constraint_eta0 double
For constrained GLM only. It affects the setting of eta_k+1=eta_0/power(ck+1, alpha).
In/Out
constraint_tau double
For constrained GLM only. It affects the setting of c_k+1=tau*c_k.
In/Out
constraint_alpha double
For constrained GLM only. It affects the setting of eta_k = eta_0/pow(c_0, alpha).
In/Out
constraint_beta double
For constrained GLM only. It affects the setting of eta_k+1 = eta_k/pow(c_k, beta).
In/Out
constraint_c0 double
For constrained GLM only. It affects the initial setting of epsilon_k = 1/c_0.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
[Deprecated] Use representation_name instead. Frame key to save resulting X.
In
representation_name string
Frame key to save resulting X
In
expand_user_y boolean
Expand categorical columns in user-specified initial Y
In
impute_original boolean
Reconstruct original training data by reversing transform
In
recover_svd boolean
Recover singular values and eigenvectors of XY
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Level of parallelism during grid model building. 1 = sequential building (default). 0 for adaptive parallelism. Any number > 1 sets the exact number of models built in parallel.
In
recovery_dir string
Path to a directory where grid will save everything necessary to resume training after cluster crash.
Milliseconds since the epoch for the time that this H2OError instance was created. Generally this is a short time since the underlying error ocurred.
Out
error_url string
Error url
Out
msg string
Message intended for the end user (a data scientist).
Out
dev_msg string
Potentially more detailed message intended for a developer (e.g. a front end engineer or someone designing a language binding).
Out
http_status int
HTTP status code for this error.
Out
values Map
Any values that are relevant to reporting or handling this error. Examples are a key name if the error is on a key, or a field name and object name if it’s on a specific field.
Milliseconds since the epoch for the time that this H2OError instance was created. Generally this is a short time since the underlying error ocurred.
Out
error_url string
Error url
Out
msg string
Message intended for the end user (a data scientist).
Out
dev_msg string
Potentially more detailed message intended for a developer (e.g. a front end engineer or someone designing a language binding).
Out
http_status int
HTTP status code for this error.
Out
values Map
Any values that are relevant to reporting or handling this error. Examples are a key name if the error is on a key, or a field name and object name if it’s on a specific field.
Plug Values (a single row frame containing values that will be used to impute missing values of the training/validation frame, use with conjunction missing_values_handling = PlugValues).
In
family enum
Family. Only gaussian is supported now.
In
rand_family enum
Set distribution of random effects. Only Gaussian is implemented now.
In
max_iterations int
Maximum number of iterations. Value should >=1. A value of 0 is only set when only the model coefficient names and model coefficient dimensions are needed.
In
tau_u_var_init double
Initial variance of random coefficient effects. If set, should provide a value > 0.0. If not set, will be randomly set in the model building process.
In
tau_e_var_init double
Initial variance of random noise. If set, should provide a value > 0.0. If not set, will be randomly set in the model building process.
In
random_columns string[]
Random columns indices for HGLM.
In
method enum
We only implemented EM as a method to obtain the fixed, random coefficients and the various variances.
In
em_epsilon double
Converge if beta/ubeta/tmat/tauEVar changes less (using L-infinity norm) than em esilon. ONLY applies to EM method.
In
random_intercept boolean
If true, will allow random component to the GLM coefficients.
In
group_column string
Group column is the column that is categorical and used to generate the groups in HGLM
In
gen_syn_data boolean
If true, add gaussian noise with variance specified in parms._tau_e_var_init.
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
missing_values_handling enum
Handling of missing values. Either MeanImputation, Skip or PlugValues.
In/Out
initial_fixed_effects double[]
An array that contains initial values of the fixed effects coefficient.
A H2OFrame id that contains initial values of the random effects coefficient. The row names shouldbe the random coefficient names. If you are not sure what the random coefficient names are, build HGLM model with max_iterations = 0 and checkout the model output field random_coefficient_names. The number of rows of this frame should be the number of level 2 units. Again, to figure this out, build HGLM model with max_iterations=0 and check out the model output field group_column_names. The number of rows should equal the length of thegroup_column_names.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Plug Values (a single row frame containing values that will be used to impute missing values of the training/validation frame, use with conjunction missing_values_handling = PlugValues).
In
max_iterations int
Maximum number of iterations.
In
prior double
Prior probability for y==1. To be used only for logistic regression iff the data has been sampled and the mean of response does not reflect reality.
In
algorithm_params string
Customized parameters for the machine learning algorithm specified in the algorithm parameter.
In
protected_columns string[]
Columns that contain features that are sensitive and need to be protected (legally, or otherwise), if applicable. These features (e.g. race, gender, etc) should not drive the prediction of the response.
In
total_information_threshold double
A number between 0 and 1 representing a threshold for total information, defaulting to 0.1. For a specific feature, if the total information is higher than this threshold, and the corresponding net information is also higher than the threshold net_information_threshold, that feature will be considered admissible. The total information is the x-axis of the Core Infogram. Default is -1 which gets set to 0.1.
In
net_information_threshold double
A number between 0 and 1 representing a threshold for net information, defaulting to 0.1. For a specific feature, if the net information is higher than this threshold, and the corresponding total information is also higher than the total_information_threshold, that feature will be considered admissible. The net information is the y-axis of the Core Infogram. Default is -1 which gets set to 0.1.
In
relevance_index_threshold double
A number between 0 and 1 representing a threshold for the relevance index, defaulting to 0.1. This is only used when protected_columns is set by the user. For a specific feature, if the relevance index value is higher than this threshold, and the corresponding safety index is also higher than the safety_index_threshold``, that feature will be considered admissible. The relevance index is the x-axis of the Fair Infogram. Default is -1 which gets set to 0.1.
In
safety_index_threshold double
A number between 0 and 1 representing a threshold for the safety index, defaulting to 0.1. This is only used when protected_columns is set by the user. For a specific feature, if the safety index value is higher than this threshold, and the corresponding relevance index is also higher than the relevance_index_threshold, that feature will be considered admissible. The safety index is the y-axis of the Fair Infogram. Default is -1 which gets set to 0.1.
In
data_fraction double
The fraction of training frame to use to build the infogram model. Defaults to 1.0, and any value greater than 0 and less than or equal to 1.0 is acceptable.
In
top_n_features int
An integer specifying the number of columns to evaluate in the infogram. The columns are ranked by variable importance, and the top N are evaluated. Defaults to 50.
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
balance_classes boolean
Balance training data class counts via over/under-sampling (for imbalanced data).
In/Out
class_sampling_factors float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
In/Out
max_after_balance_size float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
In/Out
algorithm enum
Type of machine learning algorithm used to build the infogram. Options include ‘AUTO’ (gbm), ‘deeplearning’ (Deep Learning with default parameters), ‘drf’ (Random Forest with default parameters), ‘gbm’ (GBM with default parameters), ‘glm’ (GLM with default parameters), or ‘xgboost’ (if available, XGBoost with default parameters).
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”
In
session_key string
Session ID
In/Out
session_properties_allowed boolean
Indicates whether users are allowed to set and modify session properties
Out
InputSchemaV4
_fields string
Filter on the set of output fields: if you set _fields=”foo,bar,baz”, then only those fields will be included in the output; or you can specify _fields=”-goo,gee” to include all fields except goo and gee. If the result contains nested data structures, then you can refer to the fields within those structures as well. For example if you specify _fields=”foo(oof),bar(-rab)”, then only fields foo and bar will be included, and within foo there will be only field oof, whereas within bar all fields except rab will be reported.
In
InteractionV3
_exclude_fields string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”
Whether to create pairwise quadratic interactions between factors (otherwise create one higher-order interaction). Only applicable if there are 3 or more factors.
In/Out
max_factors int
Max. number of factor levels in pair-wise interaction terms (if enforced, one extra catch-all factor will be made)
In/Out
min_occurrence int
Min. occurrence threshold for factor levels in pair-wise interaction terms
Number of randomly sampled observations used to train each Isolation Forest tree. Only one of parameters sample_size and sample_rate should be defined. If sample_rate is defined, sample_size will be ignored.
In
sample_rate double
Rate of randomly sampled observations used to train each Isolation Forest tree. Needs to be in range from 0.0 to 1.0. If set to -1, sample_rate is disabled and sample_size will be used instead.
In
mtries int
Number of variables randomly sampled as candidates at each split. If set to -1, defaults (number of predictors)/3.
In
contamination double
Contamination ratio - the proportion of anomalies in the input dataset. If undefined (-1) the predict function will not mark observations as anomalies and only anomaly score will be returned. Defaults to -1 (undefined).
In
ntrees int
Number of trees.
In
max_depth int
Maximum tree depth (0 for unlimited).
In
min_rows double
Fewest allowed (weighted) observations in a leaf.
In
nbins int
For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point
In
nbins_top_level int
For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level
In
nbins_cats int
For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.
In
r2_stopping double
r2_stopping is no longer supported and will be ignored if set - please use stopping_rounds, stopping_metric and stopping_tolerance instead. Previous version of H2O would stop making trees when the R^2 metric equals or exceeds this
In
seed long
Seed for pseudo random number generator (if applicable)
In
build_tree_one_node boolean
Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.
In
sample_rate_per_class double[]
A list of row sample rates per class (relative fraction for each class, from 0.0 to 1.0), for each tree
In
col_sample_rate_per_tree double
Column sample rate per tree (from 0.0 to 1.0)
In
col_sample_rate_change_per_level double
Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0)
In
score_tree_interval int
Score the model after every so many trees. Disabled if set to 0.
In
min_split_improvement double
Minimum relative improvement in squared error reduction for a split to happen
In
histogram_type enum
What type of histogram to use for finding optimal split points
In
calibrate_model boolean
Use Platt Scaling (default) or Isotonic Regression to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities.
In
in_training_checkpoints_dir string
Create checkpoints into defined directory while training process is still running. In case of cluster shutdown, this checkpoint can be used to restart training.
In
in_training_checkpoints_tree_interval int
Checkpoint the model after every so many trees. Parameter is used only when in_training_checkpoints_dir is defined
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
(experimental) Name of the response column in the validation frame. Response column should be binary and indicate not anomaly/anomaly.
In/Out
balance_classes boolean
Balance training data class counts via over/under-sampling (for imbalanced data).
In/Out
class_sampling_factors float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
In/Out
max_after_balance_size float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
In/Out
max_confusion_matrix_size int
[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs
Check if response column is constant. If enabled, then an exception is thrown if the response column is a constant value.If disabled, then model will train regardless of the response column being a constant value or not.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
out_of_bounds enum
Method of handling values of X predictor that are outside of the bounds seen in training.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Filter on the set of output fields: if you set _fields=”foo,bar,baz”, then only those fields will be included in the output; or you can specify _fields=”-goo,gee” to include all fields except goo and gee. If the result contains nested data structures, then you can refer to the fields within those structures as well. For example if you specify _fields=”foo(oof),bar(-rab)”, then only fields foo and bar will be included, and within foo there will be only field oof, whereas within bar all fields except rab will be reported.
In
JobKeyV3
name string
Name (string representation) for this Key.
In/Out
type string
Name (string representation) for the type of Keyed this Key points to.
In/Out
URL string
URL for the resource that this Key points to, if one exists.
This option allows you to specify a dataframe, where each row represents an initial cluster center. The user-specified points must have the same number of columns as the training observations. The number of rows must equal the number of clusters
In
max_iterations int
Maximum training iterations (if estimate_k is enabled, then this is for each inner Lloyds iteration)
In
standardize boolean
Standardize columns before computing distances
In
seed long
RNG Seed
In
init enum
Initialization mode
In
estimate_k boolean
Whether to estimate the number of clusters (<=k) iteratively and deterministically.
In
cluster_size_constraints int[]
An array specifying the minimum number of points that should be in each cluster. The length of the constraints array has to be the same as the number of clusters.
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
k int
The max. number of clusters. If estimate_k is disabled, the model will find k centroids, otherwise it will find up to k centroids.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Compute reconstruction error (optional, only for Deep Learning AutoEncoder models)
In
reconstruction_error_per_feature boolean
Compute reconstruction error per feature (optional, only for Deep Learning AutoEncoder models)
In
deep_features_hidden_layer int
Extract Deep Features for given hidden layer (optional, only for Deep Learning models)
In
deep_features_hidden_layer_name string
Extract Deep Features for given hidden layer by name (optional, only for Deep Water models)
In
reconstruct_train boolean
Reconstruct original training frame (optional, only for GLRM models)
In
project_archetypes boolean
Project GLRM archetypes back into original feature space (optional, only for GLRM models)
In
reverse_transform boolean
Reverse transformation applied during training to model output (optional, only for GLRM models)
In
leaf_node_assignment boolean
Return the leaf node assignment (optional, only for DRF/GBM models)
In
leaf_node_assignment_type enum
Type of the leaf node assignment (optional, only for DRF/GBM models)
In
predict_staged_proba boolean
Predict the class probabilities at each stage (optional, only for GBM models)
In
predict_contributions boolean
Predict the feature contributions - Shapley values (optional, only for DRF, GBM and XGBoost models)
In
row_to_tree_assignment boolean
Return which row is used in which tree (optional, only for GBM models)
In
predict_contributions_output_format enum
Specify how to output feature contributions in XGBoost - XGBoost by default outputs contributions for 1-hot encoded features, specifying a Compact output format will produce a per-feature contribution
In
top_n int
Only for predict_contributions function - sort Shapley values and return top_n highest (optional)
In
bottom_n int
Only for predict_contributions function - sort Shapley values and return bottom_n lowest (optional)
In
compare_abs boolean
Only for predict_contributions function - sort absolute Shapley values (optional)
In
feature_frequencies boolean
Retrieve the feature frequencies on paths in trees in tree-based models (optional, only for GBM, DRF and Isolation Forest)
In
exemplar_index int
Retrieve all members for a given exemplar (optional, only for Aggregator models)
In
deviances boolean
Compute the deviances per row (optional, only for classification or regression models)
In
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In
auc_type string
Set default multinomial AUC type. Must be one of: “AUTO”, “NONE”, “MACRO_OVR”, “WEIGHTED_OVR”, “MACRO_OVO”, “WEIGHTED_OVO”. Default is “NONE” (optional, only for multinomial classification).
In
auuc_type string
Set default AUUC type for uplift binomial classification. Must be one of: “AUTO”, “qini”, “lift”, “gain”. Default is “AUTO” (optional, only for uplift binomial classification).
In
auuc_nbins int
Set number of bins to calculate AUUC. Must be -1 or higher than 0. Default is -1 which means 1000 (optional, only for uplift binomial classification).
Specify background frame used as a reference for calculating SHAP.
In
output_space boolean
If true, transform contributions so that they sum up to the difference in the output space (applicable iff contributions are in link space). Note that this transformation is an approximation and the contributions won’t be exact SHAP values.
In
output_per_reference boolean
If true, return contributions against each background sample (aka reference), i.e. phi(feature, x, bg), otherwise return contributions averaged over the background sample (phi(feature, x) = E_{bg} phi(feature, x, bg))
In
_exclude_fields string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”
For Vec-type fields this is the set of Frame-type fields which must contain the named column; for example, for a SupervisedModel the response_column must be in both the training_frame and (if it’s set) the validation_frame
In
is_mutually_exclusive_with string[]
For Vec-type fields this is the set of other Vec-type fields which must contain mutually exclusive values; for example, for a SupervisedModel the response_column must be mutually exclusive with the weights_column
In
name string
name in the JSON, e.g. “lambda”
Out
label string
[DEPRECATED] same as name.
Out
help string
help for the UI, e.g. “regularization multiplier, typically used for foo bar baz etc.”
Out
required boolean
the field is required
Out
type string
Java type, e.g. “double”
Out
default_value Polymorphic
default value, e.g. 1
Out
actual_value Polymorphic
actual value as set by the user and / or modified by the ModelBuilder, e.g., 10
Out
input_value Polymorphic
input value as set by the user, e.g., 10
Out
level string
the importance of the parameter, used by the UI, e.g. “critical”, “extended” or “expert”
Out
values string[]
list of valid values for use by the front-end
Out
gridable boolean
Parameter can be used in grid call
Out
ModelParametersSchemaV3
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Seed for pseudo random number generator (if applicable)
In
family enum
Family. For maxr/maxrsweep, only gaussian. For backward, ordinal and multinomial families are not supported
In
tweedie_variance_power double
Tweedie variance power
In
tweedie_link_power double
Tweedie link power
In
theta double
Theta
In
solver enum
AUTO will set the solver based on given data and the other parameters. IRLSM is fast on on problems with small number of predictors and for lambda-search with L1 penalty, L_BFGS scales better for datasets with many columns.
In
alpha double[]
Distribution of regularization between the L1 (Lasso) and L2 (Ridge) penalties. A value of 1 for alpha represents Lasso regression, a value of 0 produces Ridge regression, and anything in between specifies the amount of mixing between the two. Default value of alpha is 0 when SOLVER = ‘L-BFGS’; 0.5 otherwise.
In
lambda double[]
Regularization strength
In
lambda_search boolean
Use lambda search starting at lambda max, given lambda is then interpreted as lambda min
In
multinode_mode boolean
For maxrsweep only. If enabled, will attempt to perform sweeping action using multiple nodes in the cluster. Defaults to false.
In
build_glm_model boolean
For maxrsweep mode only. If true, will return full blown GLM models with the desired predictorsubsets. If false, only the predictor subsets, predictor coefficients are returned. This is forspeeding up the model selection process. The users can choose to build the GLM models themselvesby using the predictor subsets themselves. Defaults to false.
In
early_stopping boolean
Stop early when there is no more relative improvement on train or validation (if provided)
In
nlambdas int
Number of lambdas to be used in a search. Default indicates: If alpha is zero, with lambda search set to True, the value of nlamdas is set to 30 (fewer lambdas are needed for ridge regression) otherwise it is set to 100.
In
score_iteration_interval int
Perform scoring for every score_iteration_interval iterations
In
standardize boolean
Standardize numeric columns to have zero mean and unit variance
In
cold_start boolean
Only applicable to multiple alpha/lambda values. If false, build the next model for next set of alpha/lambda values starting from the values provided by current model. If true will start GLM model from scratch.
Plug Values (a single row frame containing values that will be used to impute missing values of the training/validation frame, use with conjunction missing_values_handling = PlugValues)
In
non_negative boolean
Restrict coefficients (not intercept) to be non-negative
In
max_iterations int
Maximum number of iterations
In
beta_epsilon double
Converge if beta changes less (using L-infinity norm) than beta esilon, ONLY applies to IRLSM solver
In
objective_epsilon double
Converge if objective value changes less than this. Default (of -1.0) indicates: If lambda_search is set to True the value of objective_epsilon is set to .0001. If the lambda_search is set to False and lambda is equal to zero, the value of objective_epsilon is set to .000001, for any other value of lambda the default value of objective_epsilon is set to .0001.
In
gradient_epsilon double
Converge if objective changes less (using L-infinity norm) than this, ONLY applies to L-BFGS solver. Default (of -1.0) indicates: If lambda_search is set to False and lambda is equal to zero, the default value of gradient_epsilon is equal to .000001, otherwise the default value is .0001. If lambda_search is set to True, the conditional values above are 1E-8 and 1E-6 respectively.
In
obj_reg double
Likelihood divider in objective value computation, default (of -1.0) will set it to 1/nobs
In
link enum
Link function.
In
startval double[]
Double array to initialize coefficients for GLM.
In
calc_like boolean
If true, will return likelihood function value for GLM.
In
intercept boolean
Include constant term in the model
In
prior double
Prior probability for y==1. To be used only for logistic regression iff the data has been sampled and the mean of response does not reflect reality.
In
lambda_min_ratio double
Minimum lambda used in lambda search, specified as a ratio of lambda_max (the smallest lambda that drives all coefficients to zero). Default indicates: if the number of observations is greater than the number of variables, then lambda_min_ratio is set to 0.0001; if the number of observations is less than the number of variables, then lambda_min_ratio is set to 0.01.
Maximum number of active predictors during computation. Use as a stopping criterion to prevent expensive model building with many predictors. Default indicates: If the IRLSM solver is used, the value of max_active_predictors is set to 5000 otherwise it is set to 100000000.
In
compute_p_values boolean
Request p-values computation, p-values work only with IRLSM solver and no regularization
In
remove_collinear_columns boolean
In case of linearly dependent columns, remove some of the dependent columns
In
max_predictor_number int
Maximum number of predictors to be considered when building GLM models. Defaults to 1.
In
min_predictor_number int
For mode = ‘backward’ only. Minimum number of predictors to be considered when building GLM models starting with all predictors to be included. Defaults to 1.
In
nparallelism int
number of models to build in parallel. Defaults to 0.0 which is adaptive to the system capability
In
p_values_threshold double
For mode=’backward’ only. If specified, will stop the model building process when all coefficientsp-values drop below this threshold
In
influence enum
If set to dfbetas will calculate the difference in beta when a datarow is included and excluded in the dataset.
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
missing_values_handling enum
Handling of missing values. Either MeanImputation, Skip or PlugValues.
In/Out
mode enum
Mode: Used to choose model selection algorithms to use. Options include ‘allsubsets’ for all subsets, ‘maxr’ that uses sequential replacement and GLM to build all models, slow but works with cross-validation, validation frames for more robust results, ‘maxrsweep’ that uses sequential replacement and sweeping action, much faster than ‘maxr’, ‘backward’ for backward selection.
In/Out
balance_classes boolean
Balance training data class counts via over/under-sampling (for imbalanced data).
In/Out
class_sampling_factors float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
In/Out
max_after_balance_size float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
In/Out
max_confusion_matrix_size int
[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Min. standard deviation to use for observations with not enough data
In
eps_sdev double
Cutoff below which standard deviation is replaced with min_sdev
In
min_prob double
Min. probability to use for observations with not enough data
In
eps_prob double
Cutoff below which probability is replaced with min_prob
In
compute_metrics boolean
Compute metrics on training data
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
balance_classes boolean
Balance training data class counts via over/under-sampling (for imbalanced data).
In/Out
class_sampling_factors float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
In/Out
max_after_balance_size float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
In/Out
max_confusion_matrix_size int
[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs
In/Out
seed long
Seed for pseudo random number generator (only used for cross-validation and fold_assignment=”Random” or “AUTO”)
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Specify the algorithm to use for computing the principal components: GramSVD - uses a distributed computation of the Gram matrix, followed by a local SVD; Power - computes the SVD using the power iteration method (experimental); Randomized - uses randomized subspace iteration method; GLRM - fits a generalized low-rank model with L2 loss function and no regularization and solves for the SVD using local matrix algebra (experimental)
In
pca_impl enum
Specify the implementation to use for computing PCA (via SVD or EVD): MTJ_EVD_DENSEMATRIX - eigenvalue decompositions for dense matrix using MTJ; MTJ_EVD_SYMMMATRIX - eigenvalue decompositions for symmetric matrix using MTJ; MTJ_SVD_DENSEMATRIX - singular-value decompositions for dense matrix using MTJ; JAMA - eigenvalue decompositions for dense matrix using JAMA. References: JAMA - http://math.nist.gov/javanumerics/jama/; MTJ - https://github.com/fommil/matrix-toolkits-java/
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
k int
Rank of matrix approximation
In/Out
max_iterations int
Maximum training iterations
In/Out
seed long
RNG seed for initialization
In/Out
use_all_factor_levels boolean
Whether first factor level is included in each categorical expansion
In/Out
compute_metrics boolean
Whether to compute metrics on the training data
In/Out
impute_missing boolean
Whether to impute missing entries with the column mean
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Coefficient of the kernel (currently RBF gamma for gaussian kernel, -1 means 1/#features)
In
rank_ratio double
Desired rank of the ICF matrix expressed as an ration of number of input rows (-1 means use sqrt(#rows)).
In
positive_weight double
Weight of positive (+1) class of observations
In
negative_weight double
Weight of positive (-1) class of observations
In
disable_training_metrics boolean
Disable calculating training metrics (expensive on large datasets)
In
sv_threshold double
Threshold for accepting a candidate observation into the set of support vectors
In
max_iterations int
Maximum number of iteration of the algorithm
In
fact_threshold double
Convergence threshold of the Incomplete Cholesky Factorization (ICF)
In
feasible_threshold double
Convergence threshold for primal-dual residuals in the IPM iteration
In
surrogate_gap_threshold double
Feasibility criterion of the surrogate duality gap (eta)
In
mu_factor double
Increasing factor mu
In
seed long
Seed for pseudo random number generator (if applicable)
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Key-reference to an initialized instance of a Decryption Tool
In
force_col_types boolean
If true, will force the column types to be either the ones in Parquet schema for Parquet files or the ones specified in column_types. This parameter is used for numerical columns only. Other column settings will happen without setting this parameter. Defaults to false.
In
tz_adjust_to_local boolean
Adjust the imported time from GMT timezone to cluster timezone.
In
_exclude_fields string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”
Check header: 0 means guess, +1 means 1st line is header not data, -1 means 1st line is data not header
In
number_columns int
Number of columns
In
column_names string[]
Column names
In
column_types string[]
Value types for columns
In
force_col_types boolean
If true, will force the column types to be either the ones in Parquet schema for Parquet files or the ones specified in column_types. This parameter is used for numerical columns only. Other columnsettings will happen without setting this parameter. Defaults to false.
In
domains string[][]
Domains for categorical columns
In
na_strings string[][]
NA strings for columns
In
chunk_size int
Size of individual parse tasks
In
delete_on_done boolean
Delete input key after parse
In
blocking boolean
Block until the parse completes (as opposed to returning early and requiring polling
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Seed for random number generator; set to a value other than -1 for reproducibility.
In/Out
max_models int
Maximum number of models to build (optional).
In/Out
max_runtime_secs double
Maximum time to spend building models (optional).
In/Out
stopping_rounds int
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression)
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
strategy enum
Hyperparameter space search strategy.
In/Out
RapidsExpressionV3
name string
(Class) name of the language construct
In
pattern string
Code fragment pattern.
In
description string
Description of the functionality provided by this language construct.
In
RapidsFrameV3
ast string
A Rapids AstRoot expression
In
session_id string
Session key
In
id string
[DEPRECATED] Key name to assign Frame results
In
_exclude_fields string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”
Seed for pseudo random number generator (if applicable).
In
algorithm enum
The algorithm to use to generate rules.
In
min_rule_length int
Minimum length of rules. Defaults to 3.
In
max_rule_length int
Maximum length of rules. Defaults to 3.
In
max_num_rules int
The maximum number of rules to return. defaults to -1 which means the number of rules is selected
by diminishing returns in model deviance.
In
model_type enum
Specifies type of base learners in the ensemble.
In
rule_generation_ntrees int
Specifies the number of trees to build in the tree model. Defaults to 50.
In
remove_duplicates boolean
Whether to remove rules which are identical to an earlier rule. Defaults to true.
In
lambda double[]
Lambda for LASSO regressor.
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Method for computing SVD (Caution: Randomized is currently experimental and unstable)
In
nv int
Number of right singular vectors
In
max_iterations int
Maximum iterations
In
seed long
RNG seed for k-means++ initialization
In
keep_u boolean
Save left singular vectors?
In
u_name string
Frame key to save left singular vectors
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
use_all_factor_levels boolean
Whether first factor level is included in each categorical expansion
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression)
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
early_stopping boolean
Use early stopping
In/Out
strategy enum
Hyperparameter space search strategy.
In/Out
SessionIdV4
session_key string
Session ID
In
__schema string
Url describing the schema of the current object.
In
SessionPropertyV3
_exclude_fields string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”
For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point
In
nbins_top_level int
For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level
In
nbins_cats int
For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.
In
r2_stopping double
r2_stopping is no longer supported and will be ignored if set - please use stopping_rounds, stopping_metric and stopping_tolerance instead. Previous version of H2O would stop making trees when the R^2 metric equals or exceeds this
In
seed long
Seed for pseudo random number generator (if applicable)
In
build_tree_one_node boolean
Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.
In
sample_rate_per_class double[]
A list of row sample rates per class (relative fraction for each class, from 0.0 to 1.0), for each tree
In
col_sample_rate_per_tree double
Column sample rate per tree (from 0.0 to 1.0)
In
col_sample_rate_change_per_level double
Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0)
In
score_tree_interval int
Score the model after every so many trees. Disabled if set to 0.
In
min_split_improvement double
Minimum relative improvement in squared error reduction for a split to happen
In
histogram_type enum
What type of histogram to use for finding optimal split points
In
calibrate_model boolean
Use Platt Scaling (default) or Isotonic Regression to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities.
In
in_training_checkpoints_dir string
Create checkpoints into defined directory while training process is still running. In case of cluster shutdown, this checkpoint can be used to restart training.
In
in_training_checkpoints_tree_interval int
Checkpoint the model after every so many trees. Parameter is used only when in_training_checkpoints_dir is defined
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
balance_classes boolean
Balance training data class counts via over/under-sampling (for imbalanced data).
In/Out
class_sampling_factors float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
In/Out
max_after_balance_size float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
In/Out
max_confusion_matrix_size int
[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs
Check if response column is constant. If enabled, then an exception is thrown if the response column is a constant value.If disabled, then model will train regardless of the response column being a constant value or not.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Keep level one frame used for metalearner training.
In
seed long
Seed for random numbers; passed through to the metalearner algorithm. Defaults to -1 (time-based random number)
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
List of models or grids (or their ids) to ensemble/stack together. Grids are expanded to individual models. If not using blending frame, then models must have been cross-validated using nfolds > 1, and folds must be identical across models.
In/Out
metalearner_algorithm enum
Type of algorithm to use as the metalearner. Options include ‘AUTO’ (GLM with non negative weights; if validation_frame is present, a lambda search is performed), ‘deeplearning’ (Deep Learning with default parameters), ‘drf’ (Random Forest with default parameters), ‘gbm’ (GBM with default parameters), ‘glm’ (GLM with default parameters), ‘naivebayes’ (NaiveBayes with default parameters), or ‘xgboost’ (if available, XGBoost with default parameters).
In/Out
metalearner_nfolds int
Number of folds for K-fold cross-validation of the metalearner algorithm (0 to disable or >= 2).
In/Out
metalearner_fold_assignment enum
Cross-validation fold assignment scheme for metalearner cross-validation. Defaults to AUTO (which is currently set to Random). The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
The list of steps to be executed (Mutually exclusive with alias).
In/Out
StepV99
id string
The id of the step (must be unique per step provider).
In/Out
group int
The group of execution of the given step (groups are executed in ascending order of priority).Steps with group=0 are skipped. Defaults to -1 to use the default group assigned to the step id.
In/Out
weight int
The relative weight for the given step (can impact time and/or number of models allocated for this step). Steps with weight=0 are skipped. Defaults to -1 to use the default weight assigned to the step id.
Mapping between input column(s) and their corresponding target encoded output column(s). Please note that there can be multiple columns on the input/from side if columns grouping was used, and there can also be multiple columns on the output/to side if the target was multiclass.
List of categorical columns or groups of categorical columns to encode. When groups of columns are specified, each group is encoded as a single column (interactions are created internally).
In
keep_original_categorical_columns boolean
If true, the original non-encoded categorical features will remain in the result frame.
In
blending boolean
If true, enables blending of posterior probabilities (computed for a given categorical value) with prior probabilities (computed on the entire set). This allows to mitigate the effect of categorical values with small cardinality. The blending effect can be tuned using the inflection_point and smoothing parameters.
In
inflection_point double
Inflection point of the sigmoid used to blend probabilities (see blending parameter). For a given categorical value, if it appears less that inflection_point in a data sample, then the influence of the posterior probability will be smaller than the prior.
In
smoothing double
Smoothing factor corresponds to the inverse of the slope at the inflection point on the sigmoid used to blend probabilities (see blending parameter). If smoothing tends towards 0, then the sigmoid used for blending turns into a Heaviside step function.
In
data_leakage_handling enum
Data leakage handling strategy used to generate the encoding. Supported options are:
1) “none” (default) - no holdout, using the entire training frame.
2) “leave_one_out” - current row’s response value is subtracted from the per-level frequencies pre-calculated on the entire training frame.
3) “k_fold” - encodings for a fold are generated based on out-of-fold data.
In
noise double
The amount of noise to add to the encoded column. Use 0 to disable noise, and -1 (=AUTO) to let the algorithm determine a reasonable amount of noise.
In
seed long
Seed used to generate the noise. By default, the seed is chosen randomly.
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Force encoding mode for training data: when using a leakage handling strategy different from None, training data should be transformed with this flag set to true (Defaults to false).
In
blending boolean
Enables or disables blending. Defaults to the value assigned at model creation.
In
inflection_point double
Inflection point. Defaults to the value assigned at model creation.
In
smoothing double
Smoothing. Defaults to the value assigned at model creation.
In
noise double
Noise. Defaults to the value assigned at model creation.
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”
In
matches string[]
matches
Out
UnlockKeysV3
_exclude_fields string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”
In
UpliftDRFModelOutputV3
default_auuc_thresholds double[]
Default thresholds to calculate AUUC metric. If validation is enabled, thresholds from validation metrics is saved here. Otherwise thresholds are from training metrics.
Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrt{p} for classification and p/3 for regression (where p is the # of predictors
In
sample_rate double
Row sample rate per tree (from 0.0 to 1.0)
In
treatment_column string
Define the column which will be used for computing uplift gain to select best split for a tree. The column has to divide the dataset into treatment (value 1) and control (value 0) groups.
In
uplift_metric enum
Divergence metric used to find best split when building an uplift tree.
In
auuc_type enum
Metric used to calculate Area Under Uplift Curve.
In
auuc_nbins int
Number of bins to calculate Area Under Uplift Curve.
In
ntrees int
Number of trees.
In
max_depth int
Maximum tree depth (0 for unlimited).
In
min_rows double
Fewest allowed (weighted) observations in a leaf.
In
nbins int
For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point
In
nbins_top_level int
For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level
In
nbins_cats int
For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.
In
r2_stopping double
r2_stopping is no longer supported and will be ignored if set - please use stopping_rounds, stopping_metric and stopping_tolerance instead. Previous version of H2O would stop making trees when the R^2 metric equals or exceeds this
In
seed long
Seed for pseudo random number generator (if applicable)
In
build_tree_one_node boolean
Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.
In
sample_rate_per_class double[]
A list of row sample rates per class (relative fraction for each class, from 0.0 to 1.0), for each tree
In
col_sample_rate_per_tree double
Column sample rate per tree (from 0.0 to 1.0)
In
col_sample_rate_change_per_level double
Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0)
In
score_tree_interval int
Score the model after every so many trees. Disabled if set to 0.
In
min_split_improvement double
Minimum relative improvement in squared error reduction for a split to happen
In
histogram_type enum
What type of histogram to use for finding optimal split points
In
calibrate_model boolean
Use Platt Scaling (default) or Isotonic Regression to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities.
In
in_training_checkpoints_dir string
Create checkpoints into defined directory while training process is still running. In case of cluster shutdown, this checkpoint can be used to restart training.
In
in_training_checkpoints_tree_interval int
Checkpoint the model after every so many trees. Parameter is used only when in_training_checkpoints_dir is defined
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
In
balance_classes boolean
Balance training data class counts via over/under-sampling (for imbalanced data).
In/Out
class_sampling_factors float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
In/Out
max_after_balance_size float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
In/Out
max_confusion_matrix_size int
[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs
Check if response column is constant. If enabled, then an exception is thrown if the response column is a constant value.If disabled, then model will train regardless of the response column being a constant value or not.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
Set threshold for occurrence of words. Those that appear with higher frequency in the training data
will be randomly down-sampled; useful range is (0, 1e-5)
In
norm_model enum
Use Hierarchical Softmax
In
epochs int
Number of training iterations to run
In
min_word_freq int
This will discard words that appear less than times
Id of a data frame that contains a pre-trained (external) word2vec model
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.
A mapping representing monotonic constraints. Use +1 to enforce an increasing constraint and -1 to specify a decreasing constraint.
In
max_abs_leafnode_pred float
(same as max_delta_step) Maximum absolute value of a leaf node prediction
In
max_delta_step float
(same as max_abs_leafnode_pred) Maximum absolute value of a leaf node prediction
In
score_tree_interval int
Score the model after every so many trees. Disabled if set to 0.
In
seed long
Seed for pseudo random number generator (if applicable)
In
min_split_improvement float
(same as gamma) Minimum relative improvement in squared error reduction for a split to happen
In
gamma float
(same as min_split_improvement) Minimum relative improvement in squared error reduction for a split to happen
In
nthread int
Number of parallel threads that can be used to run XGBoost. Cannot exceed H2O cluster limits (-nthreads parameter). Defaults to maximum available
In
build_tree_one_node boolean
Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.
In
save_matrix_directory string
Directory where to save matrices passed to XGBoost library. Useful for debugging.
In
calibrate_model boolean
Use Platt Scaling (default) or Isotonic Regression to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities.
In
max_bins int
For tree_method=hist only: maximum number of bins
In
max_leaves int
For tree_method=hist only: maximum number of leaves
In
tree_method enum
Tree method
In
grow_policy enum
Grow policy - depthwise is standard GBM, lossguide is LightGBM
In
booster enum
Booster type
In
reg_lambda float
L2 regularization
In
reg_alpha float
L1 regularization
In
quiet_mode boolean
Enable quiet mode
In
sample_type enum
For booster=dart only: sample_type
In
normalize_type enum
For booster=dart only: normalize_type
In
rate_drop float
For booster=dart only: rate_drop (0..1)
In
one_drop boolean
For booster=dart only: one_drop
In
skip_drop float
For booster=dart only: skip_drop (0..1)
In
dmatrix_type enum
Type of DMatrix. For sparse, NAs and 0 are treated equally.
In
backend enum
Backend. By default (auto), a GPU is used if available.
In
gpu_id int[]
Which GPU(s) to use.
In
interaction_constraints string[][]
A set of allowed column interactions.
In
scale_pos_weight float
Controls the effect of observations with positive labels in relation to the observations with negative labels on gradient calculation. Useful for imbalanced problems.
In
eval_metric string
Specification of evaluation metric that will be passed to the native XGBoost backend.
In
score_eval_metric_only boolean
If enabled, score only the evaluation metric. This can make model training faster if scoring is frequent (eg. each iteration).
In
distribution enum
Distribution function
In
tweedie_power double
Tweedie power for Tweedie regression, must be between 1 and 2.
In
quantile_alpha double
Desired quantile for Quantile regression, must be between 0 and 1.
In
huber_alpha double
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
In
max_categorical_levels int
For every categorical feature, only use this many most frequent categorical levels for model training. Only used for categorical_encoding == EnumLimited.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Column with cross-validation fold index assignment per observation.
In/Out
fold_assignment enum
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
In/Out
categorical_encoding enum
Encoding scheme for categorical features
In/Out
ignored_columns string[]
Names of columns to ignore for training.
In/Out
ignore_const_cols boolean
Ignore constant columns.
In/Out
score_each_iteration boolean
Whether to score during each iteration of model training.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
In/Out
max_runtime_secs double
Maximum allowed runtime in seconds for model training. Use 0 to disable.
In/Out
stopping_metric enum
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
In/Out
stopping_tolerance double
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
In/Out
gainslift_bins int
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
In/Out
custom_metric_func string
Reference to custom evaluation function, format: language:keyName=funcName
In/Out
custom_distribution_func string
Reference to custom distribution, format: language:keyName=funcName
In/Out
export_checkpoints_dir string
Automatically export generated models to this directory.