R/targetencoder.R
h2o.targetencoder.Rd
Transformation of a categorical variable with a mean value of the target variable
h2o.targetencoder( x, y, training_frame, model_id = NULL, fold_column = NULL, columns_to_encode = NULL, keep_original_categorical_columns = TRUE, blending = FALSE, inflection_point = 10, smoothing = 20, data_leakage_handling = c("leave_one_out", "k_fold", "none", "LeaveOneOut", "KFold", "None"), noise = 0.01, seed = 1, ... )
x  (Optional) A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used. 

y  The name or column index of the response variable in the data. The response must be either a numeric or a categorical/factor variable. If the response is numeric, then a regression model will be trained, otherwise it will train a classification model. 
training_frame  Id of the training data frame. 
model_id  Destination id for this model; autogenerated if not specified. 
fold_column  Column with crossvalidation fold index assignment per observation. 
columns_to_encode  List of categorical columns or groups of categorical columns to encode. When groups of columns are specified, each group is encoded as a single column (interactions are created internally). 
keep_original_categorical_columns 

blending 

inflection_point  Inflection point of the sigmoid used to blend probabilities (see `blending` parameter). For a given categorical value, if it appears less that `inflection_point` in a data sample, then the influence of the posterior probability will be smaller than the prior. Defaults to 10. 
smoothing  Smoothing factor corresponds to the inverse of the slope at the inflection point on the sigmoid used to blend probabilities (see `blending` parameter). If smoothing tends towards 0, then the sigmoid used for blending turns into a Heaviside step function. Defaults to 20. 
data_leakage_handling  Data leakage handling strategy used to generate the encoding. Supported options are: 1) "none" (default)  no holdout, using the entire training frame. 2) "leave_one_out"  current row's response value is subtracted from the perlevel frequencies precalculated on the entire training frame. 3) "k_fold"  encodings for a fold are generated based on outoffold data. Must be one of: "leave_one_out", "k_fold", "none", "LeaveOneOut", "KFold", "None". Defaults to None. 
noise  The amount of noise to add to the encoded column. Use 0 to disable noise, and 1 (=AUTO) to let the algorithm determine a reasonable amount of noise. Defaults to 0.01. 
seed  Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default). Defaults to 1 (timebased random number). 
...  Mainly used for backwards compatibility, to allow deprecated parameters. 
if (FALSE) { library(h2o) h2o.init() #Import the titanic dataset f < "https://s3.amazonaws.com/h2opublictestdata/smalldata/gbm_test/titanic.csv" titanic < h2o.importFile(f) # Set response as a factor response < "survived" titanic[response] < as.factor(titanic[response]) # Split the dataset into train and test splits < h2o.splitFrame(data = titanic, ratios = .8, seed = 1234) train < splits[[1]] test < splits[[2]] # Choose which columns to encode encode_columns < c("home.dest", "cabin", "embarked") # Train a TE model te_model < h2o.targetencoder(x = encode_columns, y = response, training_frame = train, fold_column = "pclass", data_leakage_handling = "KFold") # New target encoded train and test sets train_te < h2o.transform(te_model, train) test_te < h2o.transform(te_model, test) }