Target encoding is the process of replacing a categorical value with the mean of the target variable. Any non-categorical columns are automatically dropped by the target encoder model.
Note: You can also use target encoding to convert categorical columns to numeric. This can help improve machine learning accuracy since algorithms tend to have a hard time dealing with high cardinality columns. The jupyter notebook, categorical predictors with tree based model, discusses two methods for dealing with high cardinality columns:
Comparing model performance after removing high cardinality columns
Parameter tuning (specifically tuning
Target Encoding Parameters¶
training_frame: (Required) Specify the dataset that you want to use when you are ready to build a Target Encoding model.
y: (Required) Specify the target column that you are attempting to predict. The data can be numeric or categorical.
x: Specify a vector containing the names or indices of the categorical columns that will be target encoded. If
xis missing, then all categorical columns except
data_leakage_handling: Specify one of the following data leakage handling strategies:
none: Do not holdout anything. Using whole frame for training. This is the default value.
k_fold: Encodings for a fold are generated based on out-of-fold data.
leave_one_out: The current row’s response value is subtracted from the pre-calculated per-level frequencies.
fold_column: Specify the name or column index of the fold column in the data. Applicable only if
data_leakage_handling="k_fold". This defaults to None/NULL (no fold_column).
keep_original_categorical_columns: Specify if the original categorical columns that are being encoded should be included in the result frame. This value defaults to True/TRUE.
blending: Specify whether the target average should be weighted based on the count of the group. It is often the case that some groups may have a small number of records and the target average will be unreliable. To prevent this, the blended average takes a weighted average of the group’s target value and the global target value. This value defaults to False/FALSE.
inflection_point: Specify the inflection point value. This value is used for blending when
blending=Trueand to calculate lambda. This determines half of the minimal sample size for which we completely trust the estimate based on the sample in the particular level of the categorical variable. This value defaults to 10.
smoothing: The smoothing value is used for blending when
blending=Trueand to calculate lambda. Smoothing controls the rate of transition between the particular level’s posterior probability and the prior probability. For smoothing values approaching infinity, it becomes a hard threshold between the posterior and the prior probability. This value defaults to 20.
noise: Specify the amount of random noise that should be added to the target average in order to prevent overfitting. Set to 0 to disable noise. This value defaults to 0.01 times the range of \(y\).
seed: The seed for the pseudorandom number generator, mainly used to generate draws from the uniform distribution for random noise. Defaults to -1 (random seed).
Target Encoding Model Methods¶
Apply transformation to target encoded columns based on the encoding maps generated during training. Available parameters include:
frame: The H2O frame to which you are applying target encoding transformations.
blending: User can override the
blendingvalue defined on the model.
inflection_point: User can override the
inflection_pointvalue defined on the model.
smoothing: User can override the
smoothingvalue defined on the model.
noise: User can override the
noisevalue defined on the model.
as_training: User should set this parameter to True/TRUE when transforming the training dataset, and leave it to False/FALSE when applying to non-training data. Defaults to False/FALSE.
In this example, we will be trying to predict
survived using the popular titanic dataset: https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv. One of the predictors is
cabin, a categorical column with a number of unique values. To perform target encoding on
cabin, we will calculate the average of
embarked per cabin. So instead of using
cabin as a predictor in our model, we could use the target encoding of
Daniele Micci-Barreca. 2001. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explor. Newsl. 3, 1 (July 2001), 27-32.