The H2O Aggregator method is a clustering-based method for reducing a numerical/categorical dataset into a dataset with fewer rows. If the dataset has categorical columns, then for each categorical column, Aggregator will:
Accumulate the category frequencies.
For the top 1,000 or fewer categories (by frequency), generate dummy variables (called one-hot encoding by ML people, called dummy coding by statisticians).
Calculate the first eigenvector of the covariance matrix of these dummy variables.
Replace the row values on the categorical column with the value from the eigenvector corresponding to the dummy values.
Aggregator maintains outliers as outliers, but lumps together dense clusters into exemplars with an attached count column showing the member points.
The Aggregator method behaves just any other unsupervised model. You can ignore columns, which will then be dropped for distance computation. Training itself creates the aggregated H2O Frame, which also includes the count of members for every row/exemplar. The aggregated frame always includes the full original content of the training frame, even if some columns were ignored for the distance computation. Scoring/prediction is overloaded with a function that returns the members of a given exemplar row index from 0…Nexemplars (this time without a count).
Defining an Aggregator Model¶
Parameters are optional unless specified as required.
num_iteration_without_new_exemplar: The number of iterations to run before aggregator exits if the number of exemplars collected doesn’t change. This option defaults to
rel_tol_num_exemplars: Specify the relative tolerance for the number of exemplars (e.g.
0.5is +/- 50 percent). This option defaults to
save_mapping_frame: When this option is enabled, the mapping of rows in an aggregated frame to the one in the original/raw frame will be created and exported. This option defaults to
target_num_exemplars: Specify a value for the targeted number of exemplars. This option defaults to
categorical_encoding: Specify one of the following encoding schemes for handling categorical features:
AUTO(default): Allow the algorithm to decide. In Aggregator, the algorithm will automatically perform
OneHotInternal: On the fly N+1 new cols for categorical features with N levels.
binary: No more than 32 columns per categorical feature.
Eigen: k columns per categorical feature, keeping projections of one-hot-encoded matrix onto k-dim eigen space only.
LabelEncoder: Convert every enum into the integer of its index (for example, level 0 -> 0, level 1 -> 1, etc.).
EnumLimited: Automatically reduce categorical levels to the most prevalent ones during Aggregator training and only keep the T (10) most frequent levels.
export_checkpoints_dir: Specify a directory to which generated models will be automatically exported.
ignore_const_cols: Enable this option to ignore constant training columns since no information can be gained from them. This option defaults to
model_id: Specify a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
training_frame: Required Specify the dataset used to build the model.
NOTE: In Flow, if you click the Build a model button from the
Parsecell, the training frame is entered automatically.
transform: Specify the transformation method for numeric columns in the training data. One of
x: Specify a vector contaitning the character names of the predictors in the model.
The output of the aggregation is a new aggregated frame that can be accessed in R and Python.
Below is a simple example showing how to build a Aggregator model.