Generalized Low Rank Models (GLRM)¶
Generalized Low Rank Models (GLRM) is an algorithm for dimensionality reduction of a dataset. It is a general, parallelized optimization algorithm that applies to a variety of loss and regularization functions. Categorical columns are handled by expansion into 0/1 indicator columns for each level. With this approach, GLRM is useful for reconstructing missing values and identifying important features in heterogeneous data.
What is a Low-Rank Model?¶
Given large collections of data with numeric and categorical values, entries in the table may be noisy or even missing altogether. Low rank models facilitate the understanding of tabular data by producing a condensed vector representation for every row and column in the dataset. Specifically, given a data table A with m rows and n columns, a GLRM consists of a decomposition of A into numeric matrices X and Y. The matrix X has the same number of rows as A, but only a small, user-specified number of columns k. The matrix Y has k rows and d columns, where d is equal to the total dimension of the embedded features in A. For example, if A has 4 numeric columns and 1 categorical column with 3 distinct levels (e.g., red, blue and green), then Y will have 7 columns. When A contains only numeric features, the number of columns in A and Y are identical.
Both X and Y have practical interpretations. Each row of Y is an archetypal feature formed from the columns of A, and each row of X corresponds to a row of A projected into this reduced dimension feature space. We can approximately reconstruct A from the matrix product XY, which has rank k. The number k is chosen to be much less than both m and n: a typical value for 1 million rows and 2,000 columns of numeric data is k = 15. The smaller k is, the more compression gained from the low-rank representation.
Defining a GLRM Model¶
Parameters are optional unless specified as required.
expand_user_y: Specify whether to expand categorical columns in the user-specified initial Y value. This option defaults to
gamma_x: Specify the regularization weight on the X matrix. This option defaults to
gamma_y: Specify the regularization weight on the Y matrix. This option defaults to
k: Required Specify the rank of matrix approximation. This option defaults to
init: Specify the initialization mode. One of:
impute_original: Specify whether to reconstruct the original training data by reversing the data transform after projecting archetypes. This option defaults to
init_step_size: Specify the initial step size. This option defaults to
loss: Specify the numeric loss function. One of:
loss_by_col: Specify the loss function by column override. One of:
loss_by_col_idx: Specify the loss function by column index override.
max_updates: Specify the maximum number of updates. This option defaults to
min_step_size: Specify the minimum step size. This option defaults to
multi_loss: Specify either
"ordinal"for the categorical loss function.
loss="periodic", specify the length of the period. This option defaults to
recover_svd: Specify whether to recover singular values and eigenvectors of XY. This option defaults to
regularization_x: Specify the regularization function for the X matrix. One of:
regularization_y: Specify the regularization function for the Y matrix. One of:
representation_name: Specify the frame key to save the resulting X.
svd_method: Specify the method for computing SVD during initialization. One of:
"randomized"is currently experimental.
user_x: Specify the initial X value.
user_y: Specify the initial Y value.
export_checkpoints_dir: Specify a directory to which generated models will be automatically exported.
ignore_const_cols: Specify whether to ignore constant training columns, since no information can be gained from them. This option defaults to
ignored_columns: (Python and Flow only) Specify the column or columns to be exclude from the model. In Flow, click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the All button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the None button. To search for a specific column, type the column name in the Search field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the Only show columns with more than 0% missing values field. To change the selections for the hidden columns, use the Select Visible or Deselect Visible buttons.
max_iterations: Specify the maximum number of training iterations. The range is 0 to 1e6, and the option defaults to
max_runtime_secs: Specify the maximum allowed runtime in seconds for model training. This option defaults to
model_id: Specify a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
score_each_iteration: Specify whether to score during each iteration of the model training. This option defaults to
seed: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations. This value defaults to
-1(time-based random number).
training_frame: Required Specify the dataset used to build the model.
Note: In Flow, if you click the Build a model button from the
Parsecell, the training frame is entered automatically.
transform: Specify the transformation method for numeric columns in the training data. One of:
x: Specify a vector containing the names or indices of the predictor variables to use when building the model. If
xis missing, then all columns are used.
With an initial GLRM model, when you give a dataset (A) to GLRM, it builds as:
\(Y\) is the archetype
\(X\) contains the coefficients to the archetype
When a new dataset (B) is passed to the GLRM model, it performs as:
\(Y\) is the original archetype generated from the training dataset
\(Xnew\) contains the new coefficients to the archetype
When you call
score in Java or
predict in Python or R with a new dataset, \(Xnew \times Y\) is returned. However, if you call
transform in Java or
transform_frame in Python or R, only \(Xnew\) is returned. See the following example for how to use
Below is a simple example showing how to build a Generalized Low Rank model.
What types of data can be used with GLRM?
GLRM can handle mixed numeric, categorical, ordinal and Boolean data with an arbitrary number of missing values. It allows the user to apply regularization to X and Y, imposing restrictions like non-negativity appropriate to a particular data science context.
What are the benefits to using low rank models?
Memory: Saving only the X and Y matrices can significantly reduce the amount of memory required to store a large data set. A file that is 10 GB can be compressed down to 100 MB. When we need the original data again, we can reconstruct it on the fly from X and Y with minimal loss in accuracy.
Speed: GLRM can be used to compress data with high-dimensional, heterogeneous features into a few numeric columns. This leads to a huge speed-up in model building and prediction, especially by machine learning algorithms that scale poorly with the size of the feature space.
Feature Engineering: The Y matrix represents the most important combination of features from the training data. These condensed features (called archetypes) can be analyzed, visualized, and incorporated into various data science applications.
Missing Data Imputation: Reconstructing a data set from X and Y will automatically impute missing values. This imputation is accomplished by intelligently leveraging the information contained in the known values of each feature, as well as user-provided parameters such as the loss function.