Skip to main content
Version: v0.14.0

Size dependency

Size Dependency refers to a validation test that enables you to analyze the effects different sizes of train data will have on the accuracy of a selected model. In particular, Size Dependency facilitates model stability analysis and, for example, can answer whether augmenting an existing train data seems to be promising in terms of model accuracy.

H2O Model Validation selects an appropriate sampling technique for the Size Dependency validation test based on the selected model type. Available sampling techniques are as follows:

  • Random sampling
    • H2O Model Validation uses random sampling to create new sub-training samples for Independent and Identically Distributed (IID) models.
  • Expanding window sampling
    • H2O Model Validation uses expanding window sampling to create new sub-training samples for time series models while utilizing time columns to ensure that sub-training samples grow from recent to oldest data.

For either sampling technique (random or expanding window), before it is applied, the original train data is split using folds that improve generalization and data balance when one of the sampling techniques is applied while using folds.

In the case of IID models, folds and sub-training samples are created at random, but for time series models, folds and sub-training samples are created using a time column to ensure that sub-training samples grow from the most recent to oldest data points.

Based on the number of folds (N), H2O Model Validation will retrain the Driverless AI experiment N times by only updating its training dataset with the new sub-training samples while generating a scorer for each iteration of the retraining process for further analysis.

Sampling the original training data for a model under random or expanding window sampling can be illustrated in the below image when N (folds) equals 4.

sampling.png


Feedback