Skip to main content
Version: v0.17.0

Settings: Robustness

Overview

H2O Model Validation offers an array of settings for a robustness test. Below, each setting is described in turn.

Settings

Test name

This setting specifies the name of the test. By default, H2O Model Validation assigns a name to the test that you can rewrite.

Model

This setting specifies the model H2O Model Validation utilizes for the robustness test.

Model training dataset

note

Model train dataset refers to one of the model's informational points, not a setting. This informative point refers to the utilized model's training dataset.

Primary dataset

caution

The primary dataset must follow the model's training dataset format.

This setting specifies the dataset H2O Model Validation utilizes to assess the model's robustness. H2O Model Validation applies the model to the dataset and calculates the original model score first. Right after, it applies perturbations to this dataset N times to create a new perturbated dataset. N refers to the value of the following robustness setting: Number of iterations to perturbate.

After generating the N perturbated dataset(s), H2O Model Validation applies the model to each dataset(s) and calculates its perturbated model score.

Perturbation size

This setting specifies the noise level (perturbation size). The perturbation size can range between 0 and 1, where a value closer to 1 indicates a higher noise level. H2O Model Validation introduces the specified noise level to each generated (specified) perturbed dataset.

Number of iterations to perturbate

This setting specifies the number of times the primary dataset is perturbated and scored by the model. In other words, H2O Model Validation applies perturbations to this dataset N times to create a new perturbated dataset. N refers to the value of this setting.

After generating the N perturbated dataset(s), H2O Model Validation applies the model to each dataset(s) and calculates its perturbated model score.

Features to perturbate

This setting defines the features (columns) to perturbate in the primary dataset.

Options

  • All
    • This option perturbates all features (columns) except the target, date, and time column(s).
  • Custom selection
    • This option enables you to specify the features (columns) to perturbate. You can not perturbate the target, date, and time column(s).

Perturbation method for numerical features

This setting defines the perturbation method for numerical features (columns).

caution

Categorical features: H2O Model Validation utilizes a default perturbation method for categorical features referred to as frequency-based perturbation. To illustrate the default method, consider the following example:

Let's say we have three categories: X, Y, and Z. We want to understand how often each category appears in our data. For example, 30% of the data is X, 30% is Y, and 40% is Z.

Now, we want to make some changes to our data. We will randomly choose some samples and change their category. The chance of a sample being changed is determined by a number called "pertubation size" (Perturbation size). For example, if the perturbation size is 0.1 (which means 10%), then 10% of the samples will be changed.

When a sample is changed, we decide its new category randomly. The chances of it becoming category X, Y, or Z are based on the percentages we found earlier (that is, 30%, 30%, and 40%). So, if a sample is changed, there is a 30% chance it becomes X, a 30% chance it becomes Y and a 40% chance it becomes Z.

In simple terms, we are looking at how often different categories appear in our data and then randomly changing some samples to different categories based on those frequencies.

Options

  • Raw
    • This option adds Gaussian noise to the feature values.
      caution

      This method may not be appropriate in certain cases, such as:

      1. When the data is discrete, like 1, 2, 3, ..., 10, the perturbed data, such as 1.2, may become invalid since it is not a valid value in the discrete set.
      2. When the data follows a skewed distribution with a long tail, as calculating the standard deviation can become unstable, and it becomes challenging to select an appropriate perturbation size.
  • Quantile
    • This option converts the feature values into a quantile space. The uniform noise is utilized to perturb the quantiles to transform the quantiles to the original space.


Feedback