Running an Experiment

  1. After Driverless AI is installed and started, open a browser (Chrome recommended) and navigate to <server>:12345.
  2. The first time you log in to Driverless AI, you will be prompted to read and accept the Evaluation Agreement. You must accept the terms before continuing. Review the agreement, then click I agree to these terms to continue.
  3. Log in by entering unique credentials. For example:
Username: h2oai
Password: h2oai

Note that these credentials do not restrict access to Driverless AI; they are used to tie experiments to users. If you log in with different credentials, for example, then you will not see any previously run experiments.

  1. As with accepting the Evaluation Agreement, the first time you log in, you will be prompted to enter your License Key. Click the Enter License button, then paste the License Key into the License Key entry field. Click Save to continue. This license key will be saved in the host machine’s /license folder.
Note: Contact sales@h2o.ai for information on how to purchase a Driverless AI license.
  1. The Home page appears, showing all datasets that have been imported. Note that the first time you log in, this list will be empty.

    Add Dataset

Add datasets using one of the following methods:

Drag and drop files from your local machine directly onto this page. Note that this method currently works for files that are less than 10 GB.

or

  1. Click the Add Dataset button. Note that if Driverless AI was started with data connectors enabled for HDFS, S3, Google Cloud Storage, and/or Google Big Query, then a dropdown will appear allowing you to specify where to begin browsing for the dataset. Refer to Data Connectors for more information.

    Add Dataset
  2. In the Explore File System field, type the location for the dataset. Driverless AI autofills the browse line as type in the file location. When you locate the folder that includes your datasets, you can specify to import the folder or to import one or more files.

Notes:

  • Datasets must be in delimited text format.
  • Driverless AI can detect the following separators: ,|;t
  • When importing a folder, the entire folder and all of its contents are read into Driverless AI as a single file.
  • When importing a folder, all of the files in the folder must have the same columns.
Search for file
  1. After importing your data, you can run an experiment by selecting [Click for Actions] button beside the dataset that you want to use. This opens a submenu that allows you to Visualize or Predict a dataset. (Note: You can delete an unused dataset by hovering over it, clicking the X button, and then confirming the delete. You cannot delete a dataset that was used in an active experiment. You have to delete the experiment first.) Click Predict to begin an experiment.
Datasets action menu
  1. The Experiment Settings form displays and auto-fills with the selected dataset. Optionally specify a validation dataset and/or a test dataset.
  • The validation set is used to tune parameters (models, features, etc.). If a validation dataset is not provided, the training data is used (with holdout splits). If a validation dataset is provided, training data is not used for parameter tuning - only for training. A validation dataset can help to improve the generalization performance on shifting data distributions.
  • The test dataset is used for the final stage scoring and is the dataset for which model metrics will be computed against. Test set predictions will be available at the end of the experiment. This dataset is not used during training of the modeling pipeline.

Keep in mind that these datasets must have the same number of columns as the training dataset. Also note that if provided, the validation set is not sampled down, so it can lead to large memory usage, even if accuracy=1 (which reduces the train size).

  1. Specify the target (response) column. Note that not all explanatory functionality will be available for multiclass classification scenarios (scenarios with more than two outcomes). When the target column is selected, Driverless AI automatically provides the target column type and the number of rows. If this is a classification problem, then the UI shows unique and frequency statistics (Target Freq/Most Freq) for numerical columns. If this is a regression problem, then the UI shows the dataset mean and standard deviation values.

Notes Regarding Frequency:

  • For data imported in versions <= 1.0.19, TARGET FREQ and MOST FREQ both represent the count of the least frequent class for numeric target columns and the count of the most frequent class for categorical target columns.
  • For data imported in versions 1.0.20-1.0.22, TARGET FREQ and MOST FREQ both represent the frequency of the target class (second class in lexicographic order) for binomial target columns; the count of the most frequent class for categorical multinomial target columns; and the count of the least frequent class for numeric multinomial target columns.
  • For data imported in version 1.0.23 (and later), TARGET FREQ is the frequency of the target class for binomial target columns, and MOST FREQ is the most frequent class for multinomial target columns.
  1. The next step is to set the parameters and settings for the experiment. (Refer to the Experiment Settings section that follows for more information about these settings.) You can set the parameters individually, or you can let Driverless AI infer the parameters and then override any that you disagree with. Available parameters and settings include the following:
  • Dropped Columns: The columns we do not want to use as predictors such as ID columns, columns with data leakage, etc.

  • Weight Column: The column that indicates the per row observation weights. If “None” is specified, each row will have an observation weight of 1.

  • Fold Column: The column that indicates the fold. If “None” is specified, the folds will be determined by Driverless AI. This is set to “Disabled” if a validation set is used.

  • Time Column: The column that provides a time order, if applicable. If “AUTO” is specified, Driverless AI will auto-detect a potential time order. If “OFF” is specified, auto-detection is disabled. This is set to “Disabled” if a validation set is used.

  • Desired relative accuracy from 1 to 10

  • Desired relative time from 1 to 10

  • Desired relative interpretability from 1 to 10

  • Specify the scorer to use for this experiment. The scorers vary based on whether this is a classification or regression experiment. Available scorers include:

    • Regression: GINI, R2, MSE, RMSE, RMSLE, RMSPE, MAE, MAPE, SMAPE
    • Classification: GINI, MCC, F05, F1, F2, Accuracy, Logloss, AUC, AUCPR

    If a scorer not selected, Driverless AI will select one based on the dataset and experiment.

Additional settings:

  • If this is a classification problem, then click the Classification button. Note that Driverless AI determines the problem type based on the response column. Though not recommended, you can override this setting and specify whether this is a classification or regression problem.
  • Click the Reproducible button to build this with a random seed.
  • Specify whether to enable GPUs. (Note that this option is ignored on CPU-only systems.)
Experiment settings
  1. Click Launch Experiment to start the experiment.

The experiment launches with a randomly generated experiment name. You can change this name at anytime during or after the experiment. Mouse over the name of the experiment to view an edit icon, then type in the desired name.

As the experiment runs, a running status displays in the upper middle portion of the UI. First Driverless AI figures out the backend and determines whether GPUs are running. Then it starts parameter tuning, followed by feature engineering. Finally, Driverless AI builds the scoring pipeline.

In addition to the status, the UI also displays details about the dataset, the iteration data (internal validation) for each cross validation fold along with any specified scorer value, the feature importance values, and CPU/Memory information (including Notifications, Logs, and Trace info). For classification problems, the lower right section includes a toggle between an ROC curve, Precision-Recall graph, Lift chart, Gains chart, and GPU Usage information (if GPUs are available). For regression problems, the lower right section includes a toggle between an Actual vs. Predicted chart and GPU Usage information (if GPUs are available). (Refer to the Experiment Graphs section for more information.) Upon completion, an Experiment Summary section will populate in the lower right section.

The bottom portion of the experiment screen will show any warnings that Driverless AI encounters. You can hide this pane by clicking the x icon.

You can stop experiments that are currently running. Click the Finish button to stop the experiment. This jumps the experiment to the end and completes the ensembling and the deployment package. You can also click Abort to terminate the experiment. (You will be prompted to confirm the abort.) Note that aborted experiments will not display on the Experiments page.

Experiment

Experiment Settings

This section describes the settings that are available when running an experiment.

Dropped Columns

Dropped columns are columns that you do not want to be used as predictors in the experiment.

Validation Dataset

The validation dataset is used for tuning the modeling pipeline. If provided, the entire training data will be used for training, and validation of the modeling pipeline is performed with only this validation dataset. This is not generally recommended, but can make sense if the data are non-stationary. In such a case, the validation dataset can help to improve the generalization performance on shifting data distributions.

This dataset must have the same number of columns (and column types) as the training dataset. Also note that if provided, the validation set is not sampled down, so it can lead to large memory usage, even if accuracy=1 (which reduces the train size).

Test Dataset

The test dataset is used for testing the modeling pipeline and creating test predictions. The test set is never used during training of the modeling pipeline. (Results are the same whether a test set is provided or not.) If a test dataset is provided, then test set predictions will be available at the end of the experiment.

Weight Column

Optional: Column that indicates the observation weight (a.k.a. sample or row weight), if applicable. This column must be numeric with values >= 0. Rows with higher weights have higher importance. The weight affects model training through a weighted loss function, and affects model scoring through weighted metrics. The weight column is not used when making test set predictions (but scoring of the test set predictions can use the weight).

Fold Column

Optional: Column to use to create stratification folds during (cross-)validation, if applicable. Must be of integer or categorical type. Rows with the same value in the fold column represent cohorts, and each cohort is assigned to exactly one fold. This can help to build better models when the data is grouped naturally. If left empty, the data is assumed to be i.i.d. (identically and independently distributed). For example, when viewing data for a pneumonia dataset, person_id would be a good Fold Column. This is because the data may include multiple diagnostic snapshots per person, and we want to ensure that the same person’s characteristics show up only in either the training or validation frames, but not in both to avoid data leakage. Note that a fold column cannot be specified if a validation set is used.

Time Column

Optional: Column that provides a time order, if applicable. Can improve model performance and model validation accuracy for problems where the target values are auto-correlated with respect to the ordering. Each observation’s time stamp is used to order the observations in a causal way (generally, to avoid training on the future to predict the past).

The values in this column must be a datetime format understood by pandas.to_datetime(), like “2017-11-29 00:30:35” or “2017/11/29”. If [AUTO] is selected, all string columns are tested for potential date/datetime content and considered as potential time columns. The natural row order of the training data is also considered in case no date/datetime columns are detected. If the data is (nearly) identically and independently distributed (i.i.d.), then no time column is needed. If [OFF] is selected, no time order is used for modeling, and data may be shuffled randomly (any potential temporal causality will be ignored). Note that a time column cannot be specified if a validation set is used.

Accuracy

The following table describes how the Accuracy value affects a Driverless AI experiment.

Accuracy Max Rows Ensemble Level Target Transformation Parameter Tuning Level Num Individuals Num Folds Only First Fold Model Distribution Check
1 10K 0 False 0 Auto 3 True No
2 100K 0 False 0 Auto 3 True No
3 500k 0 False 0 Auto 3 True No
4 1M 0 False 0 Auto 3-4 True No
5 2M 1 True 1 Auto 3-4 True Yes
6 5M 1 True 1 Auto 3-5 True Yes
7 10M <=2 True 2 Auto 3-10 True Yes
8 10M <=2 True 2 Auto 4-10 if >= 5M rows Yes
9 20M <=3 True 3 Auto 4-10 if >= 5M rows Yes
10 None <=3 True 3 Auto 4-10 if >= 5M rows Yes

Note: A check for a shift in the distribution between train and test is done for accuracy >= 5.

The list below includes more information about the parameters that are used when calculating accuracy.

  • Max Rows: The maximum number of rows to use in model training
  • For classification, stratified random sampling is performed
  • For regression, random sampling is performed
  • Ensemble Level: The level of ensembling done for the final model
  • 0: single model
  • 1: 2 4-fold models ensembled together
  • 2: 5 5-fold models ensembled together
  • 3: 8 5-fold models ensembled together
  • If ensemble level > 0, then the final model score shows an error estimate that includes the data generalization error (standard deviation of scores over folds) and the error in the estimate of the score (bootstrap score’s standard deviation with sample size same as data size).
  • For accuracy >= 8, the estimate of the error in the validation score reduces, and the error in the score is dominated by the data generalization error.
  • The estimate of the error in the test score is estimated by the maximum of the bootstrap with sample size equal to the test set size and the validation score’s error.
  • Target Transformation: Try target transformations and choose the transformation that has the best score.

    Possible transformations: identity, unit_box, log, square, square root, double square root, inverse, Anscombe, logit, sigmoid

  • Parameter Tuning Level: The level of parameter tuning done

  • 0: no parameter tuning
  • 1: 8 different parameter settings
  • 2: 16 different parameter settings
  • 3: 32 different parameter settings
  • Optimal model parameters are chosen based on a combination of the model’s accuracy, training speed, and complexity.
  • Num Individuals: The number of individuals in the population for the genetic algorithms
  • Each individual is a gene. The more genes, the more combinations of features are tried.
  • The number of individuals is automatically determined and can depend on the number of GPUs. Typical values are between 4 and 16.
  • Num Folds: The number of internal validation splits done for each pipeline
  • If the problem is a classification problem, then stratified folds are created.
  • Only First Fold Model: Whether to only use the first fold split for internal validation to save time
  • Example: Setting Num Folds to 3 and Only First Fold Model = True means you are splitting the data into 67% training and 33% validation.
  • If “Only First Fold Model” is False, then errors on the score shown during feature engineering include the data generalization error (standard deviation of scores over folds) and the error in the estimate of the score (bootstrap score’s standard deviation with a sample size the same as the data size).
  • If “Only First Fold Model” is True, then errors on the score shown during feature engineering include only the error in the estimate of the score (bootstrap score’s standard deviation with a sample size same as the data size).
  • For accuracy >= 8, the estimate of the error in the score reduces, and the error in the score is dominated by the data generalization error. This provides the most accurate generalization error.
  • Early Stopping Rounds: Time-based means based upon the Time table below.
  • Distribution Check: Checks whether validation or test data are drawn from the same distribution as the training data. Note that this is purely informative to the user. Driverless AI does not take information from the test set into consideration during training.
  • Strategy: Feature selection strategy (to prune-away features that do not clearly give improvement to model score). Feature selection is triggered by interpretability. Strategy = “FS” if interpretability >= 6; otherwise strategy is None.

Time

This specifies the relative time for completing the experiment (i.e., higher settings take longer). Early stopping will take place if the experiment doesn’t improve the score for the specified amount of iterations.

Time Iterations Early Stopping Rounds
1 1-5 None
2 10 5
3 30 5
4 40 5
5 50 10
6 100 10
7 150 15
8 200 20
9 300 30
10 500 50

Note: See the Accuracy table for cases when not based upon time.

Interpretability

Interpretability Ensemble Level Monotonicity Constraints
<= 5 <= 3 Disabled
>= 6 <= 2 Disabled
>= 7 <= 2 Enabled
>= 8 <= 1 Enabled
10 0 Enabled
Interpretability Transformers**
<= 5 All
6 Interpretability#5 - [TruncSvdNum, ClusterDist]
7 Interpretability#6 - [ClusterIDTargetEncodeMulti, ClusterIDTargetEncodeSingle]
8 Interpretability#7 - [NumToCatTargetEncodeSingle, NumToCatTargetEncodeMulti, Frequent]
9 Interpretability#8 - [NumToCatWeightOfEvidence, NumToCatWoE, NumCatTargetEncodeMulti, NumCatTargetEncodeSingle]
10 Interpretability#9 - [BulkInteractions, WeightOfEvidence, CvCatNumEncode, NumToCatWeightOfEvidenceMonotonic]

** Interpretability# - [lost transformers] explains which transformers are lost by going up by 1 to that interpretability.

** Exception - NumToCatWeightOfEvidenceMonotonic removed for interpretability<=6.

** For interpretability <= 10, i.e. only [Filter for numeric, Frequent for categorical, DateTime for Date+Time, Date for dates, and Text for text]

  • Target Transformers:

    For regression, applied on target before any other transformations.

    Interpretability Target Transformer
    <=10 TargetTransformer_identity
    <=10 TargetTransformer_unit_box
    <=10 TargetTransformer_log
    <= 9 TargetTransformer_square
    <= 9 TargetTransformer_sqrt
    <= 8 TargetTransformer_double_sqrt
    <= 6 TargetTransformer_logit
    <= 6 TargetTransformer_sigmoid
    <= 5 TargetTransformer_Anscombe
    <= 4 TargetTransformer_inverse
  • Monotonicity Constraints:

    If enabled, the model will satisfy knowledge about monotonicity in the data and monotone relationships between the predictors and the target variable. For example, in house price prediction, the house price should increase with lot size and number of rooms, and should decrease with crime rate in the area. If enabled, Driverless AI will automatically determine if monotonicity is present and enforce it in its modeling pipelines.

  • Date Types Detected:

    • categorical
    • date
    • datetime
    • numeric
    • text
  • Transformers used on raw features to generate new features:

    Interpretability Transformer
    <=10 Filter
    <=10 DateTime
    <=10 Date
    <=10 Text
    <=10 TextLin
    <=10 CvTargetEncodeMulti
    <=10 CvTargetEncodeSingle
    <=9 CvCatNumEncode
    <=9 WeightOfEvidence
    <=9 and >=7 NumToCatWeightOfEvidenceMonotonic
    <=9 BulkInteractions
    <=8 NumToCatWeightOfEvidence
    <=8 NumCatTargetEncodeMulti
    <=8 NumCatTargetEncodeSingle
    <=7 Frequent
    <=7 NumToCatTargetEncodeMulti
    <=7 NumToCatTargetEncodeSingle
    <=6 ClusterIDTargetEncodeMulti
    <=6 ClusterIDTargetEncodeSingle
    <=5 TruncSvdNum
    <=5 ClusterDist
    ** Default N-way interactions are up to 8-way except:
    • BulkInteractions are always 2-way.
    • Interactions are minimal-way (e.g. 1-way for CvTargetEncode) if interpretability=10.
  • Feature importance threshold below which features are removed

    Interpretability Threshold
    10 config.toml varimp_threshold_at_interpretability_10
    9 varimp_threshold_at_interpretability_10/5.0
    8 varimp_threshold_at_interpretability_10/7.0
    7 varimp_threshold_at_interpretability_10/10.0
    6 varimp_threshold_at_interpretability_10/20.0
    5 varimp_threshold_at_interpretability_10/30.0
    4 varimp_threshold_at_interpretability_10/50.0
    3 varimp_threshold_at_interpretability_10/500.0
    2 varimp_threshold_at_interpretability_10/5000.0
    1 1E-30

    ** Also used for strategy=FS dropping of features, but the threshold is the above value multiplied by config.varimp_fspermute_factor.

  • Base model used for scoring features and building final model

    Interpretability Allowed Base Model
    10 Only GLM if glm_enable_more==True or glm_enable_exlcusive=True, GBM+GLM if glm_enable==True, else only GBM
    9 GBM unless glm_enable_exlcusive=True, GBM+GLM if glm_enable_more==True
    8 GBM unless glm_enable_exlcusive=True, GBM+GLM if glm_enable_more==True
    7 GBM unless glm_enable_exlcusive=True, GBM+GLM if glm_enable_more==True
    6 GBM unless glm_enable_exlcusive=True, GBM+GLM if glm_enable_more==True
    5 GBM unless glm_enable_exlcusive=True
    4 GBM unless glm_enable_exlcusive=True
    3 GBM unless glm_enable_exlcusive=True
    2 GBM unless glm_enable_exlcusive=True
    1 GBM unless glm_enable_exlcusive=True

    ** When mixing GBM and GLM in parameter tuning, the search space is split 50%/50% between GBM and GLM.

Experiment Graphs

This section describes the dashboard graphs that are displayed for running and completed experiments. These graphs are interactive. Hover over a point on the graph for more details about the point.

Binary Classfication Experiments

For Binary Classification experiments, Driverless AI shows ROC Curves, a Precision-Recall graph, a Lift chart, and a Gains chart.

Experiment graphs
  • ROC: This shows Receiver-Operator Characteristics curve stats on validation data. The area under this curve is called AUC. The True Positive Rate (TPR) is the relative fraction of correct positive predictions, and the False Positive Rate (FPR) is the relative fraction of incorrect positive corrections. Each point corresponds to a classification threshold (e.g., YES if probability >= 0.3 else NO). For each threshold, there is a unique confusion matrix that represents the balance between TPR and FPR. Most useful operating points are in the top left corner in general.
Hover over a point in the ROC curve to see the True Positive, True Negative, False Positive, False Negative, Threshold, FPR, TPR, Accuracy, F1, and MCC value for that point.
  • Precision-Recall: This shows the Precision-Recall curve on validation data. The area under this curve is called AUCPR.
  • Precision: correct positive predictions (TP) / all positives (TP + FP).
  • Recall: correct positive predictions (TP) / positive predictions (TP + FN).

Each point corresponds to a classification threshold (e.g., YES if probability >= 0.3 else NO). For each threshold, there is a unique confusion matrix that represents the balance between Recall and Precision. This ROCPR curve can be more insightful than the ROC curve for highly imbalanced datasets.

Hover over a point in this graph to see the True Positive, True Negative, False Positive, False Negative, Threshold, Recall, Precision, Accuracy, F1, and MCC value for that point.

  • Lift: This chart shows lift stats on validation data. For example, “How many times more observations of the positive target class are in the top predicted 1%, 2%, 10%, etc. (cumulative) compared to selecting observations randomly?” By definition, the Lift at 100% is 1.0.
Hover over a point in the Lift chart to view the quantile percentage and cumulative lift value for that point.
  • Gains: This shows Gains stats on validation data. For example, “What fraction of all observations of the positive target class are in the top predicted 1%, 2%, 10%, etc. (cumulative)?” By definition, the Gains at 100% are 1.0.
Hover over a point in the Gains chart to view the quantile percentage and cumulative gain value for that point.

Multiclass Classification Experiments

The ROC curve, Precision-Recall, Lift chart, and Gains chart are also shown for multiclass problems. Driverless AI does this by considering the multi-class problem as multiple one-vs-all problems. This method is known as micro-averaging (reference: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#multiclass-settings).

For example, you may want to predict the species in the iris data. The predictions would look something like this:

class.Iris-setosa class.Iris-versicolor class.Iris-virginica
0.9628 0.021 0.0158
0.0182 0.3172 0.6646
0.0191 0.9534 0.0276

To create the ROC, Lift, and Gains chart, Driverless AI converts the results to 3 one-vs-all problems:

prob-setosa actual-setosa   prob-versicolor actual-versicolor   prob-virginica actual-virginica
0.9628 1   0.021 0   0.0158 0
0.0182 0   0.3172 1   0.6646 0
0.0191 0   0.9534 1   0.0276 0

The result is 3 vectors of predicted and actual values for binomial problems. Driverless AI concatenates these 3 vectors together to compute the ROC curve, lift, and gains chart.

predicted = [0.9628, 0.0182, 0.0191, 0.021, 0.3172, 0.9534, 0.0158, 0.6646, 0.0276]
actual = [1, 0, 0, 0, 1, 1, 0, 0, 0]

Regression Experiments

An Actual vs. Predicted table displays for Regression experiments. This shows Actual vs Predicted values on validation data. A small sample of values are displayed. A perfect model has a diagonal line.

Hover over a point on the graph to view the Actual and Predicted values for that point.

Experiment graphs