H2O Driverless AI is an artificial intelligence (AI) platform that automates some of the most difficult data science and machine learning workflows such as feature engineering, model validation, model tuning, model selection and model deployment. It aims to achieve highest predictive accuracy, comparable to expert data scientists, but in much shorter time thanks to end-to-end automation. Driverless AI also offers automatic visualizations and machine learning interpretability (MLI). Especially in regulated industries, model transparency and explanation are just as important as predictive performance.

This section provides answers to frequently asked questions. If you have additional questions about using Driverless AI, post them on Stack Overflow using the driverless-ai tag at http://stackoverflow.com/questions/tagged/driverless-ai.


How is Driverless AI different than any other black box ML algorithm?

Driverless AI uses many techniques (some older and some cutting-edge) for interpreting black box models including creating reason codes for every prediction the system makes. We have also created numerous open source code examples and free publications that explain these techniques. Please see the list below for links to these resources and for references for the interpretability techniques.

  • Open source interpretability examples:
  • Free Machine Learning Interpretability publications:
  • Machine Learning Techniques already in Driverless AI:


How can I change my username and password?

The username and password is tied to the experiments you have created. For example, if I log in with the username/password: megan/megan and start an experiment, then I would need to log back in with the same username and password to see those experiments. The username and password, however, does not limit your access to Driverless AI. If you want to use a new user name and password, you can log in again with a new username and password, but keep in mind that you won’t see your old experiments.

Can Driverless AI run on CPU-only machines?

Yes, Driverless AI can run on machines with CPUs only, though GPUs are recommended. Installation instructions are available for GPU and CPU systems. Refer to Installing and Upgrading Driverless AI for more information.

How often do new versions come out?

Driverless AI is in early stage development, and new releases come out quite frequently. Currently, you can expect a new release approximately every 7 to 10 days.

How can I upgrade to a newer version of Driverless AI?

Upgrade instructions vary depending on your environment. Refer to the installation section for your environment. Upgrade instructions are included there.

Why do I have to tag my release as latest?

  • The Docker convention is that it expects :latest to exist in the repo. Driverless AI ships with a tag as in this example: opsh2oai/h2oai-runtime:1.2.0.cuda9. To make rollback and deployment deterministic, you must apply a tag of latest to reflect what you want latest to be. For example, if a user downloads the versioned download of 1.2.0.cuda9 (https://s3-us-west-2.amazonaws.com/h2o-internal-release/docker/driverless-ai-docker-cuda9-runtime-rel-1.2.0.cuda9.gz), that version will have a 1.2.0.cuda9 tag on it. Perform the following to tag your Driverless AI version (for example, 1.2.0.cuda9) to latest:
docker tag opsh2oai/h2oai-runtime:1.2.0.cuda9 opsh2oai/h2oai-runtime:latest

What kind of authentication is supported in Driverless AI?

Driverless AI supports LDAP, PAM, and Local authentication. This can be configured by setting the appropriate environment variables in the config.toml file or by specifying the environment variables when starting Driverless AI. Refer to Data Connectors for more information.

Can I set up SSL on Driverless AI?

Yes, you can set up HTTPS/SSL on Driverless AI running in an AWS environment. HTTPS/SSL needs to be configured on the host machine, and the necessary ports will need to be opened on the AWS side. You will need to have your own SSL cert or you can create a self signed cert for yourself.

The following is a very simple example showing how to configure HTTPS with a proxy pass to the port on the container 12345 with the keys placed in /etc/nginx/. Replace <server_name> with your server name.

server {
    listen 80;
    return 301 https://$host$request_uri;

server {
    listen 443;

    # Specify your server name here
    server_name <server_name>;

    ssl_certificate           /etc/nginx/cert.crt;
    ssl_certificate_key       /etc/nginx/cert.key;
    ssl on;
    ssl_session_cache  builtin:1000  shared:SSL:10m;
    ssl_protocols  TLSv1 TLSv1.1 TLSv1.2;
    ssl_ciphers HIGH:!aNULL:!eNULL:!EXPORT:!CAMELLIA:!DES:!MD5:!PSK:!RC4;
    ssl_prefer_server_ciphers on;

    access_log            /var/log/nginx/dai.access.log;

    location / {
      proxy_set_header        Host $host;
      proxy_set_header        X-Real-IP $remote_addr;
      proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header        X-Forwarded-Proto $scheme;

      # Fix the “It appears that your reverse proxy set up is broken" error.
      proxy_pass          http://localhost:12345;
      proxy_read_timeout  90;

      # Specify your server name for the redirect
      proxy_redirect      http://localhost:12345 https://<server_name>;

More information about SSL for Nginx in Ubuntu 16.04 can be found here: https://www.digitalocean.com/community/tutorials/how-to-create-a-self-signed-ssl-certificate-for-nginx-in-ubuntu-16-04.


Is there a file size limit for datasets?

The file size for datasets is limited by GPU memory, but we continue to make optimizations for getting more data into an experiment.

How much memory does Driverless AI require in order to run experiments?

Right now, Driverless AI requires 10x the size of the data in system memory.

How many columns can Driverless AI handle?

Driverless AI has been tested on datasets with 10k columns. When running experiments on wide data, Driverless AI automatically checks if it is running out of memory, and if it is, it reduces the number of features until it can fit in memory. This may lead to a worse model, but Driverless AI shouldn’t crash because the data is wide.

How should I use Driverless AI if I have large data?

For large datasets, we recommend sampling your data for Driverless AI. For very large datasets, the complete dataset is not actually required to find good features. It is much more important to use the complete data to train the final supervised learning model.

For large dataset, the recommended steps are:

  1. Sample data and run Driverless AI.
  2. Extract the feature engineering pipeline. (This would be the Python scoring package.)
  3. Apply the feature engineering to the complete data.
  4. Train a supervised learning model on the complete data using something like H2O core.

How does Driverless AI detect the ID column?

The ID column logic is that the column is named ‘id’, ‘Id’, ‘ID’ or ‘iD’ exactly. (It does not check the number of unique values.) For now, if you want to ensure that your ID column is downloaded with the predictions, then you would want to name it one of those names.

Can Driverless AI handle data with missing values/nulls?

Yes, data that is imported into Driverless AI can include missing values. Feature engineering is fully aware of missing values, and missing values are treated as information - either as a special categorical level or as a special number. So for target encoding, for example, rows with a certain missing feature will belong to the same group. For Categorical Encoding where aggregations of a numeric columns are calculated for a grouped categorical column, missing values are kept. The formula for calculating the mean is the sum of non-missing values divided by the count of all values. For clustering, we impute missing values. And for frequency encoding, we count the number of rows that have a certain missing feature.

How are outliers handled?

Outliers are not removed from the data. Instead Driverless AI finds the best way to represent data with outliers. For example, Driverless AI may find that binning a variable with outliers improves performance.

For target columns, Driverless AI first determines the best representation of the column. It may find that for a target column with outliers, it is best to predict the log of the column.

If I drop several columns from the Train dataset, will Driverless AI understand that it needs to drop the same columns from the Test dataset?

If you drop columns from the training dataset, Driverless AI will do the same for the validation and test datasets (if the columns are present). There is no need for these columns because no features will be created from them.

Which algorithms are used in Driverless AI?

Features are engineered with a proprietary stack of statistical models including some of the most sophisticated target encoding and likelihood estimates, but we also employ linear models, neural nets, clustering and dimensionality reduction models and more.

On top of the engineered features, XGBoost models (and ensembles thereof, including GLM if enabled via config.toml and GBM) are used to make predictions. More models, such as linear models and neural nets, will be available in future releases.

Why does Driverless AI only use GBM for predictions?

Since 2006, boosting methods have proven to be the most accurate for noisy predictive modeling tasks outside of pattern recognition in images and sound (https://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf). The advent of XGBoost and Kaggle only cemented this position.

We are planning to introduce linear models when interpretability is set to 10, and we also plan to introduce neural nets for high-cardinality multinomial classification problems. Also remember, there are many modeling algorithms for you to try in H2O-3.

What objective function is used in XGBoost?

The objective function used in XGBoost is:

  • reg:linear for regression
  • binary:logistic or multi:softprob for classification

The objective function does not change depending on the scorer chosen. The scorer influences parameter tuning only.

For regression, Tweedie/Gamma/Poisson/etc. regression is not yet supported, but Driverless AI handles various target transforms so many target distributions can be handled very effectively already.

Further details for the XGBoost instantiations can be found in the logs and in the model summary, both of which can be downloaded from the GUI or are found in the /tmp/h2oai_experiment_<name>/ folder on the server.

Does Driverless AI perform internal or external validation?

Driverless AI does internal validation when only training data is provided. It does external validation when training and validation data are provided. In either scenario, the validation data is used for all parameter tuning (models and features), not just for feature selection. Parameter tuning includes target transformation, model selection, feature engineering, feature selection, stacking, etc.


  • Internal validation (only training data given):
    • Ideal when data is either close to i.i.d., or for time-series problems
    • Internal holdouts are used for parameter tuning, with temporal causality for time-series problems
    • Will do the full spectrum from single holdout split to 5-fold CV, depending on accuracy settings
    • No need to split training data manually
    • Final models are trained using CV on the training data
  • External validation (training + validation data given):
    • Ideal when there’s some amount of drift in the data, and the validation set mimics the test set data better than the training data
    • No training data wasted during training because training data not used for parameter tuning
    • Validation data is used only for parameter tuning, and is not part of training data
    • No CV possible because we explicitly do not want to overfit on the training data
    • Not allowed for time-series problems (see below)

Tip: If you want both training and validation data to be used for parameter tuning (the training process), just concatenate the datasets together and turn them both into training data for the “internal validation” method.

How does Driverless AI prevent overfitting?

Driverless AI performs a number of checks to prevent overfitting. For example, during certain transformations, Driverless AI calculates the average on out-of-fold data using cross validation. Driverless AI also performs early stopping for every model built, ensuring that the model build will stop when it ceases to improve on holdout data. And additional steps to prevent overfitting include checking for i.i.d. and avoiding leakage during feature engineering.

A blog post describing Driverless AI overfitting protection in greater detail is available here: https://blog.h2o.ai/2018/03/driverless-ai-prevents-overfitting-leakage/.

How’s does Driverless AI avoid the multiple hypothesis (MH) problem?

Or more specifically, like many brute force methods for tuning hyperparameters/model selection, Driverless AI runs up against the multihypothesis problem (MH). For example, if I randomly generated a gazillion models, the odds that a few will do awesome on the test data that they are all measured against is pretty large, simply by sheer luck. How does Driverless AI address this?

Driverless AI uses a variant of the reusable holdout technique to address the multiple hypothesis problem. Refer to https://pdfs.semanticscholar.org/25fe/96591144f4af3d8f8f79c95b37f415e5bb75.pdf for more information.

How does Driverless AI suggest the experiment settings?

When you run an experiment on a dataset, the experiment settings (Accuracy, Time, and Interpretability) are automatically suggested by Driverless AI. For example, Driverless AI may suggest the parameters Accuracy = 7, Time = 3, Interpretability = 6, based on your data.

Driverless AI will automatically suggest experiment settings based on the number of columns and number of rows in your dataset. The settings are suggested to ensure best handling when the data is small. If the data is small, Driverless AI will suggest the settings that prevent overfitting and ensure the full dataset is utilized.

If the number of rows and number of columns are each below a certain threshold, then:

  • Accuracy will be increased up to 8.
    • The accuracy is increased so that cross validation is done. (We don’t want to “throw away” any data for internal validation purposes.)
  • Interpretability will be increased up to 8.
    • The higher the interpretability setting, the smaller the number of features in the final model.
    • More complex features are not allowed.
    • This prevents overfitting.
  • Time will be decreased down to 2.
    • There will be fewer feature engineering iterations to prevent overfitting.

What happens when I set Interpretability and Accuracy to the same number?

The answer is currently that interpretability controls which features are created and what features are kept. (Also above interpretability = 6, monotonicity constraints are used in XGBoost.) The accuracy refers to how hard DAI then tries to make those features into the most accurate model.

Why does the final model performance appear to be worse than previous iterations?

There are a few things to remember:

  • Driverless AI creates a best effort estimate of the generalization performance of the best modeling pipeline found so far
  • The performance estimation is always based on holdout data (data unseen by the model).
  • If no validation dataset is provided, the training data is split internally to create internal validation holdout data (once or multiple times or cross-validation, depending on the accuracy settings).
  • If no validation dataset is provided, for accuracy <= 7, a single holdout split is used, and a “lucky” or “unlucky” split can bias estimates for small datasets or datasets with high variance.
  • If a validation dataset is provided, then all performance estimates are solely based on the entire validation dataset (independent of accuracy settings).
  • All scores reported are based on bootstrapped-based statistical methods and come with error bars that represent a range of estimate uncertainty.

After the final iteration, a best final model is trained on a final set of engineered features. Depending on accuracy settings, a more accurate estimation of generalization performance may be done using cross-validation. Also, the final model may be a stacked ensemble consisting of multiple base models, which generally leads to better performance. Consequently, in rare cases, the difference in performance estimation method can lead to the final model’s estimated performance seeming poorer than those from previous iterations. (i.e., The final model’s estimated score is significantly worse than the last iteration score and error bars don’t overlap.) In that case, it is very likely that the final model performance estimation is more accurate, and the prior estimates were biased due to a “lucky” split. To confirm this, you can re-run the experiment multiple times (without setting the reproducible flag).

If you would like to minimize the likelihood of the final model performance appearing worse than previous iterations, here are some recommendations:

  • Increase accuracy settings
  • Provide a validation dataset
  • Provide more data

How can I see the performance metrics on the test data?

As long as you provide a target column in the test set, Driverless AI will show the best estimate of the final model’s performance on the test set at the end of the experiment. The test set is never used to tune parameters (unlike to what Kagglers often do), so this is purely a convenience. Of course, you can still make test set predictions and compute your own metrics using a method of your choice.

How can I see all the performance metrics possible for my experiment?

At the end of the experiment, the model’s estimated performance on all provided datasets with a target column is printed in the experiment logs. For example, for the test set:

Final scores on test (external holdout) +/- stddev:
               GINI = 0.87794 +/- 0.035305 (more is better)
                MCC = 0.71124 +/- 0.043232 (more is better)
                F05 = 0.79175 +/- 0.04209 (more is better)
                 F1 = 0.75823 +/- 0.038675 (more is better)
                 F2 = 0.82752 +/- 0.03604 (more is better)
           ACCURACY = 0.91513 +/- 0.011975 (more is better)
            LOGLOSS = 0.28429 +/- 0.016682 (less is better)
              AUCPR = 0.79074 +/- 0.046223 (more is better)
 optimized: AUC = 0.93386 +/- 0.018856 (more is better)

What if my training/validation and testing data sets come from different distributions?

In general, Driverless AI uses training data to engineer features and train models and validation data to tune all parameters. If no external validation data is given, the training data is used to create internal holdouts. The way holdouts are created internally depends on whether there is a strong time dependence, see the point below. If the data has no obvious time dependency (e.g., if there is no time column neither implicit or explicit), or if the data can be sorted arbitrarily and it won’t affect the outcome (e.g., Iris data, predicting flower species from measurements), and if the test dataset is different (e.g., new flowers or only large flowers), then the model performance on validation (either internal or external) as measured during training won’t be achieved during final testing due to the obvious inability of the model to generalize.

What if my data has a time dependency?

If you know that your data has a strong time dependency, select a time column before starting the experiment. The time column must be in a Datetime format that can be parsed by pandas, such as “2017-11-06 14:32:21”, “Monday, June 18, 2012” or “Jun 18 2018 14:34:00” etc., or contain only integers.

If you are unsure about the strength of the time dependency, run two experiments: One with time column set to “[OFF]” and one with time column set to “[AUTO]” (or pick a time column yourself).

Why can’t I specify a validation data set for time-series problems? Why do you look at the test set for time-series problems

The problem with validation vs test in the time series setting is that there is only one valid way to define the split. If a test set is given, its length in time defines the validation split and the validation data has to be part of train. Otherwise the time-series validation won’t be useful.

For instance: Let’s assume we have train = [1,2,3,4,5,6,7,8,9,10] and test = [12,13], where integers define time periods (e.g., weeks). For this example, the most natural train/valid split that mimics the test scenario would be: train = [1,2,3,4,5,6,7] and valid = [9,10], and month 8 is skipped for training. Note that we will look at the start time and the duration of the test set only (if provided), and not at the contents of the test data (neither features nor target). If the user provides validation = [8,9,10] instead of test data, then this could lead to inferior validation strategy and worse generalization. Hence, we use the user-given test set only to create the optimal internal train/validation splits. If no test set is provided, the user can provide the length of the test set (in periods), the length of the train/test gap (in periods) and the length of the period itself (in seconds).

Does Driverless AI handle weighted data?

Yes. You can optionally provide an extra weight column in your training (and validation) data with non-negative observation weights. This can be useful to implement domain-specific effects such as exponential weighting in time or class weights. All of our algorithms and metrics in Driverless AI support observation weights, but note that estimated likelihoods can be skewed as a consequence.

How does Driverless AI handle fold assignments for weighted data?

Currently, Driverless AI does not take the weights into account during fold creation, but you can provide a fold column to enforce your own grouping, i.e., to keep rows that belong to the same group together (either in train or valid). The fold column has to be a categorical column (integers ok) that assigns a group ID to each row. (It needs to have at least 5 groups because we do up to 5-fold CV.)

Where can I get details of the various transformations performed in an experiment?

Download the experiment’s log .zip file from the GUI. This zip file includes summary information, log information, and a gene_summary.txt file with details of the transformations used in the experiment. Specificially, there is a details folder with all subprocess logs.

On the server, the experiment specific files are inside the /tmp/h2oai_experiment_<name>/ folder after the experiment completes, particularly h2oai_experiment_logs_<name>.zip and h2oai_experiment_summary_<name>.zip.

How can I download the predictions onto the machine where Driverless AI is running?

When you select Score on Another Dataset, the predictions will automatically be stored on the machine where Driverless AI is running. They will be saved in the following locations (and can be opened again by Driverless AI, both for .csv and .bin):

  • Training Data Predictions: tmp/h2oai_experiment_<name>/train_preds.csv (also saved as .bin)
  • Testing Data Predictions: tmp/h2oai_experiment_<name>/test_preds.csv (also saved as .bin)
  • New Data Predictions: tmp/h2oai_experiment_<name>/automatically_generated_name.csv. Note that the automatically generated name will match the name of the file downloaded to your local computer.