H2O Driverless AI Release Notes

H2O Driverless AI is a high-performance, GPU-enabled, client-server application for the rapid development and deployment of state-of-the-art predictive analytics models. It reads tabular data from various sources and automates data visualization, grand-master level automatic feature engineering, model validation (overfitting and leakage prevention), model parameter tuning, model interpretability and model deployment. H2O Driverless AI is currently targeting common regression, binomial classification, and multinomial classification applications including loss-given-default, probability of default, customer churn, campaign response, fraud detection, anti-money-laundering, and predictive asset maintenance models. It also handles time-series problems for individual or grouped time-series such as weekly sales predictions per store and department, with time-causal feature engineering and validation schemes. The ability to model unstructured data is coming soon.

High-level capabilities:

  • Client/server application for rapid experimentation and deployment of state-of-the-art supervised machine learning models

  • User-friendly GUI

  • Python and R client API

  • Automatically creates machine learning modeling pipelines for highest predictive accuracy

  • Automates data cleaning, feature selection, feature engineering, model selection, model tuning, ensembling

  • Automatically creates stand-alone batch scoring pipeline for in-process scoring or client/server scoring via HTTP or TCP protocols in Python

  • Automatically creates stand-alone (MOJO) low latency scoring pipeline for in-process scoring or client/server scoring via HTTP or TCP protocols, in C++ (with R and Python runtimes) and Java (runs anywhere)

  • Multi-GPU and multi-CPU support for powerful workstations and NVidia DGX supercomputers

  • Machine Learning model interpretation module with global and local model interpretation

  • Automatic Visualization module

  • Multi-user support

  • Backward compatibility

Problem types supported:

  • Regression (continuous target variable like age, income, price or loss prediction, time-series forecasting)

  • Binary classification (0/1 or “N”/”Y”, for fraud prediction, churn prediction, failure prediction, etc.)

  • Multinomial classification (“negative”/”neutral”/”positive” or 0/1/2/3 or 0.5/1.0/2.0 for categorical target variables, for prediction of membership type, next-action, product recommendation, sentiment analysis, etc.)

Data types supported:

  • Tabular structured data, rows are observations, columns are fields/features/variables

  • Numeric, categorical and textual fields

  • Missing values are allowed

  • i.i.d. (identically and independently distributed) data

  • Time-series data with a single time-series (time flows across the entire dataset, not per block of data)

  • Grouped time-series (e.g., sales per store per department per week, all in one file, with 3 columns for store, dept, week)

  • Time-series problems with a gap between training and testing (i.e., the time to deploy), and a known forecast horizon (after which model has to be retrained)

Data types supported via custom recipes:

  • Image

  • Video

  • Audio

  • Graphs

Data sources supported:

  • Local file system or NFS

  • File upload from browser or Python client

  • S3 (Amazon)

  • Hadoop (HDFS)

  • Azure Blob storage

  • Blue Data Tap

  • Google BigQuery

  • Google Cloud storage

  • kdb+

  • Minio

  • Snowflake

  • JDBC

  • Custom Data Recipe BYOR (Python, bring your own recipe)

File formats supported:

  • Plain text formats of columnar data (.csv, .tsv, .txt)

  • Compressed archives (.zip, .gz, .bz2)

  • Excel

  • Parquet

  • Feather

  • Python datatable (.jay)


DAI architecture

DAI architecture


DAI roadmap

DAI roadmap

Change Log

Version 1.8.10 LTS (Feb 23, 2021)

  • New Features:

  • MOJO updates:

    • Upgraded MOJO runtime dependency to 2.5.10

    • Added MOJO support to compute Shapley for Tree and Linear-based boosting models. Note that support for Shapley values for ensemble GLM models is currently in beta

  • Improvements:

    • Added more verbosity to MLI logs

  • Bug Fixes:

    • Fixed stall detected in LightGBM models on P2.8x Amazon EC2 instances

Version 1.8.9 LTS (Oct 19, 2020)

  • New Features:

    • Add configurable CSRF (Cross-site request forgery) protection on API endpoints

    • Add protection against concurrent sessions

  • Improvements:

    • Hide webserver technology info from all API endpoints

    • Improved BYOR security by introducing configurable static analysis of the code

    • Improved session verification and authenticity

    • Improved security for internals API handlers via encryption

  • Bug Fixes:

    • Fix user session autologout after session expiration

    • Fix for properly cleaning closed sessions

    • Fix invalid redirection to static artifacts when using reverse proxy and URL prefix

    • Fix import of files without extension

Version 1.8.8 LTS (Sep 30, 2020)

  • New Features:

    • Give user control over number of saved variable importances (i.e., Python and R clients can get more than 14 values back) (max_varimp_to_save in config.toml file)

    • Added holiday calendar for 24 more countries, allow user to select list of countries to create is-holiday features for time series experiments

    • Enable GPU support for LightGBM models on IBM Power

    • Expose k-LIME MOJO in MLI

    • Support rhel8-like systems

    • Introducing an option to log in using the JWT token injected by the reverse proxy

    • Allow user to specify data delimiter/separator from configuration (datatable_separator see config.toml file)

    • Add support of encrypted keystore for sensitive config.toml values

    • Save transformed column names for Shapely value computation in MOJO

  • Improvements:

    • Add more consistency in handling files without an extension

    • Improve web server request handling and disallow redirection outside of application

    • Improve log file formatting to facilitate parsing

    • Improve logging for connectors

    • Improve air-gapped support for custom recipes

    • Allow Snowflake Stage tables to be optional

  • Bug Fixes:

    • Fix OpenID and TLS login redirection when deploying behind reverse proxy

    • Fix Cgroup memory detection on IBM Power

    • Various MLI fixes

    • Various UI fixes

  • Documentation updates:

Version LTS (June 23, 2020)

  • New Features:

    • Add ability to push artifacts to a Bitbucket server

    • Add per-feature user control for monotonicity constraints for XGBoostGBM, LightGBM and DecisionTree models

  • Bug Fixes:

    • Fix Hive kerberos impersonation

    • Fix a DTap connector issue by using the proper login username for impersonation

    • Fix monotonicity constraints for XGBoostGBM, LightGBM and DecisionTree models

Version 1.8.7 LTS (June 15, 2020)

  • New Features:

    • Add intercept term to k-LIME csv

    • Add control of default categorical & numeric feature rendering in DAI PD/ICE

    • Add ability to restrict custom recipe upload to a specific git repository and branch

    • Add translations for Korean and Chinese

    • Add ability to use multiple authentication methods simultaneously

  • Improvements:

    • Improve behavior of systemctl in the case Driverless AI fails to start

    • Improve logging behavior for JDBC and Hive connectors

    • Improve behavior of C++ scorer, fewer unnecessary files saved in tmp directory

    • Improve Docker image behavior in Kubernetes

    • Improve LDAP authentication to allow for anonymous binding

    • Improve speed of feature selection for experiments on large, wide, imbalanced datasets

    • Improve speed of data import on busy system

  • Bug fixes:

    • Fix automatic Kaggle submission and score retrieval

    • Fix intermittent Java exception seen by surrogate DRF model in MLI when several MLI jobs are run concurrently

    • Fix issue with deleting Deployments if linked Experiment was deleted

    • Fix issue causing Jupyter Notebooks to not work properly in Docker Image

    • Fix custom recipe scorers not being displayed on Diagnostics page

    • Fix issue with AWS Lambda Deployment not handling dropped columns properly

    • Fix issue with not being able to limit number of GPUs for specific experiment

    • Fix in-server scoring inaccuracies for certain models built in 1.7.1 and 1.8.0 (standalone scoring not affected)

    • Fix rare datatable type casting exception

  • Documentation updates:

    • The “Maximum Number of Rows to Perform Permutation-Based Feature Selection” expert setting now has a default value of 500,000

    • Improved Hive and Snowflake connector documentation

    • Updated the Main.java example in the Java Scoring Pipeline chapter

    • Added documentation describing how to change the language in the UI before starting the application

    • Added information about how custom recipes are described and documented in the Autoreport

    • Updated the LDAP authentication documentation

    • Improved the Linux DEB and RPM installation instructions

    • Improved the AWS Community AMI installation instructions

    • Improved documentation for the Reproducible button

Version 1.8.6 LTS (Apr 30, 2020)

  • New Features:

    • Add expert setting to reduce size of MOJO scoring pipelines (and hence reduce latency and memory usage for inference)

    • Enable Lambda deployment for IBM Power

    • Add restart button for Deployments

    • Add automatic Kaggle submission for supported datasets, show private/public scores (requires Kaggle API Username/Key)

    • Show warning if single final model is worse on back-testing splits (for time series) or cross-validation folds (for IID) than the fold models (indicates issue with signal or fit)

    • Update R client API to include autodoc, experiment preview, dataset download, autovis functions

    • Add button in expert settings that toggle some effective settings to make a small MOJO production pipeline

    • Add an option to upload artifacts to S3 or a Git repository

  • Improvements:

    • Improve experiment restart/refit robustness if model type is changed

    • Extra protection against dropping features

    • Improve implementation of Hive connector

  • Bug fixes:

    • Upgrade datatable to fix endless loop during stats calculation at file import

    • Web server and UI now respect dynamic base URL suffix

    • Fix incorrect min_rows in MLI when providing weight column with small values

    • Fix segfault in MOJO for TensorFlow/PyTorch models

    • Fix elapsed time for MLI

    • Enable GPU by default for R client

    • Fix python scoring h2oai ModuleNotFound error

    • Update no_drop_features toml and expert button to more generally avoid dropping features

    • Fix datatable mmap strategy

  • Documentation updates:

    • Add documentation for enabling the Hive data connector

    • Add documentation for updating expired DAI licenses on AWS Lambda deployments using a script

    • Documentation for uploading artifacts now includes support for S3 and Git in the artifacts store

    • Improve documentation for one-hot encoding

    • Improve documentation for systemd logs/journalctl

    • Improve documentation for time series ‘unavailable columns at prediction time’

    • Improve documentation for Azure blob storage

    • Improve documentation for MOJO scoring pipeline

    • Add information about reducing the size of a MOJO using a new expert setting

Version 1.8.5 LTS (Mar 09, 2020)

  • New Features:

    • Handle large (up to 10k) multiclass problems, including GUI improvements in such cases

    • Detect class imbalance for binary problems where target class is non-rare

    • Add feature count to iteration panel

    • Add experiment lineage pdf in experiment summary zip file

    • Issue warnings if final pipeline scores are unstable across (cross-)validation folds

    • Issue warning if Constant Model is improving quality of final pipeline (sign of bad signal)

    • Report origin of leakage detection as from model fit (AUC/R2), GINI, or correlation

  • Improvements:

    • Improve handling of ID columns

    • Improve exception handling to improve stability of raising python exceptions

    • Improve exception handling when any individual transformer or model throw exception or segfaults

    • Improve robustness of restart and refit experiment to changes in experiment choices

    • Improve handling of missing values when transforming dataset

    • Improve robustness of custom recipe importing of modules

    • Improve documentation for installation instructions

    • Improve selection of initial lag sizes for time series

    • Improve LightGBM stability for regression problems for certain mutation parameters

  • Documentation updates:

    • Improved documentation for time-series experiments

    • Added topics describing how to re-enable the Data Recipe URL and Data Recipe File connectors

    • For users running older versions of the Standalone Python Scoring Pipeline, added information describing how to install upgraded versions of outdated dependencies

    • Improved the description for the “Sampling Method for Imbalanced Binary Classification Problems” expert setting

    • Added constraints related to the REST server deployments

    • Noted required vs optional parameters in the HDFS connector topics

    • Added an FAQ indicating that MOJOs are thread safe

    • On Windows 10, only Docker installs are supported

    • Added information about the Recommendations AutoViz graph

    • Added information to the Before you Begin Installing topic that master.db files are not backward compatible with earlier Driverless AI versions

  • Bug fixes:

    • Update LightGBM for bug fixes, including hangs and avoiding hard-coded library paths

    • Stabilize use of psutil package

    • Fix time-series experiments when test set has missing target values

    • Fix python scoring to not depend upon original data_directory

    • Fix preview for custom time series validation splits and low accuracy

    • Fix ignored minimum lag size setting for single time series

    • Fix parsing of Excel files with datetime columns

    • Fix column type detection for columns with mostly missing values

    • Fix invalid display of 0.0000 score in iteration scores

    • Various MLI fixes (don’t show invalid graphs, fix PDP sort order, overlapping labels)

    • Various bug fixes

Version LTS (Feb 4, 2020)

Available here

  • Add option for dynamic port allocation

  • Documentation for AWS community AMI

  • Various bug fixes (MLI UI)

Version 1.8.4 LTS (Jan 31, 2020)

Available here

  • New Features:

    • Added ‘Scores’ tab in experiment page to show detailed tuning tables and scores for models and folds

    • Added Constant Model (constant predictions) and use it as reference model by default

    • Show score of global constant predictions in experiment summary as reference

    • Added support for setting up mutual TLS for the DriverlessAI

    • Added option to use client/personal certificate as an authentication method

  • Documentation Updates:

    • Added sections for enabling mTLS and Client Certificate authentication

    • Constant Models is now included in the list of Supported Algorithms

    • Added a section describing the Model Scores page

    • Improved the C++ Scoring Pipeline documentation describing the process for importing datatable

    • Improved documentation for the Java Scoring Pipeline

  • Bug fixes:

    • Fix refitting of final pipeline when new features are added

    • Various bug fixes

Version 1.8.3 LTS (Jan 22, 2020)

Available here

  • Added option to upload experiment artifacts to a configured disk location

  • Various bug fixes (correct feature engineering from time column, migration for brain restart)

Version 1.8.2 LTS (Jan 17, 2020)

Available here

  • New Features:

    • Decision Tree model

    • Automatically enabled for accuracy <= 7 and interpretability >= 7

    • Supports all problem types: regression/binary/multiclass

    • Using LightGBM GPU/CPU backend with MOJO

    • Visualization of tree splits and leaf node decisions as part of pipeline visualization

    • Per-Column Imputation Scheme (experimental)

    • Select one of [const, mean, median, min, max, quantile] imputation scheme at start of experiment

    • Select method of calculation of imputation value: either on entire dataset or inside each pipeline’s training data split

    • Disabled by default and must be enabled at startup time to be effective

    • Show MOJO size and scoring latency (for C++/R/Python runtime) in experiment summary

    • Automatically prune low weight base models in final ensemble (based on interpretability setting) to reduce final model complexity

    • Automatically convert non-raw github URLs for custom recipes to raw source code URLs

  • Improvements:

    • Speed up feature evolution for time-series and low-accuracy experiments

    • Improved accuracy of feature evolution algorithm

    • Feature transformer interpretability, total count, and importance accounted for in genetic algorithm’s model and feature selection

    • Binary confusion matrix in ROC curve of experiment page is made consistent with Diagnostics (flipped positions of TP/TN)

    • Only include custom recipes in Python scoring pipeline if the experiment uses any custom recipes

    • Additional documentation (New OpenID config options, JDBC data connector syntax)

    • Improved AutoReport’s transformer descriptions

    • Improved progress reporting during Autoreport creation

    • Improved speed of automatic interaction search for imbalanced multiclass problems

    • Improved accuracy of single final model for GLM and FTRL

    • Allow config_overrides to be a list/vector of parameters for R client API

    • Disable early stopping for Random Forest models by default, and expose new ‘rf_early_stopping’ mode (optional)

    • Create identical example data (again, as in 1.8.0 and before) for all scoring pipelines

    • Upgraded versions of datatable and Java

    • Installed graphviz in Docker image, now get .png file of pipeline visualization in MOJO package and Autoreport. Note: For RPM/DEB/TAR SH installs, user can install graphviz to get this optional functionality

  • Documentation Updates:

    • Added a simple example for modifying a dataset by recipe using live code

    • Added a section describing how to impute datasets (experimental)

    • Added Decision Trees to list of supported algorithms

    • Fixed examples for enabling JDBC connectors

    • Added information describing how to use a JDBC driver that is not tested in house

    • Updated the Missing Values Handling topic to include sections for “Clustering in Transformers” and “Isolation Forest Anomaly Score Transformer”

    • Improved the “Fold Column” description

  • Bug Fixes:

    • Fix various reasons why final model score was too far off from best feature evolution score

    • Delete temporary files created during test set scoring

    • Fixed target transformer tuning (was potentially mixing up target transformers between feature evolution and final model)

    • Fixed tensorflow_nlp_have_gpus_in_production=true mode

    • Fixed partial dependence plots for missing datetime values and no longer show them for text columns

    • Fixed time-series GUI for quarterly data

    • Feature transformer exploration limited to no more than 1000 new features (Small data on 10/10/1 would try too many features)

    • Fixed Kaggle pipeline building recipe to try more input features than 8

    • Fixed cursor placement in live code editor for custom data recipe

    • Show correct number of cross-validation splits in pipeline visualization if have more than 10 splits

    • Fixed parsing of datetime in MOJO for some datetime formats without ‘%d’ (day)

    • Various bug fixes

  • Backward/Forward compatibility:

    • Models built in 1.8.2 LTS will remain supported in upcoming versions 1.8.x LTS

    • Models built in 1.7.1/1.8.0/1.8.1 are not deprecated and should continue to work (best effort is made to preserve MOJO and Autoreport creation, MLI, scoring, etc.)

    • Models built in 1.7.0 or earlier will be deprecated

Version (Dec 21, 2019)

Available here

  • Bugfix for time series experiments with quarterly data when launched from GUI

Version 1.8.1 (Dec 10, 2019)

Available here

  • New Features:

    • Full set of scoring metrics and corresponding downloadable holdout predictions for experiments with single final models (time-series or i.i.d)

    • MLI Updates:

      • What-If (sensitivity) analysis

      • Interpretation of experiments on text data (NLP)

    • Custom Data Recipe BYOR:

      • BYOR (bring your own recipe) in Python: pandas, numpy, datatable, third-party libraries for fast prototyping of connectors and data preprocessing inside DAI

      • data connectors, cleaning, filtering, aggregation, augmentation, feature engineering, splits, etc.

      • can create one or multiple datasets from scratch or from existing datasets

      • interactive code editor with live preview

      • example code at https://github.com/h2oai/driverlessai-recipes/tree/rel-1.8.1/data

    • Visualization of final scoring pipeline (Experimental)

      • In-GUI display of graph of feature engineering, modeling and ensembling steps of entire machine learning pipeline

      • Addition to Autodoc

    • Time-Series:

      • Ability to specify which features will be unavailable at test time for time-series experiments

      • Custom user-provided train/validation splits (by start/end datetime for each split) for time-series experiments

      • Back-testing metrics for time-series experiments (regression and classification, with and without lags) based on rolling windows (configurable number of windows)

    • MOJO:

      • Java MOJO for FTRL

      • PyTorch MOJO (C++/Py/R) for custom recipes based on BERT/DistilBERT NLP models (available upon request)

  • Improvements:

    • Accuracy:

      • Automatic pairwise interaction search (+,-,*,/) for numeric features (“magic feature” finder)

      • Improved accuracy for time series experiments with low interpretability

      • Improved leakage detection logic

      • Improved genetic algorithm heuristics for feature evolution (more exploration)

    • Time-Series Recipes:

      • Re-enable Test-time augmentation in Python scoring pipeline for time-series experiments

      • Reduce default number of time-series rolling holdout predictions to same number as validation splits (but configurable)

    • Computation:

      • Faster feature evolution part for non-time-series experiments with single final model

      • Faster binary imbalanced models for very high class imbalance by limiting internal number of re-sampling bags

      • Faster feature selection

      • Enable GPU support for ImbalancedXGBoostGBMModel

      • Improved speed for importing multiple files at once

      • Faster automatic determination of time series properties

      • Enable use of XGBoost models on large datasets if low enough accuracy settings, expose dataset size limits in expert settings

      • Reduced memory usage for all experiments

      • Faster creation of holdout predictions for time-series experiments (Shapley values are now computed by MLI on demand by default)

    • UX Improvements:

      • Added ability to rename datasets

      • Added search bar for expert settings

      • Show traces for long-running experiments

      • All experiments create a MOJO (if possible, set to ‘auto’)

      • All experiments create a pipeline visualization

      • By default, all experiments (iid and time series) have holdout predictions on training data and a full set of metrics for final model

  • Documentation Updates:

    • Updated steps for enabling GPU persistence mode

    • Added information about deprecated NVIDIA functions

    • Improved documentation for enabling LDAP authentication

    • Added information about changing the column type in datasets

    • Updated list of experiment artifacts available in an experiment summary

    • Added steps describing how to expose ports on Docker for the REST service deployment within the Driverless AI Docker container

    • Added an example showing how to run an experiment with a custom transform recipe

    • Improved the FAQ for setting up TLS/SSL

    • Added FAQ describing issues that can occur when attempting Import Folder as File with a data connector on Windows

  • Bug Fixes:

    • Allow brain restart/refit to accept unscored previous pipelines

    • Fix actual vs predicted labeling for diagnostics of regression model

    • Fix MOJO for TensorFlow for non target transformers other than identity

    • Fix column type detection for Excel files

    • Allow experiments with default expert settings to have a MOJO

    • Various bug fixes

Version 1.8.0 (Oct 3, 2019)

Available here

  • Improve speed and memory usage for feature engineering

  • Improve speed of leakage and shift detection, and improve accuracy

  • Improve speed of AutoVis under high system load

  • Improve speed for experiments with large user-given validation data

  • Improve accuracy of ensembles for regression problems

  • Improve creation of Autoreport (only one background job per experiment)

  • Improve sampling techniques for ImbalancedXGBoost and ImbalancedLightGBM models, and disable them by default since can be slower

  • Add Python/R/C++ MOJO support for FTRL and RandomForest

  • Add native categorical handling for LightGBM in CPU mode

  • Add monotonicity constraints support for LightGBM

  • Add Isolation Forest Anomaly Score transformer (outlier detection)

  • Re-enable One-Hot-Encoding for GLM models

  • Add lexicographical label encoding (disabled by default)

  • Add ability to further train user-provided pretrained embeddings for TensorFlow NLP transformers, in addition to fine-tuning the rest of the neural network graph

  • Add timeout for BYOR acceptance tests

  • Add log and notifications for large shifts in final model variable importances compared to tuning model

  • Add more expert control over time series feature engineering

  • Add ability for recipes to be uploaded in bulk as entire (or part of) github repository or as links to python files on page

  • Allow missing values in fold column

  • Add support for feature brain when starting “New Model With Same Parameters” of a model that was previously restarted

  • Add support for toggling whether additional features are to be included in pipeline during “Retrain Final Pipeline”

  • Limit experiment runtime to one day by default (approximately enforced, can be configured in Expert Settings -> Experiment or config.toml ‘max_runtime_minutes’)

  • Add support for importing pickled Pandas frames (.pkl)

  • MLI updates:

    • Show holdout predictions and test set predictions (if applicable) in MLI TS for both metric and actual vs. predicted charts

    • Add ability to download group metrics in MLI TS

    • Add ability to zoom into charts in MLI TS

    • Add ability to use column not used in DAI model as a k-LIME cluster column in MLI

    • Add ability to view original and transformed DAI model-based feature importance in MLI

    • Add ability to view Shapley importance for original features

    • Add ability to view permutation importance for a DAI model when the config option autodoc_include_permutation_feature_importance is set to on

    • Fixed bug in binary Disparate Impact Analysis, which caused incorrect calculations amongst several metrics (ones using false positives and true negatives in the numerator)

  • Disable NLP TensorFlow transformers by default (enable in NLP expert settings by switching to “on”)

  • Reorganize expert settings, add tab for feature engineering

  • Experiment now informs if aborted by user, system or server restart

  • Reduce load of all tasks launched by server, giving priority to experiments to use cores

  • Add experiment summary files to aborted experiment logs

  • Add warning when ensemble has models that reach limit of max iterations despite early stopping, with learning rate controls in expert panel to control.

  • Improve progress reporting

  • Allow disabling of H2O recipe server for scoring if not using custom recipes (to avoid Java dependency)

  • Fix RMSPE scorer

  • Fix recipes error handling when uploading via URL

  • Fix Autoreport being spawned anytime GUI was on experiment page, overloading the system with forks from the server

  • Fix time-out for Autoreport PDP calculations, so completes more quickly

  • Fix certain config settings to be honored from GUI expert settings (woe_bin_list, ohe_bin_list, text_gene_max_ngram, text_gene_dim_reduction_choice, tensorflow_max_epochs_nlp, tensorflow_nlp_pretrained_embeddings_file_path, holiday_country), previously were only honored when provided at startup time

  • Fix column type for additional columns during scored test set download

  • Fix GUI incorrectly converting time for forecast horizon in TS experiments

  • Fix calculation of correlation for string columns in AutoVis

  • Fix download for R MOJO runtime

  • Fix parameters for LightGBM RF mode

  • Fix dart parameters for LightGBM and XGBoost

  • Documentation updates:

    • Included more information in the Before You Begin Installing or Upgrading topic to help making installations and upgrades go more smoothly

    • Added topic describing how to choose between the AWS Community and AWS Marketplace AMIs

    • Added information describing how to retrieve the MOJO2 Javadoc

    • Updated Python client examples to work with Driverless AI 1.7.x releases

    • Updated documentation for new features, expert settings, MLI plots, etc.

  • Backward/Forward compatibility:

    • Models built in 1.8.0 will remain supported in versions 1.8.x

    • Models built in 1.7.1 are not deprecated and should continue to work (best effort is made to preserve MOJO and Autoreport creation, MLI, scoring, etc.)

    • 1.8.0 upgraded to scipy version 1.3.1 to support newer custom recipes. This might deprecate custom recipes that depend on scipy version 1.2.2 (and experiments using them) and might require re-import of those custom recipes. Previously built Python scoring pipelines will continue to work.

    • Models built in 1.7.0 or earlier will be deprecated

  • Various bug fixes

Version 1.7.1 (Aug 19, 2019)

Available here

  • Added two new models with internal sampling techniques for imbalanced binary classification problems: ImbalancedXGBoost and ImbalancedLightGBM

  • Added support for rolling-window based predictions for time-series experiments (2 options: test-time augmentation or re-fit)

  • Added support for setting logical column types for a dataset (to override type detection during experiments)

  • Added ability to set experiment name at start of experiment

  • Added leakage detection for time-series problems

  • Added JDBC connector

  • MOJO updates:

    • Added Python/R/C++ MOJO support for TensorFlow model

    • Added Python/R/C++ MOJO support for TensorFlow NLP transformers: TextCNN, CharCNN, BiGRU, including any pretrained embeddings if provided

    • Reduced memory usage for MOJO creation

    • Increased speed of MOJO creation

    • Configuration options for MOJO and Python scoring pipelines now have 3-way toggle: “on”/”off”/”auto”

  • MLI updates:

    • Added disparate impact analysis (DIA) for MLI

    • Allow MLI scoring pipeline to be built for datasets with column names that need to be sanitized

    • Date-aware binning for partial dependence and ICE in MLI

  • Improved generalization performance for time-series modeling with regulariation techniques for lag-based features

  • Improved “predicted vs actual” plots for regression problems (using adaptive point sizes)

  • Fix bug in datatable for manipulations of string columns larger than 2GB

  • Fixed download of predictions on user-provided validation data

  • Fix bug in time-series test-time augmentation (work-around was to include entire training data in test set)

  • Honor the expert settings flag to enable detailed traces (disable again by default)

  • Various bug fixes

Version 1.6.4 LTS (Aug 19, 2019)

Available here

  • ML Core updates:

    • Speed up schema detection

    • DAI now drops rows with missing values when diagnosing regression problems

    • Speed up column type detection

    • Fixed growth of individuals

    • Fixed n_jobs for predict

    • Target column is no longer included in predictors for skewed datasets

    • Added an option to prevent users from downloading data files locally

    • Improved UI split functionality

    • A new “max_listing_items” config option to limit the number of items fetched in listing pages

  • Model Ops updates:

    • MOJO runtime upgraded to version 2.1.3 which supports perpetual MOJO pipeline

    • Upgraded deployment templates to version matching MOJO runtime version

  • MLI updates:

    • Fix to MLI schema builder

    • Fix parsing of categorical reason codes

    • Added ability to handle integer time column

  • Various bug fixes

Version 1.7.0 (Jul 7, 2019)

Available here

  • Support for Bring Your Own Recipe (BYOR) for transformers, models (algorithms) and scorers

  • Added protobuf-based MOJO scoring runtime libraries for Python, R and Java (standalone, low-latency)

  • Added local REST server as one-click deployment option for MOJO scoring pipeline, in addition to AWS Lambda endpoint

  • Added R client package, in addition to Python client

  • Added Project workspace to group datasets and experiments and to visually compare experiments and create leaderboards

  • Added download of imported datasets as .csv

  • Recommendations for columnar transformations in AutoViz

  • Improved scalability and performance

  • Ability to provide max. runtime for experiments

  • Create MOJO scoring pipeline by default if the experiment configuration allows (for convenience, enables local/cloud deployment options without user input)

  • Support for user provided pre-trained embeddings for TensorFlow NLP models

  • Support for holdout splits lacking some target classes (can happen when a fold column is provided)

  • MLI updates:

    • Added residual plot for regression problems (keeping all outliers intact)

    • Added confusion matrix as default metric display for multinomial problems

    • Added Partial Dependence (PD) and Individual Conditional Expectation (ICE) plots for Driverless.ai models in MLI GUI

    • Added ability to search by ID column in MLI GUI

    • Added ability to run MLI PD/ICE on all features

    • Added ability to handle multiple observations for a single time column in MLI TS by taking the mean of the target and prediction where applicable

    • Added ability to handle integer time column in MLI TS

    • MLI TS will use train holdout predictions if there is no test set provided

  • Faster import of files with “%Y%m%d” and “%Y%m%d%H%M” time format strings, and files with lots of text strings

  • Fix units for RMSPE scorer to be a percentage (multiply by 100)

  • Allow non-positive outcomes for MAPE and SMAPE scorers

  • Improved listing in GUI

  • Allow zooming in GUI

  • Upgrade to TensorFlow 1.13.1 and CUDA 10 (and CUDA is part of the distribution now, to simplify installation)

  • Add CPU-support for TensorFlow on PPC

  • Documentation updates:

    • Added documentation for new features including

      • Projects

      • Custom Recipes

      • C++ MOJO Scoring Pipelines

      • R Client API

      • REST Server Deployment

    • Added information about variable importance values on the experiments page

    • Updated documentation for Expert Settings

    • Updated “Tips n Tricks” with new Scoring Pipeline tips

  • Various bug fixes

Version 1.6.3 LTS (June 14, 2019)

Available here

  • Included an Audit log feature

  • Fixed support for decimal types for parquet files in MOJO

  • Autodoc can order PDP/ICE by feature importance

  • Session Management updates

  • Upgraded datatable

  • Improved reproducibility

  • Model diagnostics now uses a weight column

  • MLI can build surrogate models on all the original features or on all the transformed features that DAI uses

  • Internal server cache now respects usernames

  • Fixed an issue with time series settings

  • Fixed an out of memory error when loading a MOJO

  • Fixed Python scoring package for TensorFlow

  • Added OpenID configurations

  • Documentation updates:

    • Updated the list of artifacts available in the Experiment Summary

    • Clarified language in the documentation for unsupported (but available) features

    • For the Terraform requirement in deployments, clarified that only Terraform versions in the 0.11.x release are supported, and specifically 0.11.10 or greater

    • Fixed link to the Miniconda installation instructions

  • Various bug fixes

Version 1.6.2 LTS (May 10, 2019)

Available here

  • This version provides PPC64le artifacts

  • Improved stability of datatable

  • Improved path filtering in the file browser

  • Fixed units for RMSPE scorer to be a percentage (multiply by 100)

  • Fixed segmentation fault on Ubuntu 18 with installed font package

  • Fixed IBM Spectrum Conductor authentication

  • Fixed handling of EC2 machine credentials

  • Fixed of Lag transformer configuration

  • Fixed KDB and Snowflake Error Reporting

  • Gradually reduce number of used workers for column statistics computation in case of failure.

  • Hide default Tornado header exposing used version of Tornado

  • Documentation updates:

    • Added instructions for installing via AWS Marketplace

    • Improved documentation for installing via Google Cloud

    • Improved FAQ documentation

    • Added Data Sampling documentation topic

  • Various bug fixes

Version LTS (Apr 24, 2019)

Available here

  • Fix in AWS role handling.

Version 1.6.1 LTS (Apr 18, 2019)

Available here

  • Several fixes for MLI (partial dependence plots, Shapley values)

  • Improved documentation for model deployment, time-series scoring, AutoVis and FAQs

Version 1.6.0 LTS (Apr 5, 2019)

Private build only.

  • Fixed import of string columns larger than 2GB

  • Fixed AutoViz crashes on Windows

  • Fixed quantile binning in MLI

  • Plot global absolute mean Shapley values instead of global mean Shapley values in MLI

  • Improvements to PDP/ICE plots in MLI

  • Validated Terraform version in AWS Lambda deployment

  • Added support for NULL variable importance in AutoDoc

  • Made Variable Importance table size configurable in AutoDoc

  • Improved support for various combinations of data import options being enabled/disabled

  • CUDA is now part of distribution for easier installation

  • Security updates:

    • Enforced SSL settings to be honored for all h2oai_client calls

    • Added config option to prevent using LocalStorage in the browser to cache information

    • Upgraded Tornado server version to 5.1.1

    • Improved session expiration and autologout functionality

    • Disabled access to Driverless AI data folder in file browser

    • Provided an option to filter content that is shown in the file browser

    • Use login name for HDFS impersonation instead of predefined name

    • Disabled autocomplete in login form

  • Various bug fixes

Version 1.5.4 (Feb 24, 2019)

Available here

  • Speed up calculation of column statistics for date/datetime columns using certain formats (now uses ‘max_rows_col_stats’ parameter)

  • Added computation of standard deviation for variable importances in experiment summary files

  • Added computation of shift of variable importances between feature evolution and final pipeline

  • Fix link to MLI Time-Series experiment

  • Fix display bug for iteration scores for long experiments

  • Fix display bug for early finish of experiment for GLM models

  • Fix display bug for k-LIME when target is skewed

  • Fix display bug for forecast horizon in MLI for Time-Series

  • Fix MLI for Time-Series for single time group column

  • Fix in-server scoring of time-series experiments created in 1.5.0 and 1.5.1

  • Fix OpenBLAS dependency

  • Detect disabled GPU persistence mode in Docker

  • Reduce disk usage during TensorFlow NLP experiments

  • Reduce disk usage of aborted experiments

  • Refresh reported size of experiments during start of application

  • Disable TensorFlow NLP transformers by default to speed up experiments (can enable in expert settings)

  • Improved progress percentage shown during experiment

  • Improved documentation (upgrade on Windows, how to create the simplest model, DTap connectors, etc.)

  • Various bug fixes

Version 1.5.3 (Feb 8, 2019)

Available here

  • Added support for splitting datasets by time via time column containing date, datetime or integer values

  • Added option to disable file upload

  • Require authentication to download experiment artifacts

  • Automatically drop predictor columns from training frame if not found in validation or test frame and warn

  • Improved performance by using physical CPU cores only (configurable in config.toml)

  • Added option to not show inactive data connectors

  • Various bug fixes

Version 1.5.2 (Feb 2, 2019)

Available here

  • Added world-level bidirectional GRU Tensorflow models for NLP features

  • Added character-level CNN Tensorflow models for NLP features

  • Added support to import multiple individual datasets at once

  • Added support for holdout predictions for time-series experiments

  • Added support for regression and multinomial classification for FTRL (in addition to binomial classification)

  • Improved scoring for time-series when test data contains actual target values (missing target values will be predicted)

  • Reduced memory usage for LightGBM models

  • Improved performance for feature engineering

  • Improved speed for TensorFlow models

  • Improved MLI GUI for time-series problems

  • Fix final model fold splits when fold_column is provided

  • Various bug fixes

Version 1.5.1 (Jan 22, 2019)

Available here

  • Fix MOJO for GLM

  • Add back .csv file of experiment summary

  • Improve collection of pipeline timing artifacts

  • Clean up Docker tag

Version 1.5.0 (Jan 18, 2019)

Available here

  • Added model diagnostics (interactive model metrics on new test data incl. residual analysis for regression)

  • Added FTRL model (Follow The Regularized Leader)

  • Added Kolmogorov-Smirnov metric (degree of separation between positives and negatives)

  • Added ability to retrain (only) the final model on new data

  • Added one-hot encoding for low-cardinality categorical features, for GLM

  • Added choice between 32-bit (now default) and 64-bit precision

  • Added system information (CPU, GPU, disk, memory, experiments)

  • Added support for time-series data with many more time gaps, and with weekday-only data

  • Added one-click deployment to Amazon Lambda

  • Added ability to split datasets randomly, with option to stratify by target column or group by fold column

  • Added support for OpenID authentication

  • Added connector for BlueData

  • Improved responsiveness of the GUI under heavy load situations

  • Improved speed and reduce memory footprint of feature engineering

  • Improved performance for RuleFit models and enable GPU and multinomial support

  • Improved auto-detection of temporal frequency for time-series problems

  • Improved accuracy of final single model if external validation provided

  • Improved final pipeline if external validation data is provided (add ensembling)

  • Improved k-LIME in MLI by using original features deemed important by DAI instead of all original features

  • Improved MLI by using 3-fold CV by default for all surrogate models

  • Improved GUI for MLI time series (integrated help, better integration)

  • Added ability to view MLI time series logs while MLI time series experiment is running

  • PDF version of the Automatic Report (AutoDoc) is now replaced by a Word version

  • Various bug fixes (GLM accuracy, UI slowness, MLI UI, AutoVis)

Version 1.4.2 (Dec 3, 2018)

Available here

  • Support for IBM Power architecture

  • Speed up training and reduce size of final pipeline

  • Reduced resource utilization during training of final pipeline

  • Display test set metrics (ROC, ROCPR, Gains, Lift) in GUI in addition to validation metrics (if test set provided)

  • Show location of best threshold for Accuracy, MCC and F1 in ROC curves

  • Add relative point sizing for scatter plots in AutoVis

  • Fix file upload and add model checkpointing in python client API

  • Various bug fixes

Version 1.4.1 (Nov 11, 2018)

Available here

  • Improved integration of MLI for time-series

  • Reduced disk and memory usage during final ensemble

  • Allow scoring and transformations on previously imported datasets

  • Enable checkpoint restart for unfinished models

  • Add startup checks for OpenCL platforms for LightGBM on GPUs

  • Improved feature importances for ensembles

  • Faster dataset statistics for date/datetime columns

  • Faster MOJO batch scoring

  • Fix potential hangs

  • Fix ‘not in list’ error in MOJO

  • Fix NullPointerException in MLI

  • Fix outlier detection in AutoVis

  • Various bug fixes

Version 1.4.0 (Oct 27, 2018)

Available here

  • Enable LightGBM by default (now with MOJO)

  • LightGBM tuned for GBM decision trees, Random Forest (rf), and Dropouts meet Multiple Additive Regression Trees (dart)

  • Add ‘isHoliday’ feature for time columns

  • Add ‘time’ column type for date/datetime columns in data preview

  • Add support for binary datatable file ingest in .jay format

  • Improved final ensemble (each model has its own feature pipeline)

  • Automatic smart checkpointing (feature brain) from prior experiments

  • Add kdb+ connector

  • Feature selection of original columns for data with many columns to handle >>100 columns

  • Improved time-series recipe (multiple validation splits, better logic)

  • Improved performance of AutoVis

  • Improved date detection logic (now detects %Y%m%d and %Y-%m date formats)

  • Automatic fallback to CPU mode if GPU runs out of memory (for XGBoost, GLM and LightGBM)

  • No longer require header for validation and testing datasets if data types match

  • No longer include text columns for data shift detection

  • Add support for time-series models in MLI (including ability to select time-series groups)

  • Add ability to download MLI logs from MLI experiment page (includes both Python and Java logs)

  • Add ability to view MLI logs while MLI experiment is running (Python and Java logs)

  • Add ability to download LIME and Shapley reason codes from MLI page

  • Add ability to run MLI on transformed features

  • Display all variables for MLI variable importance for both DAI and surrogate models in MLI summary

  • Include variable definitions for DAI variable importance list in MLI summary

  • Fix Gains/Lift charts when observations weights are given

  • Various bug fixes

Version 1.3.1 (Sep 12, 2018)

Available here

  • Fix ‘Broken pipe’ failures for TensorFlow models

  • Fix time-series problems with categorical features and interpretability >= 8

  • Various bug fixes

Version 1.3.0 (Sep 4, 2018)

Available here

  • Added LightGBM models - now have [XGBoost, LightGBM, GLM, TensorFlow, RuleFit]

  • Added TensorFlow NLP recipe based on CNN Deeplearning models (sentiment analysis, document classification, etc.)

  • Added MOJO for GLM

  • Added detailed confusion matrix statistics

  • Added more expert settings

  • Improved data exploration (columnar statistics and row-based data preview)

  • Improved speed of feature evolution stage

  • Improved speed of GLM

  • Report single-pass score on external validation and test data (instead of bootstrap mean)

  • Reduced memory overhead for data processing

  • Reduced number of open files - fixes ‘Bad file descriptor’ error on Mac/Docker

  • Simplified Python client API

  • Query any data point in the MLI UI from the original dataset due to “on-demand” reason code generation

  • Enhanced k-means clustering in k-LIME by only using a subset of features. See The K-LIME Technique for more information.

  • Report k-means centers for k-LIME in MLI summary for better cluster interpretation

  • Improved MLI experiment listing details

  • Various bug fixes

Version 1.2.2 (July 5, 2018)

Available here

  • MOJO Java scoring pipeline for time-series problems

  • Multi-class confusion matrices

  • AUCMACRO Scorer: Multi-class AUC via macro-averaging (in addition to the default micro-averaging)

  • Expert settings (configuration override) for each experiment from GUI and client APIs.

  • Support for HTTPS

  • Improved downsampling logic for time-series problems (if enabled through accuracy knob settings)

  • LDAP readonly access to Active Directory

  • Snowflake data connector

  • Various bug fixes

Version 1.2.1 (June 26, 2018)

  • Added LIME-SUP (alpha) to MLI as alternative to k-LIME (local regions are defined by decision tree instead of k-means)

  • Added RuleFit model (alpha), now have [GBM, GLM, TensorFlow, RuleFit] - TensorFlow and RuleFit are disabled by default

  • Added Minio (private cloud storage) connector

  • Added support for importing folders from S3

  • Added ‘Upload File’ option to ‘Add Dataset’ (in addition to drag & drop)

  • Predictions for binary classification problems now have 2 columns (probabilities per class), for consistency with multi-class

  • Improved model parameter tuning

  • Improved feature engineering for time-series problems

  • Improved speed of MOJO generation and loading

  • Improved speed of time-series related automatic calculations in the GUI

  • Fixed potential rare hangs at end of experiment

  • No longer require internet to run MLI

  • Various bug fixes

Version 1.2.0 (June 11, 2018)

  • Time-Series recipe

  • Low-latency standalone MOJO Java scoring pipelines (now beta)

  • Enable Elastic Net Generalized Linear Modeling (GLM) with lambda search (and GPU support), for interpretability>=6 and accuracy<=5 by default (alpha)

  • Enable TensorFlow (TF) Deep Learning models (with GPU support) for interpretability=1 and/or multi-class models (alpha, enable via config.toml)

  • Support for pre-tuning of [GBM, GLM, TF] models for picking best feature evolution model parameters

  • Support for final ensemble consisting of mix of [GBM, GLM, TF] models

  • Automatic Report (AutoDoc) in PDF and Markdown format as part of summary zip file

  • Interactive tour (assistant) for first-time users

  • MLI now runs on experiments from previous releases

  • Surrogate models in MLI now use 3 folds by default

  • Improved small data recipe with up to 10 cross-validation folds

  • Improved accuracy for binary classification with imbalanced data

  • Additional time-series transformers for interactions and aggreations between lags and lagging of non-target columns

  • Faster creation of MOJOs

  • Progress report during data ingest

  • Normalize binarized multi-class confusion matrices by class count (global scaling factor)

  • Improved parsing of boolean environment variables for configuration

  • Various bug fixes

Version 1.1.6 (May 29, 2018)

  • Improved performance for large datasets

  • Improved speed and user interface for MLI

  • Improved accuracy for binary classification with imbalanced data

  • Improved generalization estimate for experiments with given validation data

  • Reduced size of experiment directories

  • Support for Parquet files

  • Support for bzip2 compressed files

  • Added Data preview in UI: ‘Describe’

  • No longer add ID column to holdout and test set predictions for simplicity

  • Various bug fixes

Version 1.1.4 (May 17, 2018)

  • Native builds (RPM/DEB) for 1.1.3

Version 1.1.3 (May 16, 2018)

  • Faster speed for systems with large CPU core counts

  • Faster and more robust handling of user-specified missing values for training and scoring

  • Same validation scheme for feature engineering and final ensemble for high enough accuracy

  • MOJO scoring pipeline for text transformers

  • Fixed single-row scoring in Python scoring pipeline (broken in 1.1.2)

  • Fixed default scorer when experiment is started too quickly

  • Improved responsiveness for time-series GUI

  • Improved responsiveness after experiment abort

  • Improved load balancing of memory usage for multi-GPU XGBoost

  • Improved UI for selection of columns to drop

  • Various bug fixes

Version 1.1.2 (May 8, 2018)

  • Support for automatic time-series recipe (alpha)

  • Now using Generalized Linear Model (GLM) instead of XGBoost (GBM) for interpretability 10

  • Added experiment preview with runtime and memory usage estimation

  • Added MER scorer (Median Error Rate, Median Abs. Percentage Error)

  • Added ability to use integer column as time column

  • Speed up type enforcement during scoring

  • Support for reading ARFF file format (alpha)

  • Quantile Binning for MLI

  • Various bug fixes

Version 1.1.1 (April 23, 2018)

  • Support string columns larger than 2GB

Version 1.1.0 (April 19, 2018)

  • AWS/Azure integration (hourly cloud usage)

  • Bug fixes for MOJO pipeline scoring (now beta)

  • Google Cloud storage and BigQuery (alpha)

  • Speed up categorical column stats computation during data import

  • Further improved memory management on GPUs

  • Improved accuracy for MAE scorer

  • Ability to build scoring pipelines on demand (if not enabled by default)

  • Additional target transformer for regression problems sqrt(sqrt(x))

  • Add GLM models as candidates for interpretability=10 (alpha, disabled by default)

  • Improved performance of native builds (RPM/DEB)

  • Improved estimation of error bars

  • Various bug fixes

Version 1.0.30 (April 5, 2018)

  • Speed up MOJO pipeline creation and disable MOJO by default (still alpha)

  • Improved memory management on GPUs

  • Support for optional 32-bit floating-point precision for reduced memory footprint

  • Added logging of test set scoring and data transformations

  • Various bug fixes

Version 1.0.29 (April 4, 2018)

  • If MOJO fails to build, no MOJO will be available, but experiment can still succeed

Version 1.0.28 (April 3, 2018)

  • (Non-docker) RPM installers for RHEL7/CentOS7/SLES 12 with systemd support

Version 1.0.27 (March 31, 2018)

  • MOJO scoring pipeline for Java standalone cross-platform low-latency scoring (alpha)

  • Various bug fixes

Version 1.0.26 (March 28, 2018)

  • Improved performance and reduced memory usage for large datasets

  • Improved performance for F0.5, F2 and accuracy

  • Improved performance of MLI

  • Distribution shift detection now also between validation and test data

  • Batch scoring example using datatable

  • Various enhancements for AutoVis (outliers, parallel coordinates, log file)

  • Various bug fixes

Version 1.0.25 (March 22, 2018)

  • New scorers for binary/multinomial classification: F0.5, F2 and accuracy

  • Precision-recall curve for binary/multinomial classification models

  • Plot of actual vs predicted values for regression problems

  • Support for excluding feature transformations by operation type

  • Support for reading binary file formats: datatable and Feather

  • Improved multi-GPU memory load balancing

  • Improved display of initial tuning results

  • Reduced memory usage during creation of final model

  • Fixed several bugs in creation of final scoring pipeline

  • Various UI improvements (e.g., zooming on iteration scoreboard)

  • Various bug fixes

Version 1.0.24 (March 8, 2018)

  • Fix test set scoring bug for data with an ID column (introduced in 1.0.23)

  • Allow renaming of MLI experiments

  • Ability to limit maximum number of cores used for datatable

  • Print validation scores and error bars across final ensemble model CV folds in logs

  • Various UI improvements

  • Various bug fixes

Version 1.0.23 (March 7, 2018)

  • Support for Gains and Lift curves for binomial and multinomial classification

  • Support for multi-GPU single-model training for large datasets

  • Improved recipes for large datasets (faster and less memory/disk usage)

  • Improved recipes for text features

  • Increased sensitivity of interpretability setting for feature engineering complexity

  • Disable automatic time column detection by default to avoid confusion

  • Automatic column type conversion for test and validation data, and during scoring

  • Improved speed of MLI

  • Improved feature importances for MLI on transformed features

  • Added ability to download each MLI plot as a PNG file

  • Added support for dropped columns and weight column to MLI stand-alone page

  • Fix serialization of bytes objects larger than 4 GiB

  • Fix failure to build scoring pipeline with ‘command not found’ error

  • Various UI improvements

  • Various bug fixes

Version 1.0.22 (Feb 23, 2018)

  • Fix CPU-only mode

  • Improved robustness of datatable CSV parser

Version 1.0.21 (Feb 21, 2018)

  • Fix MLI GUI scaling issue on Mac

  • Work-around segfault in truncated SVD scipy backend

  • Various bug fixes

Version 1.0.20 (Feb 17, 2018)

  • HDFS/S3/Excel data connectors

  • LDAP/PAM/Kerberos authentication

  • Automatic setting of default values for accuracy / time / interpretability

  • Interpretability: per-observation and per-feature (signed) contributions to predicted values in scoring pipeline

  • Interpretability setting now affects feature engineering complexity and final model complexity

  • Standalone MLI scoring pipeline for Python

  • Time setting of 1 now runs for only 1 iteration

  • Early stopping of experiments if convergence is detected

  • ROC curve display for binomial and multinomial classification, with confusion matrices and threshold/F1/MCC display

  • Training/Validation/Test data shift detectors

  • Added AUCPR scorer for multinomial classification

  • Improved handling of imbalanced binary classification problems

  • Configuration file for runtime limits such as cores/memory/harddrive (for admins)

  • Various GUI improvements (ability to rename experiments, re-run experiments, logs)

  • Various bug fixes

Version 1.0.19 (Jan 28, 2018)

  • Fix hang during final ensemble (accuracy >= 5) for larger datasets

  • Allow scoring of all models built in older versions (>= 1.0.13) in GUI

  • More detailed progress messages in the GUI during experiments

  • Fix scoring pipeline to only use relative paths

  • Error bars in model summary are now +/- 1*stddev (instead of 2*stddev)

  • Added RMSPE scorer (RMS Percentage Error)

  • Added SMAPE scorer (Symmetric Mean Abs. Percentage Error)

  • Added AUCPR scorer (Area under Precision-Recall Curve)

  • Gracefully handle inf/-inf in data

  • Various UI improvements

  • Various bug fixes

Version 1.0.18 (Jan 24, 2018)

  • Fix migration from version 1.0.15 and earlier

  • Confirmation dialog for experiment abort and data/experiment deletion

  • Various UI improvements

  • Various AutoVis improvements

  • Various bug fixes

Version 1.0.17 (Jan 23, 2018)

  • Fix migration from version 1.0.15 and earlier (partial, for experiments only)

  • Added model summary download from GUI

  • Restructured and renamed logs archive, and add model summary to it

  • Fix regression in AutoVis in 1.0.16 that led to slowdown

  • Various bug fixes

Version 1.0.16 (Jan 22, 2018)

  • Added support for validation dataset (optional, instead of internal validation on training data)

  • Standard deviation estimates for model scores (+/- 1 std.dev.)

  • Computation of all applicable scores for final models (in logs only for now)

  • Standard deviation estimates for MLI reason codes (+/- 1 std.dev.) when running in stand-alone mode

  • Added ability to abort MLI job

  • Improved final ensemble performance

  • Improved outlier visualization

  • Updated H2O-3 to version

  • More readable experiment names

  • Various speedups

  • Various bug fixes

Version 1.0.15 (Jan 11, 2018)

  • Fix truncated per-experiment log file

  • Various bug fixes

Version 1.0.14 (Jan 11, 2018)

  • Improved performance

Version 1.0.13 (Jan 10, 2018)

  • Improved estimate of generalization performance for final ensemble by removing leakage from target encoding

  • Added API for re-fitting and applying feature engineering on new (potentially larger) data

  • Remove access to pre-transformed datasets to avoid unintended leakage issues downstream

  • Added mean absolute percentage error (MAPE) scorer

  • Enforce monotonicity constraints for binary classification and regression models if interpretability >= 6

  • Use squared Pearson correlation for R^2 metric (instead of coefficient of determination) to avoid negative values

  • Separated HTTP and TCP scoring pipeline examples

  • Reduced size of h2oai_client wheel

  • No longer require weight column for test data if it was provided for training data

  • Improved accuracy of final modeling pipeline

  • Include H2O-3 logs in downloadable logs.zip

  • Updated H2O-3 to version

  • Various bug fixes

Version 1.0.11 (Dec 12, 2017)

  • Faster multi-GPU training, especially for small data

  • Increase default amount of exploration of genetic algorithm for systems with fewer than 4 GPUs

  • Improved accuracy of generalization performance estimate for models on small data (< 100k rows)

  • Faster abort of experiment

  • Improved final ensemble meta-learner

  • More robust date parsing

  • Various bug fixes

Version 1.0.10 (Dec 4, 2017)

  • Tool tips and link to documentation in parameter settings screen

  • Faster training for multi-class problems with > 5 classes

  • Experiment summary displayed in GUI after experiment finishes

  • Python Client Library downloadable from the GUI

  • Speedup for Maxwell-based GPUs

  • Support for multinomial AUC and Gini scorers

  • Add MCC and F1 scorers for binomial and multinomial problems

  • Faster abort of experiment

  • Various bug fixes

Version 1.0.9 (Nov 29, 2017)

  • Support for time column for causal train/validation splits in time-series datasets

  • Automatic detection of the time column from temporal correlations in data

  • MLI improvements, dedicated page, selection of datasets and models

  • Improved final ensemble meta-learner

  • Test set score now displayed in experiment listing

  • Original response is preserved in exported datasets

  • Various bug fixes

Version 1.0.8 (Nov 21, 2017)

  • Various bug fixes

Version 1.0.7 (Nov 17, 2017)

  • Sharing of GPUs between experiments - can run multiple experiments at the same time while sharing GPU resources

  • Persistence of experiments and data - can stop and restart the application without loss of data

  • Support for weight column for optional user-specified per-row observation weights

  • Support for fold column for user-specified grouping of rows in train/validation splits

  • Higher accuracy through model tuning

  • Faster training - overall improvements and optimization in model training speed

  • Separate log file for each experiment

  • Ability to delete experiments and datasets from the GUI

  • Improved accuracy for regression tasks with very large response values

  • Faster test set scoring - Significant improvements in test set scoring in the GUI

  • Various bug fixes

Version 1.0.5 (Oct 24, 2017)

  • Only display scorers that are allowed

  • Various bug fixes

Version 1.0.4 (Oct 19, 2017)

  • Improved automatic type detection logic

  • Improved final ensemble accuracy

  • Various bug fixes

Version 1.0.3 (Oct 9, 2017)

  • Various speedups

  • Results are now reproducible

  • Various bug fixes

Version 1.0.2 (Oct 5, 2017)

  • Improved final ensemble accuracy

  • Weight of Evidence features added

  • Various bug fixes

Version 1.0.1 (Oct 4, 2017)

  • Improved speed of final ensemble

  • Various bug fixes

Version 1.0.0 (Sep 24, 2017)

  • Initial stable release