Change Log

Version 1.10.7 (January 19, 2024)

  • New Features:

    • Added support for configuring a host name and port for the Snowflake connector when running Driverless AI in Snowpark Container Services.

    • Added recipe support for the Feature Store data connector.

    • Added a new configuration that lets users change the timeout duration when importing data from Hive, HDFS, JDBC, and kdb+ connectors.

    • Ability to navigate to the linked project from the experiment details page.

    • Added support for Python 3.12 in h2oai_client.

    • MLI:

      • Added the 2-D Partial Dependence explainer.

      • Added the Friedman H-statistic explainer.

      • Added PDP percentile plot.

      • Added ICE curves in Partial Dependence explainer at every decile of predicted probabilities.
        • This gives an indication of local prediction behavior across the dataset.

  • Improvements:

    • In cases where a given MLI explainer is not available for a particular experiment, the new interpretation page now indicates why that MLI explainer is not available.

    • Expanded DAI Python Client features for MLI. The Python client now supports the generation of plots for the following MLI explainers:

      • Original feature importance

      • Transformed feature importance

      • Original Shapley

      • Transformed Shapley

      • Absolute permutation-based feature importance

      • Relative permutation-based feature importance

      • Random forest feature importance

      • Random forest LOCO

      • Random forest PDP

      • Kernel Shapley

      • Shapley summary

      • Decision Tree

      • NLP Tokenizer

      • NLP VLM

      • NLP LOCO

      • DAI PDP

      • NLP PDP

      • Friedman H-Statistic

  • Bug fixes:

    • Fixed an issue that could prevent GPUs from being detected in Google Cloud Platform.

    • Fixed the import of binary datatable files with date columns.

    • Fixed various UI issues across a subset of MLI explainers.

    • Numerous bug fixes.

Version 1.10.6.2 (October 31, 2023)

Available here

  • Improvements:

    • Added a new configuration that lets users use their own service account when connecting to Google BigQuery (GBQ).

    • Added a new configuration that lets users optionally select which service account to impersonate for the Google BigQuery (GBQ) data connector.

    • Added a new configuration to control the experiment Leaderboard access globally for all users.

    • DAI now reports why certain explainers are not enabled in the MLI explainer drop-down list. This is usually limited by the experiment problem type, e.g., image, multinomial, etc.

    • Users are now warned when Shapley values are approximated in MLI.

  • Bug fixes:

    • Fixed the import of binary datatable files with date columns.

    • Fixed improper handling of invalid IDs in MLI dashboard.

    • Fixed missing prediction values for MLI time-series actual vs. predicted plot when target is null or nan.

    • Fixed an issue where the download of a PNG file in MLI only captured the first page.

    • Fixed an issue where zoom in MLI DT surrogate reset the row search number to undefined.

    • Fixed handling of long y-axis values in MLI Shapley plots.

    • Fixed text wrapping for various MLI explainer docstrings.

  • Documentation:

Version 1.10.6.1 (September 01, 2023)

Available here

  • Improvements:

    • Improved license checks and notification when using an invalid license.

  • Bug fixes:

    • Fixed an issue to properly dispose progress spinners when an export to storage is completed.

Version 1.10.6 (August 18, 2023)

Available here

  • New Features:

    • Added integration with h2oGPT that lets you optionally generate and view dataset or experiment summaries using a GPT model. For more information, see h2oGPT integration.

    • Added support for Okta SSO authenticator on the Snowflake connector.

    • Added disk usage quotas per user, defined by the new configuration users_disk_usage_quota.

    • Added a time-series data splitter that automatically detects time column candidates.

    • Added support for custom telemetry recipes.

    • Added the ability for admins to see their user identity details when using OPENID authentication.

    • Added storage capacity and memory information to the health API. For more information, see Health API attribute definitions.

    • Added support for a PyTorch backend in Triton server.

    • Added support for sharing image datasets to H2O Storage. For more information, see H2O Storage (remote storage) integration.

    • You can now tag experiments inside Projects that are connected to H2O Storage (remote storage). These tags are also displayed in H2O MLOps. For more information, see Experiment tagging.

    • You can now disable the logout functionality.

    • The experiment setup comparison feature now allows for better side-to-side comparison of list and dictionary configs. For more information, see Experiment setup comparison.

    • The Project page now displays the current number of successfully linked experiments during their upload to H2O Storage (remote storage).

    • You can now browse to two unique Driverless AI instances from the same browser.

    • MLI:

      • DAI Model Dashboard: You can now view a new DAI model dashboard that provides comprehensive insights into the performance of models created using DAI. For more information, see MLI Dashboard.

      • Performance Charts: You can now view performance charts specifically designed for Decision Tree (DT) and Random Forest (RF) surrogate models. These charts let you evaluate the performance of these models. Note that this feature is not supported for multinomial experiments. For more information, see Surrogate Decision Tree Plot and Random Forest surrogate model plots.

      • NLP LOCO Scoring Pipeline: Added the ability to construct a NLP LOCO scoring pipeline. Note that the NLP LOCO scoring pipeline is not built by default.

      • Expanded DAI Python Client Features. The Python client now supports the generation of plots for individual explainers. Supported explainers include:

        • Partial Dependence Plot.

        • Shapley Summary Plot for Original Features (Naive Shapley Method).

        • Shapley Values for Transformed Features.

        • Shapley Values for Original Features (Naive Method).

        • Surrogate Decision Tree.

      • Dataset Naming in Sensitivity Analysis: You can now assign names to datasets before saving them in Sensitivity Analysis, enhancing organization and usability.

  • Improvements:

    • Missing Value Handling in Partial Dependence: Enhanced partial dependence analysis by treating missing values as a distinct bin. This refinement contributes to more comprehensive insights into model behavior.

    • Categorical Representation in DT Surrogate Model: Improved the display of categorical features in the DT surrogate model when categorical encoding is set to one-hot encoding. This enhances the readability of model representations.

    • Enhanced Shapley Tooltip Documentation: Revamped tooltip documentation for Shapley plots to provide users with clearer and more informative explanations of the displayed information.

    • Improved Shapley Plot Sorting: Shapley plots now order local importances by their absolute value. This ensures that the most important features are prominently displayed, regardless of their positive or negative impact.

    • Run explainers sequentially: Added the ability to run explainers sequentially by setting the config mli_run_explainers_sequentially to True. The default value is False.

    • Show actual and prediction values above the NLP LOCO, VLM, and Tokenizer plots.

    • MLI Notification Center: Improved various UI components.

    • Added the ability to download Detailed View in Summary Shapley plot.

    • Added the ability to convert date features from categorical to numeric in Partial Dependence plot.

    • Improved the ability to run specific explainers in DAI Python client.

    • Improved view of the surrogate model dashboard.

  • Bug fixes:

    • No longer skip display of time-series properties during Experiment Wizards after making a time-based split of training data.

    • Improved memory consumption when loading a large Parquet file from disk.

    • Improved H2O Storage integration to allow linking of large datasets and experiments in the background.

    • Improved OPENID authentication to add more resiliency in case of network issues.

    • Improved the Wizard for air-gapped environments by including some third-party scripts as static resources.

    • Fixed an issue where the count of original columns reflected in the experiment AutoDoc did not match the number of original columns in the training data when using arbitrary drop columns.

    • Fixed a missing axis label in model diagnostics charts.

    • Fixed several security vulnerabilities.

    • Fixed display of categoricals in global and cluster reason code view in k-LIME.

    • Fixed an issue that caused an MLI local search to block the DAI server.

    • Fixed an issue that could prevent downloading multiple datasets at once using the HDFS connector.

    • Fixed various UI issues across a subset of MLI explainers.

  • Documentation:

Version 1.10.5.1 (August 11, 2023)

Available here

  • Improvements:

    • Added support for Okta SSO authenticator on the Snowflake connector.

    • UI/UX improvements to datasets import panels.

  • Bug fixes:

    • Fixed an issue that prevented experiments from getting deleted if an Nvidia Triton inference server was not yet configured.

    • Fixed a warning message that is displayed in the Microsoft Word AutoReport document.

    • Broadened support for different H2O Feature Store backend versions.

    • Fixed a Hive connector JSON validation that could lead to errors connecting to Hive.

Version 1.10.5 (April 27, 2023)

Available here

  • New Features:

    • Added support for NVIDIA Triton Inference Server. Currently, only CPUs are supported. Deployments are not available in H2O MLOps. For more information, see Triton Inference Server deployment.

      • Lets you make zero-click or one-click deployments of experiments to Triton.

      • This feature is currently only available for models that support a C++ MOJO (that is, all models with a Java MOJO, as well as all default BERT NLP and TensorFlow image models).

      • Conveniently lists the estimated P99 latency, throughput, memory and disk usage of the MOJO deployment.

      • Supports both internal (built-in) and external Triton servers.

      • Added a Deployments wizard with list, load, unload, delete, export, and query operations for models deployed to Triton from the Python client and the Wizard.

      • The Deployments listing page now only displays Triton deployments. The previously available methods of creating local deployments have been removed.

    • Experiment Wizard:

      • You can now use interactive plots created with Plotly.

      • You can now split train and test data by a specific date or datetime. A visualization of the split is also displayed. For more information, see Dataset options.

      • The Experiment Wizard now provides detailed control of time-series validation splits. Visualizations of training data, the potential temporal gap until production, and the forecast horizon are now also provided. For more information, see Driverless AI Experiment Setup Wizard.

      • Added the option to select specific leaderboards for IID and time-series experiments.

    • Added support to transform a dataset with the experiment’s fitted pipeline (excluding any models). Note that this new option is distinct from the existing Fit & Transform option.

    • Shift detection is now performed on the final model’s transformed features and target to check generalization.

    • Added an Experiment Results Wizard (beta): Shows several details for a given finished experiment. For more information, see Completed Experiment Actions.

    • Added the Experiment Comparison Wizard for easy comparison of expert settings and knobs. For more information, see Experiment setup comparison.

    • Added training data column stats JSON file to scorers (MOJO and Python).

    • Control runtime more accurately using runtime estimation.

    • Added estimation of CPU memory usage during experiment preview, to help with instance sizing.

    • Added CPU memory usage of C++ MOJO to experiment summary and to Deploy Wizard.

    • Added a page that lets admin users view system logs.

    • Experiments can be assigned to a project from experiment page through Deployments wizard.

    • You can now navigate to an experiment detail in the H2O MLOps app from the experiment page.

    • MLI:

      • Added the ability to filter input data to MLI. For more information, see Filter input data to MLI.

      • Added the ability to save and download filtered dataset(s) in Sensitivity Analysis.

      • Added notification center to MLI UI. For more information, see Understanding the Model Interpretation Page.

      • Added Python client.

      • Enabled Shapley Summary plot for time-series, which displays transformed features instead of original.

      • Added a link to a video tutorial for the following explainers:

        • PDP/ICE.

        • Shapley for transformed features.

        • Shapley for original features.

        • Shapley summary plot.

        • Disparate Impact Analysis.

        • Sensitivity analysis.

        • Surrogate decision tree.

  • Improvements:

    • The health API flag is_idle has been updated to account for large datasets being uploaded from a browser session.

    • Python scorers for lag-based time-series models now keep the target column in the frame to allow test-time augmentation.

    • Details about test-time augmentation are now provided in logs. (That is, the number of newly updated historic values for each time period.)

    • Prediction frames now contain the original target column name in case the target column name contains special characters that require sanitization.

    • Added the ability to ingest pandas sparse frames for pandas .pkl files.

    • Automatically toggle GPU ON/OFF in the experiment setup page based on whether models and transformers perform better on (or must use) GPUs.

    • Reduced memory usage when making test set predictions.

    • Allow control over early_stopping_threshold (relative min_delta) for LightGBM.

    • Added stronger overfit protection for recipe (more_overfit_protection).

    • Added support for unsupervised recipes that handle text columns.

    • Sped up BertModel and BertTransformer when data is text-dominated to avoid unnecessary validation repeats for small data.

    • Sped up MOJO for TextTransformer.

    • Improved C++ MOJO performance under high CPU load.

    • Added a new log tab to allow admin users access to internal services log files.

    • Added support for HTTPS SSL key file when encrypted with a passphrase.

    • When importing experiment on project page, user is automatically prompted to download its datasets.

    • Added explanatory videos for several MLI explainers.

    • Listing pages now retain previous values for search, pagination and sorting.

    • Optimized the number of default explainers in MLI to reduce runtime and increase clarity.

    • Added support for microseconds in MLI time-series explainer.

    • Enabled Shapley values for MLI TS when only a training dataset is used.

    • Improved categorical handling in the MLI decision tree surrogate model by using one-hot encoding to encode categoricals by default.

    • Improved legends and axis labels in MLI graphs by showing how data is ordered and what data is shown.

    • Added display of target transformation in relevant Shapley plots in MLI.

    • Added better indication of re-rendering cells after changing thresholds in MLI DIA.

    • Added new tile in MLI, 3rd Party Model, when DIA is calculated on an external dataset.

  • Bug fixes:

    • Enabled time-series test time augmentation (TTA) scoring in case target column name contains special characters that require sanitization.

    • Fixed MOJO acceptance test for time-series experiments with non-target lag features, was accidentally marking the MOJO as incorrect.

    • Fixed training time and scoring time on project page when many experiments are being trained or scored at once, was measuring elapsed time since job submission (including wait time).

    • Fixed creation of MOJO for BERTTransformer with xlm-robert-base, roberta-base and camembert-base (non-default) variants.

    • Fixed Java/C++ MOJO for raw integer time columns.

    • Fixed Java MOJO Shapley prediction column names for multiclass problems when target column name contains “.” character.

    • Fixed unavailable columns at prediction time not being used except as rare new feature during evolution.

    • Fixed race when importing a zipped file multiple times simultaneously by same user.

    • Fixed fold ID reporting of frequencies.

    • Fixed number of transformations for time-series leaderboard experiment that does target encoding.

    • Avoid excessive core usage for clustering and truncated SVD.

    • Fixed Parquet loading when includes index.

    • Projects can now list more than 100 experiments.

    • Fixed handling of transformed feature names in MLI DT surrogate Python rules.

    • Changed default display of boolean columns from numeric to categorical in MLI PDP.

    • Updated MLI Morris Sensitivity recipe to work with the latest version of the interpret Python package.

    • Fixed re-computation feature of MLI PDP by only showing features that are capable of data type conversions.

  • Documentation:

Version 1.10.4.3 (January 27, 2023)

Available here

  • Improvements:

    • Added support to allow users to generate and use MapR ticket with different identities.

  • Bug fixes:

    • Fixed a small data leak in the TensorFlowModel for regression where the minimum value of the target in the validation data was learned.

    • Fixed “ratio” time-series target transformer for zero-valued target column.

    • Fixed an issue that could cause multiple start attempts when DriverlessAI is used in Spectrum Conductor.

    • Fixed excessive features made for target encode mode for TS leaderboard.

    • Fixed leakage and shift when switching from GLM default model to LightGBM.

    • Fixed case when included_models is not specified at all in config_overrides.

Version 1.10.4.2 (January 11, 2023)

Available here

  • Improvements:

    • Added the ability to use self-signed certificate or skip the certificate validation for OPENID authentication.

  • Security updates:

    • Upgrade Torch to version 1.13.1 to fix a security vulnerability.

Version 1.10.4.1 (December 19, 2022)

Available here

  • Improvements:

    • Added a change to the health API is_idle flag to account for large datasets being uploaded from a browser session.

    • Added a new routine to clean up old model predictions per user.

    • Added more examples of text handling for expert control of unsupervised models.

    • Added ability to ingest pandas sparse frames for pandas .pkl files.

    • For time series problems, added an option to the Driverless AI Experiment Setup Wizard to manually provide unavailable columns at prediction time.

  • Bug Fixes:

    • Fixed erroneous warning that validation frame has 100% duplicate rows with training data.

    • Fixed unsupervised transformer handling so that internal and recipe transformers can be used for unsupervised models.

    • Fixed race in Experiment Wizard when choosing columns to drop too quickly.

    • Fixed a path parsing issue on the GCS data connector.

    • Fixed several security vulnerabilities.

    • Added handling for “sklearn” dependency deprecation on Python scorers.

Version 1.10.4 (October 13, 2022)

Available here

  • New Features:

    • (Experimental) GUI-based Wizards:

      • Experiment Wizard. Configure and start experiments by clicking on a specific dataset and then clicking Predict Wizard. For more information, see Driverless AI Experiment Setup Wizard.

      • Dataset Join Wizard. Join two datasets by clicking on a specific dataset and then clicking Join Wizard. For more information, see Dataset Join Wizard.

      • Leaderboard Wizard. Perform a business value analysis for all models in a project by clicking the Analyze button on the Project page (only for classification experiments). For more information, see Leaderboard Wizard: Business value calculator.

    • Added repeated cross-validation for final ensembles for small data, resulting in improved accuracy.

    • Added the BinnerTransformer for one-dimensional binning of numeric features. Uses tree splits (default) or quantiles to create bins, and can automatically reduce the number of bins based on their predictive power. Given bins, a numeric column is converted into multiple output columns by using either piece-wise linear encoding or binary encoding. For cases where target encoding isn’t allowed due to higher interpretability requirements, the highly interpretable BinnerTransformer can help create more accurate models. Only enabled by default for GLM/TensorFlow/TorchGrowNet and FTRL models at high interpretability.

    • Added improved handling of duplicate rows in training data (after dropping columns to drop). Disabled by default. Option to either drop duplicate rows or convert them into single weighted rows.

    • Added detection of joint rows in training, validation and testing datasets after data preparation, before modeling. If undetected, can mislead validation.

    • Added support for prediction intervals for regression experiments in Java MOJO scoring (for both C++ and Java MOJO runtimes).

    • Redesigned Expert Settings page. For details, see expert-settings-navigate.

    • Added user preferences section for per-user data connector setup. For details, see Driverless AI user settings.

    • Added ‘feature_store_mojo’ recipe type to create a MOJO to be used as a feature engineering pipeline in the H2O Feature Store.

    • Added the ability to run Disparate Impact Analysis on external datasets.

    • Improved Shapley plots. For local queries, row and Shapley data are now displayed in tandem.

    • Added the ability to navigate to dataset and experiment in MLI.

    • Consolidated all MLI explainer logs into one zip file for download.

    • Added support for bulk abort of multiple experiments in a project.

  • Improvements:

    • Significantly increased model accuracy for 1-hour runs using 39 datasets from the OpenML AutoML Benchmark, a variety of classification datasets with range of columns, rows, and classes.

    • Moved execution of Diagnostics, Transform Data and Autoreport to worker node(s) for multinode configurations.

    • Reduced number of features made during tuning and avoids duplication among tuning trials.

    • To improve the usage of time, poorly performing models are now automatically pruned during tuning, evolution, and before making the final model.

    • Improved usage of One-Hot Encoding, which was previously only used by GLM.

    • Improved usage of TensorFlow model for low interpretability settings. Note: Neither Java MOJO nor Shapley values are currently supported. Only C++ MOJO and Python scoring are supported.

    • Increased accuracy of max_runtime_minutes to account for each model built and which will be in final model.

    • Enabled blending in link space for ensembles that include LightGBM with ‘xentropy’ binary objective, so that sum(Shapley values) = logit(preds).

    • Added mini acceptance tests for MOJOs for sum(Shapley values) = preds for regression, and sum(Shapley values) = logit(preds) for binary classification.

    • Increased the amount of model tuning and ensembling for small data for default settings.

    • Speed-up small data handling by running more tasks in parallel with single core to more efficiently use all cores.

    • Speed-up scoring in final model by parallelizing over metrics selected.

    • Speed-up XGBoost fitting and predictions by using faster data handling, esp. for many rows and columns.

    • Reduced the size of the MOJO zip file, which no longer includes TensorFlow unless needed by the pipeline.

    • Better automatic values for worker_remote_processors and max_cores to improve many-cores (32+) DAI performance to allow more experiments to be run in parallel.

    • Improved DAI startup speed.

    • Improved the speed at which experiment previews are displayed when many custom recipes present.

    • Improved reliability of multinode configuration.

    • Separated server logs into their own directory.

    • Faster type detection for text columns.

    • Sample for row duplication check to avoid using too much memory, set by config.toml value of detect_duplicate_rows_max_rows_x_cols.

    • Shift detection is now always performed. Previously, shift detection was turned off for low accuracy settings.

    • Improved stability of sampling for leakage detection.

    • Fix unsupervised expert setup.

    • Reduced memory usage during MOJO creation.

    • Improved experiment runtime and mojo size estimation.

    • Improved selection of leaderboard model parameters.

    • Improved speed of C++ MOJO runtime for Shapley predictions for tree algorithms.

    • The Recipe page code editor now shows the position of all instances of the searched string on the scroll bar.

    • Dataset pickers now contain links to the dataset detail page (opened in new tab). The same applies to custom recipes in recipe pickers.

    • When an experiment is created with the New with same settings option, Driverless AI now checks for activity of included recipes.

    • Significant performance improvements to MLI.

    • Disabled Shapley Summary plot for MLI Time Series. This plot only displays original features instead of engineered features, which are more useful for time series applications.

    • MLI compatibility check results are now visible in explainer progress bars.

    • Significantly speed-up Shapley calculations when making predictions on training dataset for MLI and autoreport.

    • Added logistic regression option for LightGBM and XGBoost.

  • Bug Fixes:

    • Made Java MOJO Shapley for LightGBMDaskModel, XGBoostGBMDaskModel, XGBoostRFDaskModel and XGBoostRFModel consistent with new behavior in 1.10.2+, for unit_box, standardize and center target transformers, to be in target space for better interpretability.

    • Fixed scaling of Shapley values for LightGBM model in ‘rf’ boosting mode.

    • Avoid failing MOJO building when MOJO visualization times out.

    • Fixed several security vulnerabilities.

    • Gracefully handle case when dataset contains a column named ‘bias’, which conflicts with Shapley values.

    • Fixed handling of parent data schema in new, restarted, or refitted experiments to avoid wrong data types.

    • Allow avoidance of sampling of validation set via config.toml value of limit_validation_size.

    • Time-Series: Feature values for TTA-scoring are no longer ignored if they are constant.

    • Time-Series: TTA rows now always get a prediction value (these were previously NaN for scoring method “rolling”).

    • Time-Series: Fixed a bug where the “ratio” target transformer was used if “difference” was selected in expert settings.

    • Time-Series: Fixed a bug where downloaded test set predictions don’t match scorer predictions on same dataset (conditions: target values given for all rows, test set longer than 1 horizon, min lag size < horizon, fast_tta_test=true)

    • Fixed apparent hang near end of experiment due to NaN score passed to server in case custom recipe scorer makes bad score.

    • Fixed overly long AutoDoc appendix for certain time series experiments.

    • Fixed unsupervised “new with same settings” workflow.

    • Fixed unsupervised expert setup.

    • Fixed imported individual recipes MODEL ACTIONS -> USE IN NEW EXPERIMENT workflow.

    • Fixed several row querying scenarios in MLI.

    • Fixed blocking main server tasks (that would affect any UI or client use) by making more MLI tasks are asynchronous.

    • Fixed overuse of CPU cores by autoreport, MLI, and other post-experiment actions that can leave system and Driverless unusable.

    • Fixed deletion of model in multinode to include the workers.

    • Fixed deletion of dataset logs when delete dataset to improve disk usage.

    • Fixed preview when reproducible is set and doing restart/refit.

    • Fixed duplicate features leading to gaps between features in UI variable importance panel.

    • Fixed an issue when some actions were not allowed on imported datasets from Storage.

    • No longer do test-time augmentation (TTA) for MLI and autoreport when predicting on training dataset. Now, significantly faster for Shapley plots.

  • Deprecation notice:

    • The Completed experiment -> Deploy functionality will be deprecated in the future.

  • Documentation:

Version 1.10.3.1 (June 15, 2022)

Available here

  • Improvements:

    • Temporary files created by Driverless AI (as, for example, during experiment export and import) are now cleaned up automatically.

    • Introduced a new configuration option to control which attributes are set for HTTP Cookies issued by the Driverless AI web server.

  • Bug Fixes:

    • Fixed a number of security vulnerabilities.

    • Fixed an issue that caused image model scorers to not work when using PyTorch.

    • Fixed an issue with MLI that caused user interface slowdowns when using very large data sets.

Version 1.10.3 (May 02, 2022)

Available here

  • New Features:

    • Original Shapley support (per-feature contributions in original feature space) for C++ MOJO runtime and its Python wrapper. For a Python code example, see C++ MOJO runtime Shapley values support. Note that Original Shapley values are already supported by the Python and Java MOJO scoring runtimes.

    • Fundamental change in how ensemble models are internally stored in memory and disk for fitting, predictions, MOJO injection, etc., resulting in overall lower memory footprints. This means that DAI can now handle larger models and/or data than previously possible on machines with the same amount of memory.

    • A new option to terminate a running experiment if DAI instance runs low on memory, instead of letting the system deal with a low-memory situation. For reference see: terminate_experiment_if_memory_low and memory_limit_gb_terminate. This option is off by default.

    • A new option to control the data schema behavior for restart/refit of existing experiments (resume_data_schema in expert settings). This reuses the same data types for the columns as in the parent experiment. This is on for restart/refit/retrain of models but is set to off for new experiments with same settings option.

    • Support for importing multiple files (folder import) into one dataset that is larger than memory. Assumes that each file can be parsed with available memory, otherwise see option below.

    • Support for ingesting a single file with file size greater than memory by going through disk. For reference see: datatable_parse_max_memory_bytes. This option is off by default.

    • Support for importing datasets with native time columns (previously raised AssertionError ltype.time).

    • For Time series experiments, support for date columns that are unavailable at prediction time.

    • Support to add custom recipes through the GUI code editor.

    • Added a leaderboard to build all built-in unsupervised models on a dataset.

    • Added a new Target Encode Time Series model to the Time series leaderboard (this model obtained first place in March 2022 Kaggle Playground competition).

    • When integration with H2O MLOps is enabled, a new GO TO MLOPS button is now displayed for completed experiments. Clicking this button opens a new browser tab for the H2O MLOps app.

    • New version of OpenID Connect authentication method with automatic provider discovery. For information, see Driverless AI OpenID Connect Authentication.

    • Added ability to show queried/selected row in MLI across all relevant explainers.

    • Added ability to run either default or custom MLI from the Completed Experiment page.

    • Added ability to specify custom bins for MLI PDP.

  • Improvements:

    • Performance:

      • Reduced memory usage:

        • For Time Series experiments with very short prediction periods.

        • During final model building, prediction making, MOJO building, and Python scoring phases. Each base model or fold is now in memory one at a time instead of all together.

        • When importing large files.

      • Reduced the size of the final model artifact for certain cases like Text/Torch/TimeSeries.

      • Increased speed of final model building to avoid repeated loading of pickled disk base models and folds.

      • Increase speed up experiments on large datasets by pre-calculating dataset stats at import time.

      • Added k-nearest neighbors algorithm (k-NN) for GPUs using RAPIDS/cuML. This can be 100x faster than scikit-learn (e.g. as used in this recipe

      • Leaderboard now launches up to 10x faster when many recipes and experiments are present.

      • The experiment preview is now displayed up to 3x faster when many recipes are present.

      • Increased speed of leak/shift detection for GrowNet.

      • Increased speed of server start by 100x when many recipes present and using the TOML option contrib_reload_and_recheck_server_start=false.

      • Increased speed of original and transformed feature importance for MLI.

      • Increased the speed of MLI time series explainer Shapley predictions.

    • Accuracy:

      • Improved robustness of refit/restart of existing experiments against changes of column types in the data.

      • Always do stratified sampling for binary classification problems, even if the imbalance ratio is less than 1:100. This can improve model robustness for small data.

      • Improved the numerical robustness of the InteractionTransformer with respect to the scale of the target column.

      • Improved the numerical accuracy of scoring of regression problems during feature evolution for cross-validated models with non-trivial target transformers.

      • Improved numerical precision for BERT models for inference on Ampere GPUs, such that results agree with C++ MOJO (i.e., in reproducible mode, use 32-bit precision instead of 16-bit).

      • For time series problems, the leakage check no longer fails if the test set has only one timestamp.

      • Monotonicity constraints are no longer enforced for multiclass.

      • When running MLI on time series experiments, integer time group columns are now treated as categoricals.

    • Leaderboard:

      • Added other options in expert panel, such as ‘random’ that does 10 experiments with different random seeds, or ‘line’ that scans through all transformers for good model, and all models for original transformers.

      • Added Unsupervised Models leaderboard.

      • Added a new Target Encode Time Series model to the Time series leaderboard.

    • GUI:

      • Expert Settings:

        • Added expert control over image auto pipeline building recipe, so can choose architectures, etc.

        • If only 1 transformer or 1 model selected in expert include lists, then allow without extra tomls having to be set.

        • You can now control mutations in the Expert Settings window with the mutation_dict TOML option.

      • Recipes:

        • The Individual recipe page now contains a link to the parent experiment.

        • Added the ability to control upgrade of packages for recipes with the new config TOML options swap_package_version and clobber_package_version.

        • On the GUI Recipes page, the in-code search feature can be accessed by pressing Control+F (or Command+F on Mac).

        • The Individual recipe page now has an action button (MORE ACTIONS > NEW EXPERIMENT) that sets up a new experiment and automatically includes the individual recipe.

        • Recipe pickers (expert settings) now contain links to custom recipe details.

      • Project Page:

        • The dataset section of the Project page is now hidden by default. You can toggle the visibility of the dataset section by clicking the Show / Hide Datasets button.

        • On the Project page, you can now link multiple experiments at the same time.

        • The link-experiments-by-dataset functionality on the Projects page is now available by clicking Link Experiments > By Selecting Dataset Used in Experiments. Users are no longer prompted about this when linking a dataset.

    • Main thread is now able to check if text transformers exist for MLI NLP.

    • Show model prediction in Random Forest surrogate model PDP plot.

    • Upgraded to latest datatable version for several improvements and bug fixes.

    • Improved error message when importing multi-file datasets with conflicting types where upcasting numeric columns to string columns was required, and added an option to upcast automatically.

    • Improved how old experiments are imported from Storage, previously imported experiments could have missing data still, in that case they might need to be re-imported from Storage.

  • Bug Fixes:

    • Fixed target transformer inverse transform operation for multiple folds-repeats in tuning-evolution (does not affect final model).

    • Fixed support of recipes using “all_cols” for min_cols or max_cols. No longer sample features, instead pass in all columns.

    • Fixed an issue that caused expert toml cols_to_drop so functions again. Used to easily copy-paste large list of features to drop, or copy-paste report by FS in logs.

    • Large items are now avoided when logging model parameters.

    • Fixed shift detection to ensure that the chosen model is selected if possible. Could have chosen disabled model before.

    • Fixed an issue that caused an error to be raised when when extra items were present in model hyperparameters.

    • Fixed confusion matrix counts when bootstrapping would sample.

    • Fixed rare bug in parsing a table with columns containing only missing values.

    • Fixed recipe_dict TOML to be exposed so can pass options down to a recipe.

    • Fixed cross-validation of meta learner (disabled by default) leading to wrong predictions for many classes or base models.

    • Fixed performance issues for wide data with more than 20k columns.

    • Fixed interpretability=10 models so do not lose as much accuracy in final model when features have low importance but are important to score.

    • Fixed predictions of Constant model created in versions 1.10.1 and earlier. Only affects in-Driverless predictions, MOJO and Python scorers were correct.

    • Fixed an issue that caused browser memory leaks.

    • Fixed an issue that caused the fast_approx Expert Setting to be ignored for MLI time series explainer Shapley predictions.

    • Fixed Invalid data access for a virtual column during parsing of files.

    • Fixed Operator - cannot be applied to columns of types str32 and float32 when providing test set target column of object type instead of numeric type for regression problems.

    • Fixed recipe uploading causing running experiments to fail.

    • Fixed scheduling of MLI tasks to avoid hangs in explainers.

    • Fixed server and worker threading-forking to avoid stalls.

    • Fixed monotonicity constraints for individual recipe when mix of auto and chosen values.

    • Show correct error message when data recipe fails.

    • Fixed text transformer to avoid excessive memory use.

    • Fixed an issue that did not show the proper cluster name for LIME-SUP in MLI.

    • Fixed out-of-sample row queries for LIME-SUP by creating an intermediate scoring step for obtaining tree paths in MLI.

    • Fixed an issue that did not preserve per explainer settings when an MLI experiment is rerun with the same settings (New With Same Parameters).

    • Fixed an issue that caused server hang when a row query failed in MLI.

    • Fixed Naive and Transformed Shapley local search when dataset >= 25000.

    • Fixed Decision Tree surrogate model on residuals when the model has text transformers.

    • Fixed handling of null columns in NLP PD/ICE and LOCO row queries.

    • Fixed various issues related to interpreting predictions from an external model in MLI.

  • Documentation:

Version 1.10.2 (February 17, 2022)

Available here

  • New Features:

    • Auto-generated, editable, Python code of the Best Models from any experiment pipeline, that can be run as standalone or can be evolved or frozen with other models in the experiment pipeline. For more information, see Custom Individual Recipe:

      • Auto-generate Python code as “Individual” custom recipe that can be edited and used in new, restarted, or retrained experiments.

      • Allows code-first full control over model type, model hyperparameters, feature types, feature parameters, data science types, forcing-in features, and monotonicity.

      • Support expert config toml control, independently from experiment values, to help control models and features.

      • Recipes can be controlled from custom recipe management for editing and downloading.

      • Auto-generate example Individual recipe that goes beyond experiment result to show other possible choices for features.

      • Recipes are generated or downloaded from Tune Experiment at end of experiment page or in experiment listing page by clicking 3-button widget on right.

      • Supported for experiments made using DAI 1.7.2 and later.

    • Shapley values for C++ MOJO runtime:

      • C++ MOJO scorer (CLI and Python wrapper) support regular predictions and transformed Shapley values (per-feature contributions to predictions).

      • Shapley values for original features will be available in an upcoming release.

      • Python scorer and in-DAI scoring of models (after the experiment is completed) use the C++ MOJO for acceleration under certain conditions, now also for transformed Shapley values.

    • Custom Recipe Management (for an in-depth explanation of these features, see Custom Recipe Management):

      • On the Recipe detail page, data recipes can now be applied to datasets (e.g. splitting a dataset) or as a standalone (e.g. to download a dataset). Additionally, preview works the same way as if developing with live code feature on dataset detail.

      • On the Recipes page, you can now apply data recipes to datasets or as a standalone by using the drop-down menu.

      • For the Dataset detail -> Modify by Recipe option, you can now use any data recipe that has already been added to DAI.

      • Custom recipes can now be downloaded from a private Bitbucket repository.

    • Allow download of pretrained BERT models from S3.

    • Added an Admin API for the DAI Python client that lets you manage entities created by other users by providing options for listing, deleting, or transferring them. For more information, see Python Client Admin API.

    • Added H2O Drive data connector for H2O AI Cloud.

    • Added H2O Feature Store data connector.

  • Improvements:

    • Improved Shapley values for linear target transformer such as unit_box, standardize and center, now are in original target space for improved interpretability. Before, Shapley contributions for those target transformers would be in globally scaled/shifted space.

    • Expert settings pickers and multi-pickers can now be opened for running and completed experiments.

    • Improved accuracy for NLP models using the TextTransformer (default), now tokenizes single characters again, as in versions 1.9.x and before.

    • For the Download Predictions option, you can now change the name of the generated CSV file. The default file name now includes the dataset name that was used for the prediction. Additionally, the dataset name is shown in the side panel job.

    • Speed up calculation of monotonicity constraints, and enable by default for wide datasets (for high enough interpretability settings).

    • Speed up calculation of partial dependence for AutoDoc.

    • Speed up import of binary image bundles when .csv file is included.

    • Improved editing of lists and dictionaries in expert settings.

    • Improved model and feature tuning stage by ensuring that every model has one case with only default parameters and features (to avoid one or a few speculative features hurting model validation).

    • Support arbitrary isolated Python 3.6, 3.7, 3.8 environments for custom model recipes (Git repo: Model Template, Autogluon Example).

    • Improved the data connectors by exposing more configuration options to the end user.

    • Pass any dropped columns and/or fold and weight columns to custom scorers that require access to the full dataset.

    • Added support for exporting artifacts to HDFS (or MapR HDFS).

    • Added support for H2O MLOps Storage migration.

    • Machine Learning Interpretability (MLI):

      • In the decision tree surrogate model, sample size per split is now displayed.

      • When selecting a terminal node in the decision tree surrogate model, Python code and pseudocode are now displayed.

      • For the decision tree surrogate model, added the option to use different categorical encoding methods.

      • For MLI NLP explainers, in cases where the text does not contain a token of interest, a token is now appended to the text before calculating importance or effects. This methodology applies specifically to ICE and LOCO.

      • Improved local explanations highlighting in MLI NLP VLM.

      • For MLI Time Series surrogate models, transformed feature space is now used by default.

      • Improved MLI Time Series migration capabilities.

      • Enabled MLI decision tree surrogate model for time series models.

      • Added n-gram support for MLI NLP PD/ICE.

      • Added on-demand per-feature categorical and numeric binning calculation for MLI DAI PD/ICE.

      • Histograms in MLI DAI PD/ICE are newly enabled by default.

      • Improved MLI DAI PD/ICE progress reporting.

      • Improved binning and out of range values for integer and single value features in MLI DAI PD/ICE.

      • Made various UI/UX improvements to MLI.

  • Bug Fixes:

    • Fix pipeline visualization, was sometimes missing connections from features to models.

    • Fix leaderboard in case the setting [OFF] is selected for the time column, no longer make time-series leaderboard.

    • Fixed the log_noclip target transformer, and disallow it for zero-inflated targets to avoid extrapolations into negative values for certain models like GLM and Neural Nets.

    • On the Custom Recipes page, only active recipes can be deactivated.

    • No longer allow client interactions during custom recipe database synchronization.

    • Allow selection of only one unsupervised model in expert settings.

    • Fix an experiment-level reproducibility issue with which features were selected during mutations.

    • Add quantile evaluation metric for LightGBM in expert settings, for matching quantile regression objective option.

    • Fixed a Java class path issue that can prevent users from importing data from Hive using the JDBC connector.

    • Fixed a number of security vulnerabilities.

    • Fixed reproducibility when sampling transformer types and higher-order interactions.

    • Fixed scorer recipe acceptance testing that was not operating.

    • Fix import of binary image bundles with .tgz extension.

  • Documentation:

    • Added info on enabling the Google BigQuery (GBQ) data connector by setting an environment variable or enabling Workload Identity. For more information, see Google BigQuery Setup.

Version 1.10.1.3 (January 7, 2022)

Available here

  • Bug Fixes:

    • Upgrade log4j-2 in some bundled java packages to version 2.17.1 to mitigate vulnerability discovered in CVE-2021-44832.

Version 1.10.1.2 (December 22, 2021)

Available here

  • Improvements:

    • Allow notification scripts to inherit environment variables from main server using a new config option.

  • Bug Fixes:

    • Upgrade log4j-2 in some bundled java packages to version 2.17.0 to mitigate vulnerability discovered in CVE-2021-45105.

Version 1.10.1.1 (December 14, 2021)

Available here

  • Bug Fixes:

    • Upgrade log4j-2 in some bundled java packages to version 2.16.0, this is to fully mitigate the risk of arbitrary code execution vulnerability discovered in CVE-2021-44228.

Version 1.10.1 (November 10, 2021)

Available here

  • New Features:

    • (Experimental) PyTorch based Deep Learning model for tabular data based on a boosting approach (GrowNet).

    • Added the option to download NLP pretrained embeddings from S3.

    • Added mojo size estimation to preview.

    • Added ability to control the default knob settings for accuracy, time and interpretability via config.toml and expert settings.

    • Added control over which target transformers to include for target transformer tuning.

    • Added convergence-based early stopping for LightGBM based models, can reduce model size.

    • AutoViz recommendations can now be used as feature transformations for experiment.

  • Improvements:

    • Show low-cardinality categorical levels in transformed feature names for OneHotEncodingTransformer.

    • Non-lag based time-series recipe now is same as lag-based time-series recipe except that all lags-based transformers are disabled. Allows support for gaps in validation splits and provides improved validation through moving windows, and adds holdout predictions.

    • Automatically perform row sampling for SILHOUETTE scorer if dataset size is larger than configurable threshold to avoid slowness.

    • Improved experiment runtime estimation for preview.

    • Improved column type detection for preview, so similar to detection of types during experiment, so feature transformations more accurately shown in preview.

    • Improved model-transformer detection during preview, so more accurately reflects what will happen in experiment.

    • Improved text detection for Chinese/Korean/Japanese and other languages that use UTF8 characters.

    • Improved feature engineering and feature evolution for time series

    • Allow dataset column type override to categorical (‘cat’) independent of cardinality.

    • Improved LightGBM early stopping to stop earlier if validation score does not improve significantly (depending on accuracy dial).

    • Allow disabled custom recipes to be loaded and edited in custom recipe management.

    • Improved native support on RedHat and other platforms by being less dependent on system libraries.

    • Added capture of non-Python errors, so for supervised or unsupervised experiments no longer required to share server logs for support.

    • Added better small data support, by only target encoding features with strong signal.

    • Added support for “more_overfit_protection” pipeline building recipe, which (for any data size) only target encodes features with strong signal.

    • Improved custom recipe acceptance testing to check for disallowed global imports (e.g. of XGBoost, LightGBM, Torch, CuPy, cuDF, etc.).

    • Support custom recipes via zip that contains base wrapper as main recipe, with support files in sub-folder (e.g. can import Torch globally if wrapper imports sub-folder locally).

    • Avoid wrong date/datetime detection for columns containing strings and large integers.

    • Validate toml and no longer allow ignoring errors.

    • Prevent deletion of datasets used as test or validations sets.

    • Upgraded Java to OpenJDK 10.

    • Updated XGBoost to support NVIDIA K80+ again (i.e. CUDA compute capability 3.5+).

    • Allow model recipes to be run in Python environment independent from DAI environment.

    • Added ability to specify the dataset location when using the Google BigQuery (GBQ) connector.

    • Fixed a number of common vulnerabilities (CVE and PRISMA).

  • Bug Fixes:

    • Fixed Python scoring by limiting packages installed. Works on base Ubuntu, UBI-8, or CentOS systems. See documentation for additional steps.

    • Fixed missing ImageVectorizer Transformer on systems with multiple GPUs.

    • Fixed the MOJO cloud deployment to AWS Lambda.

    • Fixed import of image archives created on Apple macOS systems.

    • Fixed DATA logging level anonymization.

    • Fixed custom recipe management activation choices for child experiments.

    • Fixed custom recipes present leading to absence of internal recipes.

    • Fixed bootstrap sampling estimates.

    • Fixed Python scoring for Dask-based models.

    • Fixed OpenCL (for LightGBM) on native systems. See documentation for additional steps.

    • Fixed pre-transformer list so not reset during experiment to all transformers.

    • Fixed LightGBMDask appearing when model should be LightGBM when wide rules triggered.

    • Fixed Shapley values for ConstantModel in Java MOJO runtime.

    • Fixed Date format %Y in C++ MOJO runtime.

    • Fixed tabs clicks in Internet Explorer 11.

    • Fixed pipeline visualization for tree models if features have numeric suffixes.

    • Fixed a shape mismatch for binary custom scorers.

    • Fixed ingestion of Parquet files with integer columns containing missing values.

  • Documentation:

Version 1.10.0 (September 29, 2021)

Available here

  • New Features:

    • Built on top of the latest stable versions of all major open-source packages.

      • Updated to Python 3.8, supporting faster pickle protocol 5.

      • Updated to Torch 1.9.0 and TensorFlow 2.4.2.

      • Updated to NVIDIA RAPIDS 21.08, supporting GPU-based target encoding, UMAP, TSNE, RF.

      • Updated to CUDA 11.2.2, supporting Ampere-based NVIDIA GPUs. Requires NVIDIA CUDA Driver 470 or later.

      • Updated XGBoost, LightGBM, datatable, Pandas, scikit-learn and many more.

      • Support Ubuntu 20.04 for DEB/TAR-SH deployments and Centos8 for RPM.

    • Custom Recipe Management

      • Custom recipes are versioned.

      • Lets you activate or deactivate custom recipes.

      • Lets you add a note to each recipe.

      • Adds a visual code editor.

      • Makes previous recipe versions accessible.

    • Experiment Export/Import

      • Experiments can be downloaded and uploaded as binary files.

      • Supports both new and migrated experiments from DAI 1.8.x and 1.9.x.

      • Supports experiments with custom recipes.

    • Support Shapley values for original features in Java MOJO runtime.

    • (Experimental) Automatic Unsupervised Machine Learning. Supports clustering, dimensionality reduction, outlier detection and full support for custom recipes. Includes automatic hyper-parameter optimization and feature selection for clustering, and visualizations for centroids.

    • Force-in-Feature control. Specific features can now be forced into the model without modification. For more information, see cols_to_force_in in feature expert settings.

    • Added fast approximation for regular predictions (in addition to fast approximation for Shapley values). Enabled by default for MLI/AutoDoc, disabled by default for other clients. Extent of approximation can be fully configured/disabled. Can result in significant speedup for large prediction tasks like creation of partial dependence plots and MLI in general. For more information, refer to the FAQ question on fast approximation.

    • Automatically create labels for predictions for classification problems, appends predicted label column at end of prediction frame.

    • Added the Health API, providing system metrics and resource utilization overview.

    • Added improved support for imbalanced multiclass problems with LightGBM. Can help when confusion-matrix based scorers are optimized (such as MacroF1).

    • Added new metrics for classification: MacroF1, MacroMCC. Macro scorers average the per-class scores, while micro scorers average the per-row scores. Use MacroF1 by default for imbalanced multiclass problems. MacroF1/MCC is the same as F1/MCC for binary problems.

    • Added more details about all models that were fitted during the experiment (in Scores tab and in experiment artifacts).

    • MLI Features:

      • Added new UI for MLI time series with a focus on UX.

      • Enabled Sensitivity Analysis for time series models.

      • Enabled Disparate Impact Analysis for time series models.

      • Enabled Surrogate models for time series models.

      • Enabled Partial Dependence/ICE for time series models.

      • Enabled original feature importance for time series models.

      • Added human-friendly descriptions of transformed features in MLI TS explainer UI.

      • Added MLI expert setting mli_fast_approx to speed up predictions with fast approximation.

      • Added Vectorizer + Linear Model (VLM) explainer for NLP experiments.

      • Add ability to use Vectorizer + Linear Model (VLM) as a means to create tokens for surrogate models in addition to TF-IDF.

      • Added partial dependence for NLP text tokens.

      • Added multinomial support for MLI NLP explainers.

      • Added text sample views for local NLP explanations in MLI. For more information, see NLP Plots.

      • Added English stop words for MLI NLP tokenizer and tokenizers used by NLP explainers.

      • Added ability to download TF-IDF matrices in MLI.

    • Documentation:

  • Improvements:

    • Improved default leaderboard of experiments. Covers a broader range of useful experiments.

    • Added another automatic leaderboard for time-series experiments to create a separate model for each prediction period (in addition to the diverse default leaderboard).

    • Use PyTorch Lightning framework for BERT Models and Transformers. Leads to faster training and better memory handling.

    • Improved parallelization of BERTTransformer on multi-GPU machines.

    • Reduced memory usage for text transformers.

    • Support arbitrary isolated Python 3.6, 3.7, 3.8 environments for custom recipes using wrap_create decorator (Git repo: Any Env).

    • Preview shows any input features not covered by chosen transformer-model combination.

    • Preview shows if MOJO is supported by chosen transformer-model combination.

    • Improved automatic handling of zero-inflated distributions.

    • Improved handling of time columns with %Y and %Y%m formats.

    • Improved splitting for datasets containing images, now each split has a copy of all local images (instead of just references)

    • Various improvements to the feature evolution algorithms.

    • More explanation tooltips added across application.

    • Improved runtime estimation for experiment preview.

    • Improved heuristic for default experiment settings.

    • Improved sanitization of column names.

    • Optimized MLI partial dependence explainer in terms of speed.

    • Improved categorical handling in MLI’s Decision Tree surrogate model.

    • Various UI/UX/performance improvements to MLI.

    • Improved server performance and responsiveness when many tasks (experiments, MLI, etc.) running.

    • Gracefully handle failures of custom scorers without failing the experiment.

    • Improved hardware utilization across the board.

    • Improved clean-up of experiment temporary files after experiment finished, so server start-up can avoid long clean-up on slow disks.

    • Make storage GRPC message limit configurable

  • Bug Fixes:

    • Fixed segfault during file import for datasets with many large similar strings.

    • Fixed slow MOJO generation for wide datasets.

    • Fixed MOJO for text-based transformers and models for UTF8 characters.

    • The max_feature_interaction_depth expert setting is now applicable for all transformers.

    • Fixed incorrect Shapley bias terms in Java MOJO runtime for XGBoost regression models with exponential link functions (Poisson/gamma/Tweedie/CoxPH).

    • Fixed number of cores used for prediction by XGBoost and LightGBM to avoid excessive core usage during scoring.

    • Allow One Hot Encoding to be used for any model.

    • Fixed exclusive_mode moderate and max modes, for use on isolated systems for maximum use of cores.

    • Fixed runtime data recipes to be properly used by transform dataset and MLI.

    • Fixed use of model tuning (params_tune) and override parameters.

    • Fixed automatic type casting of integer columns into string columns during scoring (avoid conversion to float first)

    • Fixed Optuna genetic algorithm choice for larger variety of expert choices.

    • Fixed feature selection by permutation importance selection of features for wide data with categoricals.

Version 1.9.3.1 (August 5, 2021)

Available here

  • New Features:

    • Added support for storing sensitive or secure configuration information in a keystore. For more information, see Configuration Security.

  • Improvements:

    • Improved Hive connector to no longer require a jaas.conf file when using Kerberos authentication with impersonation. If the jaas.conf file is provided, the Hive connector will use it, otherwise it will construct the configuration details based on information provided in hive_app_configs.

  • Bug Fixes:

    • Fixed distribution shift detection for time series models.

    • For zero-inflated distribution, fixed log printout of non-zero count in the target column.

    • Fixed confusion matrices for very small observation weights (disabled rounding to nearest integer).

  • Documentation:

Version 1.9.3 (June 3, 2021)

Available here

  • Improvements:

    • Added ability to specify which project to use when using the Google BigQuery (GBQ) connector

    • Improved MOJO batch scoring to avoid excessive memory usage for text features

    • Updated equivalent public recipes repository to fix potential FBProphet package installation errors

    • Improved column type detection for text columns

    • Reduce memory usage

    • Added support for Ubuntu 20.04 LTS

    • Added support for IBM Power

  • Bug Fixes:

    • Fixed creation of MOJO pipeline for features only (make_mojo_scoring_pipeline_for_features_only=true)

    • Fixed a segfault issue that sometimes occurred when using MLOps storage

    • Fixed the computation results in custom Autoviz boxplots

    • Fixed a license not found issue when using a local rest server deployment

    • Fixed a login issue when using Internet Explorer 11

    • Fixed an issue where AWS Lambda resources were not cleaned when a Lambda deployment was deleted from the UI

    • Fixed an issue when importing a folder with empty directories in Azure Blob Storage connector

    • Fixed a CSV writing issue with JDBC and Hive connectors that sometimes happens when importing large datasets with large text fields

    • Fixed overly high CPU memory usage for BERTTransformer on datasets with large number of rows when running on GPUs

Version 1.9.2.2 (April 7, 2021)

Available here

  • Bug Fixes:

    • Fix creation of MOJO pipeline for features only (make_mojo_scoring_pipeline_for_features_only=true)

Version 1.9.2.1 (April 2, 2021)

Available here

Version 1.9.2 (March 8, 2021)

  • New Features:

    • Optuna for model hyperparameter tuning as choice for genetic algorithm in expert options

    • Show size of largest transformers in logs

    • Optional stacking meta learner for final ensemble (“ExtraTrees” LightGBM model instead of linear blender), with optional cross-validation

    • Optimize fold splits and show Kolmogorov-Smirnov statistics for target variable across folds

  • Improvements:

    • Optimize for wide data with more columns than rows

    • Improve fold splits for regression problems

    • Reduce choice of tuned target transformers for regression problems with higher interpretability settings

    • Disable auto-tuning of target transformations for regression problems unless interpretability <= 5 and accuracy >= 5 (facilitates interpretation of Shapley values)

    • Use GPU(s) more for XGBoost algorithm, improve memory requirements estimation

    • Improved genetic algorithm for feature evolution

    • Project page view was updated

    • Expert settings items show additional description upon mouse hover

    • Use MOJO for PDP / ICE calculations

    • No longer re-create MOJO when starting MLI experiments

  • Bug Fixes:

    • Disable GPU based NLP and Image recipes for Ampere based GPUs (due to software backward incompatibility of Ampere GPUs), automatically fall back to CPU for Image and BERT Transformers and TensorFlow model.

    • Fix C++ MOJO segfault

    • Remove temporary files left behind by AutoReport

    • Various bug fixes

Version 1.9.1.3 (Feb 27, 2021)

Available here

  • New Features:

    • Added support for Keycloak authentication on the Azure data connector

  • Bug Fixes:

    • Fixed a race condition when starting the docker container from Steam

Version 1.9.1.1 (Feb 21, 2021)

Available here

  • New Features:

    • Added support for the H2O.ai License Manager (beta)

    • Added unseen values in MLI partial dependence plots

    • Added ability to download the new Python client from a local path for air-gapped installs (also exposed the new Python client for download from the UI)

  • Improvements:

    • UI/UX improvements to MLI:

      • Removed abbreviations from MLI explainer tile names

      • Improved local explanation and row searches in MLI

      • Improved MLI explainers error handling

      • Changed MLI explainers log levels from DEBUG to INFO

    • Improved logging for BERT migration

    • Various documentation updates

  • Bug Fixes:

    • Fixed the “New with same params” option in MLI

    • Fixed feature selection for PD/ICE MLI explainer to not include categorical features

    • Fixed the MLI explainers log display to not be truncated

    • Fixed the MLI on-demand engine invocation so that it can reuse parent’s explainer artifacts

    • Various MLI UI fixes

    • Various MLI explainers fixes

    • Fixed the outliers display in Autoviz

    • Fixed the None values interpretation in parquet files

    • Various package vulnerabilities fixes (CVE)

    • Fixed creation of too small time-series validation splits for newly introduced validation scheme (Time-series expert settings)

Version 1.9.1 (Jan 15, 2021)

  • New Features:

    • Automatically use MOJO for predictions after experiment is completed (now uses MOJO scoring pipeline for Predictions, MLI, Autoreport, Diagnostics and Python scoring pipeline if available and applicable)

    • Added Shapley values for original features to Python scoring pipeline and GUI/client scoring (under Model Actions)

    • Ensemble blending now performed in link space by default (such that logistic or softmax of Shapley sum equals probabilities). For regression, identity_noclip target transformation achieves same.

    • Built-in recipe for monotonic GBM on original numeric features

    • Let user drop features with weak correlation with the target when monotonicity constraints are enabled monotonicity_constraints_drop_low_correlation_features

    • Added ability to run and configure AutoDoc from the MLI recipe selection page with the option to include k-LIME and/or Decision Tree Surrogate explainers in the AutoDoc

    • Show first tree for all LightGBM and XGBoost models in MOJO visualization (not just for DecisionTree)

    • Show size of tree models in MOJO visualization

    • Allow creation of MOJO for engineered features only, only does pipeline transform() without model predict() (experimental)

    • AutoDoc can now be configured to include information about Shapley Values for Original Features, Monotonicity Constraints, and Imbalanced Models

    • Implemented detection of string columns that contain a high percentage of numeric values, and added expert setting to enable auto-conversion

    • Global task list displaying all running jobs (Resources -> System Info -> Workers Activity -> CPU/GPU Experiments)

    • MLI Features:

      • Support for Bring Your Own Recipe (BYOR) / Custom Recipes for MLI (Git repo: Responsible ML)

      • Exposed sampling parameter for all explainers in MLI expert settings

      • Added MOJO support for k-LIME (with download option). See Download k-LIME MOJO Reason Code Pipeline

      • Added ability to download raw k-LIME data from MLI UI. See Download LIME Reason Codes

      • Added ability to change threshold for Disparate Impact Analysis in DIA expert settings

      • Added ability to run PDP on out of range data, which a user can specify in MLI recipe expert settings

      • Added max runtime parameter to Kernel Shapley in MLI expert settings. To access, enable Original Kernel SHAP recipe and enable Kernel Explainer to Obtain Shapley Values for Original Features and toggle the max runtime from the MLI expert settings.

      • Added ability to run PD/ICE for multinomial models in DAI

      • Added ability to run MLI TS in typical MLI view (IID)

      • Added ability to see rules in Decision Tree surrogate model

    • Dask/RAPIDS multi-GPU/multi-node training (beta):

    • Time-Series:

      • Improved validation scheme for short forecast horizons (Time-series expert settings)

      • Greatly improved speed of creating back-testing holdout predictions

      • New DateTimeDiffTransformer for automatic feature engineering based on temporal differences between date/time columns

      • Improved dropout logic used for LagsTransformer

      • LagsTransformer is now aware of features that are known ahead of time. This allows to create smaller than horizon lags for them.

      • Added user-controllable pools of lag sizes for each of the following types of features: target, non-targets that are unknown ahead of time and non-targets that are known ahead of time

      • Expert setting value “[0]” for lag sizes can now be used to disable lags for the corresponding group of features

      • Added option for automatic selection of date/datetime transformations to avoid unseen values in the future (Time-series expert settings)

      • Added option to use fixed-size train timespans during internal validation (Time-series expert settings)

      • Added check for time invariance of lag features per sub-series to avoid redundancy

  • Improvements:

    • Significant performance improvements, reduced latency for subprocess communication and faster experiments for small data

    • Significant UI/UX improvements to MLI

    • Improved test coverage for custom recipe acceptance tests

    • Improved performance of tf–idf based text transformers: lowered memory footprint, increased speed, and implemented user control for vocabulary size

    • Improved performance and accuracy of RuleFit model

    • Improved automatic time-series leaderboard (builds 10 experiments, and can be run iteratively to get interactions of optimal expert settings)

    • Improved performance of MLI by using MOJO for PDP and ICE etc.

    • Residuals in sensitivity analysis are now logloss residuals for binomial classification and square residuals for regression

    • Improved MLI for NLP by adding the ability to backtrack all tokens to their respective column(s)

    • AutoDoc can now be configured to include information about Shapley Values for Original Features, Monotonicity Constraints, and Imbalanced Models

    • Allow creation of MOJO for engineered features only

    • Disable zero-inflated models for regression when have only constant non-zero target values

    • Improved handling of sparse target class distributions for experiments stratified by fold_column

    • Improved genetic algorithm tournament mode defaults

    • Disable feature brain by default for new experiments

    • Upgraded XGBoost to version 1.4.0

    • Upgraded datatable

    • Upgraded many Python packages

    • Improved logging of model fitting and predictions

    • UI/UX improvements to Dataset Details page, adding dataset actions, data recipe autosave and download

    • Various Web GUI UI/UX improvements and fixes

  • Bug Fixes:

    • Honor fast approximation settings (enabled by default) for LightGBM Shapley contributions - led to slow final model holdout predictions for time-series

    • Fixed in-GUI/client scoring for experiments containing LightGBM models that were created in 1.7.1/1.8.0 (scoring artifacts not affected)

    • Fixed MOJO for (non-default) regression objectives for XGBoost and LightGBM: Gamma, Tweedie, Poisson, CoxPH

    • Various other migration fixes for models created in 1.7.1+

    • Honor expert settings for DecisionTreeModel (such as max depth etc.)

    • Feature brain related fixes: all imported models are freshly scored at start and more conservative selection for time-series experiments

    • Various Autoviz fixes, yielding high correlation for categorical features and other small bugs

    • Various bug fixes

Version 1.9.0.6 (Dec 22, 2020)

Available here

  • Bug Fixes:

    • Fixed an issue that caused columns marked as being unavailable at prediction time to be dropped when leakage was detected

    • Fixed row querying on demand for out-of-sample data in MLI

    • Fixed failure during final test set scoring for time-series experiments with test set containing partially missing target values

Version 1.9.0.5 (Dec 09, 2020)

Available here

  • New features:

    • Added k-LIME MOJO

    • Added ability to copy/paste data from Shapley plots

    • Added ability to select features for PD/ICE in MLI expert settings

    • Added ability to select feature type in MLI expert settings i.e to specify which feature should be treated as categorical/numeric, etc.

    • Sensitivity analysis now calculates logloss residuals for classification and squared residuals for regression

  • Improvements:

    • Improvements in Shapley visualizations

    • DAI PDP features now preserve order in reference to feature importance json file and not in alphabetical order

    • Improvement in DAI brain re-scoring determination

  • Bug Fixes:

    • Fix MOJO for ZeroInflated models when target transformer is not identity

    • Various MLI fixes

Version 1.9.0.4 (Oct 13, 2020)

  • Bug Fixes:

    • Speedup application startup by optimizing database integrity startup checks

    • Fix file system artifact export

Version 1.9.0.3 (Sep 28, 2020)

Available here

  • New Features:

    • Added holiday calendar for 24 more countries, allow user to select list of countries to create is-holiday features for time series experiments

    • Support rhel8-like systems

    • Introducing an option to log in using the JWT token injected by the reverse proxy

    • Allow user to specify data delimiter/separator from configuration (datatable_separator see config.toml file)

  • Improvements:

    • Added an option to skip https certificate verification on MinIO connector

    • Locales and language pack improvements

    • Improved logging for connectors

    • Improved logging of sensitive data from OIDC

  • Bug Fixes:

    • Various MLI fixes

Version 1.9.0.2 (Sep 8, 2020)

Available here

  • Improvements:

    • Enable GPU support for PyTorch (BERT) models on IBM Power

    • Allow specification of destination file path for downloads from Python client

    • Enable large data upload for R client

  • Bug Fixes:

    • Fix OpenID and TLS login redirection when deploying behind reverse proxy

Version 1.9.0.1 (Aug 10, 2020)

Available here

  • Bug Fixes:

    • Fix migration for certain time-series experiments

    • Fix missing files for automatic image model

    • Fix MLI job status for PDP/ICE

    • Fix handling of ID column for MLI kernel shapley

    • Fix exception handling for startup failures

    • Constrain Python environment for standalone scoring package

Version 1.9.0 (July 27, 2020)

Available here

  • New Features:

    • Multinode training (alpha)

    • Queuing of experiments to avoid system overload

    • Automatic Leaderboard: Single-button creation of a project with a series of diverse experiments

    • Multi-layer hierarchical feature engineering:

      • Allow optional pre-processing layer for specific custom data cleanup/conversions

      • Subsequent layers take each previous layer’s output as input (can be numeric or categorical/string)

    • PyTorch deep learning backend in addition to TensorFlow

    • Image classification and regression with pre-trained and fine-tuned state-of-the-art Deep Learning models:

      • Image data ingest from binary archives

        • Archives can contain (one) optional .csv file with mapping of image paths to target (regression/classification)

        • Automatic training dataset creation and label creation (from directory structure) if no .csv provided

      • Image Transformers (for converting image path columns

        • “densenet121”, “efficientnetb0”, “efficientnetb2”, “inception_v3”, “mobilenetv2”, “resnet34”, “resnet50”, “seresnet50”, “seresnext50”, “xception”

        • Optional fine-tuning

        • Optional GPU acceleration (strongly recommended when enabling fine-tuning)

        • Pretrained and fine-tuneable ImageVectorizer transformer with automatic dimensionality reduction

        • Images can be provided either as zipped archives, or as paths to local or remote locations (URIs)

        • Automatic image labeling when importing zipped archives of images (based on folder names and structure)

        • Can handle multiple image columns with URIs in a tabular dataset

        • Single experiment can combine image, NLP and tabular data

        • MOJO support (also for CPU-only systems)

      • Automatic Image model

        • End-to-end model training, no tuning needed

        • State-of-the-art results with grandmaster techniques

        • Neural architecture search based on pretrained and fine-tuned TensorFlow models

        • Multi-GPU training

        • Visual insights in GUI (losses, sample images, augmentation, Grad-CAM visual explanations)

      • MLI is not available for image experiments and is a work in progress

    • PyTorch BERT NLP pre-trained and fine-tuned state-of-the-art Deep Learning models:

      • “bert-base-uncased”, “distilbert-base-uncased”, “xlnet-base-cased”, “xlm-mlm-enfr-1024”, “roberta-base”, “albert-base-v2”, “camembert-base”, “xlm-roberta-base”

      • Optional GPU acceleration (strongly recommended)

      • MOJO support (also for CPU-only systems)

      • BERT transformers (for converting text columns into numeric features for other models like GBMs)

      • BERT models (when only have one text column)

    • AutoReport now includes the following:

      • Information about the time series validation strategy

      • Experiment lineage (model lineage plot)

      • NLP/Image architecture details

    • Zero-inflated regression models for insurance use cases (combination of classification + regression models)

    • Time series centering and de-trending transformations:

      • Inner ML model is trained on residuals after fitting and removing trend from target signal (per time-series group)

      • Support for constant (centering), linear and logistic trends

      • SEIRD model for epidemic modeling of (S)usceptible, (E)xposed, (I)nfected, (R)ecovered and (D)eceased, fully configurable lower/upper bounds for model parameters

    • Graphical config.toml editor for expert settings

    • Empiric prediction intervals for regression problems with user-defined confidence levels (based on holdout predictions)

    • Insights tab with helpful visualizations (currently only for time-series and image problems)

    • For binary classification problems with F05, F1, F2, MCC scorers, use the same metric for optimal threshold determination

    • Custom data recipes can now be part of the experiment’s modeling pipeline, and will be part of the Python scoring package

    • Custom visualizations in AutoViz following the Grammar of Graphics

    • Pass data to (custom) scorers, so can access other columns, not only actual and predicted values

    • Added many new scorers for common regression and classification metrics out of the box

    • Added holiday calendar for 24 more countries, allow user to select list of countries to create is-holiday features for.

    • Added identity_no_clip target transformer for regression problems that never clips the predictions to observed ranges and allows extrapolation

    • MLI:

      • New GUI/UX for MLI

      • Added Kernel Explainer for original feature Shapley importance

      • Added ability to download Shapley values for original features from UI as CSV file

      • Added intercept column to k-LIME output CSV file

      • Added ability to run surrogate models on DAI model residuals to help debug model errors

      • Added ability to export Decision Tree Surrogate model rules as text and Python code

      • Added Decision Tree Surrogate model for multinomial experiments

      • Added Leave One Covariate Out (LOCO) for multinomial experiments

      • Added two traditional fair lending metrics for Disparate Impact Analysis (DIA): Standardized Mean Difference (SMD) and Marginal Error (ME)

      • Added two interpretable model recipes to https://github.com/h2oai/driverlessai-recipes: GA2M and XNN (https://github.com/h2oai/driverlessai-recipes/tree/master/models/mli)

      • Display prediction label for binary classification experiments in MLI summary page

  • Improvements:

    • Improved parsability (machine readability) of log files

    • Custom recipes are now only visible to the user that created them, previously created custom recipes remain globally visible

    • Faster time-series experiments

    • Improve preview to show more details about modeling part of final pipeline

    • Improved notifications system

    • Reduced size of MOJO

    • Only allow imbalanced sampling techniques when data is larger than user controllable threshold

    • Upgraded to latest H2O-3 backend for custom recipes

    • Faster feature selection for large imbalanced datasets

  • Documentation updates:

    • Added animated GIFs

    • Added tabbed content

    • Added more details for imbalanced sampling methods for binary classification

    • New content (refer to above linked topics)

  • Bug fixes:

    • Various bug fixes

Version 1.8.10 LTS (Feb 19, 2020)

  • New Features:

    • Exposed new Python client for download in resources menu

    • Added support for .avro file format

    • Added option to generate multiple AutoDocs. This can be set using the option autodoc_template in config.toml and setting it to a list of AutoDoc file paths

  • MOJO updates:

    • Upgraded MOJO runtime dependency to 2.5.10

    • Added MOJO support to compute Shapley for Tree and Linear based boosting models

  • Improvements:

    • Added more verbosity to MLI logs

  • Bug Fixes:

    • Fixed stall detected in LightGBM models on P2.8x Amazon EC2 instances

Version 1.8.9 LTS (Oct 19, 2020)

Available here

  • New Features:

    • Add configurable CSRF (Cross-site request forgery) protection on API endpoints

    • Add protection against concurrent sessions

  • Improvements:

    • Hide webserver technology info from all API endpoints

    • Improved BYOR security by introducing configurable static analysis of the code

    • Improved session verification and authenticity

    • Improved security for internals API handlers via encryption

  • Bug Fixes:

    • Fix user session autologout after session expiration

    • Fix for properly cleaning closed sessions

    • Fix invalid redirection to static artifacts when using reverse proxy and URL prefix

    • Fix import of files without extension

Version 1.8.8 LTS (Sep 30, 2020)

Available here

  • New Features:

    • Give user control over number of saved variable importances (i.e., Python and R clients can get more than 14 values back) (max_varimp_to_save in config.toml file)

    • Added holiday calendar for 24 more countries, allow user to select list of countries to create is-holiday features for time series experiments

    • Enable GPU support for LightGBM models on IBM Power

    • Support rhel8-like systems

    • Introducing an option to log in using the JWT token injected by the reverse proxy

    • Allow user to specify data delimiter/separator from configuration (datatable_separator see config.toml file)

    • Add support of encrypted keystore for sensitive config.toml values. Currently only available for LTS releases (1.8.8 and later)

    • Save transformed column names for Shapley value computation in MOJO

  • Improvements:

    • Add more consistency in handling files without an extension

    • Improve web server request handling and disallow redirection outside of application

    • Improve log file formatting to facilitate parsing

    • Improve logging for connectors

    • Improve air-gapped support for custom recipes

    • Allow Snowflake Stage tables to be optional

  • Bug Fixes:

    • Fix OpenID and TLS login redirection when deploying behind reverse proxy

    • Fix Cgroup memory detection on IBM Power

    • Various MLI fixes

    • Various UI fixes

  • Documentation updates:

Version 1.8.7.2 LTS (July 13, 2020)

Available here

  • Bug Fixes:

    • Add and pass authentication_method parameter to use proper get_true_username and start_session

    • SQL-like connector: strip unnecessary semi-colon from the end of query

  • Documentation updates:

    • Document use of hive_app_jvm_args

Version 1.8.7.1 LTS (June 23, 2020)

Available here

  • New Features:

    • Add ability to push artifacts to a Bitbucket server

    • Add per-feature user control for monotonicity constraints for XGBoostGBM, LightGBM and DecisionTree models

  • Bug Fixes:

    • Fix Hive kerberos impersonation

    • Fix a DTap connector issue by using the proper login username for impersonation

    • Fix monotonicity constraints for XGBoostGBM, LightGBM and DecisionTree models

Version 1.8.7 LTS (June 15, 2020)

Available here

  • New Features:

    • Add intercept term to k-LIME csv

    • Add control of default categorical & numeric feature rendering in DAI PD/ICE

    • Add ability to restrict custom recipe upload to a specific git repository and branch

    • Add translations for Korean and Chinese

    • Add ability to use multiple authentication methods simultaneously

  • Improvements:

    • Improve behavior of systemctl in the case Driverless AI fails to start

    • Improve logging behavior for JDBC and Hive connectors

    • Improve behavior of C++ scorer, fewer unnecessary files saved in tmp directory

    • Improve Docker image behavior in Kubernetes

    • Improve LDAP authentication to allow for anonymous binding

    • Improve speed of feature selection for experiments on large, wide, imbalanced datasets

    • Improve speed of data import on busy system

  • Bug fixes:

    • Fix automatic Kaggle submission and score retrieval

    • Fix intermittent Java exception seen by surrogate DRF model in MLI when several MLI jobs are run concurrently

    • Fix issue with deleting Deployments if linked Experiment was deleted

    • Fix issue causing Jupyter Notebooks to not work properly in Docker Image

    • Fix custom recipe scorers not being displayed on Diagnostics page

    • Fix issue with AWS Lambda Deployment not handling dropped columns properly

    • Fix issue with not being able to limit number of GPUs for specific experiment

    • Fix in-server scoring inaccuracies for certain models built in 1.7.1 and 1.8.0 (standalone scoring not affected)

    • Fix rare datatable type casting exception

  • Documentation updates:

    • The “Maximum Number of Rows to Perform Permutation-Based Feature Selection” expert setting now has a default value of 500,000

    • Improved Hive and Snowflake connector documentation

    • Updated the Main.java example in the Java Scoring Pipeline chapter

    • Added documentation describing how to change the language in the UI before starting the application

    • Added information about how custom recipes are described and documented in the Autoreport

    • Updated the LDAP authentication documentation

    • Improved the Linux DEB and RPM installation instructions

    • Improved the AWS Community AMI installation instructions

    • Improved documentation for the Reproducible button

Version 1.8.6 LTS (Apr 30, 2020)

Available here

  • New Features:

    • Add expert setting to reduce size of MOJO scoring pipelines (and hence reduce latency and memory usage for inference)

    • Enable Lambda deployment for IBM Power

    • Add restart button for Deployments

    • Add automatic Kaggle submission for supported datasets, show private/public scores (requires Kaggle API Username/Key)

    • Show warning if single final model is worse on back-testing splits (for time series) or cross-validation folds (for IID) than the fold models (indicates issue with signal or fit)

    • Update R client API to include autodoc, experiment preview, dataset download, autovis functions

    • Add button in expert settings that toggle some effective settings to make a small MOJO production pipeline

    • Add an option to upload artifacts to S3 or a Git repository

  • Improvements:

    • Improve experiment restart/refit robustness if model type is changed

    • Extra protection against dropping features

    • Improve implementation of Hive connector

  • Bug fixes:

    • Upgrade datatable to fix endless loop during stats calculation at file import

    • Web server and UI now respect dynamic base URL suffix

    • Fix incorrect min_rows in MLI when providing weight column with small values

    • Fix segfault in MOJO for TensorFlow/PyTorch models

    • Fix elapsed time for MLI

    • Enable GPU by default for R client

    • Fix Python scoring h2oai ModuleNotFound error

    • Update no_drop_features toml and expert button to more generally avoid dropping features

    • Fix datatable mmap strategy

  • Documentation updates:

    • Add documentation for enabling the Hive data connector

    • Add documentation for updating expired DAI licenses on AWS Lambda deployments using a script

    • Documentation for uploading artifacts now includes support for S3 and Git in the artifacts store

    • Improve documentation for one-hot encoding

    • Improve documentation for systemd logs/journalctl

    • Improve documentation for time series ‘unavailable columns at prediction time’

    • Improve documentation for Azure blob storage

    • Improve documentation for MOJO scoring pipeline

    • Add information about reducing the size of a MOJO using a new expert setting

Version 1.8.5 LTS (Mar 09, 2020)

Available here

  • New Features:

    • Handle large (up to 10k) multiclass problems, including GUI improvements in such cases

    • Detect class imbalance for binary problems where target class is non-rare

    • Add feature count to iteration panel

    • Add experiment lineage pdf in experiment summary zip file

    • Issue warnings if final pipeline scores are unstable across (cross-)validation folds

    • Issue warning if Constant Model is improving quality of final pipeline (sign of bad signal)

    • Report origin of leakage detection as from model fit (AUC/R2), Gini, or correlation

  • Improvements:

    • Improve handling of ID columns

    • Improve exception handling to improve stability of raising Python exceptions

    • Improve exception handling when any individual transformer or model throw exception or segfaults

    • Improve robustness of restart and refit experiment to changes in experiment choices

    • Improve handling of missing values when transforming dataset

    • Improve robustness of custom recipe importing of modules

    • Improve documentation for installation instructions

    • Improve selection of initial lag sizes for time series

    • Improve LightGBM stability for regression problems for certain mutation parameters

  • Documentation updates:

    • Improved documentation for time-series experiments

    • Added topics describing how to re-enable the Data Recipe URL and Data Recipe File connectors

    • For users running older versions of the Standalone Python Scoring Pipeline, added information describing how to install upgraded versions of outdated dependencies

    • Improved the description for the “Sampling Method for Imbalanced Binary Classification Problems” expert setting

    • Added constraints related to the REST server deployments

    • Noted required vs optional parameters in the HDFS connector topics

    • Added an FAQ indicating that MOJOs are thread safe

    • On Windows 10, only Docker installs are supported

    • Added information about the Recommendations AutoViz graph

    • Added information to the Before you Begin Installing topic that master.db files are not backward compatible with earlier Driverless AI versions

  • Bug fixes:

    • Update LightGBM for bug fixes, including hangs and avoiding hard-coded library paths

    • Stabilize use of psutil package

    • Fix time-series experiments when test set has missing target values

    • Fix Python scoring to not depend upon original data_directory

    • Fix preview for custom time series validation splits and low accuracy

    • Fix ignored minimum lag size setting for single time series

    • Fix parsing of Excel files with datetime columns

    • Fix column type detection for columns with mostly missing values

    • Fix invalid display of 0.0000 score in iteration scores

    • Various MLI fixes (don’t show invalid graphs, fix PDP sort order, overlapping labels)

    • Various bug fixes

Version 1.8.4.1 LTS (Feb 4, 2020)

Available here

  • Add option for dynamic port allocation

  • Documentation for AWS community AMI

  • Various bug fixes (MLI UI)

Version 1.8.4 LTS (Jan 31, 2020)

Available here

  • New Features:

    • Added ‘Scores’ tab in experiment page to show detailed tuning tables and scores for models and folds

    • Added Constant Model (constant predictions) and use it as reference model by default

    • Show score of global constant predictions in experiment summary as reference

    • Added support for setting up mutual TLS for the DriverlessAI

    • Added option to use client/personal certificate as an authentication method

  • Documentation Updates:

    • Added sections for enabling mTLS and Client Certificate authentication

    • Constant Models is now included in the list of Supported Algorithms

    • Added a section describing the Model Scores page

    • Improved the C++ Scoring Pipeline documentation describing the process for importing datatable

    • Improved documentation for the Java Scoring Pipeline

  • Bug fixes:

    • Fix refitting of final pipeline when new features are added

    • Various bug fixes

Version 1.8.3 LTS (Jan 22, 2020)

Available here

  • Added option to upload experiment artifacts to a configured disk location

  • Various bug fixes (correct feature engineering from time column, migration for brain restart)

Version 1.8.2 LTS (Jan 17, 2020)

Available here

  • New Features:

    • Decision Tree model

    • Automatically enabled for accuracy <= 7 and interpretability >= 7

    • Supports all problem types: regression/binary/multiclass

    • Using LightGBM GPU/CPU backend with MOJO

    • Visualization of tree splits and leaf node decisions as part of pipeline visualization

    • Per-Column Imputation Scheme (experimental)

    • Select one of [const, mean, median, min, max, quantile] imputation scheme at start of experiment

    • Select method of calculation of imputation value: either on entire dataset or inside each pipeline’s training data split

    • Disabled by default and must be enabled at startup time to be effective

    • Show MOJO size and scoring latency (for C++/R/Python runtime) in experiment summary

    • Automatically prune low weight base models in final ensemble (based on interpretability setting) to reduce final model complexity

    • Automatically convert non-raw github URLs for custom recipes to raw source code URLs

  • Improvements:

    • Speed up feature evolution for time-series and low-accuracy experiments

    • Improved accuracy of feature evolution algorithm

    • Feature transformer interpretability, total count, and importance accounted for in genetic algorithm’s model and feature selection

    • Binary confusion matrix in ROC curve of experiment page is made consistent with Diagnostics (flipped positions of TP/TN)

    • Only include custom recipes in Python scoring pipeline if the experiment uses any custom recipes

    • Additional documentation (New OpenID config options, JDBC data connector syntax)

    • Improved AutoReport’s transformer descriptions

    • Improved progress reporting during Autoreport creation

    • Improved speed of automatic interaction search for imbalanced multiclass problems

    • Improved accuracy of single final model for GLM and FTRL

    • Allow config_overrides to be a list/vector of parameters for R client API

    • Disable early stopping for Random Forest models by default, and expose new ‘rf_early_stopping’ mode (optional)

    • Create identical example data (again, as in 1.8.0 and before) for all scoring pipelines

    • Upgraded versions of datatable and Java

    • Installed graphviz in Docker image, now get .png file of pipeline visualization in MOJO package and Autoreport. Note: For RPM/DEB/TAR SH installs, user can install graphviz to get this optional functionality

  • Documentation Updates:

    • Added a simple example for modifying a dataset by recipe using live code

    • Added a section describing how to impute datasets (experimental)

    • Added Decision Trees to list of supported algorithms

    • Fixed examples for enabling JDBC connectors

    • Added information describing how to use a JDBC driver that is not tested in house

    • Updated the Missing Values Handling topic to include sections for “Clustering in Transformers” and “Isolation Forest Anomaly Score Transformer”

    • Improved the “Fold Column” description

  • Bug Fixes:

    • Fix various reasons why final model score was too far off from best feature evolution score

    • Delete temporary files created during test set scoring

    • Fixed target transformer tuning (was potentially mixing up target transformers between feature evolution and final model)

    • Fixed tensorflow_nlp_have_gpus_in_production=true mode

    • Fixed partial dependence plots for missing datetime values and no longer show them for text columns

    • Fixed time-series GUI for quarterly data

    • Feature transformer exploration limited to no more than 1000 new features (Small data on 10/10/1 would try too many features)

    • Fixed Kaggle pipeline building recipe to try more input features than 8

    • Fixed cursor placement in live code editor for custom data recipe

    • Show correct number of cross-validation splits in pipeline visualization if have more than 10 splits

    • Fixed parsing of datetime in MOJO for some datetime formats without ‘%d’ (day)

    • Various bug fixes

  • Backward/Forward compatibility:

    • Models built in 1.8.2 LTS will remain supported in upcoming versions 1.8.x LTS

    • Models built in 1.7.1/1.8.0/1.8.1 are not deprecated and should continue to work (best effort is made to preserve MOJO and Autoreport creation, MLI, scoring, etc.)

    • Models built in 1.7.0 or earlier will be deprecated

Version 1.8.1.1 (Dec 21, 2019)

Available here

  • Bugfix for time series experiments with quarterly data when launched from GUI

Version 1.8.1 (Dec 10, 2019)

Available here

  • New Features:

    • Full set of scoring metrics and corresponding downloadable holdout predictions for experiments with single final models (time-series or i.i.d)

    • MLI Updates:

      • What-If (sensitivity) analysis

      • Interpretation of experiments on text data (NLP)

    • Custom Data Recipe BYOR:

      • BYOR (bring your own recipe) in Python: pandas, numpy, datatable, third-party libraries for fast prototyping of connectors and data preprocessing inside DAI

      • data connectors, cleaning, filtering, aggregation, augmentation, feature engineering, splits, etc.

      • can create one or multiple datasets from scratch or from existing datasets

      • interactive code editor with live preview

      • example code at https://github.com/h2oai/driverlessai-recipes/tree/rel-1.8.1/data

    • Visualization of final scoring pipeline (Experimental)

      • In-GUI display of graph of feature engineering, modeling and ensembling steps of entire machine learning pipeline

      • Addition to Autodoc

    • Time-Series:

      • Ability to specify which features will be unavailable at test time for time-series experiments

      • Custom user-provided train/validation splits (by start/end datetime for each split) for time-series experiments

      • Back-testing metrics for time-series experiments (regression and classification, with and without lags) based on rolling windows (configurable number of windows)

    • MOJO:

      • Java MOJO for FTRL

      • PyTorch MOJO (C++/Py/R) for custom recipes based on BERT/DistilBERT NLP models (available upon request)

  • Improvements:

    • Accuracy:

      • Automatic pairwise interaction search (+,-,*,/) for numeric features (“magic feature” finder)

      • Improved accuracy for time series experiments with low interpretability

      • Improved leakage detection logic

      • Improved genetic algorithm heuristics for feature evolution (more exploration)

    • Time-Series Recipes:

      • Re-enable Test-time augmentation in Python scoring pipeline for time-series experiments

      • Reduce default number of time-series rolling holdout predictions to same number as validation splits (but configurable)

    • Computation:

      • Faster feature evolution part for non-time-series experiments with single final model

      • Faster binary imbalanced models for very high class imbalance by limiting internal number of re-sampling bags

      • Faster feature selection

      • Enable GPU support for ImbalancedXGBoostGBMModel

      • Improved speed for importing multiple files at once

      • Faster automatic determination of time series properties

      • Enable use of XGBoost models on large datasets if low enough accuracy settings, expose dataset size limits in expert settings

      • Reduced memory usage for all experiments

      • Faster creation of holdout predictions for time-series experiments (Shapley values are now computed by MLI on demand by default)

    • UX Improvements:

      • Added ability to rename datasets

      • Added search bar for expert settings

      • Show traces for long-running experiments

      • All experiments create a MOJO (if possible, set to ‘auto’)

      • All experiments create a pipeline visualization

      • By default, all experiments (iid and time series) have holdout predictions on training data and a full set of metrics for final model

  • Documentation Updates:

    • Updated steps for enabling GPU persistence mode

    • Added information about deprecated NVIDIA functions

    • Improved documentation for enabling LDAP authentication

    • Added information about changing the column type in datasets

    • Updated list of experiment artifacts available in an experiment summary

    • Added steps describing how to expose ports on Docker for the REST service deployment within the Driverless AI Docker container

    • Added an example showing how to run an experiment with a custom transform recipe

    • Improved the FAQ for setting up TLS/SSL

    • Added FAQ describing issues that can occur when attempting Import Folder as File with a data connector on Windows

  • Bug Fixes:

    • Allow brain restart/refit to accept unscored previous pipelines

    • Fix actual vs predicted labeling for diagnostics of regression model

    • Fix MOJO for TensorFlow for non target transformers other than identity

    • Fix column type detection for Excel files

    • Allow experiments with default expert settings to have a MOJO

    • Various bug fixes

Version 1.8.0 (Oct 3, 2019)

Available here

  • Improve speed and memory usage for feature engineering

  • Improve speed of leakage and shift detection, and improve accuracy

  • Improve speed of AutoVis under high system load

  • Improve speed for experiments with large user-given validation data

  • Improve accuracy of ensembles for regression problems

  • Improve creation of Autoreport (only one background job per experiment)

  • Improve sampling techniques for ImbalancedXGBoost and ImbalancedLightGBM models, and disable them by default since can be slower

  • Add Python/R/C++ MOJO support for FTRL and RandomForest

  • Add native categorical handling for LightGBM in CPU mode

  • Add monotonicity constraints support for LightGBM

  • Add Isolation Forest Anomaly Score transformer (outlier detection)

  • Re-enable One-Hot-Encoding for GLM models

  • Add lexicographical label encoding (disabled by default)

  • Add ability to further train user-provided pretrained embeddings for TensorFlow NLP transformers, in addition to fine-tuning the rest of the neural network graph

  • Add timeout for BYOR acceptance tests

  • Add log and notifications for large shifts in final model variable importances compared to tuning model

  • Add more expert control over time series feature engineering

  • Add ability for recipes to be uploaded in bulk as entire (or part of) github repository or as links to Python files on page

  • Allow missing values in fold column

  • Add support for feature brain when starting “New Model With Same Parameters” of a model that was previously restarted

  • Add support for toggling whether additional features are to be included in pipeline during “Retrain Final Pipeline”

  • Limit experiment runtime to one day by default (approximately enforced, can be configured in Expert Settings -> Experiment or config.toml ‘max_runtime_minutes’)

  • Add support for importing pickled Pandas frames (.pkl)

  • MLI updates:

    • Show holdout predictions and test set predictions (if applicable) in MLI TS for both metric and actual vs. predicted charts

    • Add ability to download group metrics in MLI TS

    • Add ability to zoom into charts in MLI TS

    • Add ability to use column not used in DAI model as a k-LIME cluster column in MLI

    • Add ability to view original and transformed DAI model-based feature importance in MLI

    • Add ability to view Shapley importance for original features

    • Add ability to view permutation importance for a DAI model when the config option autodoc_include_permutation_feature_importance is set to on

    • Fixed bug in binary Disparate Impact Analysis, which caused incorrect calculations amongst several metrics (ones using false positives and true negatives in the numerator)

  • Disable NLP TensorFlow transformers by default (enable in NLP expert settings by switching to “on”)

  • Reorganize expert settings, add tab for feature engineering

  • Experiment now informs if aborted by user, system or server restart

  • Reduce load of all tasks launched by server, giving priority to experiments to use cores

  • Add experiment summary files to aborted experiment logs

  • Add warning when ensemble has models that reach limit of max iterations despite early stopping, with learning rate controls in expert panel to control.

  • Improve progress reporting

  • Allow disabling of H2O recipe server for scoring if not using custom recipes (to avoid Java dependency)

  • Fix RMSPE scorer

  • Fix recipes error handling when uploading via URL

  • Fix Autoreport being spawned anytime GUI was on experiment page, overloading the system with forks from the server

  • Fix time-out for Autoreport PDP calculations, so completes more quickly

  • Fix certain config settings to be honored from GUI expert settings (woe_bin_list, ohe_bin_list, text_gene_max_ngram, text_gene_dim_reduction_choice, tensorflow_max_epochs_nlp, tensorflow_nlp_pretrained_embeddings_file_path, holiday_country), previously were only honored when provided at startup time

  • Fix column type for additional columns during scored test set download

  • Fix GUI incorrectly converting time for forecast horizon in TS experiments

  • Fix calculation of correlation for string columns in AutoVis

  • Fix download for R MOJO runtime

  • Fix parameters for LightGBM RF mode

  • Fix dart parameters for LightGBM and XGBoost

  • Documentation updates:

    • Included more information in the Before You Begin Installing or Upgrading topic to help making installations and upgrades go more smoothly

    • Added topic describing how to choose between the AWS Community and AWS Marketplace AMIs

    • Added information describing how to retrieve the MOJO2 Javadoc

    • Updated Python client examples to work with Driverless AI 1.7.x releases

    • Updated documentation for new features, expert settings, MLI plots, etc.

  • Backward/Forward compatibility:

    • Models built in 1.8.0 will remain supported in versions 1.8.x

    • Models built in 1.7.1 are not deprecated and should continue to work (best effort is made to preserve MOJO and Autoreport creation, MLI, scoring, etc.)

    • 1.8.0 upgraded to scipy version 1.3.1 to support newer custom recipes. This might deprecate custom recipes that depend on scipy version 1.2.2 (and experiments using them) and might require re-import of those custom recipes. Previously built Python scoring pipelines will continue to work.

    • Models built in 1.7.0 or earlier will be deprecated

  • Various bug fixes

Version 1.7.1 (Aug 19, 2019)

Available here

  • Added two new models with internal sampling techniques for imbalanced binary classification problems: ImbalancedXGBoost and ImbalancedLightGBM

  • Added support for rolling-window based predictions for time-series experiments (2 options: test time augmentation or re-fit)

  • Added support for setting logical column types for a dataset (to override type detection during experiments)

  • Added ability to set experiment name at start of experiment

  • Added leakage detection for time-series problems

  • Added JDBC connector

  • MOJO updates:

    • Added Python/R/C++ MOJO support for TensorFlow model

    • Added Python/R/C++ MOJO support for TensorFlow NLP transformers: TextCNN, CharCNN, BiGRU, including any pretrained embeddings if provided

    • Reduced memory usage for MOJO creation

    • Increased speed of MOJO creation

    • Configuration options for MOJO and Python scoring pipelines now have 3-way toggle: “on”/”off”/”auto”

  • MLI updates:

    • Added disparate impact analysis (DIA) for MLI

    • Allow MLI scoring pipeline to be built for datasets with column names that need to be sanitized

    • Date-aware binning for partial dependence and ICE in MLI

  • Improved generalization performance for time-series modeling with regulariation techniques for lag-based features

  • Improved “predicted vs actual” plots for regression problems (using adaptive point sizes)

  • Fix bug in datatable for manipulations of string columns larger than 2GB

  • Fixed download of predictions on user-provided validation data

  • Fix bug in time-series test-time augmentation (work-around was to include entire training data in test set)

  • Honor the expert settings flag to enable detailed traces (disable again by default)

  • Various bug fixes

Version 1.6.4 LTS (Aug 19, 2019)

Available here

  • ML Core updates:

    • Speed up schema detection

    • DAI now drops rows with missing values when diagnosing regression problems

    • Speed up column type detection

    • Fixed growth of individuals

    • Fixed n_jobs for predict

    • Target column is no longer included in predictors for skewed datasets

    • Added an option to prevent users from downloading data files locally

    • Improved UI split functionality

    • A new “max_listing_items” config option to limit the number of items fetched in listing pages

  • Model Ops updates:

    • MOJO runtime upgraded to version 2.1.3 which supports perpetual MOJO pipeline

    • Upgraded deployment templates to version matching MOJO runtime version

  • MLI updates:

    • Fix to MLI schema builder

    • Fix parsing of categorical reason codes

    • Added ability to handle integer time column

  • Various bug fixes

Version 1.7.0 (Jul 7, 2019)

Available here

  • Support for Bring Your Own Recipe (BYOR) for transformers, models (algorithms) and scorers

  • Added protobuf-based MOJO scoring runtime libraries for Python, R and Java (standalone, low-latency)

  • Added local REST server as one-click deployment option for MOJO scoring pipeline, in addition to AWS Lambda endpoint

  • Added R client package, in addition to Python client

  • Added Project workspace to group datasets and experiments and to visually compare experiments and create leaderboards

  • Added download of imported datasets as .csv

  • Recommendations for columnar transformations in AutoViz

  • Improved scalability and performance

  • Ability to provide max. runtime for experiments

  • Create MOJO scoring pipeline by default if the experiment configuration allows (for convenience, enables local/cloud deployment options without user input)

  • Support for user provided pre-trained embeddings for TensorFlow NLP models

  • Support for holdout splits lacking some target classes (can happen when a fold column is provided)

  • MLI updates:

    • Added residual plot for regression problems (keeping all outliers intact)

    • Added confusion matrix as default metric display for multinomial problems

    • Added Partial Dependence (PD) and Individual Conditional Expectation (ICE) plots for Driverless.ai models in MLI GUI

    • Added ability to search by ID column in MLI GUI

    • Added ability to run MLI PD/ICE on all features

    • Added ability to handle multiple observations for a single time column in MLI TS by taking the mean of the target and prediction where applicable

    • Added ability to handle integer time column in MLI TS

    • MLI TS will use train holdout predictions if there is no test set provided

  • Faster import of files with “%Y%m%d” and “%Y%m%d%H%M” time format strings, and files with lots of text strings

  • Fix units for RMSPE scorer to be a percentage (multiply by 100)

  • Allow non-positive outcomes for MAPE and SMAPE scorers

  • Improved listing in GUI

  • Allow zooming in GUI

  • Upgrade to TensorFlow 1.13.1 and CUDA 10 (and CUDA is part of the distribution now, to simplify installation)

  • Add CPU-support for TensorFlow on PPC

  • Documentation updates:

    • Added documentation for new features including

      • Projects

      • Custom Recipes

      • C++ MOJO Scoring Pipelines

      • R Client API

      • REST Server Deployment

    • Added information about variable importance values on the experiments page

    • Updated documentation for Expert Settings

    • Updated “Tips n Tricks” with new Scoring Pipeline tips

  • Various bug fixes

Version 1.6.3 LTS (June 14, 2019)

Available here

  • Included an Audit log feature

  • Fixed support for decimal types for parquet files in MOJO

  • Autodoc can order PDP/ICE by feature importance

  • Session Management updates

  • Upgraded datatable

  • Improved reproducibility

  • Model diagnostics now uses a weight column

  • MLI can build surrogate models on all the original features or on all the transformed features that DAI uses

  • Internal server cache now respects usernames

  • Fixed an issue with time series settings

  • Fixed an out of memory error when loading a MOJO

  • Fixed Python scoring package for TensorFlow

  • Added OpenID configurations

  • Documentation updates:

    • Updated the list of artifacts available in the Experiment Summary

    • Clarified language in the documentation for unsupported (but available) features

    • For the Terraform requirement in deployments, clarified that only Terraform versions in the 0.11.x release are supported, and specifically 0.11.10 or greater

    • Fixed link to the Miniconda installation instructions

  • Various bug fixes

Version 1.6.2 LTS (May 10, 2019)

Available here

  • This version provides PPC64le artifacts

  • Improved stability of datatable

  • Improved path filtering in the file browser

  • Fixed units for RMSPE scorer to be a percentage (multiply by 100)

  • Fixed segmentation fault on Ubuntu 18 with installed font package

  • Fixed IBM Spectrum Conductor authentication

  • Fixed handling of EC2 machine credentials

  • Fixed of Lag transformer configuration

  • Fixed KDB and Snowflake Error Reporting

  • Gradually reduce number of used workers for column statistics computation in case of failure.

  • Hide default Tornado header exposing used version of Tornado

  • Documentation updates:

    • Added instructions for installing via AWS Marketplace

    • Improved documentation for installing via Google Cloud

    • Improved FAQ documentation

    • Added Data Sampling documentation topic

  • Various bug fixes

Version 1.6.1.1 LTS (Apr 24, 2019)

Available here

  • Fix in AWS role handling.

Version 1.6.1 LTS (Apr 18, 2019)

Available here

  • Several fixes for MLI (partial dependence plots, Shapley values)

  • Improved documentation for model deployment, time-series scoring, AutoVis and FAQs

Version 1.6.0 LTS (Apr 5, 2019)

Private build only.

  • Fixed import of string columns larger than 2GB

  • Fixed AutoViz crashes on Windows

  • Fixed quantile binning in MLI

  • Plot global absolute mean Shapley values instead of global mean Shapley values in MLI

  • Improvements to PDP/ICE plots in MLI

  • Validated Terraform version in AWS Lambda deployment

  • Added support for NULL variable importance in AutoDoc

  • Made Variable Importance table size configurable in AutoDoc

  • Improved support for various combinations of data import options being enabled/disabled

  • CUDA is now part of distribution for easier installation

  • Security updates:

    • Enforced SSL settings to be honored for all h2oai_client calls

    • Added config option to prevent using LocalStorage in the browser to cache information

    • Upgraded Tornado server version to 5.1.1

    • Improved session expiration and autologout functionality

    • Disabled access to Driverless AI data folder in file browser

    • Provided an option to filter content that is shown in the file browser

    • Use login name for HDFS impersonation instead of predefined name

    • Disabled autocomplete in login form

  • Various bug fixes

Version 1.5.4 (Feb 24, 2019)

Available here

  • Speed up calculation of column statistics for date/datetime columns using certain formats (now uses ‘max_rows_col_stats’ parameter)

  • Added computation of standard deviation for variable importances in experiment summary files

  • Added computation of shift of variable importances between feature evolution and final pipeline

  • Fix link to MLI Time-Series experiment

  • Fix display bug for iteration scores for long experiments

  • Fix display bug for early finish of experiment for GLM models

  • Fix display bug for k-LIME when target is skewed

  • Fix display bug for forecast horizon in MLI for Time-Series

  • Fix MLI for Time-Series for single time group column

  • Fix in-server scoring of time-series experiments created in 1.5.0 and 1.5.1

  • Fix OpenBLAS dependency

  • Detect disabled GPU persistence mode in Docker

  • Reduce disk usage during TensorFlow NLP experiments

  • Reduce disk usage of aborted experiments

  • Refresh reported size of experiments during start of application

  • Disable TensorFlow NLP transformers by default to speed up experiments (can enable in expert settings)

  • Improved progress percentage shown during experiment

  • Improved documentation (upgrade on Windows, how to create the simplest model, DTap connectors, etc.)

  • Various bug fixes

Version 1.5.3 (Feb 8, 2019)

Available here

  • Added support for splitting datasets by time via time column containing date, datetime or integer values

  • Added option to disable file upload

  • Require authentication to download experiment artifacts

  • Automatically drop predictor columns from training frame if not found in validation or test frame and warn

  • Improved performance by using physical CPU cores only (configurable in config.toml)

  • Added option to not show inactive data connectors

  • Various bug fixes

Version 1.5.2 (Feb 2, 2019)

Available here

  • Added world-level bidirectional GRU Tensorflow models for NLP features

  • Added character-level CNN Tensorflow models for NLP features

  • Added support to import multiple individual datasets at once

  • Added support for holdout predictions for time-series experiments

  • Added support for regression and multinomial classification for FTRL (in addition to binomial classification)

  • Improved scoring for time-series when test data contains actual target values (missing target values will be predicted)

  • Reduced memory usage for LightGBM models

  • Improved performance for feature engineering

  • Improved speed for TensorFlow models

  • Improved MLI GUI for time-series problems

  • Fix final model fold splits when fold_column is provided

  • Various bug fixes

Version 1.5.1 (Jan 22, 2019)

Available here

  • Fix MOJO for GLM

  • Add back .csv file of experiment summary

  • Improve collection of pipeline timing artifacts

  • Clean up Docker tag

Version 1.5.0 (Jan 18, 2019)

Available here

  • Added model diagnostics (interactive model metrics on new test data incl. residual analysis for regression)

  • Added FTRL model (Follow The Regularized Leader)

  • Added Kolmogorov-Smirnov metric (degree of separation between positives and negatives)

  • Added ability to retrain (only) the final model on new data

  • Added one-hot encoding for low-cardinality categorical features, for GLM

  • Added choice between 32-bit (now default) and 64-bit precision

  • Added system information (CPU, GPU, disk, memory, experiments)

  • Added support for time-series data with many more time gaps, and with weekday-only data

  • Added one-click deployment to Amazon Lambda

  • Added ability to split datasets randomly, with option to stratify by target column or group by fold column

  • Added support for OpenID authentication

  • Added connector for BlueData

  • Improved responsiveness of the GUI under heavy load situations

  • Improved speed and reduce memory footprint of feature engineering

  • Improved performance for RuleFit models and enable GPU and multinomial support

  • Improved auto-detection of temporal frequency for time-series problems

  • Improved accuracy of final single model if external validation provided

  • Improved final pipeline if external validation data is provided (add ensembling)

  • Improved k-LIME in MLI by using original features deemed important by DAI instead of all original features

  • Improved MLI by using 3-fold CV by default for all surrogate models

  • Improved GUI for MLI time series (integrated help, better integration)

  • Added ability to view MLI time series logs while MLI time series experiment is running

  • PDF version of the Automatic Report (AutoDoc) is now replaced by a Word version

  • Various bug fixes (GLM accuracy, UI slowness, MLI UI, AutoVis)

Version 1.4.2 (Dec 3, 2018)

Available here

  • Support for IBM Power architecture

  • Speed up training and reduce size of final pipeline

  • Reduced resource utilization during training of final pipeline

  • Display test set metrics (ROC, ROCPR, Gains, Lift) in GUI in addition to validation metrics (if test set provided)

  • Show location of best threshold for Accuracy, MCC and F1 in ROC curves

  • Add relative point sizing for scatter plots in AutoVis

  • Fix file upload and add model checkpointing in Python client API

  • Various bug fixes

Version 1.4.1 (Nov 11, 2018)

Available here

  • Improved integration of MLI for time-series

  • Reduced disk and memory usage during final ensemble

  • Allow scoring and transformations on previously imported datasets

  • Enable checkpoint restart for unfinished models

  • Add startup checks for OpenCL platforms for LightGBM on GPUs

  • Improved feature importances for ensembles

  • Faster dataset statistics for date/datetime columns

  • Faster MOJO batch scoring

  • Fix potential hangs

  • Fix ‘not in list’ error in MOJO

  • Fix NullPointerException in MLI

  • Fix outlier detection in AutoVis

  • Various bug fixes

Version 1.4.0 (Oct 27, 2018)

Available here

  • Enable LightGBM by default (now with MOJO)

  • LightGBM tuned for GBM decision trees, Random Forest (rf), and Dropouts meet Multiple Additive Regression Trees (dart)

  • Add ‘isHoliday’ feature for time columns

  • Add ‘time’ column type for date/datetime columns in data preview

  • Add support for binary datatable file ingest in .jay format

  • Improved final ensemble (each model has its own feature pipeline)

  • Automatic smart checkpointing (feature brain) from prior experiments

  • Add kdb+ connector

  • Feature selection of original columns for data with many columns to handle >>100 columns

  • Improved time-series recipe (multiple validation splits, better logic)

  • Improved performance of AutoVis

  • Improved date detection logic (now detects %Y%m%d and %Y-%m date formats)

  • Automatic fallback to CPU mode if GPU runs out of memory (for XGBoost, GLM and LightGBM)

  • No longer require header for validation and testing datasets if data types match

  • No longer include text columns for data shift detection

  • Add support for time-series models in MLI (including ability to select time-series groups)

  • Add ability to download MLI logs from MLI experiment page (includes both Python and Java logs)

  • Add ability to view MLI logs while MLI experiment is running (Python and Java logs)

  • Add ability to download LIME and Shapley reason codes from MLI page

  • Add ability to run MLI on transformed features

  • Display all variables for MLI variable importance for both DAI and surrogate models in MLI summary

  • Include variable definitions for DAI variable importance list in MLI summary

  • Fix Gains/Lift charts when observations weights are given

  • Various bug fixes

Version 1.3.1 (Sep 12, 2018)

Available here

  • Fix ‘Broken pipe’ failures for TensorFlow models

  • Fix time-series problems with categorical features and interpretability >= 8

  • Various bug fixes

Version 1.3.0 (Sep 4, 2018)

Available here

  • Added LightGBM models - now have [XGBoost, LightGBM, GLM, TensorFlow, RuleFit]

  • Added TensorFlow NLP recipe based on CNN Deeplearning models (sentiment analysis, document classification, etc.)

  • Added MOJO for GLM

  • Added detailed confusion matrix statistics

  • Added more expert settings

  • Improved data exploration (columnar statistics and row-based data preview)

  • Improved speed of feature evolution stage

  • Improved speed of GLM

  • Report single-pass score on external validation and test data (instead of bootstrap mean)

  • Reduced memory overhead for data processing

  • Reduced number of open files - fixes ‘Bad file descriptor’ error on Mac/Docker

  • Simplified Python client API

  • Query any data point in the MLI UI from the original dataset due to “on-demand” reason code generation

  • Enhanced k-means clustering in k-LIME by only using a subset of features. See The k-LIME Technique for more information.

  • Report k-means centers for k-LIME in MLI summary for better cluster interpretation

  • Improved MLI experiment listing details

  • Various bug fixes

Version 1.2.2 (July 5, 2018)

Available here

  • MOJO Java scoring pipeline for time-series problems

  • Multi-class confusion matrices

  • AUCMACRO Scorer: Multi-class AUC via macro-averaging (in addition to the default micro-averaging)

  • Expert settings (configuration override) for each experiment from GUI and client APIs.

  • Support for HTTPS

  • Improved downsampling logic for time-series problems (if enabled through accuracy knob settings)

  • LDAP readonly access to Active Directory

  • Snowflake data connector

  • Various bug fixes

Version 1.2.1 (June 26, 2018)

  • Added LIME-SUP (alpha) to MLI as alternative to k-LIME (local regions are defined by decision tree instead of k-means)

  • Added RuleFit model (alpha), now have [GBM, GLM, TensorFlow, RuleFit] - TensorFlow and RuleFit are disabled by default

  • Added Minio (private cloud storage) connector

  • Added support for importing folders from S3

  • Added ‘Upload File’ option to ‘Add Dataset’ (in addition to drag & drop)

  • Predictions for binary classification problems now have 2 columns (probabilities per class), for consistency with multi-class

  • Improved model parameter tuning

  • Improved feature engineering for time-series problems

  • Improved speed of MOJO generation and loading

  • Improved speed of time-series related automatic calculations in the GUI

  • Fixed potential rare hangs at end of experiment

  • No longer require internet to run MLI

  • Various bug fixes

Version 1.2.0 (June 11, 2018)

  • Time-Series recipe

  • Low-latency standalone MOJO Java scoring pipelines (now beta)

  • Enable Elastic Net Generalized Linear Modeling (GLM) with lambda search (and GPU support), for interpretability>=6 and accuracy<=5 by default (alpha)

  • Enable TensorFlow (TF) Deep Learning models (with GPU support) for interpretability=1 and/or multi-class models (alpha, enable via config.toml)

  • Support for pre-tuning of [GBM, GLM, TF] models for picking best feature evolution model parameters

  • Support for final ensemble consisting of mix of [GBM, GLM, TF] models

  • Automatic Report (AutoDoc) in PDF and Markdown format as part of summary zip file

  • Interactive tour (assistant) for first-time users

  • MLI now runs on experiments from previous releases

  • Surrogate models in MLI now use 3 folds by default

  • Improved small data recipe with up to 10 cross-validation folds

  • Improved accuracy for binary classification with imbalanced data

  • Additional time-series transformers for interactions and aggreations between lags and lagging of non-target columns

  • Faster creation of MOJOs

  • Progress report during data ingest

  • Normalize binarized multi-class confusion matrices by class count (global scaling factor)

  • Improved parsing of boolean environment variables for configuration

  • Various bug fixes

Version 1.1.6 (May 29, 2018)

  • Improved performance for large datasets

  • Improved speed and user interface for MLI

  • Improved accuracy for binary classification with imbalanced data

  • Improved generalization estimate for experiments with given validation data

  • Reduced size of experiment directories

  • Support for Parquet files

  • Support for bzip2 compressed files

  • Added Data preview in UI: ‘Describe’

  • No longer add ID column to holdout and test set predictions for simplicity

  • Various bug fixes

Version 1.1.4 (May 17, 2018)

  • Native builds (RPM/DEB) for 1.1.3

Version 1.1.3 (May 16, 2018)

  • Faster speed for systems with large CPU core counts

  • Faster and more robust handling of user-specified missing values for training and scoring

  • Same validation scheme for feature engineering and final ensemble for high enough accuracy

  • MOJO scoring pipeline for text transformers

  • Fixed single-row scoring in Python scoring pipeline (broken in 1.1.2)

  • Fixed default scorer when experiment is started too quickly

  • Improved responsiveness for time-series GUI

  • Improved responsiveness after experiment abort

  • Improved load balancing of memory usage for multi-GPU XGBoost

  • Improved UI for selection of columns to drop

  • Various bug fixes

Version 1.1.2 (May 8, 2018)

  • Support for automatic time-series recipe (alpha)

  • Now using Generalized Linear Model (GLM) instead of XGBoost (GBM) for interpretability 10

  • Added experiment preview with runtime and memory usage estimation

  • Added MER scorer (Median Error Rate, Median Abs. Percentage Error)

  • Added ability to use integer column as time column

  • Speed up type enforcement during scoring

  • Support for reading ARFF file format (alpha)

  • Quantile Binning for MLI

  • Various bug fixes

Version 1.1.1 (April 23, 2018)

  • Support string columns larger than 2GB

Version 1.1.0 (April 19, 2018)

  • AWS/Azure integration (hourly cloud usage)

  • Bug fixes for MOJO pipeline scoring (now beta)

  • Google Cloud storage and BigQuery (alpha)

  • Speed up categorical column stats computation during data import

  • Further improved memory management on GPUs

  • Improved accuracy for MAE scorer

  • Ability to build scoring pipelines on demand (if not enabled by default)

  • Additional target transformer for regression problems sqrt(sqrt(x))

  • Add GLM models as candidates for interpretability=10 (alpha, disabled by default)

  • Improved performance of native builds (RPM/DEB)

  • Improved estimation of error bars

  • Various bug fixes

Version 1.0.30 (April 5, 2018)

  • Speed up MOJO pipeline creation and disable MOJO by default (still alpha)

  • Improved memory management on GPUs

  • Support for optional 32-bit floating-point precision for reduced memory footprint

  • Added logging of test set scoring and data transformations

  • Various bug fixes

Version 1.0.29 (April 4, 2018)

  • If MOJO fails to build, no MOJO will be available, but experiment can still succeed

Version 1.0.28 (April 3, 2018)

  • (Non-docker) RPM installers for RHEL7/CentOS7/SLES 12 with systemd support

Version 1.0.27 (March 31, 2018)

  • MOJO scoring pipeline for Java standalone cross-platform low-latency scoring (alpha)

  • Various bug fixes

Version 1.0.26 (March 28, 2018)

  • Improved performance and reduced memory usage for large datasets

  • Improved performance for F0.5, F2 and accuracy

  • Improved performance of MLI

  • Distribution shift detection now also between validation and test data

  • Batch scoring example using datatable

  • Various enhancements for AutoVis (outliers, parallel coordinates, log file)

  • Various bug fixes

Version 1.0.25 (March 22, 2018)

  • New scorers for binary/multinomial classification: F0.5, F2 and accuracy

  • Precision-recall curve for binary/multinomial classification models

  • Plot of actual vs predicted values for regression problems

  • Support for excluding feature transformations by operation type

  • Support for reading binary file formats: datatable and Feather

  • Improved multi-GPU memory load balancing

  • Improved display of initial tuning results

  • Reduced memory usage during creation of final model

  • Fixed several bugs in creation of final scoring pipeline

  • Various UI improvements (e.g., zooming on iteration scoreboard)

  • Various bug fixes

Version 1.0.24 (March 8, 2018)

  • Fix test set scoring bug for data with an ID column (introduced in 1.0.23)

  • Allow renaming of MLI experiments

  • Ability to limit maximum number of cores used for datatable

  • Print validation scores and error bars across final ensemble model CV folds in logs

  • Various UI improvements

  • Various bug fixes

Version 1.0.23 (March 7, 2018)

  • Support for Gains and Lift curves for binomial and multinomial classification

  • Support for multi-GPU single-model training for large datasets

  • Improved recipes for large datasets (faster and less memory/disk usage)

  • Improved recipes for text features

  • Increased sensitivity of interpretability setting for feature engineering complexity

  • Disable automatic time column detection by default to avoid confusion

  • Automatic column type conversion for test and validation data, and during scoring

  • Improved speed of MLI

  • Improved feature importances for MLI on transformed features

  • Added ability to download each MLI plot as a PNG file

  • Added support for dropped columns and weight column to MLI stand-alone page

  • Fix serialization of bytes objects larger than 4 GiB

  • Fix failure to build scoring pipeline with ‘command not found’ error

  • Various UI improvements

  • Various bug fixes

Version 1.0.22 (Feb 23, 2018)

  • Fix CPU-only mode

  • Improved robustness of datatable CSV parser

Version 1.0.21 (Feb 21, 2018)

  • Fix MLI GUI scaling issue on Mac

  • Work-around segfault in truncated SVD scipy backend

  • Various bug fixes

Version 1.0.20 (Feb 17, 2018)

  • HDFS/S3/Excel data connectors

  • LDAP/PAM/Kerberos authentication

  • Automatic setting of default values for accuracy / time / interpretability

  • Interpretability: per-observation and per-feature (signed) contributions to predicted values in scoring pipeline

  • Interpretability setting now affects feature engineering complexity and final model complexity

  • Standalone MLI scoring pipeline for Python

  • Time setting of 1 now runs for only 1 iteration

  • Early stopping of experiments if convergence is detected

  • ROC curve display for binomial and multinomial classification, with confusion matrices and threshold/F1/MCC display

  • Training/Validation/Test data shift detectors

  • Added AUCPR scorer for multinomial classification

  • Improved handling of imbalanced binary classification problems

  • Configuration file for runtime limits such as cores/memory/harddrive (for admins)

  • Various GUI improvements (ability to rename experiments, re-run experiments, logs)

  • Various bug fixes

Version 1.0.19 (Jan 28, 2018)

  • Fix hang during final ensemble (accuracy >= 5) for larger datasets

  • Allow scoring of all models built in older versions (>= 1.0.13) in GUI

  • More detailed progress messages in the GUI during experiments

  • Fix scoring pipeline to only use relative paths

  • Error bars in model summary are now +/- 1*stddev (instead of 2*stddev)

  • Added RMSPE scorer (RMS Percentage Error)

  • Added SMAPE scorer (Symmetric Mean Abs. Percentage Error)

  • Added AUCPR scorer (Area under Precision-Recall Curve)

  • Gracefully handle inf/-inf in data

  • Various UI improvements

  • Various bug fixes

Version 1.0.18 (Jan 24, 2018)

  • Fix migration from version 1.0.15 and earlier

  • Confirmation dialog for experiment abort and data/experiment deletion

  • Various UI improvements

  • Various AutoVis improvements

  • Various bug fixes

Version 1.0.17 (Jan 23, 2018)

  • Fix migration from version 1.0.15 and earlier (partial, for experiments only)

  • Added model summary download from GUI

  • Restructured and renamed logs archive, and add model summary to it

  • Fix regression in AutoVis in 1.0.16 that led to slowdown

  • Various bug fixes

Version 1.0.16 (Jan 22, 2018)

  • Added support for validation dataset (optional, instead of internal validation on training data)

  • Standard deviation estimates for model scores (+/- 1 std.dev.)

  • Computation of all applicable scores for final models (in logs only for now)

  • Standard deviation estimates for MLI reason codes (+/- 1 std.dev.) when running in stand-alone mode

  • Added ability to abort MLI job

  • Improved final ensemble performance

  • Improved outlier visualization

  • Updated H2O-3 to version 3.16.0.4

  • More readable experiment names

  • Various speedups

  • Various bug fixes

Version 1.0.15 (Jan 11, 2018)

  • Fix truncated per-experiment log file

  • Various bug fixes

Version 1.0.14 (Jan 11, 2018)

  • Improved performance

Version 1.0.13 (Jan 10, 2018)

  • Improved estimate of generalization performance for final ensemble by removing leakage from target encoding

  • Added API for re-fitting and applying feature engineering on new (potentially larger) data

  • Remove access to pre-transformed datasets to avoid unintended leakage issues downstream

  • Added mean absolute percentage error (MAPE) scorer

  • Enforce monotonicity constraints for binary classification and regression models if interpretability >= 6

  • Use squared Pearson correlation for R^2 metric (instead of coefficient of determination) to avoid negative values

  • Separated HTTP and TCP scoring pipeline examples

  • Reduced size of h2oai_client wheel

  • No longer require weight column for test data if it was provided for training data

  • Improved accuracy of final modeling pipeline

  • Include H2O-3 logs in downloadable logs.zip

  • Updated H2O-3 to version 3.16.0.2

  • Various bug fixes

Version 1.0.11 (Dec 12, 2017)

  • Faster multi-GPU training, especially for small data

  • Increase default amount of exploration of genetic algorithm for systems with fewer than 4 GPUs

  • Improved accuracy of generalization performance estimate for models on small data (< 100k rows)

  • Faster abort of experiment

  • Improved final ensemble meta-learner

  • More robust date parsing

  • Various bug fixes

Version 1.0.10 (Dec 4, 2017)

  • Tool tips and link to documentation in parameter settings screen

  • Faster training for multi-class problems with > 5 classes

  • Experiment summary displayed in GUI after experiment finishes

  • Python Client Library downloadable from the GUI

  • Speedup for Maxwell-based GPUs

  • Support for multinomial AUC and Gini scorers

  • Add MCC and F1 scorers for binomial and multinomial problems

  • Faster abort of experiment

  • Various bug fixes

Version 1.0.9 (Nov 29, 2017)

  • Support for time column for causal train/validation splits in time-series datasets

  • Automatic detection of the time column from temporal correlations in data

  • MLI improvements, dedicated page, selection of datasets and models

  • Improved final ensemble meta-learner

  • Test set score now displayed in experiment listing

  • Original response is preserved in exported datasets

  • Various bug fixes

Version 1.0.8 (Nov 21, 2017)

  • Various bug fixes

Version 1.0.7 (Nov 17, 2017)

  • Sharing of GPUs between experiments - can run multiple experiments at the same time while sharing GPU resources

  • Persistence of experiments and data - can stop and restart the application without loss of data

  • Support for weight column for optional user-specified per-row observation weights

  • Support for fold column for user-specified grouping of rows in train/validation splits

  • Higher accuracy through model tuning

  • Faster training - overall improvements and optimization in model training speed

  • Separate log file for each experiment

  • Ability to delete experiments and datasets from the GUI

  • Improved accuracy for regression tasks with very large response values

  • Faster test set scoring - Significant improvements in test set scoring in the GUI

  • Various bug fixes

Version 1.0.5 (Oct 24, 2017)

  • Only display scorers that are allowed

  • Various bug fixes

Version 1.0.4 (Oct 19, 2017)

  • Improved automatic type detection logic

  • Improved final ensemble accuracy

  • Various bug fixes

Version 1.0.3 (Oct 9, 2017)

  • Various speedups

  • Results are now reproducible

  • Various bug fixes

Version 1.0.2 (Oct 5, 2017)

  • Improved final ensemble accuracy

  • Weight of Evidence features added

  • Various bug fixes

Version 1.0.1 (Oct 4, 2017)

  • Improved speed of final ensemble

  • Various bug fixes

Version 1.0.0 (Sep 24, 2017)

  • Initial stable release