Tips ‘n Tricks¶
This section includes Arno’s tips for running Driverless AI.
Given training data and a target column to predict, H2O Driverless AI produces an end-to-end pipeline tuned for high predictive performance (and/or high interpretability) for general classification and regression tasks. The pipeline has only one purpose: to take a test set, row by row, and turn its feature values into predictions.
A typical pipeline creates dozens or even hundreds of derived features from the user-given dataset. Those transformations are often based on precomputed lookup tables and parameterized mathematical operations that were selected and optimized during training. It then feeds all these derived features to one or several machine learning algorithms such as linear models, deep learning models, or gradient boosting models (and several more derived models). If there are multiple models, then their output is post-processed to form the final prediction (either probabilities or target values). The pipeline is a directed acyclic graph.
It is important to note that the training dataset is processed as a whole for better results (e.g., aggregate statistics). For scoring, however, every row of the test dataset must be processed independently to mimic the actual production scenario.
To facilitate deployment to various production environments, there are multiple ways to obtain predictions from a completed Driverless AI experiment, either from the GUI, from the R or Python client API, or from a standalone pipeline.
Score on Another Dataset - Convenient, parallelized, ideal for imported data
Download Predictions - Available if a test set was provided during training
Deploy - Creates an Amazon Lambda endpoint (more endpoints coming soon)
Diagnostics - Useful if the test set includes a target column
Python client - Use the
make_prediction_sync()method. An optional argument can be used to get per-row and per-feature ‘Shapley’ prediction contributions. (Pass
R client - Use the
predict()method. An optional argument can be used to get per-row and per-feature ‘Shapley’ prediction contributions. (Pass
Python - Supports all models and transformers, and supports ‘Shapley’ prediction contributions and MLI reason codes
Java - Most portable, low latency, supports all models and transformers that are enabled by default (except TensorFlow NLP transformers), can be used in Spark/H2O-3/SparklingWater for scale
C++ - Highly portable, low latency, standalone runtime with a convenient Python and R wrapper
Time Series Tips¶
H2O Driverless AI handles time-series forecasting problems out of the box.
All you need to do when starting a time-series experiment is to provide a regular columnar dataset containing your features. Then pick a target column and also pick a “time column” - a designated column containing time stamps for every record (row) such as “April 10 2019 09:13:41” or “2019/04/10”. If you have a test set for which you want predictions for every record, make sure to provide future time stamps and features as well.
In most cases, that’s it. You can launch the experiment and let Driverless AI do the rest. It will even auto-detect multiple time series in the same dataset for different groups such as weekly sales for stores and departments (by finding the columns that identify stores and departments to group by). Driverless AI will also auto-detect the time period including potential gaps during weekends, as well as the forecast horizon, a possible time gap between training and testing time periods (to optimize for deployment delay) and even keeps track of holiday calendars. Of course, it automatically creates multiple causal time-based validation splits (sliding time windows) for proper validation, and incorporates many other related grand-master recipes such as automatic target and non-target lag feature generation as well as interactions between lags, first and second derivatives and exponential smoothing.
If you find that the automatic lag-based time-series recipe isn’t performing well for your dataset, we recommend that you try to disable the creation of lag-based features by disabling “Time-series lag-based recipe” in the expert settings. This will lead to regular feature engineering but with time-based causal validation splits. Especially for small datasets and short forecast periods, this can lead to better results.
If the target column is present in the test set and has partially filled information (non-missing values), then Driverless AI will automatically augment the model with those future target values to make better predictions. This can be used to extend the usable lifetime of the model into the future without the need for retraining by providing past known outcomes. Contact us if you’re interested in learning more about test-time augmentation.
For now, training and test datasets should have the same input features available, so think about which of the predictors (input features) will be available during production time and drop the rest (or create your own lag features that can be available to both train and test sets).
For datasets that are non-stationary in time, create a test set from the last temporal portion of data, and create time-based features. This allows the model to be optimized for your production scenario.
We are working on further improving many aspects of our time-series recipe. For example, we will add support to automatically generate lags for features that are only available in the training set, but not in the test set, such as environmental or economic factors. We’ll also improve the performance of back-testing using rolling windows.
A core capability of H2O Driverless AI is the creation of automatic machine learning modeling pipelines for supervised problems. In addition to the data and the target column to be predicted, the user can pick a scorer. A scorer is a function that takes actual and predicted values for a dataset and returns a number. Looking at this single number is the most common way to estimate the generalization performance of a predictive model on unseen data by comparing the model’s predictions on the dataset with its actual values. There are more detailed ways to estimate the performance of a machine learning model such as residual plots (available on the Diagnostics page in Driverless AI), but we will focus on scorers here.
For a given scorer, Driverless AI optimizes the pipeline to end up with the best possible score for this scorer. The default scorer for regression problems is RMSE (root mean squared error), where 0 is the best possible value. For example, for a dataset containing 4 rows, if actual target values are [1, 1, 10, 0], but predictions are [2, 3, 4, -1], then the RMSE is sqrt((1+4+36+1)/4) and the largest misprediction dominates the overall score (quadratically). Driverless AI will focus on improving the predictions for the third data point, which can be very difficult when hard-to-predict outliers are present in the data. If outliers are not that important to get right, a metric like the MAE (mean absolute error) can lead to better results. For this case, the MAE is (1+2+6+1)/4 and the optimization process will consider all errors equally (linearly). Another scorer that is robust to outliers is RMSLE (root mean square logarithmic error), which is like RMSE but after taking the logarithm of actual and predicted values - however, it is restricted to positive values. For price predictions, scorers such as MAPE (mean absolute percentage error) or MER (median absolute percentage error) are useful, but have problems with zero or small positive values. SMAPE (symmetric mean absolute percentage error) is designed to improve upon that.
For classification problems, the default scorer is either the AUC (area under the receiver operating characteristic curve) or LOGLOSS (logarithmic loss) for imbalanced problems. LOGLOSS focuses on getting the probabilities right (strongly penalizes wrong probabilities), while AUC is designed for ranking problems. Gini is similar to the AUC, but measures the quality of ranking (inequality) for regression problems. For general imbalanced classification problems, AUCPR and MCC are good choices, while F05, F1 and F2 are designed to balance recall against precision.
We highly suggest experimenting with different scorers and to study their impact on the resulting models. Using the Diagnostics page in Driverless AI, all applicable scores can be computed for any given model, no matter which scorer was used during training.
Knob Settings Tips¶
H2O Driverless AI lets you customize every experiment in great detail via the expert settings. The most important controls however are the three knobs for accuracy, time and interpretability. A higher accuracy setting results in a better estimate of the model generalization performance, usually through using more data, more holdout sets, more parameter tuning rounds and other advanced techniques. Higher time settings means the experiment is given more time to converge to an optimal solution. Higher interpretability settings reduces the model’s complexity through less feature engineering and using simpler models. In general, a setting of 1/1/10 will lead to the simplest and usually least accurate modeling pipeline, while a setting of 10/10/1 will lead to the most complex and most time consuming experiment possible. Generally, it is sufficient to use settings of 7/5/5 or similar, and we recommend to start with the default settings. We highly recommend studying the experiment preview on the left-hand side of the GUI before each experiment - it can help you fine-tune the settings and save time overall.
Note that you can always finish an experiment early, either by clicking ‘Finish’ to get the deployable final pipeline out, or by clicking ‘Abort’ to instantly terminate the experiment. In either case, the experiment can be continued seamlessly at a later time with ‘Restart from last Checkpoint’ or ‘Retrain Final Pipeline’, and you can always turn the knobs (or modify the expert settings) to adapt to your requirements.
Tips for Running an Experiment¶
H2O Driverless AI is an automatic machine learning platform designed to create highly accurate modeling pipelines from tabular training data. The predictive performance of the pipeline is a function of both the training data and the parameters of the pipeline (details of feature engineering and modeling). During an experiment, Driverless AI automatically tunes these parameters by scoring candidate pipelines on held out (“validation”) data. This important validation data is either provided by the user (for experts) or automatically created (random, time-based or fold-based) by Driverless AI. Once a final pipeline has been created, it should be scored on yet another held out dataset (“test data”) to estimate its generalization performance. Understanding the origin of the training, validation and test datasets (“the validation scheme”) is critical for success with machine learning, and we welcome your feedback and suggestions to help us create the right validation schemes for your use cases.
Expert Settings Tips¶
H2O Driverless AI offers a range of ‘Expert Settings’ that let you customize each experiment. For example, you can limit the amount of feature engineering by reducing the value for ‘Feature engineering effort’ or ‘Max. feature interaction depth’ or by disabling ‘Target Encoding’. You can also select the model types to be used for training on the engineered features (such as XGBoost, LightGBM, GLM, TensorFlow, FTRL, or RuleFit). For time-series problems where the selected time_column leads to an error message (this can currently happen if the the time structure is not regular enough - we are working on an improved version), you can disable the ‘Time-series lag-based recipe’ and Driverless AI will create train/validation splits based on the time order instead, which can increase the model’s performance if the time column is important.
Driverless AI provides the option to checkpoint experiments to speed up feature engineering and model tuning when running multiple experiments on the same dataset. By default, H2O Driverless AI automatically scans all prior experiments (including aborted ones) for an optimal checkpoint to restart from. You can select a specific prior experiment to restart a new experiment from with “Restart from Last Checkpoint” in the experiment listing page (click on the 3 yellow bars on the right). You can disable checkpointing by setting ‘Feature Brain Level’ in the expert settings (or feature_brain_level in the configuration file) to 0 to force the experiment to start from scratch.
Text Data Tips¶
For datasets that contain text (string) columns - where each value can be a few words, a paragraph or an entire document - Driverless AI automatically creates NLP features based on bag of words, tf-idf, singular value decomposition and out-of-fold likelihood estimates. In versions 1.3 and above, you can enable TensorFlow in the expert settings to see how CNN (convolutional neural net) based learned word embeddings can improve predictive accuracy even more. Try this for sentiment analysis, document classification, and generic text-enriched datasets.