Time Series in Driverless AI

Time-series forecasting is one of the most common and important tasks in business analytics. There are many real-world applications like sales, weather, stock market, and energy demand, just to name a few. At H2O, we believe that automation can help our users deliver business value in a timely manner. Therefore, we combined advanced time series analysis and our Kaggle Grand Masters’ time-series recipes into Driverless AI.

The key features/recipes that make automation possible are:

  • Automatic handling of time groups (e.g., different stores and departments)
  • Robust time-series validation
    • Accounts for gaps and forecast horizon
    • Uses past information only (i.e., no data leakage)
  • Time-series-specific feature engineering recipes
    • Date features like day of week, day of month, etc.
    • AutoRegressive features, like optimal lag and lag-features interaction
    • Different types of exponentially weighted moving averages
    • Aggregation of past information (different time groups and time intervals)
    • Target transformations and differentiation
  • Integration with existing feature engineering functions (recipes and optimization)
  • Rolling-window based predictions for time series experiments with test-time augmentation or re-fit
  • Automatic pipeline generation (See “From Kaggle Grand Masters’ Recipes to Production Ready in a Few Clicks” blog post.)

Understanding Time Series

Modeling Approach

Driverless AI uses GBMs, GLMs and neural networks with a focus on time-series-specific feature engineering. The feature engineering includes:

  • Autoregressive elements: creating lag variables
  • Aggregated features on lagged variables: moving averages, exponential smoothing descriptive statistics, correlations
  • Date-specific features: week number, day of week, month, year
  • Target transformations: Integration/Differentiation, univariate transforms (like logs, square roots)

This approach is combined with AutoDL features as part of the genetic algorithm. The selection is still based on validation accuracy. In other words, the same transformations/genes apply; plus there are new transformations that come from time series. Some transformations (like target encoding) are deactivated.

When running a time-series experiment, Driverless AI builds multiple models by rolling the validation window back in time (and potentially using less and less training data).

User-Configurable Options


The guiding principle for properly modeling a time series forecasting problem is to use the historical data in the model training dataset such that it mimics the data/information environment at scoring time (i.e. deployed predictions). Specifically, you want to partition the training set to account for: 1) the information available to the model when making predictions and 2) the number of units out that the model should be optimized to predict.

Given a training dataset, the gap and forecast horizon are parameters that determine how to split the training dataset into training samples and validation samples.

Gap is the amount of missing time bins between the end of a training set and the start of test set (with regards to time). For example:

  • Assume there are daily data with days 1/1/2019, 2/1/2019, 3/1/2019, 4/1/2019 in train. There are 4 days in total for training.
  • In addition, the test data will start from 6/1/2019. There is only 1 day in the test data.
  • The previous day (5/1/2019) does not belong to the train data. It is a day that cannot be used for training (i.e because information from that day may not be available at scoring time). This day cannot be used to derive information (such as historical lags) for the test data either.
  • Here the time bin (or time unit) is 1 day. This is the time interval that separates the different samples/rows in the data.
  • In summary, there are 4 time bins/units for the train data and 1 time bin/unit for the test data plus the Gap.
  • In order to estimate the Gap between the end of the train data and the beginning of the test data, the following formula is applied.
  • Gap = min(time bin test) - max(time bin train) - 1.
  • In this case min(time bin test) is 6 (or 6/1/2019). This is the earliest (and only) day in the test data.
  • max(time bin train) is 4 (or 4/1/2019). This is the latest (or the most recent) day in the train data.
  • Therefore the GAP is 1 time bin (or 1 day in this case), because Gap = 6 - 4 - 1 or Gap = 1

Forecast Horizon

Quite often, it is not possible to have the most recent data available when applying a model (or it is costly to update the data table too often); hence models need to be built accounting for a “future gap”. For example, if it takes a week to update a certain data table, ideally we would like to predict “7 days ahead” with the data as it is “today”; hence a gap of 7 days would be sensible. Not specifying a gap and predicting 7 days ahead with the data as it is is unrealistic (and cannot happen, as we update the data on a weekly basis in this example). Similarly, gap can be used for those who want to forecast further in advance. For example, users want to know what will happen 7 days in the future, they will set the gap to 7 days.

Forecast Horizon (or prediction length) is the period that the test data spans for (for example, one day, one week, etc.). In other words it is the future period that the model can make predictions for (or the number of units out that the model should be optimized to predict). Forecast horizon is used in feature selection and engineering and in model selection. Note that forecast horizon might not equal the number of predictions. The actual predictions are determined by the test dataset.


The periodicity of updating the data may require model predictions to account for significant time in the future. In an ideal world where data can be updated very quickly, predictions can always be made having the most recent data available. In this scenario there is no need for a model to be able to predict cases that are well into the future, but rather focus on maximizing its ability to predict short term. However this is not always the case, and a model needs to be able to make predictions that span deep into the future because it may be too costly to make predictions every single day after the data gets updated.

In addition, each future data point is not the same. For example, predicting tomorrow with today’s data is easier than predicting 2 days ahead with today’s data. Hence specifying the forecast horizon can facilitate building models that optimize prediction accuracy for these future time intervals.


Note: This is only available in the Python and R clients. Time period in seconds cannot be specified in the UI.

In Driverless AI, the forecast horizon (a.k.a., num_prediction_periods) needs to be in periods, and the size is unknown. To overcome this, you can use the optional time_period_in_seconds parameter when running start_experiment_sync (in Python) or train (in R). This is used to specify the forecast horizon in real time units (as well as for gap.) If this parameter is not specified, then Driverless AI will automatically detect the period size in the experiment, and the forecast horizon value will respect this period. I.e., if you are sure that your data has a 1 week period, you can say num_prediction_periods=14; otherwise it is possible that the model will not work correctly.


Groups are categorical columns in the data that can significantly help predict the target variable in time series problems. For example, one may need to predict sales given information about stores and products. Being able to identify that the combination of store and products can lead to very different sales is key for predicting the target variable, as a big store or a popular product will have higher sales than a small store and/or with unpopular products.

For example, if we don’t know that the store is available in the data, and we try to see the distribution of sales along time (with all stores mixed together), it may look like that:

Sales for all stores

The same graph grouped by store gives a much clearer view of what the sales look like for different stores.

Sales by store


The primary generated time series features are lag features, which are a variable’s past values. At a given sample with time stamp \(t\), features at some time difference \(T\) (lag) in the past are considered. For example, if the sales today are 300, and sales of yesterday are 250, then the lag of one day for sales is 250. Lags can be created on any feature as well as on the target.


As previously noted, the training dataset is appropriately split such that the amount of validation data samples equals that of the testing dataset samples. If we want to determine valid lags, we must consider what happens when we will evaluate our model on the testing dataset. Essentially, the minimum lag size must be greater than the gap size.

Aside from the minimum useable lag, Driverless AI attemps to to discover predictive lag sizes based on auto-correlation.

“Lagging” variables are important in time series because knowing what happened in different time periods in the past can greatly facilitate predictions for the future. Consider the following example to see the lag of 1 and 2 days:

Date Sales Lag1 Lag2
1/1/2018 100 - -
2/1/2018 150 100 -
3/1/2018 160 150 100
4/1/2018 200 160 150
5/1/2018 210 200 160
6/1/2018 150 210 200
7/1/2018 160 150 210
8/1/2018 120 160 150
9/1/2018 80 120 160
10/1/2018 70 80 120

Settings Determined by Driverless AI

Window/Moving Average

Using the above Lag table, a moving average of 2 would constitute the average of Lag1 and Lag2:

Date Sales Lag1 Lag2 MA2
1/1/2018 100 - - -
2/1/2018 150 100 - -
3/1/2018 160 150 100 125
4/1/2018 200 160 150 155
5/1/2018 210 200 160 180
6/1/2018 150 210 200 205
7/1/2018 160 150 210 180
8/1/2018 120 160 150 155
9/1/2018 80 120 160 140
10/1/2018 70 80 120 100

Aggregating multiple lags together (instead of just one) can facilitate stability for defining the target variable. It may include various lags values, for example lags [1-30] or lags [20-40] or lags [7-70 by 7].

Exponential Weighting

Exponential weighting is a form of weighted moving average where more recent values have higher weight than less recent values. That weight is exponentially decreased over time based on an alpha (a) (hyper) parameter (0,1), which is normally within the range of [0.9 - 0.99]. For example:

  • Exponential Weight = a**(time)
  • If sales 1 day ago = 3.0 and 2 days ago =4.5 and a=0.95:
  • Exp. smooth = 3.0*(0.95**1) + 4.5*(0.95**2) / ((0.95**1) + (0.95**2)) =3.73 approx.

Rolling-Window-Based Predictions

Driverless AI supports rolling-window-based predictions for time-series experiments with two options: Test Time Augmentation (TTA) or re-fit.

This process is automated when the test set spans for a longer period than the forecast horizon. If the user does not provide a test set, but then scores one after the experiment is finished, rolling predictions will still be applied as long as the selected horizon is shorter than the test set.

When using Rolling Windows and TTA, Driverless AI takes into account the Prediction Duration and the Rolling Duration.

  • Prediction Duration (PD): This is the duration configured as forecaset horizon while training the Driverless AI experiment. If you don’t want to predict beyond the horizon configured during experiment training using the experiment’s scoring pipeline, then in that case, PD may be the same as Test Data Duration/Horizon and the situation is shown in the previous Horizon image (above).
Note: When using TTA, the prediction duration represents the forecast horizon during experiment training. During scoring, the prediction duration will be the duration of data passed to score for each invocation of the score_batch method of the scoring module.
  • Rolling Duration (RD): This is the amount of duration by which we move ahead (roll) in time before we score again for the next prediction duration data.
Gap with rolling window

Time Series Constraints

Dataset Size

Usually, the forecast horizon (prediction length) \(H\) equals the number of time periods in the testing data \(N_{TEST}\) (i.e. \(N_{TEST} = H\)). You want to have enough training data time periods \(N_{TRAIN}\) to score well on the testing dataset. At a minimum, the training dataset should contain at least three times as many time periods as the testing dataset (i.e. \(N_{TRAIN} >= 3 × N_{TEST}\)). This allows for the training dataset to be split into a validation set with the same amount of time periods as the testing dataset while maintaining enough historical data for feature engineering.

Time Series Use Case: Sales Forecasting

Below is a typical example of sales forecasting based on the Walmart competition on Kaggle. In order to frame it as a machine learning problem, we formulate the historical sales data and additional attributes as shown below:

Raw data

Raw data

Data formulated for machine learning

Machine learning data

The additional attributes are attributes that we will know at time of scoring. In this example, we want to forecast the next week of sales. Therefore, all of the attributes included in our data must be known at least one week in advance. In this case, we assume that we will know whether or not a Store and Department will be running a promotional markdown. We will not use features like the temperature of the Week since we will not have that information at the time of scoring.

Once you have your data prepared in tabular format (see raw data above), Driverless AI can formulate it for machine learning and sort out the rest. If this is your very first session, the Driverless AI assistant will guide you through the journey.

First-time user tour

Similar to previous Driverless AI examples, you need to select the dataset for training/test and define the target. For time-series, you need to define the time column (by choosing AUTO or selecting the date column manually). If weighted scoring is required (like the Walmart Kaggle competition), you can select the column with specific weights for different samples.

Time series experiment settings

If you prefer to use automatic handling of time groups, you can leave the setting for time groups columns as AUTO, or you can define specific time groups. You can also specify the columns that will be unavailable at prediction time, the forecast horizon (in weeks), and the gap (in weeks) between the train and test periods.

Once the experiment is finished, you can make new predictions and download the scoring pipeline just like any other Driverless AI experiments.

Time Series Expert Settings

The user may further configure the time series experiments with a dedicated set of options available through the EXPERT SETTINGS. The EXPERT SETTINGS panel is available from within the experiment page right above the Scorer knob.

Refer to Time Series Settings for information about the available Time Series Settings options.

Time series option from within Expert Settings

Using a Driverless AI Time Series Model to Forecast

When you set the experiment’s forecast horizon, you are telling the Driverless AI experiment the dates this model will be asked to forecast for. In the Walmart Sales example, we set the Driverless AI forecast horizon to 1 (1 week in the future). This means that Driverless AI expects this model to be used to forecast 1 week after training ends. Since the training data ends on 2012-10-26, then this model should be used to score for the week of 2012-11-02.

What should the user do once the 2012-11-02 week has passed?

There are two options:

Option 1: Trigger a Driverless AI experiment to be trained once the forecast horizon ends. A Driverless AI experiment will need to be re-trained every week.

Option 2: Use Test Time Augmentation to update historical features so that we can use the same model to forecast outside of the forecast horizon.

Test Time Augmentation refers to the process where the model stays the same but the features are refreshed using the latest data. In our Walmart Sales Forecasting example, a feature that may be very important is the Weekly Sales from the previous week. Once we move outside of the forecast horizon, our model no longer knows the Weekly Sales from the previous week. By performing Test Time Augmentation, Driverless AI will automatically generate these historical features if new data is provided.

In Option 1, we would launch a new Driverless AI experiment every week with the latest data and use the resulting model to forecast the next week. In Option 2, we would continue using the same Driverless AI experiment outside of the forecast horizon by using Test Time Augmentation.

Both options have their advantages and disadvantages. By re-training an experiment with the latest data, Driverless AI has the ability to possibly improve the model by changing the features used, choosing a different algorithm, and/or selecting different parameters. As the data changes over time, for example, Driverless AI may find that the best algorithm for this use case has changed.

There may be clear advantages for retraining an experiment after each forecast horizon or for using Test Time Augmentation. Refer to this example to see how to use the scoring pipeline to predict future data instead of using the prediction endpoint on the Driverless AI server.

Using Test Time Augmentation to be able to continue using the same experiment over a longer period of time means there would be no need to continually repeat a model review process. The model may become out of date, however, and the MOJO scoring pipeline is not supported.

Scoring Supported Retraining Model Test Time Augmentation
Driverless AI Scoring Supported Supported
Python Scoring Pipeline Supported Supported
MOJO Scoring Pipeline Supported Not Supported

For different use cases, there may be clear advantages for retraining an experiment after each forecast horizon or for using Test Time Augmentation. In this notebook, we show how to perform both and compare the performance: Time Series Model Rolling Window.

How to trigger Test Time Augmentation?

To tell Driverless AI to perform Test Time Augmentation, simply create your forecast data to include any data that occurred after the training data ended up to the date you want a forecast for. The date which you want Driverless AI to forecast should have NA where the target column is. Here is an example of forecasting 2012-11-09.

Date Store Dept Mark Down 1 Mark Down 2 Weekly_Sales
2012-11-02 1 1 -1 -1 $40,000
2012-11-09 1 1 -1 -1 NA

If we do not include an NA in the Target column for the date we are interested in forecasting, then Test Time Augmentation will not be triggered.

Additional Resources

Refer to the following for examples showing how to run Time Series examples in Driverless AI: