Time series in Driverless AI

Time series forecasting is one of the most common and important tasks in business analytics. Many real-world applications exist, including forecasting sales, weather, the stock market, and energy demand, to name a few. You can use the advanced time series analysis capabilities of Driverless AI in combination with time series recipes from H2O’s Kaggle Grandmasters to efficiently deliver business value.

The key features and recipes that make automation possible are:

  • Automatic handling of time groups (for example, different stores and departments).

  • Robust time series validation.

    • Accounts for gaps and forecast horizon.

    • Uses past information only (that is, no data leakage occurs).

  • Time series-specific feature engineering recipes.

    • Date features like day of week, and day of the month.

    • AutoRegressive features like optimal lag and lag-features interaction.

    • Different types of exponentially weighted moving averages.

    • Aggregation of past information (that is, different time groups and time intervals).

    • Target transformations and differentiation.

  • Integration with existing feature engineering functions (recipes and optimization).

  • Rolling-window based predictions for time series experiments with test-time augmentation or re-fit.

  • Automatic pipeline generation. (For more information, see the “From Kaggle Grandmasters’ Recipes to Production Ready in a Few Clicks” blog post.)

Note: Locale-dependent datetime formats may cause issues in Driverless AI. Converting datetime to a locale-independent format prior to running experiments is recommended. For information on how to convert datetime formats so that they are accepted in DAI, refer to the final note in the Modify by custom data recipe section.

Understanding time series

The following is an in depth description of time series in Driverless AI. For an overview of best practices when running time series experiments, see Time Series Best Practices.

Modeling approach

Driverless AI uses GBMs, GLMs, and neural networks with a focus on time series-specific feature engineering. This feature engineering includes the following:

  • Autoregressive elements: creating lag variables

  • Aggregated features on lagged variables: moving averages, exponential smoothing descriptive statistics, correlations

  • Date-specific features: week number, day of week, month, year

  • Target transformations: Integration/Differentiation, univariate transforms (such as logs and square roots)

This approach is combined with AutoDL features as part of the genetic algorithm. The selection is still based on validation accuracy. That is, the same transformations and genes apply. Additionally, there are new transformations that come from time series. Some transformations, like target encoding, are deactivated.

When running a time series experiment, Driverless AI builds multiple models by rolling the validation window back in time (and potentially using less and less training data).

User-configurable options

The following sections describe user-configurable options for time series experiments.

Gap

The guiding principle for correctly modeling a time series forecasting problem is to use the historical data in the model training dataset so that it mimics the data or information environment at scoring time (that is, deployed predictions). Specifically, you want the training set must be partitioned to account for the following:

  1. The information available to the model when making predictions.

  2. The number of units out that the model should be optimized to predict.

Given a training dataset, the gap and forecast horizon are parameters that determine how to split the training dataset into training samples and validation samples.

Gap is the amount of missing time bins between the end of a training set and the start of test set (with regards to time). For example:

  1. Assume there is daily data with days 1/1/2022, 1/2/2022, 1/3/2022, 1/4/2022 in train. There are 4 days in total for training.

2. Additionally, the test data starts from 1/6/2022. There is only 1 day in the test data. - The previous day (1/5/2022) does not belong to the train data. This day cannot be used for training since information from that day may not be available at scoring time. This day cannot be used to derive information (such as historical lags) for the test data either. - The time bin (or time unit) in this example is 1 day. This is the time interval that separates the different samples/rows in the data. - In summary, there are 4 time bins/units for the train data and 1 time bin/unit for the test data plus the Gap. - In order to estimate the Gap between the end of the train data and the beginning of the test data, the following formula is applied. - Gap = min(time bin test) - max(time bin train) - 1. - In this case, min(time bin test) is 6 (or 1/6/2022). This is the earliest (and only) day in the test data. - max(time bin train) is 4 (or 1/4/2022). This is the latest (or the most recent) day in the train data. - Therefore, the Gap is 1 time bin (or 1 day in this case), because Gap = 6 - 4 - 1 or Gap = 1

Time series gaps

Forecast horizon

It’s often not possible to have the most recent data available when applying a model (or it’s costly to update the data table too often); therefore, some models need to be built accounting for a “future gap.” For example, if it takes a week to update a specific data table, you ideally want to predict 7 days ahead with the data as it is “today”; therefore, a gap of 6 days is recommended. Not specifying a gap and predicting 7 days ahead with the data as it is is unrealistic (and cannot happen, as the data is updated on a weekly basis in this example). Similarly, the gap can be used if you want to forecast further in advance. For example, if you want to know what will happen 7 days in the future, then set the gap to 6 days.

Forecast Horizon (or prediction length) is the period that the test data spans for (for example, one day, one week, etc.). In other words, it is the future period that the model can make predictions for (or the number of units out that the model should be optimized to predict). Forecast horizon is used in feature selection and engineering and in model selection. Note that forecast horizon might not equal the number of predictions. The actual predictions are determined by the test dataset.

Horizon

The periodicity of updating the data may require model predictions to account for significant time in the future. In an ideal world where data can be updated very quickly, predictions can always be made having the most recent data available. In this scenario there is no need for a model to be able to predict cases that are well into the future, but rather focus on maximizing its ability to predict short term. However, this is not always the case, and a model needs to be able to make predictions that span deep into the future because it may be too costly to make predictions every single day after the data gets updated.

In addition, each future data point is not the same. For example, predicting tomorrow with today’s data is easier than predicting 2 days ahead with today’s data. Hence specifying the forecast horizon can facilitate building models that optimize prediction accuracy for these future time intervals.

Prediction intervals

For regression problems, enable the prediction_intervals expert setting to have Driverless AI provide two additional columns y.lower and y.upper in the prediction frame. The true target value y for a predicted sample is expected to lie within [y.lower, y.upper] with a certain probability. The default value for this confidence level can be specified with the prediction_intervals_alpha expert setting, which has a default value of 0.9.

Driverless AI uses holdout predictions to determine intervals empirically (Williams, W.H. and Goodman, M.L. “A Simple Method for the Construction of Empirical Confidence Limits for Economic Forecasts.” Journal of the American Statistical Association, 66, 752-754. 1971). This method makes no assumption about the underlying model or the distribution of error and has been shown to outperform many other approaches (Lee, Yun Shin and Scholtes, Stefan. “Empirical prediction intervals revisited.” International Journal of Forecasting, 30, 217-234. 2014).

Notes:

  • This feature applies to regression tasks (i.i.d. and time series).

  • This feature works with all model types.

  • MOJO support is not currently implemented for this feature.

  • Prediction intervals are computed for each individual time group.

time_period_in_seconds

In Driverless AI, the forecast horizon (a.k.a., num_prediction_periods) needs to be in periods, and the size is unknown. To overcome this, you can use the optional time_period_in_seconds parameter when running start_experiment_sync (in Python) or train (in R). This is used to specify the forecast horizon in real time units (as well as for gap.) If this parameter is not specified, then Driverless AI will automatically detect the period size in the experiment, and the forecast horizon value will respect this period. I.e., if you are sure that your data has a 1 week period, you can say num_prediction_periods=14; otherwise it is possible that the model will not work correctly.

Groups

Groups are categorical columns in the data that can significantly help predict the target variable in time series problems. For example, given information about stores and products, you may need to forecast sales. Recognizing that the combination of store and products can result in very different sales is critical for predicting the target variable, as a large store or a popular product will have higher sales than a small store and/or unpopular products.

For example, if you don’t know that the store is available in the data and you attempt to see the distribution of sales over time (with all stores mixed together), it may look as follows:

Sales for all stores

The same graph grouped by store gives a much clearer view of what the sales look like for different stores:

Sales by store

Lag

The primary generated time series features are lag features, which are a variable’s past values. At a given sample with time stamp \(t\), features at some time difference \(T\) (lag) in the past are considered. For example, if the sales today are 300, and sales of yesterday are 250, then the lag of one day for sales is 250. Lags can be created on any feature as well as on the target.

Lag

As previously noted, the training dataset is appropriately split such that the number of validation data samples equals that of the testing dataset samples. If you want to determine valid lags, you must consider what happens when you evaluate your model on the testing dataset. Essentially, the minimum lag size must be greater than the gap size.

Aside from the minimum useable lag, Driverless AI attempts to discover predictive lag sizes based on auto-correlation.

“Lagging” variables are important in time series because knowing what happened in different time periods in the past can greatly facilitate predictions for the future. Consider the following example to see the lag of 1 and 2 days:

Date

Sales

Lag1

Lag2

1/1/2020

100

-

-

2/1/2020

150

100

-

3/1/2020

160

150

100

4/1/2020

200

160

150

5/1/2020

210

200

160

6/1/2020

150

210

200

7/1/2020

160

150

210

8/1/2020

120

160

150

9/1/2020

80

120

160

10/1/2020

70

80

120

Settings determined by Driverless AI

Using the preceding Lag table, a moving average of 2 would constitute the average of Lag1 and Lag2:

Date

Sales

Lag1

Lag2

MA2

1/1/2020

100

-

-

-

2/1/2020

150

100

-

-

3/1/2020

160

150

100

125

4/1/2020

200

160

150

155

5/1/2020

210

200

160

180

6/1/2020

150

210

200

205

7/1/2020

160

150

210

180

8/1/2020

120

160

150

155

9/1/2020

80

120

160

140

10/1/2020

70

80

120

100

Aggregating multiple lags together (instead of just one) can facilitate stability for defining the target variable. It may include various lags values, for example lags [1-30] or lags [20-40] or lags [7-70 by 7].

Exponential weighting

Exponential weighting is a form of weighted moving average where more recent values have higher weight than less recent values. That weight is exponentially decreased over time based on an alpha (a) (hyper) parameter (0,1), which is normally within the range of [0.9 - 0.99]. For example:

  • Exponential Weight = a**(time)

  • If sales 1 day ago = 3.0 and 2 days ago =4.5 and a=0.95:

  • Exp. smooth = 3.0*(0.95**1) + 4.5*(0.95**2) / ((0.95**1) + (0.95**2)) =3.73 approx.

Time series constraints dataset size

Usually, the forecast horizon (prediction length) \(H\) equals the number of time periods in the testing data \(N_{TEST}\) (i.e. \(N_{TEST} = H\)). You want to have enough training data time periods \(N_{TRAIN}\) to score well on the testing dataset. At a minimum, the training dataset should contain at least three times as many time periods as the testing dataset (i.e. \(N_{TRAIN} >= 3 × N_{TEST}\)). This allows for the training dataset to be split into a validation set with the same amount of time periods as the testing dataset while maintaining enough historical data for feature engineering.

Missing values in time series experiments

The DAI time series recipe does not allow missing values in the target column during training. These rows will be dropped. In the event there are missing rows or feature columns, DAI does not perform zero imputation and just uses the available information. Some of the time series transformers, such as EWMA, will tend to regress towards the mean in the case of missing values. If DAI detects duplicate timestamps the user will receive a warning.

Use a Driverless AI time series model to forecast

Rolling-window-based predictions

When you set the experiment’s forecast horizon, you are telling the Driverless AI experiment the dates this model will be asked to forecast for. Driverless AI supports rolling-window-based predictions for time series experiments with two options: Test Time Augmentation (TTA) or re-fit. Both options are useful to assess the performance of the pipeline for predicting not just a single forecast horizon, but many in succession.

Option 1: Trigger a Driverless AI experiment to be trained once the forecast horizon ends. A Driverless AI experiment will need to be re-trained every week. Option 2: Use Test Time Augmentation (TTA) to update historical features so that you can use the same model to forecast outside of the forecast horizon.

Both options have their advantages and disadvantages. TTA simulates the process where the model stays the same, but the features are refreshed using newly available data. Re-fit simulates the process of re-fitting the entire pipeline (including the model) once new data is available. Re-fit is only applicable for test sets provided during an experiment. On the other hand, retraining an experiment with the latest data, Driverless AI has the ability to possibly improve the model by changing the features used, choosing a different algorithm, and/or selecting different parameters. As the data changes over time, for example, Driverless AI may find that the best algorithm for this use case has changed.

Notes:

  • Scorers cannot refit or retrain a model.

  • To specify a method for creating rolling test set predictions, use this expert setting. Note that refitting performed with this expert setting is only applied to the test set that is provided by the user during an experiment. The final scoring pipeline always uses TTA.

Test Time Augmentation (TTA)

TTA simulates the process where the model stays the same but the features are refreshed using newly available data. This process is automated when the test set spans for a longer period than the forecast horizon and if the target values of the test set are known. If the user scores a test set that meets these conditions after the experiment is finished, rolling predictions with TTA will be applied. For example, in forecasting weekly sales data a feature that may be very important is the weekly sales from the previous week. Once you move outside of the forecast horizon, your model no longer knows the sales from the previous week. By performing TTA, Driverless AI will automatically generate these historical features if new data is provided.

Using TTA to continue using the same experiment over a longer period of time means there is no longer any need to continually repeat a model review process. However, it is possible for the model to become out of date.

Rolling window with TTA
Rolling window with re-fit

TTA is the default option and can be changed with the Method to Create Rolling Test Set Predictions expert setting.

The following table lists several scoring methods and whether they support TTA:

Scoring Method

Test Time Augmentation Support

Driverless AI Scorer

Supported

Python Scoring Pipeline

Supported

MOJO Scoring Pipeline

Not Supported

Fast TTA

Fast TTA lets Driverless AI score datasets that are longer than the horizon in a single pass. For this reason, it is much faster than performing rolling predictions. This performance increase comes at the expense of adding randomness due to the random dropout. To account for missing values in the validation or test datasets (but available in the training dataset), DAI randomly drops values in the train dataset to ensure the same proportion. This balances for the leakage introduced by augmenting the data with feature values that were not available when the model was trained. Therefore, fast TTA is good if you’re only interested in the score, but not as good if you want to look at individual predictions.

Fast TTA
Fast TTA continued

Triggering Test Time Augmentation

To perform Test Time Augmentation, create your forecast data to include any data that occurred after the training data ended up to the dates you want a forecast for. The dates that you want Driverless AI to forecast should have missing values (NAs) where the target column is. Target values for the remaining dates must be filled in.

The following is an example of forecasting for 2020-11-23 and 2020-11-30 with the remaining dates being used for TTA:

Date

Store

Dept

Mark Down 1

Mark Down 2

Weekly_Sales

2020-11-02

1

1

-1

-1

$35,000

2020-11-09

1

1

-1

-1

$40,000

2020-11-16

1

1

-1

-1

$45,000

2020-11-23

1

1

-1

-1

NA

2020-11-30

1

1

-1

-1

NA

Notes:

  • Although TTA can span any length of time into the future, the dates that are being predicted cannot exceed the horizon.

  • If the date being forecasted contains any non-missing value in the target column, then TTA is not triggered for that row.

Forecasting future dates

To forecast or predict future dates, upload a dataset that contains the future dates of interest and provide additional information such as group IDs or features known in the future. The dataset can then be used to run and score your predictions.

The following is an example of a model that was trained up to 2020-05-31:

Date

Group_ID

Known_Feature_1

Known_Feature_2

2020-06-01

A

3

1

2020-06-02

A

2

2

2020-06-03

A

4

1

2020-06-01

B

3

0

2020-06-02

B

2

1

2020-06-03

B

4

0

Refer to this example. to see how to use the scoring pipeline to predict future data instead of using the prediction endpoint on the Driverless AI server.

More about Unavailable Columns at Time of Prediction

The Unavailable Columns at Prediction Time (UCAPT) option is a way to mark features that will not be available in the test dataset or at the time of prediction but might still be predictive when looking at historical values. These features will only be used in historical feature engineering recipes, such as Lagging or Exponential Weighted Moving Average.

For example, if you’re predicting the sales amount each day, you may have the number of customers each day as a feature in your training dataset. In the future, you won’t know how many customers will be coming into the store, making this a leaky feature to use. However, the average number of customers last week may be predictive and is something that you can calculate ahead of time. In this case, looking at the historical values would be better than just dropping the feature.

The default value for this setting is often –, meaning that all features can be used as they are. If you include a test dataset before selecting a time column, and that test dataset is missing any columns, then you will see a number as the default for Unavailable Columns at Prediction Time, which will be the number of columns that are in the training dataset but not the testing dataset. All these features will only be looked at historically, and you can see a list of them by clicking on this setting.

Time series Expert Settings

You can further configure time series experiments with a dedicated set of options available through the Expert Settings panel. This panel is available from within the experiment page right above the Scorer knob.

For more information on these settings, see Time Series Settings.

Time series option from within Expert Settings

Time series use case: Sales forecasting

H2O Driverless AI handles time-series forecasting problems out of the box.

You can start an individual time series experiment from the Experiment Setup page. To do this, provide a regular columnar dataset containing your features and then pick a target column and time column. (A time column is a designated column containing time stamps for every record (row), such as “April 10 2019 09:13:41” or “2019/04/10”.) If you have a test set for which you want predictions for every record, make sure to provide future time stamps and features as well.

In most cases, this is all you have to do. You can launch the experiment and let Driverless AI do the rest. It will even auto-detect multiple time series in the same dataset for different groups such as weekly sales for stores and departments (by finding the columns that identify stores and departments to group by). Driverless AI will also auto-detect the time period including potential gaps during weekends, as well as the forecast horizon, a possible time gap between training and testing time periods (to optimize for deployment delay) and even keeps track of holiday calendars. Of course, it automatically creates multiple causal time-based validation splits (sliding time windows) for proper validation, and incorporates many other related grand-master recipes such as automatic target and non-target lag feature generation as well as interactions between lags, first and second derivatives and exponential smoothing.

The following is a typical example of sales forecasting based on the Walmart competition on Kaggle. To frame it as a machine learning problem, the historical sales data and additional attributes are formulated as follows:

Raw data

Raw data

Data formulated for machine learning

Machine learning data

The additional attributes are attributes that are known at the time of scoring. In this example, the goal is to forecast the next week’s sales. All the attributes included in the data must be known at least one week in advance. In this case, you can assume that you will know if a store and department will be running a promotional markdown. Features like the temperature of the week are not used because that information is not available at the time of scoring.

Once you’ve prepared your data in tabular format (see raw data above), Driverless AI can formulate it for machine learning and sort out the rest. If this is your first session, the Driverless AI assistant walks you through the process.

Similar to previous Driverless AI examples, you need to select the dataset for training/test and define the target. For time series, you need to define the time column (by choosing AUTO or selecting the date column manually). If weighted scoring is required (as in the Walmart Kaggle competition), you can select the column with specific weights for different samples.

If you prefer to use automatic handling of time groups, you can leave the setting for time group columns as AUTO, or you can define specific time groups. You can also specify the columns that will be unavailable at prediction time (see More About Unavailable Columns at Time of Prediction below for more information), the forecast horizon (in weeks), and the gap (in weeks) between the train and test periods.

Once the experiment is finished, you can make new predictions and download the scoring pipeline just like any other Driverless AI experiment.

Additional resources

For more information on running time series experiments in Driverless AI, refer to the following resources: