# Time Series in Driverless AI¶

Time-series forecasting is one of the most common and important tasks in business analytics. There are many real-world applications like sales, weather, stock market, energy demand, just to name a few. At H2O, we believe that automation can help our users deliver business value in a timely manner. Therefore, we translated our Kaggle Grand Masters’ time-series recipes into Driverless AI.

The key features/recipes that make automation prossible are:

• Automatic handling of time groups (e.g., different stores and departments)
• Robust time-series validation
• Accounts for gaps and forecaset horizon
• Uses past information only (i.e., no data leakage)
• Time-series-specific feature engineering recipes
• Date features like day of week, day of month, etc.
• AutoRegressive features, like optimal lag and lag-features interaction
• Different types of exponentially weighted moving averages
• Aggregation of past information (different time groups and time intervals)
• Target transformations and differentiation
• Integration with existing feature engineering functions (recipes and optimization)
• Automatic pipeline generation (See “From Kaggle Grand Masters’ Recipes to Production Ready in a Few Clicks” blog post.)

## Understanding Time Series¶

### Modeling Approach¶

Driverless AI uses GBMs with a focus on time-series-specific feature engineering. The feature engineering includes:

• Autoregressive elements: creating lag variables
• Aggregated features on lagged variables: moving averages, exponential smoothing descriptive statistics, correlations
• Date-specific features: week number, day of week, month, year
• Target transformations: Integration/Differentiation, univariate transforms (like logs, square roots)

This approach is combined with AutoDL features as part of the genetic algorithm. The selection is still based on validation accuracy. In other words, the same transformations/genes apply; plus there are new transformations that come from time series. Some transformations (like target encoding) are deactivated.

### User-Configurable Options¶

#### Gap and Horizon¶

The guiding principle for properly modeling a time series forecasting problem is to use the historical data in the model training dataset such that it mimics the data/information environment at scoring time (i.e. deployed predictions). Specifically, you want to partition the training set to account for: 1) the information available to the model when making predictions and 2) the length of predictions to make.

Given a training dataset, gap and prediction length are parameters that determine how to split the training dataset into training samples and validation samples.

Gap is the amount of missing time bins between the end of a training set and the start of test set (with regards to time). For example:

• Assume you have daily data with days 1/1, 1/2, 1/3, 1/4 in train.
• The corresponding time bins would be 1, 2, 3, 4 for a time period of 1 day.
• Given that, the first valid time bin to predict is 5.
• As a result, Gap = max(time bin train) - min(time bin test) - 1.

Quite often, it is not possible to have the most recent data available when applying a model (or it is costly to update the data table too often); hence models need to be built accounting for a “future gap”. For example if it takes a week to update a certain data table, ideally we would like to predict “7 days ahead” with the data as it is “today”; hence a gap of 7 days would be sensible. Not specifying a gap and predicting 7 days ahead with the data as it is 7 days ahead is unrealistic (and can cannot happen as we update the data on a weekly basis in this example).

Similarly, gap can be used for those who want to forecast further in advance. For example, users want to know what will happen 7 days in the future, they will set the gap to 7 days.

Horizon (or prediction length) is the period that the test data spans for (for example, one day, one week, etc.). In other words it is the future period that the model can make predictions for.

The periodicity of updating the data may require model predictions to account for significant time in the future. In an ideal world where data can be updated very quickly, predictions can always be made having the most recent data available. In this scenario there is no need for a model to be able to predict cases that are well into the future, but rather focus on maximizing its ability to predict short term. However this is not always the case, and a model needs to be able to make predictions that span deep into the future because it may be too costly to make predictions every single day after the data gets updated.

In addition, each future data point is not the same. For example, predicting tomorrow with today’s data is easier than predicting 2 days ahead with today’s data. Hence specifying the horizon can facilitate building models that optimize prediction accuracy for these future time intervals.

#### Groups¶

Groups are categorical columns in the data that can significantly help predict the target variable in time series problems. For example I need to predict sales and I have information of store and products. Being able to identify that the combination of store and products can lead to very different sales is key for predicting the target variable as a big store or a popular product will have higher sales than a small store and/or with unpopular products.

For example, if we don’t know that the store is available in the data, and we try to see the distribution of sales along time (with all stores mixed together), it may look like that:

The same graph grouped by store gives a much clearer view of what the sales look like for different stores.

#### Lag¶

The primary generated time series features are lag features, which are a variable’s past values. At a given sample with time stamp $$t$$, features at some time difference $$T$$ (lag) in the past are considered. For example if the sales today are 300, and sales of yesterday are 250, then the lag of one day for sales is 250. Lags can be created on any feature as well as on the target.

As previously noted, the training dataset is appropriate split such that the validation data samples equals that of the testing dataset samples. If we want to determine valid lags, we must consider what happens when we will evaluate our model on the testing dataset. Essentially, the minimum lag size must be greater than the gap size.

Aside from the minimum useable lag, Driverless AI attemps to to discover predictive lag sizes based on auto-correlation.

“Lagging” variables are important in time series because knowing what happened in different time periods in the past can greatly facilitate predictions for the future. Consider the following example to see the lag of 1 and 2 days:

Date Sales Lag1 Lag2
1/1/2018 100 - -
2/1/2018 150 100 -
3/1/2018 160 150 100
4/1/2018 200 160 150
5/1/2018 210 200 160
6/1/2018 150 210 200
7/1/2018 160 150 210
8/1/2018 120 160 150
9/1/2018 80 120 160
10/1/2018 70 80 120

### Settings Determined by Driverless AI¶

#### Window/Moving Average¶

Using the above Lag table, a moving average of 2 would constitute the average of Lag1 and Lag2:

Date Sales Lag1 Lag2 MA2
1/1/2018 100 - - -
2/1/2018 150 100 - -
3/1/2018 160 150 100 125
4/1/2018 200 160 150 155
5/1/2018 210 200 160 180
6/1/2018 150 210 200 205
7/1/2018 160 150 210 180
8/1/2018 120 160 150 155
9/1/2018 80 120 160 140
10/1/2018 70 80 120 100

Aggregating multiple lags together (instead of just one) can facilitate stability for defining the target variable. It may include various lags values, for example lags [1-30] or lags [20-40] or lags [7-70 by 7].

#### Exponential Weighting¶

Exponential weighting is a form of weighted moving average where more recent values have higher weight than less recent values. That weight is exponentially decreased over time based on an alpha (a) (hyper) parameter (0,1), which is normally within the range of [0.9 - 0.99]. For example:

• Exponential Weight = a**(time)
• If sales 1 day ago = 3.0 and 2 days ago =4.5 and a=0.95:
• Exp. smooth = 3.0*(0.95**1) + 4.5*(0.95**2) / ((0.95**1) + (0.95**2)) =3.73 approx.

## Time Series Constraints¶

### Dataset Size¶

For a desired horizon (prediction length) $$H$$, you should have at least the equivalent number of testing dataset samples $$N_{TEST}$$ (i.e. $$N_{TEST} > H$$). You also want to have enough training data time periods $$N_{TRAIN}$$ to score well on the testing dataset. At a minimum, you want a training dataset to contain twice as many time periods as the testing dataset (i.e. $$N_{TRAIN} >= 2 × N_{TEST}$$). This allows for the training dataset to be split into a validation set of equal sample size to the testing dataset and to have at least as many training time periods as validation time periods. Remember though, all datasets used for model development are historical data.

## Time Series Use Case: Sales Forecasting¶

Below is a typical example of sales forecasting based on the Walmart competition on Kaggle. In order to frame it as a machine learning problem, we formulate the historical sales data and additional attributes as shown below:

Raw data

Data formulated for machine learning

Once you have your data prepared in tabular format (see raw data above), Driverless AI can formulate it for machine learning and sort out the rest. If this is your very first session, the Driverless AI assistant will guide you through the journey.

Similar to previous Driverless AI examples, you need to select the dataset for training/test and define the target. For time-series, you need to define the time column (by choosing AUTO or selecting the date column manually). If weighted scoring is required (like the Walmart Kaggle competition), you can select the column with specific weights for different samples.

If you prefer to use automatic handling of time groups, you can leave the setting for time groups columns as AUTO.

Expert users can define specific time groups and change other settings as shown below. The Driverless AI time series expert settings provide a matrix of gap and prediction length combinations to choose from. The options for prediction length is based on quantiles of valid training dataset splits. Also, notice that the maximum prediction length (39 weeks) is set to exactly the size of the testing dataset. Because Driverless AI attempts to auto-detect the gap based on the training and testing datasets, it indentifies a better split (i.e. gap and prediction length combination) by the brightness of the matrix cells.

Once the experiment is finished, you can make new predictions and download the scoring pipeline just like any other Driverless AI experiments.