Driverless AI - Time Series Recipes with Rolling Window¶

The purpose of this notebook is to show an example of using Driverless AI to train experiments on different subsets of data. This would result in a collection of forecasted values that can be evaluated. The data used in this notebook is a public dataset: S+P 500 Stock Data. In this example, we are using the all_stocks_5yr.csv dataset.

Workflow¶

Import data into Python
Create function that slices data by index
For each slice of data:
- import data into Driverless AI
- train an experiment
- combine test predictions

Import Data¶

We will begin by importing our data using pandas.

In [1]:

import pandas as pd

stock_data = pd.read_csv("./all_stocks_5yr.csv")
stock_data.head()

Out[1]:

	date	open	high	low	close	volume	Name
0	2013-02-08	15.07	15.12	14.63	14.75	8407500	AAL
1	2013-02-11	14.89	15.01	14.26	14.46	8882000	AAL
2	2013-02-12	14.45	14.51	14.10	14.27	8126000	AAL
3	2013-02-13	14.30	14.94	14.25	14.66	10259500	AAL
4	2013-02-14	14.94	14.96	13.16	13.99	31879900	AAL

In [2]:

# Convert Date column to datetime
stock_data["date"] = pd.to_datetime(stock_data["date"], format="%Y-%m-%d")

We will add a new column which is the index. We will use this later on to do a rolling window of training and testing. We will use this index instead of the actual date because this data only occurs on weekdays (when the stock market is opened). When you use Driverless AI to perform a forecast, it will forecast the next n days. In this particular case, we never want to forecast Saturday’s and Sunday’s. We will instead treat our time column as the index of the record.

In [3]:

dates_index = pd.DataFrame(sorted(stock_data["date"].unique()), columns = ["date"])
dates_index["index"] = range(len(dates_index))
stock_data = pd.merge(stock_data, dates_index, on = "date")

stock_data.head()

Out[3]:

	date	open	high	low	close	volume	Name
0	2013-02-08	15.0700	15.1200	14.6300	14.7500	8407500	AAL
1	2013-02-08	67.7142	68.4014	66.8928	67.8542	158168416	AAPL
2	2013-02-08	78.3400	79.7200	78.0100	78.9000	1298137	AAP
3	2013-02-08	36.3700	36.4200	35.8250	36.2500	13858795	ABBV
4	2013-02-08	46.5200	46.8950	46.4600	46.8900	1232802	ABC

Create Moving Window Function¶

Now we will create a function that can split our data by time to create multiple experiments.

We will start by first logging into Driverless AI.

In [4]:

import h2oai_client
import numpy as np
import pandas as pd
# import h2o
import requests
import math
from h2oai_client import Client, ModelParameters

In [10]:

address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
h2oai = Client(address = address, username = username, password = password)
# make sure to use the same user name and password when signing in through the GUI

Our function will split the data into training and testing based on the training length and testing length specified by the user. It will then run an experiment in Driverless AI and download the test predictions.

In [11]:

def dai_moving_window(dataset, train_len, test_len, target, predictors, index_col, time_group_cols,
                      accuracy, time, interpretability):

    # Calculate windows for the training and testing data based on the train_len and test_len arguments
    num_dates = max(dataset[index_col])
    num_windows = (num_dates - train_len) // test_len

    windows = []
    for i in range(num_windows):
        train_start_id = i*test_len
        train_end_id = train_start_id + (train_len - 1)
        test_start_id = train_end_id + 1
        test_end_id = test_start_id + (test_len - 1)

        window = {'train_start_index': train_start_id,
                  'train_end_index': train_end_id,
                  'test_start_index': test_start_id,
                  'test_end_index': test_end_id}
        windows.append(window)


    # Split the data by the window
    forecast_predictions = pd.DataFrame([])
    for window in windows:
        train_data = dataset[(dataset[index_col] >= window.get("train_start_index")) &
                             (dataset[index_col] <= window.get("train_end_index"))]

        test_data = dataset[(dataset[index_col] >= window.get("test_start_index")) &
                            (dataset[index_col] <= window.get("test_end_index"))]

        # Get the Driverless AI forecast predictions
        window_preds = dai_get_forecast(train_data, test_data, predictors, target, index_col, time_group_cols,
                                        accuracy, time, interpretability)
        forecast_predictions = forecast_predictions.append(window_preds)

    return forecast_predictions

In [12]:

def dai_get_forecast(train_data, test_data, predictors, target, index_col, time_group_cols,
                     accuracy, time, interpretability):

    # Save dataset
    train_path = "./train_data.csv"
    test_path = "./test_data.csv"
    keep_cols = predictors + [target, index_col] + time_group_cols
    train_data[keep_cols].to_csv(train_path)
    test_data[keep_cols].to_csv(test_path)

    # Add datasets to Driverless AI
    train_dai = h2oai.upload_dataset_sync(train_path)
    test_dai = h2oai.upload_dataset_sync(test_path)

    # Run Driverless AI Experiment
    experiment = h2oai.start_experiment_sync(dataset_key = train_dai.key,
                                             testset_key = test_dai.key,
                                             target_col = target,
                                             cols_to_drop = [],
                                             is_classification = False,
                                             accuracy = accuracy,
                                             time = time,
                                             interpretability = interpretability,
                                             scorer = "RMSE",
                                             seed = 1234,
                                             time_col = index_col,
                                             time_groups_columns = time_group_cols,
                                             num_prediction_periods = test_data[index_col].nunique(),
                                             num_gap_periods = 0)

    # Download the predictions on the test dataset
    test_predictions_path = h2oai.download(experiment.test_predictions_path, "./")
    test_predictions = pd.read_csv(test_predictions_path)
    test_predictions.columns = ["Prediction"]

    # Add predictions to original test data
    keep_cols = [target, index_col] + time_group_cols
    test_predictions = pd.concat([test_data[keep_cols].reset_index(drop=True), test_predictions], axis = 1)

    return test_predictions

In [13]:

predictors = ["Name", "index"]
target = "close"
index_col = "index"
time_group_cols = ["Name"]

In [ ]:

# We will filter the dataset to the first 1030 dates for demo purposes
filtered_stock_data = stock_data[stock_data["index"] <= 1029]
forecast_predictions = dai_moving_window(filtered_stock_data, 1000, 3, target, predictors, index_col, time_group_cols,
                                         accuracy = 1, time = 1, interpretability = 1)

In [25]:

forecast_predictions.head()

Out[25]:

	close	index	Name	Prediction
0	44.90	1000	AAL	48.050527
1	121.63	1000	AAPL	119.485352
2	164.63	1000	AAP	167.960700
3	60.43	1000	ABBV	60.784213
4	83.62	1000	ABC	86.939174

In [26]:

# Calculate some error metric
mae = (forecast_predictions[target] - forecast_predictions["Prediction"]).abs().mean()
print("Mean Absolute Error: ${:,.2f}".format(mae))

Mean Absolute Error: $6.79

In [ ]: