Driverless AI - Time Series Recipes with Rolling Window

The purpose of this notebook is to show an example of using Driverless AI to train experiments on different subsets of data. This would result in a collection of forecasted values that can be evaluated. The data used in this notebook is a public dataset: S+P 500 Stock Data. In this example, we are using the all_stocks_5yr.csv dataset.

Here is the Python Client Documentation.

Workflow

  1. Import data into Python

  2. Create function that slices data by index

  3. For each slice of data:

    • import data into Driverless AI

    • train an experiment

    • combine test predictions

Import Data

We will begin by importing our data using pandas.

[18]:
import pandas as pd
import wget

file = wget.download('https://h2o-public-test-data.s3.amazonaws.com/dai_release_testing/datasets/s&p_all_stocks_5yr.csv')

stock_data = pd.read_csv(file)
stock_data.head()
[18]:
date open high low close volume Name
0 2013-02-08 15.07 15.12 14.63 14.75 8407500 AAL
1 2013-02-11 14.89 15.01 14.26 14.46 8882000 AAL
2 2013-02-12 14.45 14.51 14.10 14.27 8126000 AAL
3 2013-02-13 14.30 14.94 14.25 14.66 10259500 AAL
4 2013-02-14 14.94 14.96 13.16 13.99 31879900 AAL
[19]:
# Convert Date column to datetime
stock_data["date"] = pd.to_datetime(stock_data["date"], format="%Y-%m-%d")

We will add a new column which is the index. We will use this later on to do a rolling window of training and testing. We will use this index instead of the actual date because this data only occurs on weekdays (when the stock market is opened). When you use Driverless AI to perform a forecast, it will forecast the next n days. In this particular case, we never want to forecast Saturday’s and Sunday’s. We will instead treat our time column as the index of the record.

[20]:
dates_index = pd.DataFrame(sorted(stock_data["date"].unique()), columns = ["date"])
dates_index["index"] = range(len(dates_index))
stock_data = pd.merge(stock_data, dates_index, on = "date")

stock_data.head()
[20]:
date open high low close volume Name index
0 2013-02-08 15.0700 15.1200 14.6300 14.7500 8407500 AAL 0
1 2013-02-08 67.7142 68.4014 66.8928 67.8542 158168416 AAPL 0
2 2013-02-08 78.3400 79.7200 78.0100 78.9000 1298137 AAP 0
3 2013-02-08 36.3700 36.4200 35.8250 36.2500 13858795 ABBV 0
4 2013-02-08 46.5200 46.8950 46.4600 46.8900 1232802 ABC 0

Create Moving Window Function

Now we will create a function that can split our data by time to create multiple experiments.

We will start by first logging into Driverless AI.

[21]:
import driverlessai
import numpy as np
import pandas as pd
# import h2o
import requests
import math
[22]:
address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
dai = driverlessai.Client(address = address, username = username, password = password)
# make sure to use the same user name and password when signing in through the GUI

Our function will split the data into training and testing based on the training length and testing length specified by the user. It will then run an experiment in Driverless AI and download the test predictions.

[23]:
def dai_moving_window(dataset, train_len, test_len, target, predictors, index_col, time_group_cols,
                      accuracy, time, interpretability):

    # Calculate windows for the training and testing data based on the train_len and test_len arguments
    num_dates = max(dataset[index_col])
    num_windows = (num_dates - train_len) // test_len
    print(num_windows)

    windows = []
    for i in range(num_windows):
        train_start_id = i*test_len
        train_end_id = train_start_id + (train_len - 1)
        test_start_id = train_end_id + 1
        test_end_id = test_start_id + (test_len - 1)

        window = {'train_start_index': train_start_id,
                  'train_end_index': train_end_id,
                  'test_start_index': test_start_id,
                  'test_end_index': test_end_id}
        windows.append(window)


    # Split the data by the window
    forecast_predictions = pd.DataFrame([])
    for window in windows:
        train_data = dataset[(dataset[index_col] >= window.get("train_start_index")) &
                             (dataset[index_col] <= window.get("train_end_index"))]

        test_data = dataset[(dataset[index_col] >= window.get("test_start_index")) &
                            (dataset[index_col] <= window.get("test_end_index"))]

        # Get the Driverless AI forecast predictions
        window_preds = dai_get_forecast(train_data, test_data, predictors, target, index_col, time_group_cols,
                                        accuracy, time, interpretability)
        forecast_predictions = forecast_predictions.append(window_preds)

    return forecast_predictions
[24]:
def dai_get_forecast(train_data, test_data, predictors, target, index_col, time_group_cols,
                     accuracy, time, interpretability):

    # Save dataset
    train_path = "./train_data.csv"
    test_path = "./test_data.csv"
    keep_cols = predictors + [target, index_col] + time_group_cols
    train_data[keep_cols].to_csv(train_path)
    test_data[keep_cols].to_csv(test_path)

    # Add datasets to Driverless AI
    train_dai = dai.datasets.create(train_path, force=True)
    test_dai = dai.datasets.create(test_path, force=True)
    ids = [c for c in train_dai.columns]

    # Run Driverless AI Experiment
    model = dai.experiments.create(train_dataset=train_dai,
                                             target_column=target, task='regression',
                                             accuracy = accuracy,
                                             time = time,
                                             interpretability = interpretability,
                                             name='stock_timeseries_beta', scorer = "RMSE",
                                             time_column = index_col,
                                             time_groups_columns = time_group_cols,
                                             num_prediction_periods = test_data[index_col].nunique(),
                                             num_gap_periods = 0,
                                             force=True)

    # Download the predictions on the test dataset
    test_predictions_path = model.predict(dataset = test_dai).download(dst_dir = '.', overwrite = True)
    test_predictions = pd.read_csv(test_predictions_path)
    test_predictions.columns = ["Prediction"]

    # Add predictions to original test data
    keep_cols = [target, index_col] + time_group_cols
    test_predictions = pd.concat([test_data[keep_cols].reset_index(drop=True), test_predictions], axis = 1)


    return test_predictions
[25]:
predictors = ["Name", "index"]
target = "close"
index_col = "index"
time_group_cols = ["Name"]
[26]:
# We will filter the dataset to the first 1030 dates for demo purposes
filtered_stock_data = stock_data[stock_data["index"] <= 1029]
forecast_predictions = dai_moving_window(filtered_stock_data, 1000, 3, target, predictors, index_col, time_group_cols, accuracy = 1, time = 1, interpretability = 1)
9
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=82c893e2-e305-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './82c893e2-e305-11ea-9088-0242ac110002_preds_82e2cb12.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=c807ea1a-e306-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './c807ea1a-e306-11ea-9088-0242ac110002_preds_823bdb9c.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=11cb3dc2-e308-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './11cb3dc2-e308-11ea-9088-0242ac110002_preds_f3ce319b.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=7229d0c4-e309-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './7229d0c4-e309-11ea-9088-0242ac110002_preds_3ee4f4ea.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=d27db76e-e30a-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './d27db76e-e30a-11ea-9088-0242ac110002_preds_dfc77e64.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=33418a16-e30c-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './33418a16-e30c-11ea-9088-0242ac110002_preds_12878b69.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=c18d9548-e30d-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './c18d9548-e30d-11ea-9088-0242ac110002_preds_8b801977.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=44f0b31a-e30f-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './44f0b31a-e30f-11ea-9088-0242ac110002_preds_7dc3726c.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=b34c1736-e310-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './b34c1736-e310-11ea-9088-0242ac110002_preds_139e07a1.csv'
[27]:
forecast_predictions.head()
[27]:
close index Name Prediction
0 44.90 1000 AAL 48.131153
1 121.63 1000 AAPL 122.264595
2 164.63 1000 AAP 167.594040
3 60.43 1000 ABBV 61.546516
4 83.62 1000 ABC 85.815880
[28]:
# Calculate some error metric
mae = (forecast_predictions[target] - forecast_predictions["Prediction"]).abs().mean()
print("Mean Absolute Error: ${:,.2f}".format(mae))
Mean Absolute Error: $2.43
[ ]: