Driverless AI - Time Series Recipes with Rolling Window¶
The purpose of this notebook is to show an example of using Driverless AI to train experiments on different subsets of data. This would result in a collection of forecasted values that can be evaluated. The data used in this notebook is a public dataset: S+P 500 Stock Data. In this example, we are using the all_stocks_5yr.csv dataset.
Workflow¶
- Import data into Python
- Create function that slices data by index
- For each slice of data:
- import data into Driverless AI
- train an experiment
- combine test predictions
Import Data¶
We will begin by importing our data using pandas.
In [1]:
import pandas as pd
stock_data = pd.read_csv("./all_stocks_5yr.csv")
stock_data.head()
Out[1]:
date | open | high | low | close | volume | Name | |
---|---|---|---|---|---|---|---|
0 | 2013-02-08 | 15.07 | 15.12 | 14.63 | 14.75 | 8407500 | AAL |
1 | 2013-02-11 | 14.89 | 15.01 | 14.26 | 14.46 | 8882000 | AAL |
2 | 2013-02-12 | 14.45 | 14.51 | 14.10 | 14.27 | 8126000 | AAL |
3 | 2013-02-13 | 14.30 | 14.94 | 14.25 | 14.66 | 10259500 | AAL |
4 | 2013-02-14 | 14.94 | 14.96 | 13.16 | 13.99 | 31879900 | AAL |
In [2]:
# Convert Date column to datetime
stock_data["date"] = pd.to_datetime(stock_data["date"], format="%Y-%m-%d")
We will add a new column which is the index. We will use this later on to do a rolling window of training and testing. We will use this index instead of the actual date because this data only occurs on weekdays (when the stock market is opened). When you use Driverless AI to perform a forecast, it will forecast the next n days. In this particular case, we never want to forecast Saturday’s and Sunday’s. We will instead treat our time column as the index of the record.
In [3]:
dates_index = pd.DataFrame(sorted(stock_data["date"].unique()), columns = ["date"])
dates_index["index"] = range(len(dates_index))
stock_data = pd.merge(stock_data, dates_index, on = "date")
stock_data.head()
Out[3]:
date | open | high | low | close | volume | Name | index | |
---|---|---|---|---|---|---|---|---|
0 | 2013-02-08 | 15.0700 | 15.1200 | 14.6300 | 14.7500 | 8407500 | AAL | 0 |
1 | 2013-02-08 | 67.7142 | 68.4014 | 66.8928 | 67.8542 | 158168416 | AAPL | 0 |
2 | 2013-02-08 | 78.3400 | 79.7200 | 78.0100 | 78.9000 | 1298137 | AAP | 0 |
3 | 2013-02-08 | 36.3700 | 36.4200 | 35.8250 | 36.2500 | 13858795 | ABBV | 0 |
4 | 2013-02-08 | 46.5200 | 46.8950 | 46.4600 | 46.8900 | 1232802 | ABC | 0 |
Create Moving Window Function¶
Now we will create a function that can split our data by time to create multiple experiments.
We will start by first logging into Driverless AI.
In [4]:
import h2oai_client
import numpy as np
import pandas as pd
# import h2o
import requests
import math
from h2oai_client import Client, ModelParameters
In [10]:
address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
h2oai = Client(address = address, username = username, password = password)
# make sure to use the same user name and password when signing in through the GUI
Our function will split the data into training and testing based on the training length and testing length specified by the user. It will then run an experiment in Driverless AI and download the test predictions.
In [11]:
def dai_moving_window(dataset, train_len, test_len, target, predictors, index_col, time_group_cols,
accuracy, time, interpretability):
# Calculate windows for the training and testing data based on the train_len and test_len arguments
num_dates = max(dataset[index_col])
num_windows = (num_dates - train_len) // test_len
windows = []
for i in range(num_windows):
train_start_id = i*test_len
train_end_id = train_start_id + (train_len - 1)
test_start_id = train_end_id + 1
test_end_id = test_start_id + (test_len - 1)
window = {'train_start_index': train_start_id,
'train_end_index': train_end_id,
'test_start_index': test_start_id,
'test_end_index': test_end_id}
windows.append(window)
# Split the data by the window
forecast_predictions = pd.DataFrame([])
for window in windows:
train_data = dataset[(dataset[index_col] >= window.get("train_start_index")) &
(dataset[index_col] <= window.get("train_end_index"))]
test_data = dataset[(dataset[index_col] >= window.get("test_start_index")) &
(dataset[index_col] <= window.get("test_end_index"))]
# Get the Driverless AI forecast predictions
window_preds = dai_get_forecast(train_data, test_data, predictors, target, index_col, time_group_cols,
accuracy, time, interpretability)
forecast_predictions = forecast_predictions.append(window_preds)
return forecast_predictions
In [12]:
def dai_get_forecast(train_data, test_data, predictors, target, index_col, time_group_cols,
accuracy, time, interpretability):
# Save dataset
train_path = "./train_data.csv"
test_path = "./test_data.csv"
keep_cols = predictors + [target, index_col] + time_group_cols
train_data[keep_cols].to_csv(train_path)
test_data[keep_cols].to_csv(test_path)
# Add datasets to Driverless AI
train_dai = h2oai.upload_dataset_sync(train_path)
test_dai = h2oai.upload_dataset_sync(test_path)
# Run Driverless AI Experiment
experiment = h2oai.start_experiment_sync(dataset_key = train_dai.key,
testset_key = test_dai.key,
target_col = target,
cols_to_drop = [],
is_classification = False,
accuracy = accuracy,
time = time,
interpretability = interpretability,
scorer = "RMSE",
seed = 1234,
time_col = index_col,
time_groups_columns = time_group_cols,
num_prediction_periods = test_data[index_col].nunique(),
num_gap_periods = 0)
# Download the predictions on the test dataset
test_predictions_path = h2oai.download(experiment.test_predictions_path, "./")
test_predictions = pd.read_csv(test_predictions_path)
test_predictions.columns = ["Prediction"]
# Add predictions to original test data
keep_cols = [target, index_col] + time_group_cols
test_predictions = pd.concat([test_data[keep_cols].reset_index(drop=True), test_predictions], axis = 1)
return test_predictions
In [13]:
predictors = ["Name", "index"]
target = "close"
index_col = "index"
time_group_cols = ["Name"]
In [ ]:
# We will filter the dataset to the first 1030 dates for demo purposes
filtered_stock_data = stock_data[stock_data["index"] <= 1029]
forecast_predictions = dai_moving_window(filtered_stock_data, 1000, 3, target, predictors, index_col, time_group_cols,
accuracy = 1, time = 1, interpretability = 1)
In [25]:
forecast_predictions.head()
Out[25]:
close | index | Name | Prediction | |
---|---|---|---|---|
0 | 44.90 | 1000 | AAL | 48.050527 |
1 | 121.63 | 1000 | AAPL | 119.485352 |
2 | 164.63 | 1000 | AAP | 167.960700 |
3 | 60.43 | 1000 | ABBV | 60.784213 |
4 | 83.62 | 1000 | ABC | 86.939174 |
In [26]:
# Calculate some error metric
mae = (forecast_predictions[target] - forecast_predictions["Prediction"]).abs().mean()
print("Mean Absolute Error: ${:,.2f}".format(mae))
Mean Absolute Error: $6.79
In [ ]: