Driverless AI - Time Series Recipes with Rolling Window¶
The purpose of this notebook is to show an example of using Driverless AI to train experiments on different subsets of data. This would result in a collection of forecasted values that can be evaluated. The data used in this notebook is a public dataset: S+P 500 Stock Data. In this example, we are using the all_stocks_5yr.csv dataset.
Here is the Python Client Documentation.
Workflow¶
Import data into Python
Create function that slices data by index
For each slice of data:
import data into Driverless AI
train an experiment
combine test predictions
Import Data¶
We will begin by importing our data using pandas.
[18]:
import pandas as pd
import wget
file = wget.download('https://h2o-public-test-data.s3.amazonaws.com/dai_release_testing/datasets/s&p_all_stocks_5yr.csv')
stock_data = pd.read_csv(file)
stock_data.head()
[18]:
date | open | high | low | close | volume | Name | |
---|---|---|---|---|---|---|---|
0 | 2013-02-08 | 15.07 | 15.12 | 14.63 | 14.75 | 8407500 | AAL |
1 | 2013-02-11 | 14.89 | 15.01 | 14.26 | 14.46 | 8882000 | AAL |
2 | 2013-02-12 | 14.45 | 14.51 | 14.10 | 14.27 | 8126000 | AAL |
3 | 2013-02-13 | 14.30 | 14.94 | 14.25 | 14.66 | 10259500 | AAL |
4 | 2013-02-14 | 14.94 | 14.96 | 13.16 | 13.99 | 31879900 | AAL |
[19]:
# Convert Date column to datetime
stock_data["date"] = pd.to_datetime(stock_data["date"], format="%Y-%m-%d")
We will add a new column which is the index. We will use this later on to do a rolling window of training and testing. We will use this index instead of the actual date because this data only occurs on weekdays (when the stock market is opened). When you use Driverless AI to perform a forecast, it will forecast the next n days. In this particular case, we never want to forecast Saturday’s and Sunday’s. We will instead treat our time column as the index of the record.
[20]:
dates_index = pd.DataFrame(sorted(stock_data["date"].unique()), columns = ["date"])
dates_index["index"] = range(len(dates_index))
stock_data = pd.merge(stock_data, dates_index, on = "date")
stock_data.head()
[20]:
date | open | high | low | close | volume | Name | index | |
---|---|---|---|---|---|---|---|---|
0 | 2013-02-08 | 15.0700 | 15.1200 | 14.6300 | 14.7500 | 8407500 | AAL | 0 |
1 | 2013-02-08 | 67.7142 | 68.4014 | 66.8928 | 67.8542 | 158168416 | AAPL | 0 |
2 | 2013-02-08 | 78.3400 | 79.7200 | 78.0100 | 78.9000 | 1298137 | AAP | 0 |
3 | 2013-02-08 | 36.3700 | 36.4200 | 35.8250 | 36.2500 | 13858795 | ABBV | 0 |
4 | 2013-02-08 | 46.5200 | 46.8950 | 46.4600 | 46.8900 | 1232802 | ABC | 0 |
Create Moving Window Function¶
Now we will create a function that can split our data by time to create multiple experiments.
We will start by first logging into Driverless AI.
[21]:
import driverlessai
import numpy as np
import pandas as pd
# import h2o
import requests
import math
[22]:
address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
dai = driverlessai.Client(address = address, username = username, password = password)
# make sure to use the same user name and password when signing in through the GUI
Our function will split the data into training and testing based on the training length and testing length specified by the user. It will then run an experiment in Driverless AI and download the test predictions.
[23]:
def dai_moving_window(dataset, train_len, test_len, target, predictors, index_col, time_group_cols,
accuracy, time, interpretability):
# Calculate windows for the training and testing data based on the train_len and test_len arguments
num_dates = max(dataset[index_col])
num_windows = (num_dates - train_len) // test_len
print(num_windows)
windows = []
for i in range(num_windows):
train_start_id = i*test_len
train_end_id = train_start_id + (train_len - 1)
test_start_id = train_end_id + 1
test_end_id = test_start_id + (test_len - 1)
window = {'train_start_index': train_start_id,
'train_end_index': train_end_id,
'test_start_index': test_start_id,
'test_end_index': test_end_id}
windows.append(window)
# Split the data by the window
forecast_predictions = pd.DataFrame([])
for window in windows:
train_data = dataset[(dataset[index_col] >= window.get("train_start_index")) &
(dataset[index_col] <= window.get("train_end_index"))]
test_data = dataset[(dataset[index_col] >= window.get("test_start_index")) &
(dataset[index_col] <= window.get("test_end_index"))]
# Get the Driverless AI forecast predictions
window_preds = dai_get_forecast(train_data, test_data, predictors, target, index_col, time_group_cols,
accuracy, time, interpretability)
forecast_predictions = forecast_predictions.append(window_preds)
return forecast_predictions
[24]:
def dai_get_forecast(train_data, test_data, predictors, target, index_col, time_group_cols,
accuracy, time, interpretability):
# Save dataset
train_path = "./train_data.csv"
test_path = "./test_data.csv"
keep_cols = predictors + [target, index_col] + time_group_cols
train_data[keep_cols].to_csv(train_path)
test_data[keep_cols].to_csv(test_path)
# Add datasets to Driverless AI
train_dai = dai.datasets.create(train_path, force=True)
test_dai = dai.datasets.create(test_path, force=True)
ids = [c for c in train_dai.columns]
# Run Driverless AI Experiment
model = dai.experiments.create(train_dataset=train_dai,
target_column=target, task='regression',
accuracy = accuracy,
time = time,
interpretability = interpretability,
name='stock_timeseries_beta', scorer = "RMSE",
time_column = index_col,
time_groups_columns = time_group_cols,
num_prediction_periods = test_data[index_col].nunique(),
num_gap_periods = 0,
force=True)
# Download the predictions on the test dataset
test_predictions_path = model.predict(dataset = test_dai).download(dst_dir = '.', overwrite = True)
test_predictions = pd.read_csv(test_predictions_path)
test_predictions.columns = ["Prediction"]
# Add predictions to original test data
keep_cols = [target, index_col] + time_group_cols
test_predictions = pd.concat([test_data[keep_cols].reset_index(drop=True), test_predictions], axis = 1)
return test_predictions
[25]:
predictors = ["Name", "index"]
target = "close"
index_col = "index"
time_group_cols = ["Name"]
[26]:
# We will filter the dataset to the first 1030 dates for demo purposes
filtered_stock_data = stock_data[stock_data["index"] <= 1029]
forecast_predictions = dai_moving_window(filtered_stock_data, 1000, 3, target, predictors, index_col, time_group_cols, accuracy = 1, time = 1, interpretability = 1)
9
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=82c893e2-e305-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './82c893e2-e305-11ea-9088-0242ac110002_preds_82e2cb12.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=c807ea1a-e306-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './c807ea1a-e306-11ea-9088-0242ac110002_preds_823bdb9c.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=11cb3dc2-e308-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './11cb3dc2-e308-11ea-9088-0242ac110002_preds_f3ce319b.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=7229d0c4-e309-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './7229d0c4-e309-11ea-9088-0242ac110002_preds_3ee4f4ea.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=d27db76e-e30a-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './d27db76e-e30a-11ea-9088-0242ac110002_preds_dfc77e64.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=33418a16-e30c-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './33418a16-e30c-11ea-9088-0242ac110002_preds_12878b69.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=c18d9548-e30d-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './c18d9548-e30d-11ea-9088-0242ac110002_preds_8b801977.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=44f0b31a-e30f-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './44f0b31a-e30f-11ea-9088-0242ac110002_preds_7dc3726c.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=b34c1736-e310-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './b34c1736-e310-11ea-9088-0242ac110002_preds_139e07a1.csv'
[27]:
forecast_predictions.head()
[27]:
close | index | Name | Prediction | |
---|---|---|---|---|
0 | 44.90 | 1000 | AAL | 48.131153 |
1 | 121.63 | 1000 | AAPL | 122.264595 |
2 | 164.63 | 1000 | AAP | 167.594040 |
3 | 60.43 | 1000 | ABBV | 61.546516 |
4 | 83.62 | 1000 | ABC | 85.815880 |
[28]:
# Calculate some error metric
mae = (forecast_predictions[target] - forecast_predictions["Prediction"]).abs().mean()
print("Mean Absolute Error: ${:,.2f}".format(mae))
Mean Absolute Error: $2.43
[ ]: