Driverless AI - 롤링 윈도우를 포함한 Time Series 레시피¶
이 노트북의 목적은 Driverless AI를 사용하여 다양한 데이터 서브세트에 대한 실험을 학습시키는 예제를 보여주는 것입니다. 이로써 평가될 수 있는 예측값의 모음이 생성됩니다. 해당 노트북에 사용된 데이터는 공개된 데이터 세트인 S+P 500 Stock Data 입니다. 이 예제에서는 all_stocks_5yr.csv 데이터 세트를 사용합니다.
다음은 Python Client Documentation 입니다.
워크플로우¶
Python으로 데이터 가져오기
인덱스별로 데이터 분할 함수 생성
각각의 데이터 조각에 대해:
Driverless AI로 데이터 가져오기
실험 학습시키기
테스트 예측 조합하기
데이터 가져오기¶
pandas를 사용해서 데이터를 가져오는 것으로 시작합니다.
[18]:
import pandas as pd
import wget
file = wget.download('https://h2o-public-test-data.s3.amazonaws.com/dai_release_testing/datasets/s&p_all_stocks_5yr.csv')
stock_data = pd.read_csv(file)
stock_data.head()
[18]:
date | open | high | low | close | volume | Name | |
---|---|---|---|---|---|---|---|
0 | 2013-02-08 | 15.07 | 15.12 | 14.63 | 14.75 | 8407500 | AAL |
1 | 2013-02-11 | 14.89 | 15.01 | 14.26 | 14.46 | 8882000 | AAL |
2 | 2013-02-12 | 14.45 | 14.51 | 14.10 | 14.27 | 8126000 | AAL |
3 | 2013-02-13 | 14.30 | 14.94 | 14.25 | 14.66 | 10259500 | AAL |
4 | 2013-02-14 | 14.94 | 14.96 | 13.16 | 13.99 | 31879900 | AAL |
[19]:
# Convert Date column to datetime
stock_data["date"] = pd.to_datetime(stock_data["date"], format="%Y-%m-%d")
인덱스인 새 열을 추가합니다. 나중에 이를 사용하여 학습 및 테스트의 롤링 윈도우를 수행합니다. 해당 데이터는 평일(주식 시장 개장 시)에만 생성되기 때문에 실제 일자 대신 이 지수를 이용합니다. Driverless AI를 사용하여 예측 수행 시, 다음 n 일을 예측합니다. 이 특정한 사례에서는 토요일과 일요일은 예측하지 않습니다. 대신 시간 열을 레코드의 인덱스로 취급합니다.
[20]:
dates_index = pd.DataFrame(sorted(stock_data["date"].unique()), columns = ["date"])
dates_index["index"] = range(len(dates_index))
stock_data = pd.merge(stock_data, dates_index, on = "date")
stock_data.head()
[20]:
date | open | high | low | close | volume | Name | index | |
---|---|---|---|---|---|---|---|---|
0 | 2013-02-08 | 15.0700 | 15.1200 | 14.6300 | 14.7500 | 8407500 | AAL | 0 |
1 | 2013-02-08 | 67.7142 | 68.4014 | 66.8928 | 67.8542 | 158168416 | AAPL | 0 |
2 | 2013-02-08 | 78.3400 | 79.7200 | 78.0100 | 78.9000 | 1298137 | AAP | 0 |
3 | 2013-02-08 | 36.3700 | 36.4200 | 35.8250 | 36.2500 | 13858795 | ABBV | 0 |
4 | 2013-02-08 | 46.5200 | 46.8950 | 46.4600 | 46.8900 | 1232802 | ABC | 0 |
이동하는 창 기능 생성¶
이제 데이터를 시간별로 분할하여 다양한 실험을 생성할 수 있는 함수를 생성합니다.
먼저 Driverless AI에 로그인합니다.
[21]:
import driverlessai
import numpy as np
import pandas as pd
# import h2o
import requests
import math
[22]:
address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
dai = driverlessai.Client(address = address, username = username, password = password)
# make sure to use the same user name and password when signing in through the GUI
우리의 기능이 사용자에 의해 지정된 학습 및 테스트의 길이에 따라 데이터를 학습 및 테스트로 분할합니다. 그 후, Driverless AI에서 실험을 실행하고 테스트 예측을 다운로드합니다.
[23]:
def dai_moving_window(dataset, train_len, test_len, target, predictors, index_col, time_group_cols,
accuracy, time, interpretability):
# Calculate windows for the training and testing data based on the train_len and test_len arguments
num_dates = max(dataset[index_col])
num_windows = (num_dates - train_len) // test_len
print(num_windows)
windows = []
for i in range(num_windows):
train_start_id = i*test_len
train_end_id = train_start_id + (train_len - 1)
test_start_id = train_end_id + 1
test_end_id = test_start_id + (test_len - 1)
window = {'train_start_index': train_start_id,
'train_end_index': train_end_id,
'test_start_index': test_start_id,
'test_end_index': test_end_id}
windows.append(window)
# Split the data by the window
forecast_predictions = pd.DataFrame([])
for window in windows:
train_data = dataset[(dataset[index_col] >= window.get("train_start_index")) &
(dataset[index_col] <= window.get("train_end_index"))]
test_data = dataset[(dataset[index_col] >= window.get("test_start_index")) &
(dataset[index_col] <= window.get("test_end_index"))]
# Get the Driverless AI forecast predictions
window_preds = dai_get_forecast(train_data, test_data, predictors, target, index_col, time_group_cols,
accuracy, time, interpretability)
forecast_predictions = forecast_predictions.append(window_preds)
return forecast_predictions
[24]:
def dai_get_forecast(train_data, test_data, predictors, target, index_col, time_group_cols,
accuracy, time, interpretability):
# Save dataset
train_path = "./train_data.csv"
test_path = "./test_data.csv"
keep_cols = predictors + [target, index_col] + time_group_cols
train_data[keep_cols].to_csv(train_path)
test_data[keep_cols].to_csv(test_path)
# Add datasets to Driverless AI
train_dai = dai.datasets.create(train_path, force=True)
test_dai = dai.datasets.create(test_path, force=True)
ids = [c for c in train_dai.columns]
# Run Driverless AI Experiment
model = dai.experiments.create(train_dataset=train_dai,
target_column=target, task='regression',
accuracy = accuracy,
time = time,
interpretability = interpretability,
name='stock_timeseries_beta', scorer = "RMSE", seed = 1234,
time_column = index_col,
time_groups_columns = time_group_cols,
num_prediction_periods = test_data[index_col].nunique(),
num_gap_periods = 0,
force=True)
# Download the predictions on the test dataset
test_predictions_path = model.predict(dataset = test_dai).download(dst_dir = '.', overwrite = True)
test_predictions = pd.read_csv(test_predictions_path)
test_predictions.columns = ["Prediction"]
# Add predictions to original test data
keep_cols = [target, index_col] + time_group_cols
test_predictions = pd.concat([test_data[keep_cols].reset_index(drop=True), test_predictions], axis = 1)
return test_predictions
[25]:
predictors = ["Name", "index"]
target = "close"
index_col = "index"
time_group_cols = ["Name"]
[26]:
# We will filter the dataset to the first 1030 dates for demo purposes
filtered_stock_data = stock_data[stock_data["index"] <= 1029]
forecast_predictions = dai_moving_window(filtered_stock_data, 1000, 3, target, predictors, index_col, time_group_cols, accuracy = 1, time = 1, interpretability = 1)
9
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=82c893e2-e305-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './82c893e2-e305-11ea-9088-0242ac110002_preds_82e2cb12.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=c807ea1a-e306-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './c807ea1a-e306-11ea-9088-0242ac110002_preds_823bdb9c.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=11cb3dc2-e308-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './11cb3dc2-e308-11ea-9088-0242ac110002_preds_f3ce319b.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=7229d0c4-e309-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './7229d0c4-e309-11ea-9088-0242ac110002_preds_3ee4f4ea.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=d27db76e-e30a-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './d27db76e-e30a-11ea-9088-0242ac110002_preds_dfc77e64.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=33418a16-e30c-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './33418a16-e30c-11ea-9088-0242ac110002_preds_12878b69.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=c18d9548-e30d-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './c18d9548-e30d-11ea-9088-0242ac110002_preds_8b801977.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=44f0b31a-e30f-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './44f0b31a-e30f-11ea-9088-0242ac110002_preds_7dc3726c.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=b34c1736-e310-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './b34c1736-e310-11ea-9088-0242ac110002_preds_139e07a1.csv'
[27]:
forecast_predictions.head()
[27]:
close | index | Name | Prediction | |
---|---|---|---|---|
0 | 44.90 | 1000 | AAL | 48.131153 |
1 | 121.63 | 1000 | AAPL | 122.264595 |
2 | 164.63 | 1000 | AAP | 167.594040 |
3 | 60.43 | 1000 | ABBV | 61.546516 |
4 | 83.62 | 1000 | ABC | 85.815880 |
[28]:
# Calculate some error metric
mae = (forecast_predictions[target] - forecast_predictions["Prediction"]).abs().mean()
print("Mean Absolute Error: ${:,.2f}".format(mae))
Mean Absolute Error: $2.43
[ ]: