Driverless AI – 带有滚动窗口的时间序列插件¶

本 notebook 旨在展示一个使用 Driverless AI 针对不同数据子集训练实验的示例。这将生成一批可进行评估的预测值。本 notebook 中使用的数据是一个公共数据集：标普 500 指数股票数据。在此示例中，我们将使用 all_stocks_5yr.csv 数据集。

点击此处，获取 Python 客户端文档资料。

工作流¶

将数据导入 Python
创建按索引将数据切片的函数
针对每个数据切片：
- 将数据导入至 Driverless AI
- 训练一个实验
- 组合测试预测结果

导入数据¶

我们将首先使用 pandas 导入数据。

[18]:

import pandas as pd
import wget

file = wget.download('https://h2o-public-test-data.s3.amazonaws.com/dai_release_testing/datasets/s&p_all_stocks_5yr.csv')

stock_data = pd.read_csv(file)
stock_data.head()

[18]:

	date	open	high	low	close	volume	Name
0	2013-02-08	15.07	15.12	14.63	14.75	8407500	AAL
1	2013-02-11	14.89	15.01	14.26	14.46	8882000	AAL
2	2013-02-12	14.45	14.51	14.10	14.27	8126000	AAL
3	2013-02-13	14.30	14.94	14.25	14.66	10259500	AAL
4	2013-02-14	14.94	14.96	13.16	13.99	31879900	AAL

[19]:

# Convert Date column to datetime
stock_data["date"] = pd.to_datetime(stock_data["date"], format="%Y-%m-%d")

我们将添加新的列，即索引列。稍后将使用此列做一个训练和测试的滚动窗口。由于此数据仅在工作日（股市开盘时）出现，因此我们将使用此索引而非实际日期。当您使用 Driverless AI 进行预测时，它将预测接下来 n 天的数据。在这种情况下，我们不需要预测星期六和星期天的数据。因而，我们会用记录的索引替代时间列。

[20]:

dates_index = pd.DataFrame(sorted(stock_data["date"].unique()), columns = ["date"])
dates_index["index"] = range(len(dates_index))
stock_data = pd.merge(stock_data, dates_index, on = "date")

stock_data.head()

[20]:

	date	open	high	low	close	volume	Name
0	2013-02-08	15.0700	15.1200	14.6300	14.7500	8407500	AAL
1	2013-02-08	67.7142	68.4014	66.8928	67.8542	158168416	AAPL
2	2013-02-08	78.3400	79.7200	78.0100	78.9000	1298137	AAP
3	2013-02-08	36.3700	36.4200	35.8250	36.2500	13858795	ABBV
4	2013-02-08	46.5200	46.8950	46.4600	46.8900	1232802	ABC

创建移动窗口函数¶

现在我们将创建一个可以按时间拆分数据的函数，以创建多个实验。

我们首先要登录 Driverless AI。

[21]:

import driverlessai
import numpy as np
import pandas as pd
# import h2o
import requests
import math

[22]:

address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
dai = driverlessai.Client(address = address, username = username, password = password)
# make sure to use the same user name and password when signing in through the GUI

根据用户指定的训练时长和测试时长，此函数会将数据拆分成训练数据和测试数据。随后将在 Driverless AI 中运行实验并下载测试预测结果。

[23]:

def dai_moving_window(dataset, train_len, test_len, target, predictors, index_col, time_group_cols,
                      accuracy, time, interpretability):

    # Calculate windows for the training and testing data based on the train_len and test_len arguments
    num_dates = max(dataset[index_col])
    num_windows = (num_dates - train_len) // test_len
    print(num_windows)

    windows = []
    for i in range(num_windows):
        train_start_id = i*test_len
        train_end_id = train_start_id + (train_len - 1)
        test_start_id = train_end_id + 1
        test_end_id = test_start_id + (test_len - 1)

        window = {'train_start_index': train_start_id,
                  'train_end_index': train_end_id,
                  'test_start_index': test_start_id,
                  'test_end_index': test_end_id}
        windows.append(window)


    # Split the data by the window
    forecast_predictions = pd.DataFrame([])
    for window in windows:
        train_data = dataset[(dataset[index_col] >= window.get("train_start_index")) &
                             (dataset[index_col] <= window.get("train_end_index"))]

        test_data = dataset[(dataset[index_col] >= window.get("test_start_index")) &
                            (dataset[index_col] <= window.get("test_end_index"))]

        # Get the Driverless AI forecast predictions
        window_preds = dai_get_forecast(train_data, test_data, predictors, target, index_col, time_group_cols,
                                        accuracy, time, interpretability)
        forecast_predictions = forecast_predictions.append(window_preds)

    return forecast_predictions

[24]:

def dai_get_forecast(train_data, test_data, predictors, target, index_col, time_group_cols,
                     accuracy, time, interpretability):

    # Save dataset
    train_path = "./train_data.csv"
    test_path = "./test_data.csv"
    keep_cols = predictors + [target, index_col] + time_group_cols
    train_data[keep_cols].to_csv(train_path)
    test_data[keep_cols].to_csv(test_path)

    # Add datasets to Driverless AI
    train_dai = dai.datasets.create(train_path, force=True)
    test_dai = dai.datasets.create(test_path, force=True)
    ids = [c for c in train_dai.columns]

    # Run Driverless AI Experiment
    model = dai.experiments.create(train_dataset=train_dai,
                                             target_column=target, task='regression',
                                             accuracy = accuracy,
                                             time = time,
                                             interpretability = interpretability,
                                             name='stock_timeseries_beta', scorer = "RMSE", seed = 1234,
                                             time_column = index_col,
                                             time_groups_columns = time_group_cols,
                                             num_prediction_periods = test_data[index_col].nunique(),
                                             num_gap_periods = 0,
                                             force=True)

    # Download the predictions on the test dataset
    test_predictions_path = model.predict(dataset = test_dai).download(dst_dir = '.', overwrite = True)
    test_predictions = pd.read_csv(test_predictions_path)
    test_predictions.columns = ["Prediction"]

    # Add predictions to original test data
    keep_cols = [target, index_col] + time_group_cols
    test_predictions = pd.concat([test_data[keep_cols].reset_index(drop=True), test_predictions], axis = 1)


    return test_predictions

[25]:

predictors = ["Name", "index"]
target = "close"
index_col = "index"
time_group_cols = ["Name"]

[26]:

# We will filter the dataset to the first 1030 dates for demo purposes
filtered_stock_data = stock_data[stock_data["index"] <= 1029]
forecast_predictions = dai_moving_window(filtered_stock_data, 1000, 3, target, predictors, index_col, time_group_cols, accuracy = 1, time = 1, interpretability = 1)

9
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=82c893e2-e305-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './82c893e2-e305-11ea-9088-0242ac110002_preds_82e2cb12.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=c807ea1a-e306-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './c807ea1a-e306-11ea-9088-0242ac110002_preds_823bdb9c.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=11cb3dc2-e308-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './11cb3dc2-e308-11ea-9088-0242ac110002_preds_f3ce319b.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=7229d0c4-e309-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './7229d0c4-e309-11ea-9088-0242ac110002_preds_3ee4f4ea.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=d27db76e-e30a-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './d27db76e-e30a-11ea-9088-0242ac110002_preds_dfc77e64.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=33418a16-e30c-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './33418a16-e30c-11ea-9088-0242ac110002_preds_12878b69.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=c18d9548-e30d-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './c18d9548-e30d-11ea-9088-0242ac110002_preds_8b801977.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=44f0b31a-e30f-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './44f0b31a-e30f-11ea-9088-0242ac110002_preds_7dc3726c.csv'
Complete 100.00% - [4/4] Computing column statistics
Complete 100.00% - [4/4] Computing column statistics
Experiment launched at: http://localhost:12345/#experiment?key=b34c1736-e310-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
Complete
Downloaded './b34c1736-e310-11ea-9088-0242ac110002_preds_139e07a1.csv'

[27]:

forecast_predictions.head()

[27]:

	close	index	Name	Prediction
0	44.90	1000	AAL	48.131153
1	121.63	1000	AAPL	122.264595
2	164.63	1000	AAP	167.594040
3	60.43	1000	ABBV	61.546516
4	83.62	1000	ABC	85.815880

[28]:

# Calculate some error metric
mae = (forecast_predictions[target] - forecast_predictions["Prediction"]).abs().mean()
print("Mean Absolute Error: ${:,.2f}".format(mae))

Mean Absolute Error: $2.43

[ ]: