R 客户端教程

本教程介绍了如何利用 Driverless AI R 客户端软件包来使用和控制 Driverless AI 平台。其中涉及了主要的预测性数据科学工作流,包括:

  1. 加载数据

  2. 自动化特征工程和模型调优

  3. 检测模型

  4. 预测新数据

  5. 管理数据集和模型

请注意:这些步骤均假设您已在 Driverless AI UI 中已经输入许可证密钥。

加载数据

我们必须先导入软件包并初始化连接,然后才能开始使用 Driverless AI 平台 (DAI):

library(dai)
dai.connect(uri = 'http://localhost:12345', username = 'h2oai', password = 'h2oai')

creditcard <- dai.create_dataset('/data/smalldata/kaggle/CreditCard/creditcard_train_cat.csv')
#>
  |
  |                                                                 |   0%
  |
  |================                                                 |  24%
  |
  |=================================================================| 100%

函数 dai.create_dataset() 会加载托管 DAI 的主机上的数据。以上命令假设 creditcard_train_cat.csv 位于运行 Driverless AI 的主机上的 /data 文件夹中。此文件可以在 https://s3.amazonaws.com/h2o-public-test-data/smalldata/kaggle/CreditCard/creditcard_train_cat.csv 上下载。

如果您想上传工作站中的数据,请使用 dai.upload_dataset().

如果您已经将数据加载到 R data.frame 中,则可以将其转换为 DAIFrame。例如:

iris.dai <- as.DAIFrame(iris)
#>
  |
  |                                                                 |   0%
  |
  |=================================================================| 100%

print(iris.dai)
#> DAI frame '7c38cb84-5baa-11e9-a50b-b938de969cdb': 150 obs. of 5 variables
#> File path: ./tmp/7c38cb84-5baa-11e9-a50b-b938de969cdb/iris9e1f15d2df00.csv.1554912339.9424415.bin

您可以关闭进度条,前提是已通过设置 progress = FALSE 显示进度条。

在创建数据集后,您可以立即通过调用泛型打印和摘要信息,显示基本信息和摘要统计数据:

print(creditcard)
#> DAI frame '7abe28b2-5baa-11e9-a50b-b938de969cdb': 23999 obs. of 25 variables
#> File path: tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv

summary(creditcard)
#>                      variable num_classes is_numeric count
#> 1                          ID           0       TRUE 23999
#> 2                   LIMIT_BAL          79       TRUE 23999
#> 3                         SEX           2      FALSE 23999
#> 4                   EDUCATION           4      FALSE 23999
#> 5                    MARRIAGE           4      FALSE 23999
#> 6                         AGE          55       TRUE 23999
#> 7                       PAY_1          11       TRUE 23999
#> 8                       PAY_2          11       TRUE 23999
#> 9                       PAY_3          11       TRUE 23999
#> 10                      PAY_4          11       TRUE 23999
#> 11                      PAY_5          10       TRUE 23999
#> 12                      PAY_6          10       TRUE 23999
#> 13                  BILL_AMT1           0       TRUE 23999
#> 14                  BILL_AMT2           0       TRUE 23999
#> 15                  BILL_AMT3           0       TRUE 23999
#> 16                  BILL_AMT4           0       TRUE 23999
#> 17                  BILL_AMT5           0       TRUE 23999
#> 18                  BILL_AMT6           0       TRUE 23999
#> 19                   PAY_AMT1           0       TRUE 23999
#> 20                   PAY_AMT2           0       TRUE 23999
#> 21                   PAY_AMT3           0       TRUE 23999
#> 22                   PAY_AMT4           0       TRUE 23999
#> 23                   PAY_AMT5           0       TRUE 23999
#> 24                   PAY_AMT6           0       TRUE 23999
#> 25 DEFAULT_PAYMENT_NEXT_MONTH           2       TRUE 23999
#>                    mean              std     min     max unique  freq
#> 1                 12000 6928.05889120466       1   23999  23999     1
#> 2      165498.715779824 129130.743065318   10000 1000000     79  2740
#> 3                                                             2  8921
#> 4                                                             4 11360
#> 5                                                             4 12876
#> 6      35.3808492020501  9.2710457493384      21      79     55  1284
#> 7  -0.00312513021375891 1.12344874325651      -2       8     11 11738
#> 8    -0.123463477644902 1.20059118344043      -2       8     11 12543
#> 9    -0.154756448185341 1.20405796618856      -2       8     11 12576
#> 10   -0.211675486478603 1.16657279943005      -2       8     11 13250
#> 11   -0.252885536897371    1.13700672904      -2       8     10 13520
#> 12   -0.278011583815992  1.1581916495226      -2       8     10 12876
#> 13     50598.9286636943 72650.1978092856 -165580  964511  18717  1607
#> 14     48648.0474186424 70365.3956426641  -69777  983931  18367  2049
#> 15     46368.9035376474 68194.7195202748 -157264 1664089  18131  2325
#> 16     42369.8728280345 63071.4551670874 -170000  891586  17719  2547
#> 17     40002.3330972124 60345.7282797424  -81334  927171  17284  2840
#> 18     38565.2666361098 59156.5011434754 -339603  961664  16906  3258
#> 19     5543.09804575191   15068.86272958       0  505000   6918  4270
#> 20     5815.52852202175  20797.443884891       0 1684259   6839  4362
#> 21     4969.43139297471 16095.9292948255       0  896040   6424  4853
#> 22     4743.65686070253 14883.5548720259       0  497000   6028  5200
#> 23     4783.64369348723 15270.7039035392       0  417990   5984  5407
#> 24     5189.57360723363 17630.7185745277       0  528666   5988  5846
#> 25    0.223717654902288 0.41674368928609   FALSE    TRUE      2  5369
#>                                                                                                                                                             num_hist_ticks
#> 1                                               1.0, 2400.8, 4800.6, 7200.400000000001, 9600.2, 12000.0, 14399.800000000001, 16799.600000000002, 19199.4, 21599.2, 23999.0
#> 2                                                             10000.0, 109000.0, 208000.0, 307000.0, 406000.0, 505000.0, 604000.0, 703000.0, 802000.0, 901000.0, 1000000.0
#> 3
#> 4
#> 5
#> 6                                                                                            21.0, 26.8, 32.6, 38.4, 44.2, 50.0, 55.8, 61.6, 67.4, 73.19999999999999, 79.0
#> 7                                                                                                                                        -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8
#> 8                                                                                                                                        -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8
#> 9                                                                                                                                        -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8
#> 10                                                                                                                                       -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8
#> 11                                                                                                                                          -2, -1, 0, 2, 3, 4, 5, 6, 7, 8
#> 12                                                                                                                                          -2, -1, 0, 2, 3, 4, 5, 6, 7, 8
#> 13           -165580.0, -52570.899999999994, 60438.20000000001, 173447.30000000005, 286456.4, 399465.5, 512474.6000000001, 625483.7000000001, 738492.8, 851501.9, 964511.0
#> 14                                          -69777.0, 35593.8, 140964.6, 246335.40000000002, 351706.2, 457077.0, 562447.8, 667818.6, 773189.4, 878560.2000000001, 983931.0
#> 15         -157264.0, 24871.29999999999, 207006.59999999998, 389141.8999999999, 571277.2, 753412.5, 935547.7999999998, 1117683.0999999999, 1299818.4, 1481953.7, 1664089.0
#> 16 -170000.0, -63841.399999999994, 42317.20000000001, 148475.80000000005, 254634.40000000002, 360793.0, 466951.6000000001, 573110.2000000001, 679268.8, 785427.4, 891586.0
#> 17                                                             -81334.0, 19516.5, 120367.0, 221217.5, 322068.0, 422918.5, 523769.0, 624619.5, 725470.0, 826320.5, 927171.0
#> 18                                       -339603.0, -209476.3, -79349.6, 50777.09999999998, 180903.8, 311030.5, 441157.19999999995, 571283.9, 701410.6, 831537.3, 961664.0
#> 19                                                                  0.0, 50500.0, 101000.0, 151500.0, 202000.0, 252500.0, 303000.0, 353500.0, 404000.0, 454500.0, 505000.0
#> 20                                0.0, 168425.9, 336851.8, 505277.69999999995, 673703.6, 842129.5, 1010555.3999999999, 1178981.3, 1347407.2, 1515833.0999999999, 1684259.0
#> 21                                                                  0.0, 89604.0, 179208.0, 268812.0, 358416.0, 448020.0, 537624.0, 627228.0, 716832.0, 806436.0, 896040.0
#> 22                                                                   0.0, 49700.0, 99400.0, 149100.0, 198800.0, 248500.0, 298200.0, 347900.0, 397600.0, 447300.0, 497000.0
#> 23                                                                   0.0, 41799.0, 83598.0, 125397.0, 167196.0, 208995.0, 250794.0, 292593.0, 334392.0, 376191.0, 417990.0
#> 24                                                        0.0, 52866.6, 105733.2, 158599.8, 211466.4, 264333.0, 317199.6, 370066.2, 422932.8, 475799.39999999997, 528666.0
#> 25                                                                                                                                                             False, True
#>                                               num_hist_counts        top
#> 1  2400, 2400, 2400, 2400, 2399, 2400, 2400, 2400, 2400, 2400
#> 2             10151, 6327, 3965, 2149, 1251, 96, 44, 15, 0, 1
#> 3                                                                 female
#> 4                                                             university
#> 5                                                                 single
#> 6         4285, 6546, 5187, 3780, 2048, 1469, 501, 147, 34, 2
#> 7        2086, 4625, 11738, 2994, 2185, 254, 66, 17, 9, 7, 18
#> 8          2953, 4886, 12543, 20, 3204, 268, 76, 21, 9, 18, 1
#> 9          3197, 4787, 12576, 4, 3121, 183, 64, 17, 21, 27, 2
#> 10          3382, 4555, 13250, 2, 2515, 158, 55, 29, 5, 46, 2
#> 11             3539, 4482, 13520, 2178, 147, 71, 11, 3, 47, 1
#> 12             3818, 4722, 12876, 2324, 158, 37, 9, 16, 37, 2
#> 13                2, 17603, 4754, 1193, 316, 111, 18, 1, 0, 1
#> 14                14571, 7214, 1578, 429, 155, 43, 7, 1, 0, 1
#> 15                    12977, 10150, 767, 99, 5, 0, 0, 0, 0, 1
#> 16                 2, 16619, 5775, 1181, 311, 89, 20, 1, 0, 1
#> 17                12722, 9033, 1720, 374, 113, 31, 4, 0, 1, 1
#> 18                   1, 1, 18312, 4788, 745, 131, 19, 1, 0, 1
#> 19                      23643, 249, 56, 26, 14, 8, 0, 1, 1, 1
#> 20                         23936, 50, 11, 1, 0, 0, 0, 0, 0, 1
#> 21                        23836, 130, 20, 9, 3, 0, 0, 0, 0, 1
#> 22                      23647, 235, 65, 29, 11, 5, 4, 0, 2, 1
#> 23                      23588, 234, 94, 40, 22, 7, 3, 8, 0, 3
#> 24                      23605, 235, 77, 56, 15, 5, 1, 3, 0, 2
#> 25                                                18630, 5369
#>              nonnum_hist_ticks nonnum_hist_counts
#> 1
#> 2
#> 3          female, male, Other     15078, 8921, 0
#> 4  university, graduate, Other  11360, 8442, 4197
#> 5       single, married, Other  12876, 10813, 310
#> 6
#> 7
#> 8
#> 9
#> 10
#> 11
#> 12
#> 13
#> 14
#> 15
#> 16
#> 17
#> 18
#> 19
#> 20
#> 21
#> 22
#> 23
#> 24
#> 25

还有多个其他泛型也可在 DAIFrame 上正常使用:dimheadformat.

dim(creditcard)
#> [1] 23999    25

head(creditcard, 10)
#>    ID LIMIT_BAL    SEX  EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4
#> 1   1     20000 female university  married  24     2     2    -1    -1
#> 2   2    120000 female university   single  26    -1     2     0     0
#> 3   3     90000 female university   single  34     0     0     0     0
#> 4   4     50000 female university  married  37     0     0     0     0
#> 5   5     50000   male university  married  57    -1     0    -1     0
#> 6   6     50000   male   graduate   single  37     0     0     0     0
#> 7   7    500000   male   graduate   single  29     0     0     0     0
#> 8   8    100000 female university   single  23     0    -1    -1     0
#> 9   9    140000 female highschool  married  28     0     0     2     0
#> 10 10     20000   male highschool   single  35    -2    -2    -2    -2
#>    PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6
#> 1     -2    -2      3913      3102       689         0         0         0
#> 2      0     2      2682      1725      2682      3272      3455      3261
#> 3      0     0     29239     14027     13559     14331     14948     15549
#> 4      0     0     46990     48233     49291     28314     28959     29547
#> 5      0     0      8617      5670     35835     20940     19146     19131
#> 6      0     0     64400     57069     57608     19394     19619     20024
#> 7      0     0    367965    412023    445007    542653    483003    473944
#> 8      0    -1     11876       380       601       221      -159       567
#> 9      0     0     11285     14096     12108     12211     11793      3719
#> 10    -1    -1         0         0         0         0     13007     13912
#>    PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
#> 1         0      689        0        0        0        0
#> 2         0     1000     1000     1000        0     2000
#> 3      1518     1500     1000     1000     1000     5000
#> 4      2000     2019     1200     1100     1069     1000
#> 5      2000    36681    10000     9000      689      679
#> 6      2500     1815      657     1000     1000      800
#> 7     55000    40000    38000    20239    13750    13770
#> 8       380      601        0      581     1687     1542
#> 9      3329        0      432     1000     1000     1000
#> 10        0        0        0    13007     1122        0
#>    DEFAULT_PAYMENT_NEXT_MONTH
#> 1                        TRUE
#> 2                        TRUE
#> 3                       FALSE
#> 4                       FALSE
#> 5                       FALSE
#> 6                       FALSE
#> 7                       FALSE
#> 8                       FALSE
#> 9                       FALSE
#> 10                      FALSE

但是,您不能使用 DAIFrame 访问其所有数据,也不能用其来修改数据。其只显示加载到 DAI 平台中的数据集。头部函数只允许访问示例数据:

creditcard$example_data[1:10, ]
#>    ID LIMIT_BAL    SEX  EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4
#> 1   1     20000 female university  married  24     2     2    -1    -1
#> 2   2    120000 female university   single  26    -1     2     0     0
#> 3   3     90000 female university   single  34     0     0     0     0
#> 4   4     50000 female university  married  37     0     0     0     0
#> 5   5     50000   male university  married  57    -1     0    -1     0
#> 6   6     50000   male   graduate   single  37     0     0     0     0
#> 7   7    500000   male   graduate   single  29     0     0     0     0
#> 8   8    100000 female university   single  23     0    -1    -1     0
#> 9   9    140000 female highschool  married  28     0     0     2     0
#> 10 10     20000   male highschool   single  35    -2    -2    -2    -2
#>    PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6
#> 1     -2    -2      3913      3102       689         0         0         0
#> 2      0     2      2682      1725      2682      3272      3455      3261
#> 3      0     0     29239     14027     13559     14331     14948     15549
#> 4      0     0     46990     48233     49291     28314     28959     29547
#> 5      0     0      8617      5670     35835     20940     19146     19131
#> 6      0     0     64400     57069     57608     19394     19619     20024
#> 7      0     0    367965    412023    445007    542653    483003    473944
#> 8      0    -1     11876       380       601       221      -159       567
#> 9      0     0     11285     14096     12108     12211     11793      3719
#> 10    -1    -1         0         0         0         0     13007     13912
#>    PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
#> 1         0      689        0        0        0        0
#> 2         0     1000     1000     1000        0     2000
#> 3      1518     1500     1000     1000     1000     5000
#> 4      2000     2019     1200     1100     1069     1000
#> 5      2000    36681    10000     9000      689      679
#> 6      2500     1815      657     1000     1000      800
#> 7     55000    40000    38000    20239    13750    13770
#> 8       380      601        0      581     1687     1542
#> 9      3329        0      432     1000     1000     1000
#> 10        0        0        0    13007     1122        0
#>    DEFAULT_PAYMENT_NEXT_MONTH
#> 1                        TRUE
#> 2                        TRUE
#> 3                       FALSE
#> 4                       FALSE
#> 5                       FALSE
#> 6                       FALSE
#> 7                       FALSE
#> 8                       FALSE
#> 9                       FALSE
#> 10                      FALSE

数据集可在 R 中直接拆分成训练数据集和测试数据集:

creditcard.splits <- dai.split_dataset(creditcard,
                                       output_name1 = 'train',
                                       output_name2 = 'test',
                                       ratio = .8,
                                       seed = 25,
                                       progress = FALSE)

在此例中,creditcard.splits 列表带有两个名称分别为 “train” 和 “test” 的元素,其中 80% 的数据进入训练数据集,20% 进入测试数据集。

creditcard.splits$train
#> DAI frame '7cf3024c-5baa-11e9-a50b-b938de969cdb': 19199 obs. of 25 variables
#> File path: ./tmp/7cf3024c-5baa-11e9-a50b-b938de969cdb/train.1554912341.0864356.bin

creditcard.splits$test
#> DAI frame '7cf613a6-5baa-11e9-a50b-b938de969cdb': 4800 obs. of 25 variables
#> File path: ./tmp/7cf613a6-5baa-11e9-a50b-b938de969cdb/test.1554912341.0966916.bin

默认情况下,会产生随机样本,但是您同样可以分层或基于时间分割。更多详细信息,请参见该函数的文档资料。

自动化特征工程和模型调优

Driverless AI 的主要优势之一是完全自动化的特征工程以及超参数调优、模型选择和集成。函数 dai.train() 负责执行产生 DAIModel 实例的实验,该实例代表模型。

model <- dai.train(training_frame = creditcard.splits$train,
                   testing_frame = creditcard.splits$test,
                   target_col = 'DEFAULT_PAYMENT_NEXT_MONTH',
                   is_classification = T,
                   is_timeseries = F,
                   accuracy = 1, time = 1, interpretability = 10,
                   seed = 25)
#>
  |
  |                                                                 |   0%
  |
  |==========================                                       |  40%
  |
  |===============================================                  |  73%
  |
  |===========================================================      |  91%
  |
  |=================================================================| 100%

如果您不指定准确度、时间或可解释性,DAI 平台会提供建议。(请参见 dai.suggest_model_params.)

检测模型

与用于 DAIFrame 一样, printformatsummary``和 ``predict 等通用方法对 DAIModel 也有效:

print(model)
#> Status: Complete
#> Experiment: 7e2b70ae-5baa-11e9-a50b-b938de969cdb, 2019-04-10 18:06, 1.7.0+local_0c7d019-dirty
#>   Settings: 1/1/10, seed=25, GPUs enabled
#>   Train data: train (19199, 25)
#>   Validation data: N/A
#>   Test data: test (4800, 24)
#>   Target column: DEFAULT_PAYMENT_NEXT_MONTH (binary, 22.366% target class)
#> System specs: Linux, 126 GB, 40 CPU cores, 2/2 GPUs
#>   Max memory usage: 0.406 GB, 0.167 GB GPU
#> Recipe: AutoDL (2 iterations, 2 individuals)
#>   Validation scheme: stratified, 1 internal holdout
#>   Feature engineering: 33 features scored (18 selected)
#> Timing:
#>   Data preparation: 4.94 secs
#>   Model and feature tuning: 10.13 secs (3 models trained)
#>   Feature evolution: 5.54 secs (1 of 3 model trained)
#>   Final pipeline training: 7.85 secs (1 model trained)
#>   Python / MOJO scorer building: 42.05 secs / 0.00 secs
#> Validation score: AUC = 0.77802 +/- 0.0077539 (baseline)
#> Validation score: AUC = 0.77802 +/- 0.0077539 (final pipeline)
#> Test score:       AUC = 0.7861 +/- 0.0064711 (final pipeline)

summary(model)$score
#> [1] 0.7780229

预测新数据

新数据可通过两种不同的方法储存:

  • 在 R 会话中,直接对模型调用 predict().

  • 下载评分管道并将该管道嵌入到 Python 或 Java 工作流中。

在 R 中预测

泛型 predict() 直接返回带有(默认)结果的 R data.frame,或其返回指向列有结果 (return_df=FALSE) 的 CSV 文件的 URL。当您依据较大的数据集预测时,后者可能比较有用。

predictions <- predict(model, newdata = creditcard.splits$test)
#>
  |
  |                                                                 |   0%
  |
  |=================================================================| 100%
#> Loading required package: bitops

head(predictions)
#>   DEFAULT_PAYMENT_NEXT_MONTH.0 DEFAULT_PAYMENT_NEXT_MONTH.1
#> 1                    0.8879988                   0.11200116
#> 2                    0.9289870                   0.07101299
#> 3                    0.9550328                   0.04496716
#> 4                    0.3513577                   0.64864230
#> 5                    0.9183724                   0.08162758
#> 6                    0.9154425                   0.08455751

predict(model, newdata = creditcard.splits$test, return_df = FALSE)
#>
  |
  |                                                                 |   0%
  |
  |=================================================================| 100%
#> [1] "h2oai_experiment_7e2b70ae-5baa-11e9-a50b-b938de969cdb/7e2b70ae-5baa-11e9-a50b-b938de969cdb_preds_f854b49f.csv"

下载 Python 评分管道或 MOJO 评分管道

要使用 Python 或 Java 对模型进行产品化,您可以分别下载完整版本的 Python 或 MOJO 管道。更多关于如何使用管道的信息,请参阅 R 客户端文档资料。

dai.download_mojo(model, path = tempdir(), force = TRUE)
#>
  |
  |                                                                 |   0%
  |
  |=================================================================| 100%
#> Downloading the pipeline:
#> [1] "/tmp/RtmppsLTZ9/mojo-7e2b70ae-5baa-11e9-a50b-b938de969cdb.zip"

dai.download_python_pipeline(model, path = tempdir(), force = TRUE)
#>
  |
  |                                                                 |   0%
  |
  |=================================================================| 100%
#> Downloading the pipeline:
#> [1] "/tmp/RtmppsLTZ9/python-pipeline-7e2b70ae-5baa-11e9-a50b-b938de969cdb.zip"

管理数据集和模型

一段时间后,您的 DAI 服务器中可能会有多个数据集和模型。DAI 包提供了一些工具函数,用于查找、重用和删除现有数据集和模型。

如果您已经将数据集加载到 DAI 中,可以通过 dai.get_frame (如果您知道该帧的密钥)或 dai.find_dataset (如果您知道原始路径或其至少一部分)获取 DAIFrame 对象。

dai.get_frame(creditcard$key)
#> DAI frame '7abe28b2-5baa-11e9-a50b-b938de969cdb': 23999 obs. of 25 variables
#> File path: tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv

dai.find_dataset('creditcard')
#> DAI frame '7abe28b2-5baa-11e9-a50b-b938de969cdb': 23999 obs. of 25 variables
#> File path: tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv

如果只有一项匹配,后者将直接返回该帧。否则,会让您从所有匹配的帧中选择返回哪个帧。

而且,您还可以获取数据集或模型列表:

datasets <- dai.list_datasets()
head(datasets)
#>                                    key                     name
#> 1 7cf613a6-5baa-11e9-a50b-b938de969cdb                     test
#> 2 7cf3024c-5baa-11e9-a50b-b938de969cdb                    train
#> 3 7c38cb84-5baa-11e9-a50b-b938de969cdb     iris9e1f15d2df00.csv
#> 4 7abe28b2-5baa-11e9-a50b-b938de969cdb creditcard_train_cat.csv
#>                                                                                file_path
#> 1                 ./tmp/7cf613a6-5baa-11e9-a50b-b938de969cdb/test.1554912341.0966916.bin
#> 2                ./tmp/7cf3024c-5baa-11e9-a50b-b938de969cdb/train.1554912341.0864356.bin
#> 3 ./tmp/7c38cb84-5baa-11e9-a50b-b938de969cdb/iris9e1f15d2df00.csv.1554912339.9424415.bin
#> 4                             tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv
#>   file_size data_source row_count column_count import_status import_error
#> 1    567584      upload      4800           25             0
#> 2   2265952      upload     19199           25             0
#> 3      7064      upload       150            5             0
#> 4   2832040        file     23999           25             0
#>   aggregation_status aggregation_error aggregated_frame mapping_frame
#> 1                 -1
#> 2                 -1
#> 3                 -1
#> 4                 -1
#>   uploaded
#> 1     TRUE
#> 2     TRUE
#> 3     TRUE
#> 4    FALSE

models <- dai.list_models()
head(models)
#>                                    key description
#> 1 7e2b70ae-5baa-11e9-a50b-b938de969cdb    mupulori
#>                   dataset_name               parameters.dataset_key
#> 1 train.1554912341.0864356.bin 7cf3024c-5baa-11e9-a50b-b938de969cdb
#>   parameters.resumed_model_key      parameters.target_col
#> 1                              DEFAULT_PAYMENT_NEXT_MONTH
#>   parameters.weight_col parameters.fold_col parameters.orig_time_col
#> 1
#>   parameters.time_col parameters.is_classification parameters.cols_to_drop
#> 1               [OFF]                         TRUE                    NULL
#>   parameters.validset_key               parameters.testset_key
#> 1                         7cf613a6-5baa-11e9-a50b-b938de969cdb
#>   parameters.enable_gpus parameters.seed parameters.accuracy
#> 1                   TRUE              25                   1
#>   parameters.time parameters.interpretability parameters.scorer
#> 1               1                          10               AUC
#>   parameters.time_groups_columns parameters.time_period_in_seconds
#> 1                           NULL                                NA
#>   parameters.num_prediction_periods parameters.num_gap_periods
#> 1                                NA                         NA
#>   parameters.is_timeseries parameters.config_overrides
#> 1                    FALSE                          NA
#>                                                                                                          log_file_path
#> 1 h2oai_experiment_7e2b70ae-5baa-11e9-a50b-b938de969cdb/h2oai_experiment_logs_7e2b70ae-5baa-11e9-a50b-b938de969cdb.zip
#>                                                                    pickle_path
#> 1 h2oai_experiment_7e2b70ae-5baa-11e9-a50b-b938de969cdb/best_individual.pickle
#>                                                                                                              summary_path
#> 1 h2oai_experiment_7e2b70ae-5baa-11e9-a50b-b938de969cdb/h2oai_experiment_summary_7e2b70ae-5baa-11e9-a50b-b938de969cdb.zip
#>   train_predictions_path valid_predictions_path
#> 1
#>                                                  test_predictions_path
#> 1 h2oai_experiment_7e2b70ae-5baa-11e9-a50b-b938de969cdb/test_preds.csv
#>   progress status training_duration scorer     score test_score deprecated
#> 1        1      0          71.43582    AUC 0.7780229     0.7861      FALSE
#>   model_file_size diagnostic_keys
#> 1       695996094            NULL

如果您知道数据集或模型的密钥,可以通过 dai.get_modeldai.get_frame 获取 DAIFrame 或 DAIModel 的实例:

dai.get_model(models$key[1])
#> Status: Complete
#> Experiment: 7e2b70ae-5baa-11e9-a50b-b938de969cdb, 2019-04-10 18:06, 1.7.0+local_0c7d019-dirty
#>   Settings: 1/1/10, seed=25, GPUs enabled
#>   Train data: train (19199, 25)
#>   Validation data: N/A
#>   Test data: test (4800, 24)
#>   Target column: DEFAULT_PAYMENT_NEXT_MONTH (binary, 22.366% target class)
#> System specs: Linux, 126 GB, 40 CPU cores, 2/2 GPUs
#>   Max memory usage: 0.406 GB, 0.167 GB GPU
#> Recipe: AutoDL (2 iterations, 2 individuals)
#>   Validation scheme: stratified, 1 internal holdout
#>   Feature engineering: 33 features scored (18 selected)
#> Timing:
#>   Data preparation: 4.94 secs
#>   Model and feature tuning: 10.13 secs (3 models trained)
#>   Feature evolution: 5.54 secs (1 of 3 model trained)
#>   Final pipeline training: 7.85 secs (1 model trained)
#>   Python / MOJO scorer building: 42.05 secs / 0.00 secs
#> Validation score: AUC = 0.77802 +/- 0.0077539 (baseline)
#> Validation score: AUC = 0.77802 +/- 0.0077539 (final pipeline)
#> Test score:       AUC = 0.7861 +/- 0.0064711 (final pipeline)
dai.get_frame(datasets$key[1])
#> DAI frame '7cf613a6-5baa-11e9-a50b-b938de969cdb': 4800 obs. of 25 variables
#> File path: ./tmp/7cf613a6-5baa-11e9-a50b-b938de969cdb/test.1554912341.0966916.bin

最后,可以通过``dai.rm``删除数据集和模型:

dai.rm(model, creditcard, creditcard.splits$train, creditcard.splits$test)
#> Model 7e2b70ae-5baa-11e9-a50b-b938de969cdb removed
#> Dataset 7abe28b2-5baa-11e9-a50b-b938de969cdb removed
#> Dataset 7cf3024c-5baa-11e9-a50b-b938de969cdb removed
#> Dataset 7cf613a6-5baa-11e9-a50b-b938de969cdb removed

函数 dai.rm 默认会将对象从服务器和 R 会话中删除。如果希望只从服务器中删除对象,您可以设置 from_session=FALSE .请注意,只有对象才能从会话中删除,也就是说,在以上示例中,creditcard.splits$traincreditcard.splits$test 对象将不会被从 R 会话中删除,因为它们实际是函数调用(回顾可知``$``表示函数)。