R Client Tutorial¶
This tutorial describes how to use the Driverless AI R client package to use and control the Driveless AI platform. It covers the main predictive data-science workflow, including:
Data load
Automated feature engineering and model tuning
Model inspection
Predicting on new data
Managing the datasets and models
Note: These steps assume that you have entered your license key in the Driverless AI UI.
Loading the Data¶
Before we can start working with the Driverless.ai platform (DAI), we have to import the package and initialize the connection:
library(dai)
dai.connect(uri = 'http://localhost:12345', username = 'h2oai', password = 'h2oai')
creditcard <- dai.create_dataset('/data/smalldata/kaggle/CreditCard/creditcard_train_cat.csv')
#>
|
| | 0%
|
|================ | 24%
|
|=================================================================| 100%
The function dai.create_dataset()
loads the data located at the machine that hosts DAI. The above command assumes that the creditcard_train_cat.csv is in the /data folder on the machine running Driverless AI. This file is available at https://s3.amazonaws.com/h2o-public-test-data/smalldata/kaggle/CreditCard/creditcard_train_cat.csv.
If you want to upload the data located at your workstation, use dai.upload_dataset()
instead.
If you already have the data loaded into R data.frame, you can simply convert it into a DAIFrame. For example:
iris.dai <- as.DAIFrame(iris)
#>
|
| | 0%
|
|=================================================================| 100%
print(iris.dai)
#> DAI frame '7c38cb84-5baa-11e9-a50b-b938de969cdb': 150 obs. of 5 variables
#> File path: ./tmp/7c38cb84-5baa-11e9-a50b-b938de969cdb/iris9e1f15d2df00.csv.1554912339.9424415.bin
You can switch off the progress bar whenever it is displayed by setting progress = FALSE
.
Upon creation of the dataset, you can display the basic information and summary statistics by calling generics print and summary:
print(creditcard)
#> DAI frame '7abe28b2-5baa-11e9-a50b-b938de969cdb': 23999 obs. of 25 variables
#> File path: tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv
summary(creditcard)
#> variable num_classes is_numeric count
#> 1 ID 0 TRUE 23999
#> 2 LIMIT_BAL 79 TRUE 23999
#> 3 SEX 2 FALSE 23999
#> 4 EDUCATION 4 FALSE 23999
#> 5 MARRIAGE 4 FALSE 23999
#> 6 AGE 55 TRUE 23999
#> 7 PAY_1 11 TRUE 23999
#> 8 PAY_2 11 TRUE 23999
#> 9 PAY_3 11 TRUE 23999
#> 10 PAY_4 11 TRUE 23999
#> 11 PAY_5 10 TRUE 23999
#> 12 PAY_6 10 TRUE 23999
#> 13 BILL_AMT1 0 TRUE 23999
#> 14 BILL_AMT2 0 TRUE 23999
#> 15 BILL_AMT3 0 TRUE 23999
#> 16 BILL_AMT4 0 TRUE 23999
#> 17 BILL_AMT5 0 TRUE 23999
#> 18 BILL_AMT6 0 TRUE 23999
#> 19 PAY_AMT1 0 TRUE 23999
#> 20 PAY_AMT2 0 TRUE 23999
#> 21 PAY_AMT3 0 TRUE 23999
#> 22 PAY_AMT4 0 TRUE 23999
#> 23 PAY_AMT5 0 TRUE 23999
#> 24 PAY_AMT6 0 TRUE 23999
#> 25 DEFAULT_PAYMENT_NEXT_MONTH 2 TRUE 23999
#> mean std min max unique freq
#> 1 12000 6928.05889120466 1 23999 23999 1
#> 2 165498.715779824 129130.743065318 10000 1000000 79 2740
#> 3 2 8921
#> 4 4 11360
#> 5 4 12876
#> 6 35.3808492020501 9.2710457493384 21 79 55 1284
#> 7 -0.00312513021375891 1.12344874325651 -2 8 11 11738
#> 8 -0.123463477644902 1.20059118344043 -2 8 11 12543
#> 9 -0.154756448185341 1.20405796618856 -2 8 11 12576
#> 10 -0.211675486478603 1.16657279943005 -2 8 11 13250
#> 11 -0.252885536897371 1.13700672904 -2 8 10 13520
#> 12 -0.278011583815992 1.1581916495226 -2 8 10 12876
#> 13 50598.9286636943 72650.1978092856 -165580 964511 18717 1607
#> 14 48648.0474186424 70365.3956426641 -69777 983931 18367 2049
#> 15 46368.9035376474 68194.7195202748 -157264 1664089 18131 2325
#> 16 42369.8728280345 63071.4551670874 -170000 891586 17719 2547
#> 17 40002.3330972124 60345.7282797424 -81334 927171 17284 2840
#> 18 38565.2666361098 59156.5011434754 -339603 961664 16906 3258
#> 19 5543.09804575191 15068.86272958 0 505000 6918 4270
#> 20 5815.52852202175 20797.443884891 0 1684259 6839 4362
#> 21 4969.43139297471 16095.9292948255 0 896040 6424 4853
#> 22 4743.65686070253 14883.5548720259 0 497000 6028 5200
#> 23 4783.64369348723 15270.7039035392 0 417990 5984 5407
#> 24 5189.57360723363 17630.7185745277 0 528666 5988 5846
#> 25 0.223717654902288 0.41674368928609 FALSE TRUE 2 5369
#> num_hist_ticks
#> 1 1.0, 2400.8, 4800.6, 7200.400000000001, 9600.2, 12000.0, 14399.800000000001, 16799.600000000002, 19199.4, 21599.2, 23999.0
#> 2 10000.0, 109000.0, 208000.0, 307000.0, 406000.0, 505000.0, 604000.0, 703000.0, 802000.0, 901000.0, 1000000.0
#> 3
#> 4
#> 5
#> 6 21.0, 26.8, 32.6, 38.4, 44.2, 50.0, 55.8, 61.6, 67.4, 73.19999999999999, 79.0
#> 7 -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8
#> 8 -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8
#> 9 -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8
#> 10 -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8
#> 11 -2, -1, 0, 2, 3, 4, 5, 6, 7, 8
#> 12 -2, -1, 0, 2, 3, 4, 5, 6, 7, 8
#> 13 -165580.0, -52570.899999999994, 60438.20000000001, 173447.30000000005, 286456.4, 399465.5, 512474.6000000001, 625483.7000000001, 738492.8, 851501.9, 964511.0
#> 14 -69777.0, 35593.8, 140964.6, 246335.40000000002, 351706.2, 457077.0, 562447.8, 667818.6, 773189.4, 878560.2000000001, 983931.0
#> 15 -157264.0, 24871.29999999999, 207006.59999999998, 389141.8999999999, 571277.2, 753412.5, 935547.7999999998, 1117683.0999999999, 1299818.4, 1481953.7, 1664089.0
#> 16 -170000.0, -63841.399999999994, 42317.20000000001, 148475.80000000005, 254634.40000000002, 360793.0, 466951.6000000001, 573110.2000000001, 679268.8, 785427.4, 891586.0
#> 17 -81334.0, 19516.5, 120367.0, 221217.5, 322068.0, 422918.5, 523769.0, 624619.5, 725470.0, 826320.5, 927171.0
#> 18 -339603.0, -209476.3, -79349.6, 50777.09999999998, 180903.8, 311030.5, 441157.19999999995, 571283.9, 701410.6, 831537.3, 961664.0
#> 19 0.0, 50500.0, 101000.0, 151500.0, 202000.0, 252500.0, 303000.0, 353500.0, 404000.0, 454500.0, 505000.0
#> 20 0.0, 168425.9, 336851.8, 505277.69999999995, 673703.6, 842129.5, 1010555.3999999999, 1178981.3, 1347407.2, 1515833.0999999999, 1684259.0
#> 21 0.0, 89604.0, 179208.0, 268812.0, 358416.0, 448020.0, 537624.0, 627228.0, 716832.0, 806436.0, 896040.0
#> 22 0.0, 49700.0, 99400.0, 149100.0, 198800.0, 248500.0, 298200.0, 347900.0, 397600.0, 447300.0, 497000.0
#> 23 0.0, 41799.0, 83598.0, 125397.0, 167196.0, 208995.0, 250794.0, 292593.0, 334392.0, 376191.0, 417990.0
#> 24 0.0, 52866.6, 105733.2, 158599.8, 211466.4, 264333.0, 317199.6, 370066.2, 422932.8, 475799.39999999997, 528666.0
#> 25 False, True
#> num_hist_counts top
#> 1 2400, 2400, 2400, 2400, 2399, 2400, 2400, 2400, 2400, 2400
#> 2 10151, 6327, 3965, 2149, 1251, 96, 44, 15, 0, 1
#> 3 female
#> 4 university
#> 5 single
#> 6 4285, 6546, 5187, 3780, 2048, 1469, 501, 147, 34, 2
#> 7 2086, 4625, 11738, 2994, 2185, 254, 66, 17, 9, 7, 18
#> 8 2953, 4886, 12543, 20, 3204, 268, 76, 21, 9, 18, 1
#> 9 3197, 4787, 12576, 4, 3121, 183, 64, 17, 21, 27, 2
#> 10 3382, 4555, 13250, 2, 2515, 158, 55, 29, 5, 46, 2
#> 11 3539, 4482, 13520, 2178, 147, 71, 11, 3, 47, 1
#> 12 3818, 4722, 12876, 2324, 158, 37, 9, 16, 37, 2
#> 13 2, 17603, 4754, 1193, 316, 111, 18, 1, 0, 1
#> 14 14571, 7214, 1578, 429, 155, 43, 7, 1, 0, 1
#> 15 12977, 10150, 767, 99, 5, 0, 0, 0, 0, 1
#> 16 2, 16619, 5775, 1181, 311, 89, 20, 1, 0, 1
#> 17 12722, 9033, 1720, 374, 113, 31, 4, 0, 1, 1
#> 18 1, 1, 18312, 4788, 745, 131, 19, 1, 0, 1
#> 19 23643, 249, 56, 26, 14, 8, 0, 1, 1, 1
#> 20 23936, 50, 11, 1, 0, 0, 0, 0, 0, 1
#> 21 23836, 130, 20, 9, 3, 0, 0, 0, 0, 1
#> 22 23647, 235, 65, 29, 11, 5, 4, 0, 2, 1
#> 23 23588, 234, 94, 40, 22, 7, 3, 8, 0, 3
#> 24 23605, 235, 77, 56, 15, 5, 1, 3, 0, 2
#> 25 18630, 5369
#> nonnum_hist_ticks nonnum_hist_counts
#> 1
#> 2
#> 3 female, male, Other 15078, 8921, 0
#> 4 university, graduate, Other 11360, 8442, 4197
#> 5 single, married, Other 12876, 10813, 310
#> 6
#> 7
#> 8
#> 9
#> 10
#> 11
#> 12
#> 13
#> 14
#> 15
#> 16
#> 17
#> 18
#> 19
#> 20
#> 21
#> 22
#> 23
#> 24
#> 25
A couple of other generics works as usual on a DAIFrame: dim
, head
, and format
.
dim(creditcard)
#> [1] 23999 25
head(creditcard, 10)
#> ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4
#> 1 1 20000 female university married 24 2 2 -1 -1
#> 2 2 120000 female university single 26 -1 2 0 0
#> 3 3 90000 female university single 34 0 0 0 0
#> 4 4 50000 female university married 37 0 0 0 0
#> 5 5 50000 male university married 57 -1 0 -1 0
#> 6 6 50000 male graduate single 37 0 0 0 0
#> 7 7 500000 male graduate single 29 0 0 0 0
#> 8 8 100000 female university single 23 0 -1 -1 0
#> 9 9 140000 female highschool married 28 0 0 2 0
#> 10 10 20000 male highschool single 35 -2 -2 -2 -2
#> PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6
#> 1 -2 -2 3913 3102 689 0 0 0
#> 2 0 2 2682 1725 2682 3272 3455 3261
#> 3 0 0 29239 14027 13559 14331 14948 15549
#> 4 0 0 46990 48233 49291 28314 28959 29547
#> 5 0 0 8617 5670 35835 20940 19146 19131
#> 6 0 0 64400 57069 57608 19394 19619 20024
#> 7 0 0 367965 412023 445007 542653 483003 473944
#> 8 0 -1 11876 380 601 221 -159 567
#> 9 0 0 11285 14096 12108 12211 11793 3719
#> 10 -1 -1 0 0 0 0 13007 13912
#> PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
#> 1 0 689 0 0 0 0
#> 2 0 1000 1000 1000 0 2000
#> 3 1518 1500 1000 1000 1000 5000
#> 4 2000 2019 1200 1100 1069 1000
#> 5 2000 36681 10000 9000 689 679
#> 6 2500 1815 657 1000 1000 800
#> 7 55000 40000 38000 20239 13750 13770
#> 8 380 601 0 581 1687 1542
#> 9 3329 0 432 1000 1000 1000
#> 10 0 0 0 13007 1122 0
#> DEFAULT_PAYMENT_NEXT_MONTH
#> 1 TRUE
#> 2 TRUE
#> 3 FALSE
#> 4 FALSE
#> 5 FALSE
#> 6 FALSE
#> 7 FALSE
#> 8 FALSE
#> 9 FALSE
#> 10 FALSE
You cannot, however, use DAIFrame
to access all its data, nor can you use it to modify the data. It only represents the data set loaded into the DAI platform. The head function gives access only to example data:
creditcard$example_data[1:10, ]
#> ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4
#> 1 1 20000 female university married 24 2 2 -1 -1
#> 2 2 120000 female university single 26 -1 2 0 0
#> 3 3 90000 female university single 34 0 0 0 0
#> 4 4 50000 female university married 37 0 0 0 0
#> 5 5 50000 male university married 57 -1 0 -1 0
#> 6 6 50000 male graduate single 37 0 0 0 0
#> 7 7 500000 male graduate single 29 0 0 0 0
#> 8 8 100000 female university single 23 0 -1 -1 0
#> 9 9 140000 female highschool married 28 0 0 2 0
#> 10 10 20000 male highschool single 35 -2 -2 -2 -2
#> PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6
#> 1 -2 -2 3913 3102 689 0 0 0
#> 2 0 2 2682 1725 2682 3272 3455 3261
#> 3 0 0 29239 14027 13559 14331 14948 15549
#> 4 0 0 46990 48233 49291 28314 28959 29547
#> 5 0 0 8617 5670 35835 20940 19146 19131
#> 6 0 0 64400 57069 57608 19394 19619 20024
#> 7 0 0 367965 412023 445007 542653 483003 473944
#> 8 0 -1 11876 380 601 221 -159 567
#> 9 0 0 11285 14096 12108 12211 11793 3719
#> 10 -1 -1 0 0 0 0 13007 13912
#> PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
#> 1 0 689 0 0 0 0
#> 2 0 1000 1000 1000 0 2000
#> 3 1518 1500 1000 1000 1000 5000
#> 4 2000 2019 1200 1100 1069 1000
#> 5 2000 36681 10000 9000 689 679
#> 6 2500 1815 657 1000 1000 800
#> 7 55000 40000 38000 20239 13750 13770
#> 8 380 601 0 581 1687 1542
#> 9 3329 0 432 1000 1000 1000
#> 10 0 0 0 13007 1122 0
#> DEFAULT_PAYMENT_NEXT_MONTH
#> 1 TRUE
#> 2 TRUE
#> 3 FALSE
#> 4 FALSE
#> 5 FALSE
#> 6 FALSE
#> 7 FALSE
#> 8 FALSE
#> 9 FALSE
#> 10 FALSE
A dataset can be split into e.g. training and test sets directly in R:
creditcard.splits <- dai.split_dataset(creditcard,
output_name1 = 'train',
output_name2 = 'test',
ratio = .8,
seed = 25,
progress = FALSE)
In this case the creditcard.splits is a list with two elements with names “train” and “test”, where 80% of the data went into train and 20% of the data went into test.
creditcard.splits$train
#> DAI frame '7cf3024c-5baa-11e9-a50b-b938de969cdb': 19199 obs. of 25 variables
#> File path: ./tmp/7cf3024c-5baa-11e9-a50b-b938de969cdb/train.1554912341.0864356.bin
creditcard.splits$test
#> DAI frame '7cf613a6-5baa-11e9-a50b-b938de969cdb': 4800 obs. of 25 variables
#> File path: ./tmp/7cf613a6-5baa-11e9-a50b-b938de969cdb/test.1554912341.0966916.bin
By default it yields a simple random sample, but you can do stratified or time-based splits as well. See the function’s documentation for more details.
Automated Feature Engineering and Model Tuning¶
One of the main strengths of Driverless AI is the fully automated feature engineering along with hyperparameter tuning, model selection and ensembling. The function dai.train()
executes the experiment that results in a DAIModel instance that represents the model.
model <- dai.train(training_frame = creditcard.splits$train,
testing_frame = creditcard.splits$test,
target_col = 'DEFAULT_PAYMENT_NEXT_MONTH',
is_classification = T,
is_timeseries = F,
accuracy = 1, time = 1, interpretability = 10,
seed = 25)
#>
|
| | 0%
|
|========================== | 40%
|
|=============================================== | 73%
|
|=========================================================== | 91%
|
|=================================================================| 100%
If you do not specify the accuracy, time, or interpretability, they will be suggested by the DAI platform. (See dai.suggest_model_params
.)
Model Inspection¶
As with DAIFrame, generic methods such as print
, format
, summary
, and predict
work with DAIModel:
print(model)
#> Status: Complete
#> Experiment: 7e2b70ae-5baa-11e9-a50b-b938de969cdb, 2019-04-10 18:06, 1.7.0+local_0c7d019-dirty
#> Settings: 1/1/10, seed=25, GPUs enabled
#> Train data: train (19199, 25)
#> Validation data: N/A
#> Test data: test (4800, 24)
#> Target column: DEFAULT_PAYMENT_NEXT_MONTH (binary, 22.366% target class)
#> System specs: Linux, 126 GB, 40 CPU cores, 2/2 GPUs
#> Max memory usage: 0.406 GB, 0.167 GB GPU
#> Recipe: AutoDL (2 iterations, 2 individuals)
#> Validation scheme: stratified, 1 internal holdout
#> Feature engineering: 33 features scored (18 selected)
#> Timing:
#> Data preparation: 4.94 secs
#> Model and feature tuning: 10.13 secs (3 models trained)
#> Feature evolution: 5.54 secs (1 of 3 model trained)
#> Final pipeline training: 7.85 secs (1 model trained)
#> Python / MOJO scorer building: 42.05 secs / 0.00 secs
#> Validation score: AUC = 0.77802 +/- 0.0077539 (baseline)
#> Validation score: AUC = 0.77802 +/- 0.0077539 (final pipeline)
#> Test score: AUC = 0.7861 +/- 0.0064711 (final pipeline)
summary(model)$score
#> [1] 0.7780229
Predicting on New Data¶
New data can be scored in two different ways:
Call
predict()
directly on the model in R session.Download a scoring pipeline and embed that into your Python or Java workflow.
Predicting in R¶
Generic predict()
either directly returns an R data.frame with the results (by default) or it returns a URL pointing to a CSV file with the results (return_df=FALSE). The latter option may be useful when you predict on a large dataset.
predictions <- predict(model, newdata = creditcard.splits$test)
#>
|
| | 0%
|
|=================================================================| 100%
#> Loading required package: bitops
head(predictions)
#> DEFAULT_PAYMENT_NEXT_MONTH.0 DEFAULT_PAYMENT_NEXT_MONTH.1
#> 1 0.8879988 0.11200116
#> 2 0.9289870 0.07101299
#> 3 0.9550328 0.04496716
#> 4 0.3513577 0.64864230
#> 5 0.9183724 0.08162758
#> 6 0.9154425 0.08455751
predict(model, newdata = creditcard.splits$test, return_df = FALSE)
#>
|
| | 0%
|
|=================================================================| 100%
#> [1] "h2oai_experiment_7e2b70ae-5baa-11e9-a50b-b938de969cdb/7e2b70ae-5baa-11e9-a50b-b938de969cdb_preds_f854b49f.csv"
Downloading Python or MOJO Scoring Pipelines¶
For productizing your model in a Python or Java, you can download full Python or MOJO pipelines, respectively. For more information about how to use the pipelines, please see the documentation.
dai.download_mojo(model, path = tempdir(), force = TRUE)
#>
|
| | 0%
|
|=================================================================| 100%
#> Downloading the pipeline:
#> [1] "/tmp/RtmppsLTZ9/mojo-7e2b70ae-5baa-11e9-a50b-b938de969cdb.zip"
dai.download_python_pipeline(model, path = tempdir(), force = TRUE)
#>
|
| | 0%
|
|=================================================================| 100%
#> Downloading the pipeline:
#> [1] "/tmp/RtmppsLTZ9/python-pipeline-7e2b70ae-5baa-11e9-a50b-b938de969cdb.zip"
Managing the Datasets and Models¶
After some time, you may have multiple datasets and models on your DAI server. The dai package offers a few utility functions to find, reuse, and remove the existing datasets and models.
If you already have the dataset loaded into DAI, you can get the DAIFrame object by either dai.get_frame
(if you know the frame’s key) or dai.find_dataset
(if you know the original path or at least a part of it):
dai.get_frame(creditcard$key)
#> DAI frame '7abe28b2-5baa-11e9-a50b-b938de969cdb': 23999 obs. of 25 variables
#> File path: tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv
dai.find_dataset('creditcard')
#> DAI frame '7abe28b2-5baa-11e9-a50b-b938de969cdb': 23999 obs. of 25 variables
#> File path: tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv
The latter directly returns you the frame if there’s only one match. Otherwise it let you select which frame to return from all the matching candidates.
Furthemore, you can get a list of datasets or models:
datasets <- dai.list_datasets()
head(datasets)
#> key name
#> 1 7cf613a6-5baa-11e9-a50b-b938de969cdb test
#> 2 7cf3024c-5baa-11e9-a50b-b938de969cdb train
#> 3 7c38cb84-5baa-11e9-a50b-b938de969cdb iris9e1f15d2df00.csv
#> 4 7abe28b2-5baa-11e9-a50b-b938de969cdb creditcard_train_cat.csv
#> file_path
#> 1 ./tmp/7cf613a6-5baa-11e9-a50b-b938de969cdb/test.1554912341.0966916.bin
#> 2 ./tmp/7cf3024c-5baa-11e9-a50b-b938de969cdb/train.1554912341.0864356.bin
#> 3 ./tmp/7c38cb84-5baa-11e9-a50b-b938de969cdb/iris9e1f15d2df00.csv.1554912339.9424415.bin
#> 4 tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv
#> file_size data_source row_count column_count import_status import_error
#> 1 567584 upload 4800 25 0
#> 2 2265952 upload 19199 25 0
#> 3 7064 upload 150 5 0
#> 4 2832040 file 23999 25 0
#> aggregation_status aggregation_error aggregated_frame mapping_frame
#> 1 -1
#> 2 -1
#> 3 -1
#> 4 -1
#> uploaded
#> 1 TRUE
#> 2 TRUE
#> 3 TRUE
#> 4 FALSE
models <- dai.list_models()
head(models)
#> key description
#> 1 7e2b70ae-5baa-11e9-a50b-b938de969cdb mupulori
#> dataset_name parameters.dataset_key
#> 1 train.1554912341.0864356.bin 7cf3024c-5baa-11e9-a50b-b938de969cdb
#> parameters.resumed_model_key parameters.target_col
#> 1 DEFAULT_PAYMENT_NEXT_MONTH
#> parameters.weight_col parameters.fold_col parameters.orig_time_col
#> 1
#> parameters.time_col parameters.is_classification parameters.cols_to_drop
#> 1 [OFF] TRUE NULL
#> parameters.validset_key parameters.testset_key
#> 1 7cf613a6-5baa-11e9-a50b-b938de969cdb
#> parameters.enable_gpus parameters.seed parameters.accuracy
#> 1 TRUE 25 1
#> parameters.time parameters.interpretability parameters.scorer
#> 1 1 10 AUC
#> parameters.time_groups_columns parameters.time_period_in_seconds
#> 1 NULL NA
#> parameters.num_prediction_periods parameters.num_gap_periods
#> 1 NA NA
#> parameters.is_timeseries parameters.config_overrides
#> 1 FALSE NA
#> log_file_path
#> 1 h2oai_experiment_7e2b70ae-5baa-11e9-a50b-b938de969cdb/h2oai_experiment_logs_7e2b70ae-5baa-11e9-a50b-b938de969cdb.zip
#> pickle_path
#> 1 h2oai_experiment_7e2b70ae-5baa-11e9-a50b-b938de969cdb/best_individual.pickle
#> summary_path
#> 1 h2oai_experiment_7e2b70ae-5baa-11e9-a50b-b938de969cdb/h2oai_experiment_summary_7e2b70ae-5baa-11e9-a50b-b938de969cdb.zip
#> train_predictions_path valid_predictions_path
#> 1
#> test_predictions_path
#> 1 h2oai_experiment_7e2b70ae-5baa-11e9-a50b-b938de969cdb/test_preds.csv
#> progress status training_duration scorer score test_score deprecated
#> 1 1 0 71.43582 AUC 0.7780229 0.7861 FALSE
#> model_file_size diagnostic_keys
#> 1 695996094 NULL
If you know the key of the dataset or model, you can obtain the instance of DAIFrame or DAIModel by dai.get_model
and dai.get_frame
:
dai.get_model(models$key[1])
#> Status: Complete
#> Experiment: 7e2b70ae-5baa-11e9-a50b-b938de969cdb, 2019-04-10 18:06, 1.7.0+local_0c7d019-dirty
#> Settings: 1/1/10, seed=25, GPUs enabled
#> Train data: train (19199, 25)
#> Validation data: N/A
#> Test data: test (4800, 24)
#> Target column: DEFAULT_PAYMENT_NEXT_MONTH (binary, 22.366% target class)
#> System specs: Linux, 126 GB, 40 CPU cores, 2/2 GPUs
#> Max memory usage: 0.406 GB, 0.167 GB GPU
#> Recipe: AutoDL (2 iterations, 2 individuals)
#> Validation scheme: stratified, 1 internal holdout
#> Feature engineering: 33 features scored (18 selected)
#> Timing:
#> Data preparation: 4.94 secs
#> Model and feature tuning: 10.13 secs (3 models trained)
#> Feature evolution: 5.54 secs (1 of 3 model trained)
#> Final pipeline training: 7.85 secs (1 model trained)
#> Python / MOJO scorer building: 42.05 secs / 0.00 secs
#> Validation score: AUC = 0.77802 +/- 0.0077539 (baseline)
#> Validation score: AUC = 0.77802 +/- 0.0077539 (final pipeline)
#> Test score: AUC = 0.7861 +/- 0.0064711 (final pipeline)
dai.get_frame(datasets$key[1])
#> DAI frame '7cf613a6-5baa-11e9-a50b-b938de969cdb': 4800 obs. of 25 variables
#> File path: ./tmp/7cf613a6-5baa-11e9-a50b-b938de969cdb/test.1554912341.0966916.bin
Finally, the datasets and models can be removed by dai.rm
:
dai.rm(model, creditcard, creditcard.splits$train, creditcard.splits$test)
#> Model 7e2b70ae-5baa-11e9-a50b-b938de969cdb removed
#> Dataset 7abe28b2-5baa-11e9-a50b-b938de969cdb removed
#> Dataset 7cf3024c-5baa-11e9-a50b-b938de969cdb removed
#> Dataset 7cf613a6-5baa-11e9-a50b-b938de969cdb removed
The function dai.rm
deletes the objects by default both from the server and the R session. If you wish to remove it only from the server, you can set from_session=FALSE
. Please note that only objects can be removed from the session, i.e. in the example above the creditcard.splits$train
and creditcard.splits$test
objects will not be removed from R session because they are actually function calls (recall that $
is a function).