Getting Started¶
Connecting to a Server¶
First we’ll initialize a client with our server credentials and store it in the variable dai
.
[1]:
import driverlessai
dai = driverlessai.Client(address='http://localhost:12345', username='py', password='py')
Loading Data¶
Here we import the file iris.csv
from s3 to the Driverless AI server and name the dataset ‘iris-getting-started’.
[2]:
ds = dai.datasets.create(
data='s3://h2o-public-test-data/smalldata/iris/iris.csv',
data_source='s3',
name='iris-getting-started'
)
Complete 100.00% - [4/4] Computing column statistics
This creates a dataset object which we store in the variable ds
. Dataset objects give you attributes and methods for interacting with the corresponding dataset on the Driverless AI server.
[3]:
print(ds.name, "|", ds.key)
print("Columns:", ds.columns)
print('Shape:', ds.shape)
print("Head:")
print(ds.head())
print("Tail:")
print(ds.tail())
print("Summary:")
print(ds.column_summaries()[0:2])
iris-getting-started | 98c0da08-a947-11ea-9bb7-0242ac110002
Columns: ['C1', 'C2', 'C3', 'C4', 'C5']
Shape: (150, 5)
Head:
C1 | C2 | C3 | C4 | C5
------+------+------+------+-------------
5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa
4.9 | 3 | 1.4 | 0.2 | Iris-setosa
4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa
4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa
5 | 3.6 | 1.4 | 0.2 | Iris-setosa
Tail:
C1 | C2 | C3 | C4 | C5
------+------+------+------+----------------
6.7 | 3 | 5.2 | 2.3 | Iris-virginica
6.3 | 2.5 | 5 | 1.9 | Iris-virginica
6.5 | 3 | 5.2 | 2 | Iris-virginica
6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica
5.9 | 3 | 5.1 | 1.8 | Iris-virginica
Summary:
--- C1 ---
4.3|███████
|█████████████████
|██████████
|████████████████████
|████████████
|███████████████████
|█████████████
|████
|████
7.9|████
Data Type: real
Logical Types: []
Datetime Format:
Count: 150
Missing: 0
Mean: 5.84
SD: 0.828
Min: 4.3
Max: 7.9
Unique: 35
Freq: 10
--- C2 ---
2|██
|████
|████████████
|█████████████
|████████████████████
|████████████████
|█████
|██████
|█
4.4|█
Data Type: real
Logical Types: []
Datetime Format:
Count: 150
Missing: 0
Mean: 3.05
SD: 0.434
Min: 2
Max: 4.4
Unique: 23
Freq: 26
We can also list all datasets on the server that are available to the user. Here we see the ‘iris-getting-started’ dataset we just created.
[4]:
dai.datasets.list()
[4]:
[<class 'driverlessai._datasets.Dataset'> 98c0da08-a947-11ea-9bb7-0242ac110002 iris-getting-started]
Next, we’ll split the data into train and test sets on the Driverless AI server.
[5]:
ds_split = ds.split_to_train_test(train_size=0.7)
Complete
This gives us a dictionary with the keys train_dataset
and test_dataset
. This dictionary can be unpacked directly when running experiments, which we’ll see in a bit.
[6]:
print(ds_split)
{'train_dataset': <class 'driverlessai._datasets.Dataset'> 9a18a0a2-a947-11ea-9bb7-0242ac110002 riparedi, 'test_dataset': <class 'driverlessai._datasets.Dataset'> 9a1b8bd2-a947-11ea-9bb7-0242ac110002 vasusise}
We can also see our train and test datasets on the server.
[7]:
dai.datasets.list()
[7]:
[<class 'driverlessai._datasets.Dataset'> 9a1b8bd2-a947-11ea-9bb7-0242ac110002 vasusise,
<class 'driverlessai._datasets.Dataset'> 9a18a0a2-a947-11ea-9bb7-0242ac110002 riparedi,
<class 'driverlessai._datasets.Dataset'> 98c0da08-a947-11ea-9bb7-0242ac110002 iris-getting-started]
Running an Experiment¶
Let’s define the experiment settings we want to use in a dictionary. Using a dictionary allows us to define our settings once and reuse them in different functions.
We need to specify whether the problem is classification or regression. Usually the goal for the iris dataset is to predict the class of the iris plant, so we’ll specify the task
is 'classification'
. We also need to specify the target column, which conveniently is the last column in this dataset.
Optionally, we’ll specify accuracy
and time
as 1
to get a quick baseline.
[8]:
settings = {
'task': 'classification',
'target_column': ds.columns[-1],
'accuracy': 1,
'time': 1
}
If we unpack our dataset and settings dictionaries as arguments for the function dai.experiments.preview()
we can get a look at how an experiment would behave.
[9]:
dai.experiments.preview(**ds_split, **settings)
ACCURACY [1/10]:
- Training data size: *105 rows, 5 cols*
- Feature evolution: *[Constant, DecisionTree, GLM, LightGBM, XGBoostGBM]*, *1/4 validation split*
- Final pipeline: *CV+single model (9 models), 4-fold CV*
TIME [1/10]:
- Feature evolution: *2 individuals*, up to *3 iterations*
- Early stopping: disabled
INTERPRETABILITY [8/10]:
- Feature pre-pruning strategy: None
- Monotonicity constraints: enabled
- Feature engineering search space: [CVCatNumEncode, CVTargetEncode, CatOriginal, Cat, Frequent, Interactions, NumCatTE, OneHotEncoding, Original, Text]
[Constant, DecisionTree, GLM, LightGBM, XGBoostGBM] models to train:
- Model and feature tuning: *2*
- Feature evolution: *3*
- Final pipeline: *9*
Estimated runtime: *minutes*
Auto-click Finish/Abort if not done in: *1 day*/*7 days*
Assuming we’re happy with the experiment preview, we can now run an experiment with dai.experiments.create()
. The dai.experiments.create()
function waits for the experiment to complete, then returns an experiment object that can be used to get information and artifacts from the completed experiment.
[10]:
ex = dai.experiments.create(**ds_split, **settings, name='iris-getting-started')
Experiment launched at: http://localhost:12345/#experiment?key=9d0042ca-a947-11ea-9bb7-0242ac110002
Complete 100.00% - Status: Complete
Here we get a summary of the experiment results.
[11]:
ex.summary()
Status: Complete
Experiment: iris-getting-started (9d0042ca-a947-11ea-9bb7-0242ac110002)
Version: 1.8.6, 2020-06-08 05:21
Settings: 1/1/8, seed=970309338, GPUs disabled
Train data: riparedi (105, 5)
Validation data: N/A
Test data: vasusise (45, 4)
Target column: C5 (3-class)
System specs: Docker/Darwin, 10 GB, 6 CPU cores, 0/0 GPU
Max memory usage: 0.492 GB, 0 GB GPU
Recipe: AutoDL (6 iterations, 2 individuals)
Validation scheme: stratified, 1 internal holdout
Feature engineering: 19 features scored (4 selected)
Timing: MOJO latency: 0.04136 millis (2.2KB)
Data preparation: 7.74 secs
Shift/Leakage detection: 1.88 secs
Model and feature tuning: 23.90 secs (7 models trained)
Feature evolution: 1.45 secs (0 of 3 model trained)
Final pipeline training: 32.98 secs (9 models trained)
Python / MOJO scorer building: 57.61 secs / 2.67 secs
Validation score: LOGLOSS = 1.076322 (constant preds)
Validation score: LOGLOSS = 0.3246407 +/- 0.06908537 (baseline)
Validation score: LOGLOSS = 0.1103104 +/- 0.02503069 (final pipeline)
Test score: LOGLOSS = 0.05808781 +/- 0.02410646 (final pipeline)
Variable Importance:
1.00 | 2_C3 | C3 (Orig)
0.77 | 3_C4 | C4 (Orig)
0.29 | 1_C2 | C2 (Orig)
0.21 | 0_C1 | C1 (Orig)
Here we download all the artifacts created by the experiment. Artifacts include things like scoring pipelines, auto-documentation, prediction csvs, and logs. The function ex.artifacts.download()
also returns a dictionary of paths to the artifacts it downloads.
[12]:
artifacts = ex.artifacts.download(overwrite=True)
Downloaded './h2oai_experiment_logs_9d0042ca-a947-11ea-9bb7-0242ac110002.zip'
Downloaded './mojo.zip'
Downloaded './scorer.zip'
Downloaded './h2oai_experiment_summary_9d0042ca-a947-11ea-9bb7-0242ac110002.zip'
Downloaded './test_preds.csv'
Downloaded './train_preds.csv'
Using the dictionary of artifact paths, we’ll read the test set predictions csv into a Pandas dataframe.
[13]:
import pandas as pd
pd.read_csv(artifacts['test_predictions']).head()
[13]:
C5.Iris-setosa | C5.Iris-versicolor | C5.Iris-virginica | |
---|---|---|---|
0 | 9.941657e-01 | 0.005834 | 9.627899e-13 |
1 | 9.884536e-01 | 0.011546 | 1.065597e-13 |
2 | 1.878662e-03 | 0.995781 | 2.339911e-03 |
3 | 9.963759e-01 | 0.003624 | 4.196660e-15 |
4 | 5.640818e-09 | 0.001772 | 9.982275e-01 |