Getting Started

Connecting to a Server

First we’ll initialize a client with our server credentials and store it in the variable dai.

[1]:
import driverlessai

dai = driverlessai.Client(address='http://localhost:12345', username='py', password='py')

Loading Data

Here we import the file iris.csv from s3 to the Driverless AI server and name the dataset ‘iris-getting-started’.

[2]:
ds = dai.datasets.create(
    data='s3://h2o-public-test-data/smalldata/iris/iris.csv',
    data_source='s3',
    name='iris-getting-started'
)
Complete 100.00% - [4/4] Computing column statistics

This creates a dataset object which we store in the variable ds. Dataset objects give you attributes and methods for interacting with the corresponding dataset on the Driverless AI server.

[3]:
print(ds.name, "|", ds.key)
print("Columns:", ds.columns)
print('Shape:', ds.shape)
print("Head:")
print(ds.head())
print("Tail:")
print(ds.tail())
print("Summary:")
print(ds.column_summaries()[0:2])
iris-getting-started | 98c0da08-a947-11ea-9bb7-0242ac110002
Columns: ['C1', 'C2', 'C3', 'C4', 'C5']
Shape: (150, 5)
Head:
   C1 |   C2 |   C3 |   C4 | C5
------+------+------+------+-------------
  5.1 |  3.5 |  1.4 |  0.2 | Iris-setosa
  4.9 |  3   |  1.4 |  0.2 | Iris-setosa
  4.7 |  3.2 |  1.3 |  0.2 | Iris-setosa
  4.6 |  3.1 |  1.5 |  0.2 | Iris-setosa
  5   |  3.6 |  1.4 |  0.2 | Iris-setosa
Tail:
   C1 |   C2 |   C3 |   C4 | C5
------+------+------+------+----------------
  6.7 |  3   |  5.2 |  2.3 | Iris-virginica
  6.3 |  2.5 |  5   |  1.9 | Iris-virginica
  6.5 |  3   |  5.2 |  2   | Iris-virginica
  6.2 |  3.4 |  5.4 |  2.3 | Iris-virginica
  5.9 |  3   |  5.1 |  1.8 | Iris-virginica
Summary:
--- C1 ---

 4.3|███████
    |█████████████████
    |██████████
    |████████████████████
    |████████████
    |███████████████████
    |█████████████
    |████
    |████
 7.9|████

Data Type: real
Logical Types: []
Datetime Format:
Count: 150
Missing: 0
Mean: 5.84
SD: 0.828
Min: 4.3
Max: 7.9
Unique: 35
Freq: 10
--- C2 ---

   2|██
    |████
    |████████████
    |█████████████
    |████████████████████
    |████████████████
    |█████
    |██████
    |█
 4.4|█

Data Type: real
Logical Types: []
Datetime Format:
Count: 150
Missing: 0
Mean: 3.05
SD: 0.434
Min: 2
Max: 4.4
Unique: 23
Freq: 26

We can also list all datasets on the server that are available to the user. Here we see the ‘iris-getting-started’ dataset we just created.

[4]:
dai.datasets.list()
[4]:
[<class 'driverlessai._datasets.Dataset'> 98c0da08-a947-11ea-9bb7-0242ac110002 iris-getting-started]

Next, we’ll split the data into train and test sets on the Driverless AI server.

[5]:
ds_split = ds.split_to_train_test(train_size=0.7)
Complete

This gives us a dictionary with the keys train_dataset and test_dataset. This dictionary can be unpacked directly when running experiments, which we’ll see in a bit.

[6]:
print(ds_split)
{'train_dataset': <class 'driverlessai._datasets.Dataset'> 9a18a0a2-a947-11ea-9bb7-0242ac110002 riparedi, 'test_dataset': <class 'driverlessai._datasets.Dataset'> 9a1b8bd2-a947-11ea-9bb7-0242ac110002 vasusise}

We can also see our train and test datasets on the server.

[7]:
dai.datasets.list()
[7]:
[<class 'driverlessai._datasets.Dataset'> 9a1b8bd2-a947-11ea-9bb7-0242ac110002 vasusise,
 <class 'driverlessai._datasets.Dataset'> 9a18a0a2-a947-11ea-9bb7-0242ac110002 riparedi,
 <class 'driverlessai._datasets.Dataset'> 98c0da08-a947-11ea-9bb7-0242ac110002 iris-getting-started]

Running an Experiment

Let’s define the experiment settings we want to use in a dictionary. Using a dictionary allows us to define our settings once and reuse them in different functions.

We need to specify whether the problem is classification or regression. Usually the goal for the iris dataset is to predict the class of the iris plant, so we’ll specify the task is 'classification'. We also need to specify the target column, which conveniently is the last column in this dataset.

Optionally, we’ll specify accuracy and time as 1 to get a quick baseline.

[8]:
settings = {
    'task': 'classification',
    'target_column': ds.columns[-1],
    'accuracy': 1,
    'time': 1
}

If we unpack our dataset and settings dictionaries as arguments for the function dai.experiments.preview() we can get a look at how an experiment would behave.

[9]:
dai.experiments.preview(**ds_split, **settings)
ACCURACY [1/10]:
- Training data size: *105 rows, 5 cols*
- Feature evolution: *[Constant, DecisionTree, GLM, LightGBM, XGBoostGBM]*, *1/4 validation split*
- Final pipeline: *CV+single model (9 models), 4-fold CV*

TIME [1/10]:
- Feature evolution: *2 individuals*, up to *3 iterations*
- Early stopping: disabled

INTERPRETABILITY [8/10]:
- Feature pre-pruning strategy: None
- Monotonicity constraints: enabled
- Feature engineering search space: [CVCatNumEncode, CVTargetEncode, CatOriginal, Cat, Frequent, Interactions, NumCatTE, OneHotEncoding, Original, Text]

[Constant, DecisionTree, GLM, LightGBM, XGBoostGBM] models to train:
- Model and feature tuning: *2*
- Feature evolution: *3*
- Final pipeline: *9*

Estimated runtime: *minutes*
Auto-click Finish/Abort if not done in: *1 day*/*7 days*

Assuming we’re happy with the experiment preview, we can now run an experiment with dai.experiments.create(). The dai.experiments.create() function waits for the experiment to complete, then returns an experiment object that can be used to get information and artifacts from the completed experiment.

[10]:
ex = dai.experiments.create(**ds_split, **settings, name='iris-getting-started')
Experiment launched at: http://localhost:12345/#experiment?key=9d0042ca-a947-11ea-9bb7-0242ac110002
Complete 100.00% - Status: Complete

Here we get a summary of the experiment results.

[11]:
ex.summary()
Status: Complete
Experiment: iris-getting-started (9d0042ca-a947-11ea-9bb7-0242ac110002)
  Version: 1.8.6, 2020-06-08 05:21
  Settings: 1/1/8, seed=970309338, GPUs disabled
  Train data: riparedi (105, 5)
  Validation data: N/A
  Test data: vasusise (45, 4)
  Target column: C5 (3-class)
System specs: Docker/Darwin, 10 GB, 6 CPU cores, 0/0 GPU
  Max memory usage: 0.492 GB, 0 GB GPU
Recipe: AutoDL (6 iterations, 2 individuals)
  Validation scheme: stratified, 1 internal holdout
  Feature engineering: 19 features scored (4 selected)
Timing: MOJO latency: 0.04136 millis (2.2KB)
  Data preparation: 7.74 secs
  Shift/Leakage detection: 1.88 secs
  Model and feature tuning: 23.90 secs (7 models trained)
  Feature evolution: 1.45 secs (0 of 3 model trained)
  Final pipeline training: 32.98 secs (9 models trained)
  Python / MOJO scorer building: 57.61 secs / 2.67 secs
Validation score: LOGLOSS = 1.076322 (constant preds)
Validation score: LOGLOSS = 0.3246407 +/- 0.06908537 (baseline)
Validation score: LOGLOSS = 0.1103104 +/- 0.02503069 (final pipeline)
Test score:       LOGLOSS = 0.05808781 +/- 0.02410646 (final pipeline)
Variable Importance:
  1.00 | 2_C3 | C3 (Orig)
  0.77 | 3_C4 | C4 (Orig)
  0.29 | 1_C2 | C2 (Orig)
  0.21 | 0_C1 | C1 (Orig)

Here we download all the artifacts created by the experiment. Artifacts include things like scoring pipelines, auto-documentation, prediction csvs, and logs. The function ex.artifacts.download() also returns a dictionary of paths to the artifacts it downloads.

[12]:
artifacts = ex.artifacts.download(overwrite=True)
Downloaded './h2oai_experiment_logs_9d0042ca-a947-11ea-9bb7-0242ac110002.zip'
Downloaded './mojo.zip'
Downloaded './scorer.zip'
Downloaded './h2oai_experiment_summary_9d0042ca-a947-11ea-9bb7-0242ac110002.zip'
Downloaded './test_preds.csv'
Downloaded './train_preds.csv'

Using the dictionary of artifact paths, we’ll read the test set predictions csv into a Pandas dataframe.

[13]:
import pandas as pd

pd.read_csv(artifacts['test_predictions']).head()
[13]:
C5.Iris-setosa C5.Iris-versicolor C5.Iris-virginica
0 9.941657e-01 0.005834 9.627899e-13
1 9.884536e-01 0.011546 1.065597e-13
2 1.878662e-03 0.995781 2.339911e-03
3 9.963759e-01 0.003624 4.196660e-15
4 5.640818e-09 0.001772 9.982275e-01