Driverless AI NLP Demo - Airline Sentiment Dataset¶

In this notebook, we will see how to use Driverless AI python client to build text classification models using the Airline sentiment twitter dataset.

Import the necessary python modules to get started including the Driverless AI client. If not already installed, please download the python client from Driverless AI GUI and install the same.

In [1]:

import h2oai_client
import numpy as np
import pandas as pd
from sklearn import model_selection
from h2oai_client import Client

The below code downloads the twitter airline sentiment dataset and save it in the current folder.

In [2]:

! wget https://www.figure-eight.com/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv

--2018-09-11 12:12:42--  https://www.figure-eight.com/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv
Resolving www.figure-eight.com (www.figure-eight.com)... 52.0.208.137, 52.3.39.167
Connecting to www.figure-eight.com (www.figure-eight.com)|52.0.208.137|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3704908 (3.5M) [application/octet-stream]
Saving to: ‘Airline-Sentiment-2-w-AA.csv’

Airline-Sentiment-2 100%[===================>]   3.53M   660KB/s    in 5.5s

2018-09-11 12:12:50 (660 KB/s) - ‘Airline-Sentiment-2-w-AA.csv’ saved [3704908/3704908]

We can now split the dataset into train and test files so as to build models.

In [2]:

al = pd.read_csv("Airline-Sentiment-2-w-AA.csv", encoding='ISO-8859-1')
train_al, test_al = model_selection.train_test_split(al, test_size=0.2, random_state=2018)
train_al.to_csv("train_airline_sentiment.csv", index=False)
test_al.to_csv("test_airline_sentiment.csv", index=False)

The first step is to establish a connection to Driverless AI using Client. Please key in your credentials and the url address.

In [3]:

h2o = Client(address='http://localhost:12345', username='h2oai', password='h2oai')

Read the train and test files into Driverless AI using the create_dataset_sync command.

In [5]:

train_path = './train_airline_sentiment.csv'
test_path = './test_airline_sentiment.csv'

train = h2o.create_dataset_sync(train_path)
test = h2o.create_dataset_sync(test_path)

Now let us look at some basic information about the dataset. To check the number of columns and rows in the dataset.

In [6]:

print('Train Dataset: ', len(train.columns), 'x', train.row_count)
print('Test Dataset: ', len(test.columns), 'x', test.row_count)

Train Dataset:  20 x 11712
Test Dataset:  20 x 2928

To get the names of the columns in the training set.

In [7]:

[c.name for c in train.columns]

Out[7]:

['_unit_id',
 '_golden',
 '_unit_state',
 '_trusted_judgments',
 '_last_judgment_at',
 'airline_sentiment',
 'airline_sentiment:confidence',
 'negativereason',
 'negativereason:confidence',
 'airline',
 'airline_sentiment_gold',
 'name',
 'negativereason_gold',
 'retweet_count',
 'text',
 'tweet_coord',
 'tweet_created',
 'tweet_id',
 'tweet_location',
 'user_timezone']

We just need two columns for our experiment. text which contains the text of the tweet and airline_sentiment which contains the sentiment of the tweet (target column). We can drop the remaining columns for this experiment. Let us get a preview for the same.

Also please set enable the tensorflow models by setting enable_tensorflow="on" if you have a GPU. This will help in creating the CNN based text features.

In [8]:

exp_preview = h2o.get_experiment_preview_sync(
    dataset_key=train.key,
    validset_key='',
    target_col='airline_sentiment',
    classification=True,
    dropped_cols=["_unit_id", "_golden", "_unit_state", "_trusted_judgments", "_last_judgment_at",
                  "airline_sentiment:confidence", "negativereason", "negativereason:confidence", "airline",
                  "airline_sentiment_gold", "name", "negativereason_gold", "retweet_count",
                  "tweet_coord", "tweet_created", "tweet_id", "tweet_location", "user_timezone"],
    accuracy=6,
    time=4,
    interpretability=8,
    time_col='',
    enable_gpus=True,
    config_overrides='enable_tensorflow="on"'
)
exp_preview

Out[8]:

['ACCURACY [6/10]:',
 '- Training data size: *11,712 rows, 2 cols*',
 '- Feature evolution: *XGBoost*, *time-based validation*',
 '- Final pipeline: *XGBoost*',
 '',
 'TIME [4/10]:',
 '- Feature evolution: *4 individuals*, up to *52 iterations*',
 '- Early stopping: After *5* iterations of no improvement',
 '',
 'INTERPRETABILITY [8/10]:',
 '- Feature pre-pruning strategy: FS',
 '- Monotonicity constraints: enabled',
 '- Feature engineering search space (where applicable): [Date, Identity, Interactions, Lags, Text, TextCNN, WeightOfEvidence]',
 '',
 'XGBoost models to train:',
 '- Model and feature tuning: *72*',
 '- Feature evolution: *252*',
 '- Final pipeline: *1*',
 '',
 'Estimated max. total memory usage:',
 '- Feature engineering: *144.0MB*',
 '- GPU XGBoost: *8.0MB*']

Please note that the Text and TextCNN features are enabled for this experiment.

Now we can start the experiment.

In [53]:

model = h2o.start_experiment_sync(
    dataset_key=train.key,
    testset_key=test.key,
    target_col='airline_sentiment',
    scorer=None,
    is_classification=True,
    cols_to_drop=["_unit_id", "_golden", "_unit_state", "_trusted_judgments", "_last_judgment_at",
                  "airline_sentiment:confidence", "negativereason", "negativereason:confidence", "airline",
                  "airline_sentiment_gold", "name", "negativereason_gold", "retweet_count",
                  "tweet_coord", "tweet_created", "tweet_id", "tweet_location", "user_timezone"],
    accuracy=6,
    time=2,
    interpretability=8,
    time_col='',
    enable_gpus=True,
    config_overrides='enable_tensorflow="on"'
)

In [89]:

print('Modeling completed for model ' + model.key)

Modeling completed for model pakimeto

In [56]:

print('Logs available at', model.log_file_path)

Logs available at h2oai_experiment_pakimeto/h2oai_experiment_logs_pakimeto.zip

We can download the predictions to the current folder.

In [58]:

test_preds = h2o.download(model.test_predictions_path, '.')
print('Test set predictions available at', test_preds)

Test set predictions available at ./test_preds.csv