Driverless AI NLP Demo - Airline Sentiment Dataset¶
In this notebook, we will see how to use Driverless AI python client to build text classification models using the Airline sentiment twitter dataset.
Import the necessary python modules to get started including the Driverless AI client. If not already installed, please download the python client from Driverless AI GUI and install the same.
In [1]:
import h2oai_client
import numpy as np
import pandas as pd
from sklearn import model_selection
from h2oai_client import Client
The below code downloads the twitter airline sentiment dataset and save it in the current folder.
In [2]:
! wget https://www.figure-eight.com/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv
--2018-09-11 12:12:42-- https://www.figure-eight.com/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv
Resolving www.figure-eight.com (www.figure-eight.com)... 52.0.208.137, 52.3.39.167
Connecting to www.figure-eight.com (www.figure-eight.com)|52.0.208.137|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3704908 (3.5M) [application/octet-stream]
Saving to: ‘Airline-Sentiment-2-w-AA.csv’
Airline-Sentiment-2 100%[===================>] 3.53M 660KB/s in 5.5s
2018-09-11 12:12:50 (660 KB/s) - ‘Airline-Sentiment-2-w-AA.csv’ saved [3704908/3704908]
We can now split the dataset into train and test files so as to build models.
In [2]:
al = pd.read_csv("Airline-Sentiment-2-w-AA.csv", encoding='ISO-8859-1')
train_al, test_al = model_selection.train_test_split(al, test_size=0.2, random_state=2018)
train_al.to_csv("train_airline_sentiment.csv", index=False)
test_al.to_csv("test_airline_sentiment.csv", index=False)
The first step is to establish a connection to Driverless AI using
Client
. Please key in your credentials and the url address.
In [3]:
h2o = Client(address='http://localhost:12345', username='h2oai', password='h2oai')
Read the train and test files into Driverless AI using the
create_dataset_sync
command.
In [5]:
train_path = './train_airline_sentiment.csv'
test_path = './test_airline_sentiment.csv'
train = h2o.create_dataset_sync(train_path)
test = h2o.create_dataset_sync(test_path)
Now let us look at some basic information about the dataset. To check the number of columns and rows in the dataset.
In [6]:
print('Train Dataset: ', len(train.columns), 'x', train.row_count)
print('Test Dataset: ', len(test.columns), 'x', test.row_count)
Train Dataset: 20 x 11712
Test Dataset: 20 x 2928
To get the names of the columns in the training set.
In [7]:
[c.name for c in train.columns]
Out[7]:
['_unit_id',
'_golden',
'_unit_state',
'_trusted_judgments',
'_last_judgment_at',
'airline_sentiment',
'airline_sentiment:confidence',
'negativereason',
'negativereason:confidence',
'airline',
'airline_sentiment_gold',
'name',
'negativereason_gold',
'retweet_count',
'text',
'tweet_coord',
'tweet_created',
'tweet_id',
'tweet_location',
'user_timezone']
We just need two columns for our experiment. text
which contains the
text of the tweet and airline_sentiment
which contains the sentiment
of the tweet (target column). We can drop the remaining columns for this
experiment. Let us get a preview for the same.
Also please set enable the tensorflow models by setting
enable_tensorflow="on"
if you have a GPU. This will help in creating
the CNN based text features.
In [8]:
exp_preview = h2o.get_experiment_preview_sync(
dataset_key=train.key,
validset_key='',
target_col='airline_sentiment',
classification=True,
dropped_cols=["_unit_id", "_golden", "_unit_state", "_trusted_judgments", "_last_judgment_at",
"airline_sentiment:confidence", "negativereason", "negativereason:confidence", "airline",
"airline_sentiment_gold", "name", "negativereason_gold", "retweet_count",
"tweet_coord", "tweet_created", "tweet_id", "tweet_location", "user_timezone"],
accuracy=6,
time=4,
interpretability=8,
time_col='',
enable_gpus=True,
config_overrides='enable_tensorflow="on"'
)
exp_preview
Out[8]:
['ACCURACY [6/10]:',
'- Training data size: *11,712 rows, 2 cols*',
'- Feature evolution: *XGBoost*, *time-based validation*',
'- Final pipeline: *XGBoost*',
'',
'TIME [4/10]:',
'- Feature evolution: *4 individuals*, up to *52 iterations*',
'- Early stopping: After *5* iterations of no improvement',
'',
'INTERPRETABILITY [8/10]:',
'- Feature pre-pruning strategy: FS',
'- Monotonicity constraints: enabled',
'- Feature engineering search space (where applicable): [Date, Identity, Interactions, Lags, Text, TextCNN, WeightOfEvidence]',
'',
'XGBoost models to train:',
'- Model and feature tuning: *72*',
'- Feature evolution: *252*',
'- Final pipeline: *1*',
'',
'Estimated max. total memory usage:',
'- Feature engineering: *144.0MB*',
'- GPU XGBoost: *8.0MB*']
Please note that the Text
and TextCNN
features are enabled for
this experiment.
Now we can start the experiment.
In [53]:
model = h2o.start_experiment_sync(
dataset_key=train.key,
testset_key=test.key,
target_col='airline_sentiment',
scorer=None,
is_classification=True,
cols_to_drop=["_unit_id", "_golden", "_unit_state", "_trusted_judgments", "_last_judgment_at",
"airline_sentiment:confidence", "negativereason", "negativereason:confidence", "airline",
"airline_sentiment_gold", "name", "negativereason_gold", "retweet_count",
"tweet_coord", "tweet_created", "tweet_id", "tweet_location", "user_timezone"],
accuracy=6,
time=2,
interpretability=8,
time_col='',
enable_gpus=True,
config_overrides='enable_tensorflow="on"'
)
In [89]:
print('Modeling completed for model ' + model.key)
Modeling completed for model pakimeto
In [56]:
print('Logs available at', model.log_file_path)
Logs available at h2oai_experiment_pakimeto/h2oai_experiment_logs_pakimeto.zip
We can download the predictions to the current folder.
In [58]:
test_preds = h2o.download(model.test_predictions_path, '.')
print('Test set predictions available at', test_preds)
Test set predictions available at ./test_preds.csv