Driverless AI NLP Demo - Airline Sentiment Dataset

In this notebook, we will see how to use Driverless AI python client to build text classification models using the Airline sentiment twitter dataset.

Import the necessary python modules to get started including the Driverless AI client. If not already installed, please download and install the python client from Driverless AI GUI.

This notebook was tested in Driverless AI version 1.8.2.

[1]:
import pandas as pd
from sklearn import model_selection
from h2oai_client import Client

The below code downloads the twitter airline sentiment dataset and save it in the current folder.

[2]:
! wget https://www.figure-eight.com/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv
--2020-01-17 09:38:39--  https://www.figure-eight.com/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv
Resolving www.figure-eight.com (www.figure-eight.com)... 54.86.123.68, 35.169.155.50
Connecting to www.figure-eight.com (www.figure-eight.com)|54.86.123.68|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3704908 (3.5M) [application/octet-stream]
Saving to: ‘Airline-Sentiment-2-w-AA.csv.1’

Airline-Sentiment-2 100%[===================>]   3.53M  2.26MB/s    in 1.6s

2020-01-17 09:38:41 (2.26 MB/s) - ‘Airline-Sentiment-2-w-AA.csv.1’ saved [3704908/3704908]

We can now split the data into training and testing datasets.

[3]:
al = pd.read_csv("Airline-Sentiment-2-w-AA.csv", encoding='ISO-8859-1')
train_al, test_al = model_selection.train_test_split(al, test_size=0.2, random_state=2018)
train_al.to_csv("train_airline_sentiment.csv", index=False)
test_al.to_csv("test_airline_sentiment.csv", index=False)

The first step is to establish a connection to Driverless AI using Client. Please key in your credentials and the url address.

[4]:
address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
h2oai = Client(address = address, username = username, password = password)
# # make sure to use the same user name and password when signing in through the GUI

Read the train and test files into Driverless AI using the upload_dataset_sync command.

[5]:
train_path = './train_airline_sentiment.csv'
test_path = './test_airline_sentiment.csv'

train = h2oai.upload_dataset_sync(train_path)
test = h2oai.upload_dataset_sync(test_path)

Now let us look at some basic information about the dataset.

[6]:
print('Train Dataset: ', len(train.columns), 'x', train.row_count)
print('Test Dataset: ', len(test.columns), 'x', test.row_count)

[c.name for c in train.columns]
Train Dataset:  20 x 11712
Test Dataset:  20 x 2928
[6]:
['_unit_id',
 '_golden',
 '_unit_state',
 '_trusted_judgments',
 '_last_judgment_at',
 'airline_sentiment',
 'airline_sentiment:confidence',
 'negativereason',
 'negativereason:confidence',
 'airline',
 'airline_sentiment_gold',
 'name',
 'negativereason_gold',
 'retweet_count',
 'text',
 'tweet_coord',
 'tweet_created',
 'tweet_id',
 'tweet_location',
 'user_timezone']

We just need two columns for our experiment. text which contains the text of the tweet and airline_sentiment which contains the sentiment of the tweet (target column). We can drop the remaining columns for this experiment.

We will enable tensorflow models and transformations to take advantage of CNN based text features.

[7]:
exp_preview = h2oai.get_experiment_preview_sync(
    dataset_key=train.key
    , validset_key=''
    , target_col='airline_sentiment'
    , classification=True
    , dropped_cols=["_unit_id", "_golden", "_unit_state", "_trusted_judgments", "_last_judgment_at",
                  "airline_sentiment:confidence", "negativereason", "negativereason:confidence", "airline",
                  "airline_sentiment_gold", "name", "negativereason_gold", "retweet_count",
                  "tweet_coord", "tweet_created", "tweet_id", "tweet_location", "user_timezone"]
    , accuracy=6
    , time=4
    , interpretability=5
    , is_time_series=False
    , time_col=''
    , enable_gpus=True
    , reproducible=False
    , resumed_experiment_id=''
    , config_overrides="""
        enable_tensorflow='on'
        enable_tensorflow_charcnn='on'
        enable_tensorflow_textcnn='on'
        enable_tensorflow_textbigru='on'
    """
)
exp_preview
[7]:
['ACCURACY [6/10]:',
 '- Training data size: *11,712 rows, 2 cols*',
 '- Feature evolution: *[LightGBM, TensorFlow, XGBoostGBM]*, *3-fold CV**, 2 reps*',
 '- Final pipeline: *Ensemble (6 models), 3-fold CV*',
 '',
 'TIME [4/10]:',
 '- Feature evolution: *4 individuals*, up to *46 iterations*',
 '- Early stopping: After *5* iterations of no improvement',
 '',
 'INTERPRETABILITY [5/10]:',
 '- Feature pre-pruning strategy: None',
 '- Monotonicity constraints: disabled',
 '- Feature engineering search space: [CVTargetEncode, Frequent, TextBiGRU, TextCNN, TextCharCNN, Text]',
 '',
 '[LightGBM, TensorFlow, XGBoostGBM] models to train:',
 '- Model and feature tuning: *144*',
 '- Feature evolution: *504*',
 '- Final pipeline: *6*',
 '',
 'Estimated runtime: *minutes*',
 'Auto-click Finish/Abort if not done in: *1 day*/*7 days*']

Please note that the Text and TextCNN features are enabled for this experiment.

Now we can start the experiment.

[8]:
model = h2oai.start_experiment_sync(
    dataset_key=train.key,
    testset_key=test.key,
    target_col='airline_sentiment',
    scorer='F1',
    is_classification=True,
    cols_to_drop=["_unit_id", "_golden", "_unit_state", "_trusted_judgments", "_last_judgment_at",
                  "airline_sentiment:confidence", "negativereason", "negativereason:confidence", "airline",
                  "airline_sentiment_gold", "name", "negativereason_gold", "retweet_count",
                  "tweet_coord", "tweet_created", "tweet_id", "tweet_location", "user_timezone"],
    accuracy=6,
    time=2,
    interpretability=5,
    enable_gpus=True,
    config_overrides="""
        enable_tensorflow='on'
        enable_tensorflow_charcnn='on'
        enable_tensorflow_textcnn='on'
        enable_tensorflow_textbigru='on'
    """
)
[9]:
print('Modeling completed for model ' + model.key)
Modeling completed for model ce5935e6-3950-11ea-9465-0242ac110002
[10]:
logs = h2oai.download(model.log_file_path, '.')
print('Logs available at', test_preds)
Logs available at ./test_preds.csv

We can download the predictions to the current folder.

[11]:
test_preds = h2oai.download(model.test_predictions_path, '.')
print('Test set predictions available at', test_preds)
Test set predictions available at ./test_preds.csv
[ ]: