Driverless AI NLP Demo - Airline Sentiment Dataset

In this notebook, we will see how to use Driverless AI python client to build text classification models using the Airline sentiment twitter dataset.

Import the necessary python modules to get started including the Driverless AI client. If not already installed, please download and install the python client from Driverless AI GUI.

Here is the Python Client Documentation.

[1]:
import pandas as pd
from sklearn import model_selection
import driverlessai

The first step is to establish a connection to Driverless AI using Client. Please key in your credentials and the url address.

[2]:
address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
dai = driverlessai.Client(address = address, username = username, password = password)
# # make sure to use the same user name and password when signing in through the GUI

Upload the airlines file into Driverless AI using the datasets.create command. Then, split the data into train and test sets.

[3]:
airlines = dai.datasets.create(data='https://h2o-public-test-data.s3.amazonaws.com/dai_release_testing/datasets/airline_sentiment_tweets.csv',
                               data_source='s3')

ds_split = airlines.split_to_train_test(train_size=0.7,
                                        train_name='train',
                                        test_name='test')
Complete 100.00% - [4/4] Computing column statistics
Complete

Now let us look at some basic information about the dataset.

[9]:
print('Train Dataset: ', train.shape)
print('Test Dataset: ', test.shape)

ids = [c for c in train.columns]
print(ids)
Train Dataset:  (11712, 15)
Test Dataset:  (2928, 15)

We just need two columns for our experiment. text which contains the text of the tweet and airline_sentiment which contains the sentiment of the tweet (target column). We can drop the remaining columns for this experiment.

We will enable tensorflow models and transformations to take advantage of CNN based text features.

[12]:
exp_preview = dai.experiments.preview(train_dataset=train,
                                      target_column='airline_sentiment',
                                      task='classification',
                                      drop_columns=["_unit_id", "_golden", "_unit_state", "_trusted_judgments", "_last_judgment_at",
                                      "airline_sentiment:confidence", "negativereason", "negativereason:confidence", "airline",
                                      "airline_sentiment_gold", "name", "negativereason_gold", "retweet_count",
                                      "tweet_coord", "tweet_created", "tweet_id", "tweet_location", "user_timezone"],
                                      config_overrides="""
                                      enable_tensorflow='on'
                                      enable_tensorflow_charcnn='on'
                                      enable_tensorflow_textcnn='on'
                                      enable_tensorflow_textbigru='on'
                                      """)
ACCURACY [7/10]:
- Training data size: *11,712 rows, 4 cols*
- Feature evolution: *[Constant, DecisionTree, LightGBM, TensorFlow, XGBoostGBM]*, *3-fold CV**, 2 reps*
- Final pipeline: *Ensemble (6 models), 3-fold CV*

TIME [2/10]:
- Feature evolution: *8 individuals*, up to *42 iterations*
- Early stopping: After *5* iterations of no improvement

INTERPRETABILITY [8/10]:
- Feature pre-pruning strategy: Permutation Importance FS
- Monotonicity constraints: enabled
- Feature engineering search space: [CVCatNumEncode, CVTargetEncode, CatOriginal, Cat, Frequent, Interactions, NumCatTE, Original, TextBiGRU, TextCNN, TextCharCNN, Text]

[Constant, DecisionTree, LightGBM, TensorFlow, XGBoostGBM] models to train:
- Model and feature tuning: *192*
- Feature evolution: *288*
- Final pipeline: *6*

Estimated runtime: *minutes*
Auto-click Finish/Abort if not done in: *1 day*/*7 days*

Please note that the Text and TextCNN features are enabled for this experiment.

Now we can start the experiment.

[13]:
model = dai.experiments.create(train_dataset=train,
                               target_column='airline_sentiment',
                               task='classification',
                               name="nlp_airline_sentiment_beta",
                               scorer='F1',
                               drop_columns=["tweet_id", "airline_sentiment_confidence", "negativereason", "negativereason_confidence", "airline", "airline_sentiment_gold", "name", "negativereason_gold", "retweet_count", "tweet_coord", "tweet_created", "tweet_location", "user_timezone", "airline_sentiment.negative", "airline_sentiment.neutral", "airline_sentiment.positive"],
                               accuracy=6,
                               time=2,
                               interpretability=5)
Experiment launched at: http://localhost:12345/#experiment?key=b971fe8a-e317-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete
[14]:
print('Modeling completed for model ' + model.key)
Modeling completed for model b971fe8a-e317-11ea-9088-0242ac110002
[15]:
logs = model.log.download(dst_dir = '.', overwrite = True)
#logs = dai.datasets.download(model.log_file_path, '.')
print('Logs available at', logs)
Downloaded './h2oai_experiment_logs_b971fe8a-e317-11ea-9088-0242ac110002.zip'
Logs available at ./h2oai_experiment_logs_b971fe8a-e317-11ea-9088-0242ac110002.zip

We can download the predictions to the current folder.

[16]:
test_preds = model.predict(dataset = test, include_columns = ids).download(dst_dir = '.', overwrite = True)
print('Test set predictions available at', test_preds)
Complete
Downloaded './b971fe8a-e317-11ea-9088-0242ac110002_preds_9f438fac.csv'
Test set predictions available at ./b971fe8a-e317-11ea-9088-0242ac110002_preds_9f438fac.csv
[ ]: