NLP in Driverless AI

Driverless AI version 1.3 introduced support for Natural Language Processing (NLP) experiments for text classfication and regression problems. The Driverless AI platform has the ability to support both standalone text and text with other numerical values as predictive features. In particular, Driverless AI implements the following recipes and models.

Text-specific feature engineering recipes:

  • TFIDF, Frequency of n-grams
  • Truncated SVD
  • Word embeddings

Text-specific models to extract features from text:

  • Convolutional neural network models on word embeddings
  • Linear models on TFIDF vectors

A Typical NLP Example: Sentiment Analysis

The following section provides an NLP example. This information is based on the Automatic Feature Engineering for Text Analytics blog post. A similar example using the Python Client is available in Appendix A: The Python Client.

This example uses a classical example of sentiment analysis on tweets using the US Airline Sentiment dataset from Figure Eight’s Data for Everyone library. We can split the dataset into training and test with this simple script. We will just use the tweets in the ‘text’ column and the sentiment (positive, negative or neutural) in the ‘airline_sentiment’ column for this demo. Here are some samples from the dataset:

Example text in dataset

Once we have our dataset ready in the tabular format, we are all set to use the Driverless AI. Similar to other problems in the Driverless AI setup, we need to choose the dataset, and then specify the target column (‘airline_sentiment’).

Example experiment settings

Because we don’t want to use any other columns in the dataset, we need to click on Dropped Cols, and then exclude everything but text as shown below:

Dropping columns in the dataset

Next, we will need to make sure TensorFlow is enabled for the experiment. We can go to Expert Settings and enable TensorFlow Models.

Enable TensorFlow models

At this point, we are ready to launch an experiment. Text features will be automatically generated and evaluated during the feature engineering process. Note that some features such as TextCNN rely on TensorFlow models. We recommend using GPU(s) to leverage the power of TensorFlow and accelerate the feature engineering process.

Enable TensorFlow models

Once the experiment is done, users can make new predictions and download the scoring pipeline just like any other Driverless AI experiments.