NLP in Driverless AI¶
Driverless AI version 1.3 introduced support for Natural Language Processing (NLP) experiments for text classification and regression problems. The Driverless AI platform has the ability to support both standalone text and text with other numerical values as predictive features.
The following is the set of features created by the NLP recipe for a given text column:
- N-gram frequency / TFIDF followed by Truncated SVD
- N-gram frequency / TFIDF followed by Linear / Logistic regression
- Word embeddings followed by CNN model
- Word embeddings followed by BiGRU model
- Character embeddings followed by CNN model
An n-gram is a contiguous sequence of n items from a given sample of text or speech.
Frequency-based features represent the count of each word from a given text in the form of vectors. These are created for different n-gram values. For example, a one-gram is equivalent to a single word, a two-gram is equivalent to two consecutive words paired together, and so on.
Words and n-grams that occur more often will receive a higher weightage. The ones that are rare will receive a lower weightage.
TFIDF of n-grams¶
Frequency-based features can be multiplied with the inverse document frequency to get term frequency–inverse document frequency (TFIDF) vectors. Doing so also gives importance to the rare terms that occur in the corpus, which may be helpful in certain classification tasks.
Truncated SVD Features¶
TFIDF and the frequency of n-grams both result in higher dimensions of the representational vectors. To counteract this, Truncated SVD is commonly used to decompose the vectorized arrays into lower dimensions.
Linear Models for TFIDF Vectors¶
Linear models are also available in the Driverless AI NLP recipe. These capture linear dependencies that are crucial to the process of achieving high accuracy rates.
Word embeddings is the term for a collective set of feature engineering techniques for text where words or phrases from the vocabulary are mapped to vectors of real numbers. Representations are made so that words with similar meanings are placed close to or equidistant from one another. For example, the word “king” is closely associated with the word “queen” in this kind of vector representation.
TFIDF and frequency-based models represent counts and significant word information, but they lack the semantic context for these words. Word embedding techniques are used to make up for this lack of semantic information.
CNN Models for Word Embedding¶
Although Convolutional Neural Network (CNN) models are primarily used on image-level machine learning tasks, their use case on representing text as information has proven to be quite efficient and faster compared to RNN models. In Driverless AI, we pass word embeddings as input to CNN models, which return cross validated predictions that can be used as a new set of features.
Bi-direction GRU Models for Word Embedding¶
Recurrent neural networks like long short-term memory units (LSTM) and gated recurrent units (GRU) are state-of-the-art algorithms for NLP problems. In Driverless AI, we implement bi-directional GRU features for previous word steps and for later steps to predict the current state. For example, in the sentence “John is walking on the golf course,” a unidirectional model would represent states that represent “golf” based on “John is walking on,” but would not represent “course.” Using a bi-directional model, the representation would also account the later representations, giving the model more predictive power.
In simple terms, a bi-directional GRU model combines two independent RNN models into a single model. A GRU architecture provides high speeds and accuracy rates similar to a LSTM architecture. As with CNN models, we pass word embeddings as input to these models, which return cross validated predictions that can be used as a new set of features.
CNN Models for Character Embedding¶
For languages like Japanese and Mandarin Chinese, where characters play a major role, character level embedding is available in the NLP recipe.
In character embedding, each character is represented in the form of vectors rather than words. Driverless AI uses character level embedding as the input to CNN models and later extracts class probabilities to feed as features for downstream models.
The image below represents the overall set of features created by our NLP recipes:
NLP Naming Conventions¶
The naming conventions of the NLP features help to understand the type of feature that has been created.
The syntax for the feature names is as follows:
- [FEAT TYPE] represents one of the following:
- Txt – Frequency / TFIDF of N-grams followed by SVD
- TxtTE - Frequency / TFIDF of N-grams followed by linear model
- TextCNN_TE – Word embeddings followed by CNN model
- TextBiGRU_TE – Word embeddings followed by Bidirectional GRU model
- TextCharCNN_TE – Character embeddings followed by CNN model
- [COL] represents the name of the text column.
- [TARGET_CLASS] represents the target class for which the model predictions are made.
For example, TxtTE:text.0 equates to class 0 predictions for the text column “text” using Frequency / TFIDF of n-grams followed by a linear model.
NLP Expert Settings¶
A number of configurable settings are available for NLP in Driverless AI. Refer to NLP Settings in the Expert Settings topic for more information.
A Typical NLP Example: Sentiment Analysis¶
The following section provides an NLP example. This information is based on the Automatic Feature Engineering for Text Analytics blog post. A similar example using the Python Client is available in The Python Client.
This example uses a classical example of sentiment analysis on tweets using the US Airline Sentiment dataset from Figure Eight’s Data for Everyone library. We can split the dataset into training and test with this simple script. We will just use the tweets in the ‘text’ column and the sentiment (positive, negative or neutural) in the ‘airline_sentiment’ column for this demo. Here are some samples from the dataset:
Once we have our dataset ready in the tabular format, we are all set to use the Driverless AI. Similar to other problems in the Driverless AI setup, we need to choose the dataset, and then specify the target column (‘airline_sentiment’).
Because we don’t want to use any other columns in the dataset, we need to click on Dropped Cols, and then exclude everything but text as shown below:
Next, we will need to make sure TensorFlow is enabled for the experiment. We can go to Expert Settings and enable TensorFlow Models.
At this point, we are ready to launch an experiment. Text features will be automatically generated and evaluated during the feature engineering process. Note that some features such as TextCNN rely on TensorFlow models. We recommend using GPU(s) to leverage the power of TensorFlow and accelerate the feature engineering process.
Once the experiment is done, users can make new predictions and download the scoring pipeline just like any other Driverless AI experiments.