.. _Transformations:
Driverless AI Transformations
=============================
Transformations in Driverless AI are applied to columns in the data. The transformers create the engineered features in experiments.
Driverless AI provides a number of transformers. The downloaded experiment logs includes the transformations that were applied to your experiment. Note that you can blacklist transformations in the **config.toml** file, and that list of Blacklisted transformers will also be available in the experiment log.
Available Transformers
----------------------
The following transformers are available for classification (muliclass and binary) and regression experiments.
- FilterTransformer
The Filter Transformer counts each numeric value in the dataset.
- FrequentTransformer
The Frequent Transformer calculates the frequency for each value in categorical column(s) and uses this as a new feature. This count can be either the raw count or the normalized count.
- BulkInteractionsTransformer
The Bulk Interactions Transformer add, divide, multiply, and subtract two numeric columns in the data to create a new feature.
- ClusterTETransformer
In the Cluster Target Encoding Transformer clusters selected numeric columns and calculates the mean of the response column for each cluster. The mean of the response is used as a new feature. Cross Validation is used to calculate mean response to prevent overfitting.
- TruncSVDNumTransformer
Truncated SVD Transformer trains a Truncated SVD model on selected numeric columns and uses the components of the truncated SVD matrix as new features.
- CVTargetEncodeF
The Cross Validation Target Encoding Transformer calculates the mean of the response column for each value in a categorical column and uses this as a new feature. Cross Validation is used to calculate mean response to prevent overfitting.
- CVCatNumEncodeF
The Cross Validation Categorical to Numeric Encoding (Fit) Transformer converts a categorical column to a numeric column. Cross validation target encoding is done on the categorical column.
- CVCatNumEncodeDT
The Cross Validationcal Categorical to Numeric Encoding (culates an aggregation of a numeric column for each value in a categorical column (ex: calculate the mean Temperature for each City) and uses this aggregation as a new feature.
- NumToCatTETransformer
The Numeric to Categorical Target Encoding Transformer converts a numeric columns to categoricals by binning and then calculates the mean of the response column for each group. The mean of the response for the bin is used as a new feature. Cross Validation is used to calculate mean response to prevent overfitting.
- NumCatTETransformer
The Numeric Categorical Target Encoding Transformer calculates the mean of the response column for several selected columns. If one of the selected columns is numeric, it is first converted to categorical by binning. The mean of the response column is used as a new feature. Cross Validation is used to calculate mean response to prevent overfitting.
- DatesTransformer
The Dates Transformer retrieves any date values, including:
- Year
- Quarter
- Month
- Day
- Day of year
- Week
- Week day
- Hour
- Minute
- Second
- TextTransformer
The Text Transformer tokenizes a text column and creates a TFIDF matrix (term frequency-inverse document frequency) or count (count of the word) matrix. This may be followed by dimensionality reduction using truncated SVD. Selected components of the TF-IDF/Count matrix are used as new features.
- ClusterDistTransformer
The Cluster Distance Transformer clusters selected numeric columns and uses the distance to a specific cluster as a new feature.
- WeightOfEvidenceTransformer
The Weight of Evidence Transformer calculates Weight of Evidence for each value in categorical column(s). The Weight of Evidence is used as a new feature. Weight of Evidence measures the “strength” of a grouping for separating good and bad risk and is calculated by taking the log of the ratio of distributions for a binary response column.
.. figure:: images/woe.png
This only works with a binary target variable. The likelihood needs to be created within a stratified kfold if a fit_transform method is used. More information can be found here: http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/.
- NumToCatWoETransformer
The Numeric to Categorical Weight of Evidence Transformer converts a numeric column to categorical by binning and then calculates Weight of Evidence for each bin. The Weight of Evidence is used as a new feature. Weight of Evidence measures the “strength” of a grouping for separating good and bad risk and is calculated by taking the log of the ratio of distributions for a binary response column.
- LagsTransfomer
The Lags Transformer creates target/feature lags possibly over groups. Each lag is used as a new feature.
- LagsInteractionTransfomer
The Lags Interaction Transformer creates target/feature lags and calculates interactions between the lags (lag2 - lag1, for instance). The interaction is used as a new feature.
- LagsAggregatesTransformer
The Lags Aggregates Transformer calculates aggregations of target/feature lags like mean(lag7, lag14, lag21) with support for mean, min, max, median, sum, skew, kurtosis, std. The aggregation is used as a new feature.
- IsHolidayTransformer
The Is Holiday Transformer determines if a date column is a holiday. A boolean column indicating if the date is a holiday is added as a new feature.
- NumToCatWoEMonotonicTransformer
The Numeric to Categorical Weight of Evidence Monotonic Transformer converts a numeric column to categorical by binning and then calculates Weight of Evidence for each bin. The monotonic constraint ensures the bins of values are monotonically related to the Weight of Evidence value. The Weight of Evidence is used as a new feature. Weight of Evidence measures the “strength” of a grouping for separating good and bad risk and is calculated by taking the log of the ratio of distributions for a binary response column.
- TextLinModelTransformer
The Text Linear Model Transformer trains a linear model on a TF-IDF matrix created from a text feature to predict the response column. The linear model prediction is used as a new feature. Cross Validation is used when training the linear model to prevent overfitting.
- TextCNNTransformer
The Text CNN Transformer trains a CNN Tensorflow model on word embeddings created from a text feature to predict the response column. The CNN prediction is used as a new a feature. Cross Validation is used when training the CNN model to prevent overfitting.
- OHETransformer
The One-hot Encoding transformer converts a categorical column to a series of boolean features by performing one-hot encoding. The boolean features are used as new features.
- SortedLETransformer
The Sorted Label Encoding Transformer sorts a categorical column by the response column and uses the order index created as a new feature.
- LexiLabelEncoder
The Lexi Label Encoder sorts a categorical column in lexigraphical order and uses the order index created as a new feature.
- EwmaLagsTransformer
The Exponentially Weighted Moving Average (EWMA) Transformer calculates the exponentially weighted moving average temporal lag of some target/feature.
- TextClustDistTransformer
The Text Cluster Distance Transformer clusters a TF-IDF matrix created from a text feature and uses the distance to a specific cluster as a new feature.
- TextClustTETransformer
The Text Cluster Target Encoding Transformer clusters a TF-IDF matrix created from a text feature. The mean of the response is calculated for each cluster and this is used as a new feature. Cross Validation is used to calculate mean response to prevent overfitting.
Example Transformations
-----------------------
In this section, we will describe some of the available transformations using the example of predicting house prices on the example dataset.
+--------------+------------------+------------+-------------+---------+---------+
| Date Built | Square Footage | Num Beds | Num Baths | State | Price |
+==============+==================+============+=============+=========+=========+
| 01/01/1920 | 1700 | 3 | 2 | NY | $700K |
+--------------+------------------+------------+-------------+---------+---------+
Frequent Transformer
~~~~~~~~~~~~~~~~~~~~
- the count of each categorical value in the dataset
- the count can be either the raw count or the normalized count
+--------------+------------------+------------+-------------+---------+-----------+---------------+
| Date Built | Square Footage | Num Beds | Num Baths | State | Price | Freq\_State |
+==============+==================+============+=============+=========+===========+===============+
| 01/01/1920 | 1700 | 3 | 2 | NY | 700,000 | 4,500 |
+--------------+------------------+------------+-------------+---------+-----------+---------------+
There are 4,500 properties in this dataset with state = NY.
Bulk Interactions Transformer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- add, divide, multiply, and subtract two columns in the data
+--------------+------------------+------------+-------------+---------+-----------+---------------------------------------+
| Date Built | Square Footage | Num Beds | Num Baths | State | Price | Interaction_NumBeds#subtract#NumBaths |
+==============+==================+============+=============+=========+===========+=======================================+
| 01/01/1920 | 1700 | 3 | 2 | NY | 700,000 | 1 |
+--------------+------------------+------------+-------------+---------+-----------+---------------------------------------+
There is one more bedroom than there are number of bathrooms for this property.
Truncated SVD Numeric Transformer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- truncated SVD trained on selected numeric columns of the data
- the components of the truncated SVD will be new features
+--------------+------------------+------------+-------------+---------+-----------+------------------------------------+
| Date Built | Square Footage | Num Beds | Num Baths | State | Price | TruncSVD_Price_NumBeds_NumBaths_1 |
+==============+==================+============+=============+=========+===========+====================================+
| 01/01/1920 | 1700 | 3 | 2 | NY | 700,000 | 0.632 |
+--------------+------------------+------------+-------------+---------+-----------+------------------------------------+
The first component of the truncated SVD of the columns Price, Number of Beds, Number of Baths.
Dates Transformer
~~~~~~~~~~~~~~~~~
- get year, get quarter, get month, get day, get day of year, get week,
get week day, get hour, get minute, get second
+--------------+------------------+------------+-------------+---------+-----------+--------------------+
| Date Built | Square Footage | Num Beds | Num Baths | State | Price | DateBuilt\_Month |
+==============+==================+============+=============+=========+===========+====================+
| 01/01/1920 | 1700 | 3 | 2 | NY | 700,000 | 1 |
+--------------+------------------+------------+-------------+---------+-----------+--------------------+
The home was built in the month January.
Text Transformer
~~~~~~~~~~~~~~~~
- transform text column using methods: TFIDF or count (count of the word)
- this may be followed by dimensionality reduction using truncated SVD
Categorical Target Encoding Transformer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- cross validation target encoding done on a categorical column
+--------------+------------------+------------+-------------+---------+-----------+-----------------+
| Date Built | Square Footage | Num Beds | Num Baths | State | Price | CV\_TE\_State |
+==============+==================+============+=============+=========+===========+=================+
| 01/01/1920 | 1700 | 3 | 2 | NY | 700,000 | 550,000 |
+--------------+------------------+------------+-------------+---------+-----------+-----------------+
The average price of properties in NY state is $550,000\*.
\*In order to prevent overfitting, Driverless AI calculates this average on out-of-fold data using cross validation.
Numeric to Categorical Target Encoding Transformer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- numeric column converted to categorical by binning
- cross validation target encoding done on the binned numeric column
+--------------+------------------+------------+-------------+---------+-----------+-------------------------+
| Date Built | Square Footage | Num Beds | Num Baths | State | Price | CV\_TE\_SquareFootage |
+==============+==================+============+=============+=========+===========+=========================+
| 01/01/1920 | 1700 | 3 | 2 | NY | 700,000 | 345,000 |
+--------------+------------------+------------+-------------+---------+-----------+-------------------------+
The column ``Square Footage`` has been bucketed into 10 equally populated bins. This property lies in the ``Square Footage`` bucket 1,572 to 1,749. The average price of properties with this range of square footage is $345,000\*.
\*In order to prevent overfitting, Driverless AI calculates this average on out-of-fold data using cross validation.
Cluster Target Encoding Transformer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- selected columns in the data are clustered
- target encoding is done on the cluster ID
+--------------+------------------+------------+-------------+---------+-----------+--------------------------------------------+
| Date Built | Square Footage | Num Beds | Num Baths | State | Price | ClusterTE_4_NumBeds_NumBaths_SquareFootage |
+==============+==================+============+=============+=========+===========+============================================+
| 01/01/1920 | 1700 | 3 | 2 | NY | 700,000 | 450,000 |
+--------------+------------------+------------+-------------+---------+-----------+--------------------------------------------+
The columns: ``Num Beds``, ``Num Baths``, ``Square Footage`` have been segmented into 4 clusters. The average price of properties in the same cluster as the selected property is $450,000\*.
\*In order to prevent overfitting, Driverless AI calculates this average on out-of-fold data using cross validation.
Cluster Distance Transformer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- selected columns in the data are clustered
- the distance to a chosen cluster center is calculated
+--------------+------------------+------------+-------------+---------+-----------+------------------------------------------------+
| Date Built | Square Footage | Num Beds | Num Baths | State | Price | ClusterDist_4_NumBeds_NumBaths_SquareFootage_1 |
+==============+==================+============+=============+=========+===========+================================================+
| 01/01/1920 | 1700 | 3 | 2 | NY | 700,000 | 0.83 |
+--------------+------------------+------------+-------------+---------+-----------+------------------------------------------------+
The columns: ``Num Beds``, ``Num Baths``, ``Square Footage`` have been segmented into 4 clusters. The difference from this record to Cluster 1 is 0.83.