Skip to main content
Version: v0.14.0

Adversarial similarity

Adversarial Similarity refers to a validation test that assists in observing similar or dissimilar segments of two different datasets. Observing the feature distribution of two different datasets can indicate similarity or dissimilarity. An Adversarial Similarity test can be performed rather than going over all the features individually to observe the differences. During an Adversarial Similarity test, decision tree algorithms are leveraged to find similar or dissimilar rows between the training dataset and any dataset with the same train columns.

An Adversarial Similarity test sets Gradient Boosted Decision Trees (GBDT) on a dataset obtained by concatenating the training dataset with another dataset (often with similar train columns). In this concatenated dataset, a new target column is created where the rows of the training dataset and the other dataset are assigned 0s and 1s, respectively. For example, consider the below image concatenating the training dataset with a test dataset.

concatenating-datasets.png

The test generates a predicted score for each row of the concatenated dataset. These scores can be used to analyze further the most different or similar rows of the dataset. The return scores refer to Area Under the Curve (AUC) values. The higher the score, the more dissimilar the row is to the training dataset.


Feedback