Skip to main content
Version: v0.15.0

Adversarial similarity

Adversarial similarity refers to a validation test that assists in observing similar or dissimilar segments of two different datasets. Observing the feature distribution of two different datasets can indicate similarity or dissimilarity. An adversarial similarity test can be performed rather than going over all the features individually to observe the differences. During an adversarial similarity test, decision tree algorithms are leveraged to find similar or dissimilar rows between the train dataset and any dataset with the same train columns.

An adversarial similarity test sets Gradient Boosted Decision Trees (GBDT) on a dataset obtained by concatenating the train dataset with another dataset (often with similar train columns). In this concatenated dataset, a new target column is created where the rows of the train dataset and the other dataset are assigned 0s and 1s, respectively. For example, consider the following image, which concatenates the train dataset with the test dataset.


The test generates a predicted score for each row of the concatenated dataset. These scores can be used to further analyze the most different or similar rows of the dataset. The return scores refer to Area Under the Curve (AUC) values. The higher the score, the more dissimilar the row is to the train dataset.