TF-IDF (Term Frequency–Inverse Document Frequency) is a statistical measure that aims to reflect how important a word is to a document in a collection of documents (also known as a corpus).
TF-IDF, as its name suggest, is composed from 2 different statistical measures. TF-IDF is equal to a product of TF (term frequency) and IDF (inverse document frequency). Terms used in the following equations:
\(D\) - a collection of documents (corpus)
\(d\) - a document from corpus \(D\)
\(w\) - a word from some document \(d\)
TF (Term Frequency)¶
TF is a statistical measure expressing how frequently does word appear in a document. The implementation used in H2O TF-IDF is as follows:
\(TF(w,d)\) - number of occurrences of a word \(w\) in a document \(d\)
IDF (Inverse Document Frequency)¶
IDF is a statistical measure expressing how much information does the word provide. To put it simply, it expresses whether it is a common or rare word across all documents. IDF is computed using a statistical measure named DF (Document Frequency). The implementation of DF used H2O IDF is as follows:
\(DF(w)\) - number of documents from \(D\) which contain a word \(w\)
The implementation of IDF used in H2O TF-IDF is as follows:
where natural logarithm is being used (i.e. the logarithm has a base equal to \(e\)). Based on the equation above, IDF of a word present in all documents from the corpus is equal to 0, and the fewer documents contain the word, the higher its IDF value.
TF-IDF is defined as a product of the TF and IDF measures explained above:
frame: Documents or words frame for which TF-IDF values should be computed.
document_id_col: Index or name of a column containing document IDs.
text_col: Index or name of a column containing documents if data should be pre-processed or words if input data is already pre-processed (defined by preprocess parameter).
preprocess: (Optional) A flag specifying whether input text data should be pre-processed. By default, data is pre-processed.
case_sensitive: (Optional) A flag specifying whether input data should be treated as case sensitive. By default, input data is treated as case sensitive.
Output is a H2OFrame with rows consisting of document ID, word and its corresponding TF, IDF and TF-IDF values.