What is Term Frequency-Inverse Document Frequency?
Term Frequency-Inverse Document Frequency, or TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.
This is done by multiplying two metrics:
- How many times a word appears in a document
- The inverse document frequency of the word across a set of documents
It has many uses, most importantly in automated text analysis, and is very useful for scoring words in machine learning algorithms for Natural Language Processing.
TF-IDF was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.
How is Term Frequency-Inverse Document Frequency calculated?
TF-IDF is frequently used in machine learning algorithms in various capacities, including stop-word removal. These are common words like “a, the, an, it” that occur frequently but hold little informational value. TF-IDF consists of two components, term frequency, and inverse document frequency.
Term frequency can be determined by counting the number of occurrences of a term in a document.
IDF is calculated by dividing the total number of documents by the number of documents in the collection containing the term. It’s useful for reducing the weight of terms that are common within a collection of documents. The log of this figure is used to dampen the effect of IDF.
Why is Term Frequency-Inverse Document Frequency used in Machine Learning?
Machine learning with natural language is faced with one major hurdle – its algorithms usually deal with numbers, and natural language is, well, text. So we need to transform that text into numbers, otherwise known as text vectorization. It’s a fundamental step in the process of machine learning for analyzing data, and different vectorization algorithms will drastically affect end results, so you need to choose one that will deliver the results you’re hoping for.
Once you’ve transformed words into numbers, in a way that’s machine learning algorithms can understand, the TF-IDF score can be fed to algorithms such as Naive Bayes and Support Vector Machines, greatly improving the results of more basic methods like word counts.
What are the applications of Term Frequency-Inverse Document Frequency?
Determining how relevant a word is to a document, or TD-IDF, is useful in many ways, for example:
1. Information retrieval
TF-IDF was invented for document search and can be used to deliver results that are most relevant to what you’re searching for. Imagine you have a search engine and somebody looks for LeBron. The results will be displayed in order of relevance. That’s to say the most relevant sports articles will be ranked higher because TF-IDF gives the word LeBron a higher score.
It’s likely that every search engine you have ever encountered uses TF-IDF scores in its algorithm.
2. Keyword Extraction
TF-IDF is also useful for extracting keywords from text. How? The highest scoring words of a document are the most relevant to that document, and therefore they can be considered keywords for that document. Pretty straightforward.
Why does this work? Simply put, a word vector represents a document as a list of numbers, with one for each possible word of the corpus. Vectorizing a document is taking the text and creating one of these vectors, and the numbers of the vectors somehow represent the content of the text. TF-IDF enables us to gives us a way to associate each word in a document with a number that represents how relevant each word is in that document. Then, documents with similar, relevant words will have similar vectors, which is what we are looking for in a machine learning algorithm.