What is statistical language modeling in NLP?
Statistical Language Modeling, or Language Modeling and LM for short, is the development of probabilistic models that can predict the next word in the sequence given the words that precede it.
A statistical language model learns the probability of word occurrence based on examples of text. Simpler models may look at a context of a short sequence of words, whereas larger models may work at the level of sentences or paragraphs. Most commonly, language models operate at the level of words.
You could develop a language model and use it standalone for purposes like generating new sequences of text that appear to have come from the body.
Language modeling is a core problem for a rather wide range of natural language processing tasks. Language models are generally used on the front-end or back-end of a more sophisticated model for a task that needs language understanding.
What are the types of statistical language models?
Statistical models include the development of probabilistic models that are able to predict the next word in the sequence, given the words that precede it. A number of statistical language models are in use already. Let’s take a look at some of those popular models:
1. N-Gram
This is one of the simplest approaches to language modelling. Here, a probability distribution for a sequence of ‘n’ is created, where ‘n’ can be any number and defines the size of the gram (or sequence of words being assigned a probability). If n=4, a gram may look like: “can you help me”. Basically, ‘n’ is the amount of context that the model is trained to consider. There are different types of N-Gram models such as unigrams, bigrams, trigrams, etc.
2. Exponential
This type of statistical model evaluates text by using an equation which is a combination of n-grams and feature functions. Here the features and parameters of the desired results are already specified. The model is based on the principle of entropy, which states that probability distribution with the most entropy is the best choice. Exponential models have fewer statistical assumptions which mean the chances of having accurate results are more.
3. Continuous Space
In this type of statistical model, words are arranged as a non-linear combination of weights in a neural network. The process of assigning weight to a word is known as word embedding. This type of model proves helpful in scenarios where the data set of words continues to become large and include unique words.
In cases where the data set is large and consists of rarely used or unique words, linear models such as n-gram do not work. This is because, with increasing words, the possible word sequences increase, and thus the patterns predicting the next word become weaker.
How do you build a simple Statistical Language Model?
Language models start with a Markov Assumption. This is a simplifying assumption that the k+1st word is dependent on the previous k words. A 2nd order assumption results in a Bigram model. The models are training using Maximum Likelihood Estimations (MLE) of an existing corpus. The MLE approach then is simply a fraction of work counts.
There are some advantages of using tradition n-gram language models.
- They are easy to train on a large corpus
- They work surprisingly well in most tasks!!
- However, they have some disadvantages
What are the applications of statistical language modeling?
Statistical language models are used to generate text in many similar natural language processing tasks, such as:
- Speech Recognization
Voice assistants such as Siri and Alexa are examples of how language models help machines in processing speech audio.
- Machine Translation
Google Translator and Microsoft Translate are examples of how NLP models can help in translating one language to another.
- Sentiment Analysis
This helps in analyzing sentiments behind a phrase. This use case of NLP models is used in products that allow businesses to understand a customer’s intent behind opinions or attitudes expressed in the text. Hubspot’s Service Hub is an example of how language models can help in sentiment analysis.
- Text Suggestions
Google services such as Gmail or Google Docs use language models to help users get text suggestions while they compose an email or create long text documents, respectively.
- Parsing Tools
Parsing involves analyzing sentences or words that comply with syntax or grammar rules. Spell checking tools are perfect examples of language modelling and parsing.
Language models are also used to generate text in other similar language processing tasks like optical character recognition, handwriting recognition, image captioning, etc.
What are the drawbacks of statistical language modeling?
1. Zero probabilities
If we have a tri-gram language model that conditions of two words and has a vocabulary of 10,000 words. The we have 10¹² triplets. If our training data has 10¹⁰ words, there are many triples that will never be observed in the training data and thus the basic MLE will assign zero probabilities to those events. And a zero-probability translates to infinite perplexity. To overcome this issue many techniques have been developed under the family of Smoothing Techniques. A good overview of these techniques is presented in this paper.
2. Exponential Growth
The second challenge is that the number of n-grams grows as an nth exponent of the vocabulary size. A 10,000-word vocabulary would have 10¹² tri-grams and a 100,000 word vocabulary will have 10¹⁵ trigrams.
3. Generalization
The last issue with MLE techniques is the lack of generalization. If the model sees the term ‘white horse’ in the training data but does not see ‘black horse’, the MLE will assign zero probability to ‘black horse’. (Thankfully, it will assign zero probability to Purple horse as well)