What is language detection?
In natural language processing, language detection determines which natural language the given content is in. Computational approaches to this problem view it as a special case of text categorization, solved with various statistical methods.
Most NLP applications tend to be language-specific and therefore require monolingual data. To build an application in your target language, you may need to apply a preprocessing technique that filters out text written in non-target languages. This requires proper identification of the language of each input example.
What are the applications of language detection?
In Natural Language Processing (NLP), one may need to work with data sets that contain documents in various languages. Many NLP algorithms only work with specific languages because the training data is usually in a single language. It can be a valuable time saver to determine which language your data set is in before running more algorithms on it.
An example of a Language Detection algorithm lies in the web search arena. A web crawler will hit pages that are potentially written in one of many different languages. If this data is to be used by a search engine, the results will be most helpful to the end-user if the language used in the search is the same as the results. Thus, you can quickly see how a web developer who must work with content in multiple languages would want to implement language detection as a search functionality.
Spam filtering services that support multiple languages must identify the language that emails, online comments, and other input are written in before applying true spam filtering algorithms. Without such detection, content originating from specific countries, regions, or areas suspected of generating spam cannot be adequately eliminated from online platforms.
Language detection is usually used to identify the language of business texts like emails and chats. This technique identifies the language of a text and the parts of that text in which the language changes, all the way down to the word level. It is primarily used because these business texts (chats, emails, etc.) can be in various languages. A vital part in natural language processing pipelines is the process of identifying the main language so that every text could be processed by the relevant language specific steps.
In some situations, people might switch the language in which they are conversing in a chat so as to avoid monitoring or to conceal illicit activity. Detecting the points at which there are switches or changes in the languages can be rather useful in figuring out whether any sort of suspicious activity is taking place.
How language detection works?
Language classifications rely upon using a primer of specialized text called a 'corpus.' There is one corpus for each language the algorithm can identify. In summary, the input text is compared to each corpus, and pattern matching is used to identify the strongest correlation to a corpus.
Because there are so many potential words to profile in every language, computer scientists use algorithms called 'profiling algorithms' to create a subset of words for each language to be used for the corpus. The most common strategy is to choose very common words. For example, in English, we might choose words like "the," "and," "of," and "or."
This approach works well when the input data is relatively lengthy. The shorter the phrase in the input text, the less likely these common words appear, and the less likely the algorithm will classify correctly. In fact, some languages don't have spaces between written words, making such isolation impossible.
Facing this problem, researchers tried to use character sets generally, rather than relying on them being split into words. Even if the words have spaces between them, depending on the natural words alone often causes problems when analyzing short phrases.
There isn’t just one single way that can be followed in order to carry out language identification or language detection. There are multiple statistical approaches towards accomplishing this task by making use of various techniques for the purpose of classifying the data.
One approach involves comparing the compressibility of the text with the compressibility of texts in a set of known languages. This technique is called mutual information based distance measure. You could use this same approach for the purpose of empirically constructing family trees of languages that rather closely correspond to the trees constructed using historical methods. Mutual information based distance measure is basically equivalent to rather conventional model-based methods. It is not usually considered to be either novel or better than simpler techniques.
Cavnar and Trenkle (1994) and Dunning (1994) described another approach that involves creating a language n-gram model from a training text for every one of the languages. Cavnar and Trenkle suggested that these models could be based on characters, while Dunning suggested that the models could be based on encoded bytes. In the case of the models being based on encoded bytes, language identification and character encoding detection are integrated. In this situation, for any piece of text that needs to be identified, a similar model is made, and that model gets compared to every stored language model. The language with the model that is most similar to the model built from the text that needed to be identified is the most likely language. However, this technique can face issues when the input text is in a language for which there is no model. In such a situation, the method might give you another ‘most-similar’ language as the result. This approach also faces issues if there are pieces of input text that are composed of multiple languages.
Řehůřek and Kolkus put forth a method in 2009 that could detect several languages in an unstructured piece of text and performs rather robustly on short texts that only contain a few words (a task that the n-gram techniques tend to struggle with).
Accuracy and limitations of language detection
We can increase the accuracy by creating a newly trained model and doing the following:
1. Adding more varied training set data
2. Increasing training set size
3. Modifying the following fast-text hyperparameters
- Iterations
- Learning rate
- Sub-word length
The model does not give perfect 100% accuracy due to certain issue that includes:
1. The model may not perform well on texts that have lists of proper names or part numbers, that is, specific words that did not appear in the training set.
2. The accuracy of the model depends on being trained on similar texts.
3. Accuracy might get impacted given the statement structure and length of the sentence.
4. There can be chaos between similar languages, such as Portuguese and Spanish or French.
5. There could also be problems with indexing if we alter the languages.
What are the Language Modelling Methods?
The first stage in language detection is the modelling stage, where language models are developed. These models consist of entities, representing distinct characteristics of a language. Developers use multiple techniques to define these entities which are nothing but the words or N-grams with their occurrences in the training set.
Such models consist of elements, representing specific features of the language. The language models are determined for each language included in the training corpus. On the other hand, a document model is a similar model that is produced from an input document for which the language should be defined.
The different approaches are:
1. Short Word Based Approach
2. Frequent Word-Based Approach
3. N-Gram Based Approach
Short Word Based Approach
The short word-based approach is similar to the frequent words method, but it focuses on the significance of common words like determiners, conjunctions, and prepositions having mostly only small lengths. And the common limit/length of these words is between 4 and 5 letters.
Frequent Words Method
One of the simplest ways of generating a language model is to use and include words from the maximum language in the training model for the better functioning of the model. As per Zipf's Law, words with a higher frequency should be used as the model can pick these words easily and processing becomes easier.
N-Gram Based Approach
N-gram is another technique for generating language models. In this method, a language model is generated from a collection of documents using N-grams instead of the complete words, that are used in the first two approaches. The beginning and the end of the words are often marked with an underscore or a space before N-grams are created.
For example, the word data, surrounded with the underscores, results in the following N-grams:
unigrams: _, d, a, t
bigrams: _a, da, at, ta, a_
trigrams: _da, dat, ata, ta_
quadgrams: _dat, data, ata_
5-grams: _data, data_
6-grams: _data_