LASER (Language-Agnostic Sentence Representations), is a method to generate pre-trained language representation in multiple languages. It was released by Facebook.
In Part 1 of this post, I will give an overview of LASER (Language-Agnostic Sentence Representations) and in Part 2, I will discuss an implementation of it.
Language Models
A Language Model is a probability distribution over sequences of words. It is required to get the probability of a word appearing in a given context. Language Model helps represent text to a form, understandable from the machine point of view.
There were various language models released in previous years, which had an excellent performance. Most language models are trained in English or a few local languages. However, there is a need to be able to handle multiple languages in the same model. But, none of the language models had multilingual support. The problem with the traditional language models is that they need to be trained in each language to carry out the tasks. Moreover, the non-availability of adequate datasets for all languages is an important factor that acts as a constraint to train the model for specific tasks. But, with the release of a few multilingual language models, eg: BERT by Google and LASER by Facebook, which provides multilingual support, it aims to solve these issues. So, for multilingual support, I have taken LASER for carrying out the NLP tasks for multilingual support.
LASER (Language-Agnostic Sentence Representations) was released by Facebook on Jan 22, 2019, to provide multilingual support. LASER provides multilingual sentence representations to carry out various NLP tasks. It works with more than 90 languages, written in 28 different alphabets. LASER achieves these results by embedding all languages jointly in a single shared space (rather than having a separate model for each).
LASER's approach
LASER approach is based on zero-shot transfer( zero-shot learning is about leveraging deep learning networks already trained by supervised learning in other ways, without additional supervised learning) of NLP models from one language, such as English, to scores of others — including languages where training data is extremely limited. LASER uses one single model to handle this variety of languages, including very low-resource languages. This model is extremely helpful in providing multiple NLP features, such as sentiment analysis, in one language, which can be deployed easily in more than 100 other languages, without any separate training on each language.
LASER's vector representatons
LASER’s vector representations of sentences are generic with respect to both the input language and the NLP task. It maps a sentence in any language to a point in a high-dimensional space with the goal that the same statement in any language will end up in the same neighborhood closely placed. This representation could be seen as a universal language in a semantic vector space. It can be observed from the figure above that the distance in space correlates very well to the semantic closeness of the sentences.
For eg. , the dog is brown (an English sentence), and its French translation: Le chein est brun are very closely placed, as they have the same meaning.
Laser architecture is the same as neural machine translation: an encoder/decoder approach, it uses one shared encoder for all input languages and a shared decoder to generate the output language. The encoder is a five-layer bidirectional LSTM network. It does not use an attention mechanism but instead, has a 1,024-dimension fixed-size vector to represent the input sentence. It is obtained by max-pooling over the last states of the BiLSTM, which enables to compare sentence representations and feed them directly into a classifier. It was possible by using a shared BPE vocabulary trained on the concatenation of all languages.
Long Short-Term Memory (LSTM) networks
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems, whereas, a bidirectional LSTM (BiLSTM) layer learns bidirectional long-term dependencies between time steps of time series or sequence data. These dependencies can be useful when you want the network to learn from the complete time series at each time step.
For more details, you can find the academic paper in the LASER library.
LASER opens the door to performing zero-shot transfer of NLP models from one language, such as English, to scores of others — including languages where training data is extremely limited.
- FB engineering.
LASER for the multi-classification task
Dataset Preparation
Clean your data to ensure you don’t have any empty rows or na values in your dataset. Divide the entire dataset to split into training, dev, and testing, in which train.tsv and dev.tsv will have the labels, and test.tsv won’t have the labels. I had around 31k English sentences in my training dataset after cleaning.
Setup and installation
This Github link can be referred for the setup of the docker for LASER for getting the sentence embeddings; else, there is a package for LASER named laser embeddings, which is a production-ready port of Facebook Research’s LASER (Language-Agnostic Sentence Representations) to compute multilingual sentence embeddings. I used the laser embeddings package to get the embeddings.
Run the following command to install the package and test it:
Calculate the embeddings for all the sentences in the dataset using laser embeddings package and store all the embedding vectors(1024 dimensional) in a NumPy file.
Classification Model Training
Since we already have the embeddings for all the sentences computed with us, here laser is acting as the encoder in our model, as it provides the embeddings for the input sentences. So, now we need to build a classifier network for our decoder in the model to classify the sentence as a positive, negative or neutral sentiment:
For building the model:
Do the following imports:
For modeling:
(I will explain the parameters given below)
Here, we have a sequential model. The above model works as a decoder, in which you can tweak the layers in the model. The above is a very simplistic architecture. I experimented with adding global average pool layer also. However, the results were comparatively the same. The ‘3’ in the Dense layer suggests we have three classes to predict. I have used adam optimizer and categorical_crossentropy, as I have multi-class classification problem statement[positive, negative, neutral].X1, is the embedding for all the sentences in the training dataset and Y1 corresponds to the labels, whereas X2 is the embedding for all the sentences in the validation dataset and Y2 corresponds to the labels. I ran it for a few epochs around 7, for which val_accuracy came around 92%. You can decide epochs based on your data.
Finally, I trained the model using my English dataset(consisting of 31k sentences).
After you train the model by running the above steps, make sure you save the model:
Inference
Now, with the saved model, I carried the inference task by loading the laser_model.
I carried inference on the test dataset, using the loaded model: model.predict_classes, with an accuracy of around 90 %. I converted my test dataset in various languages (Hindi, German, French, Arabic, Tamil, Indonesia) using the Google Translate package and after getting the embeddings using laser embeddings, I used the trained model to carry out the inferences.
Analysis of results
The accuracy score on various languages came as:
The properties of the multilingual semantic space can be used for paraphrasing a sentence or searching for sentences with similar meaning — either in the same language or in any of the 93 others now supported by LASER.
- FB engineering article on Laser.
Conclusion
In this article, we have learned to use LASER for the multi-classification task. LASER can also be used on other Natural Language Processing tasks instead of just classification, as I implemented FAQ for multilingual support. With adequate data, the results are surprisingly good even with the simple architecture of the decoder model.
But, it has some issues when it comes to less data; I got to know that when I tried the same with another domain data, which had less data comparatively. I feel you should have proper data to achieve extremely good results. I am also exploring BERT Multilingual to see if it outperforms LASER and will come up with the results. Thanks for reading, and have a great day ahead. See you again in the next article!
Until then, register to explore our chatbot offerings.
References: