What is semi-supervised learning?
Semi-supervised learning is a type of machine learning. It refers to a learning problem (and algorithms designed for the learning problem) that involves a small portion of labeled examples and a large number of unlabeled examples from which a model must learn and make predictions on new examples.
As such, it is a learning problem that sits between supervised learning and unsupervised learning. We require semi-supervised learning algorithms when working with data where labeling examples is challenging or expensive. The sign of an effective semi-supervised learning algorithm is that it can achieve better performance than a supervised learning algorithm fit only on the labeled training examples.
Semi-supervised learning algorithms generally are able to clear this low bar expectation. Finally, semi-supervised learning may be used or may contrast inductive and transductive learning.
Generally, inductive learning refers to a learning algorithm that learns from labeled training data and generalizes to new data, such as a test dataset. Transductive learning refers to learning from labeled training data and generalizing to available unlabeled (training) data. Both types of learning tasks may be performed by a semi-supervised learning algorithm
What is semi-supervised learning example?
A common example of an application of semi-supervised learning is a text document classifier. This is the type of situation where semi-supervised learning is ideal because it would be nearly impossible to find a large amount of labeled text documents. This is simply because it is not time efficient to have a person read through entire text documents just to assign it a simple classification.
So, semi-supervised learning allows for the algorithm to learn from a small amount of labeled text documents while still classifying a large amount of unlabeled text documents in the training data.
How does semi-supervised learning work?
The way that semi-supervised learning manages to train the model with less labeled training data than supervised learning is by using pseudo labeling. This can combine many neural network models and training methods. Here’s how it works:
- Train the model with the small amount of labeled training data just like you would in supervised learning, until it gives you good results.
- Then use it with the unlabeled training dataset to predict the outputs, which are pseudo labels since they may not be quite accurate.
- Link the labels from the labeled training data with the pseudo labels created in the previous step.
- Link the data inputs in the labeled training data with the inputs in the unlabeled data.
- Then, train the model the same way as you did with the labeled set in the beginning in order to decrease the error and improve the model’s accuracy.
What is the difference between semi-supervised and unsupervised learning?
The biggest difference between supervised and unsupervised machine learning is: Supervised machine learning algorithms are trained on datasets that include labels added by a machine learning engineer or data scientist that guide the algorithm to understand which features are important to the problem at hand. This is a very costly process, especially when dealing with large volumes of data.Unsupervised machine learning algorithms, on the other hand, are trained on unlabeled data and must determine feature importance on their own based on inherent patterns in the data, The most basic disadvantage of any Unsupervised Learning is that it’s application spectrum is limited.
To counter these disadvantages, the concept of Semi-Supervised Learning was introduced. In this type of learning, the algorithm is trained upon a combination of labeled and unlabeled data. Typically, this combination will contain a very small amount of labeled data and a very large amount of unlabeled data. This is useful for a few reasons. First, the process of labeling massive amounts of data for supervised learning is often prohibitively time-consuming and expensive. What’s more, too much labeling can impose human biases on the model. That means including lots of unlabeled data during the training process actually tends to improve the accuracy of the final model while reducing the time and cost spent building it. You can use unsupervised learning techniques to discover and learn the structure in the input variables. You can also use supervised learning techniques to make best guess predictions for the unlabeled data, feed that data back into the supervised learning algorithm as training data and use the model to make predictions on new unseen data.
What are the types of semi-supervised learning?
The majority of the traditional SSL methods accept that the classes of unlabelled data are included for the arrangement of classes of labelled information.
Likewise, these techniques do not sort out useless unlabeled examples and utilize all the unlabeled information for training, which isn’t appropriate for reasonable circumstances.
Let’s discuss various types of semi-supervised learning algorithms.
Self-Training
Self-training techniques have for quite some time been utilized for semi-supervised learning. It is a resampling method that over and over labels unlabeled training samples dependent on the certainty scores and retrains itself with the chosen pseudo-annotated data. This technique can likewise be sorted as a self-training strategy.
Graph-based semi supervised machine learning
Graph-based SSL algorithms are a significant sub-class of SSL algorithms that have got a lot of consideration lately.
Here, one accepts that the data (both labelled and unlabelled) is inserted inside a low-dimensional complex that might be sensibly communicated by a graph.
All data sample is represented by a vertex in a weighted chart with the loads giving a proportion of closeness between vertices. Hence, adopting a graph-based strategy for tackling an SSL issue includes the following steps:
- Graph development (if no info chart exists),
- Infusing seed marks on a subset of the nodes, and
- Inferring labels on unlabeled nodes in the graph
Low-density Separation
The decision boundary should lie in a low-density region. The comparability is anything but difficult to see: A decision boundary in a high-density region would cut a bunch into two unique classes; numerous objects of various classes in a similar cluster would require the boundary limit to cut the bunch, i.e., to experience a high-density region.