What is anomaly detection in AI?
Anomaly detection is a process in data mining that involves identifying outlier values in a series of data. It assumes that the data we possess falls within a particular understood range, which may be based on historical data, and that values are very rarely found outside of that range.
This process is also known as outlier analysis. In supervised anomaly detection, the data can be labeled to be ‘normal or ‘abnormal’ so that models can be developed and those models can be applied to new data.
In unsupervised machine learning, anomaly detection can be done on unlabeled data by making use of historical data, analyzing the probability distribution of values, and then figuring out whether a new value is unlikely, making it an anomaly.
Anomaly detection can be conducted on single variables (this is known as Univariate Anomaly Detection) as well as on combinations of variables (this is known as Multivariate Anomaly Detection).
There are two fundamental assumptions in anomaly detection:
Anomalies occur very rarely in the data.
There are significant differences in the features of anomalies and those of normal instances.
What is a data anomaly?
Within datasets, certain patterns show that everything is normal. These patterns represent business as usual.
However, there can be unexpected changes in these data patterns. Sometimes you may notice events that do not conform to the data pattern that is expected. These are known as anomalies. They are essentially deviations from business as usual.
Anomalies are also known as outliers, discordant observations, exceptions, novelties, noise, peculiarities, aberrations, deviations, surprises, or contaminants, depending on the application domain. The terms ‘anomaly’ and ‘outlier’ are the most widely used ones, and they are generally used interchangeably. An anomaly doesn’t necessarily have to be a good or bad thing. They simply are deviations from the usual state of things.
What are the types of anomalies?
There are essentially three types of anomalies. These are: update anomalies, deletion anomalies, and insertion anomalies.
- Update anomalies result from data redundancies and partial updates.
- Deletion anomalies refer to unintentionally losing data because of deleting other data.
- Insertion anomalies arise when you cannot add data to the database because some other data is not present in the database.
What is anomaly detection used for?
Anomaly detection can be used for a variety of purposes. Here are a few of them:
Improving application performance
Traditional approaches to application performance monitoring would only really help you to react to issues after they arise. That means that you would have to deal with the issues arising because of anomalies before you can take any action to fix them.
Using anomaly detection techniques with artificial intelligence and machine learning would help you identify anomalies and fix potential problems before they surface and affect your users and their experience with your application.
It can help you correlate data with the appropriate application performance metrics so that your team knows what to take action on.
It helps you proactively increase the performance of your applications.
Enhancing product quality
From the time to create a product and launch it to every single time you release an update, you need to be sure that your product will function seamlessly, as intended by you.
Any change that you make to your purchase funnel, any update to your features, and any new version of your product can potentially result in behavioral anomalies.
If you don’t proactively look for these anomalies and take the initiative to fix them, your business could end up losing massive amounts of revenue.
Augmenting user experience
Any lapses in your customer experience can cause your customers to get frustrated. This could cause a spike in customer attrition and a drop in your revenues.
Anomaly detection can help you fix this in advance, give your customers and users a seamless experience and keep them loyal to you, thus helping you increase your customer lifetime value.
How do you identify an anomaly?
There are many techniques that can be used for anomaly detection, These include:
Simple Statistical Methods
The simplest approach involves flagging data points that deviate from the common statistical properties of the distribution (including median, mean, mode, and quantities).
Approaches based in machine learning
- Density-Based Anomaly Detection:
This approach is based on the k-nearest neighbors algorithm. It assumes that normal data points are located around a dense neighborhood and that abnormalities are situated far away. It involves evaluating the nearest set of data points with the help of a score. This score could be the Eucledian distance or another measure. It depends on whether the datatype is categorical or numerical. They could be classified into two algorithms: k-nearest neighbor (k-NN) and Relative density of data, aka, Local Outlier Factor (LOF). - Clustering-Based Anomaly Detection:
Clustering is an extremely popular concept in unsupervised learning. It assumes that similar data points tend to belong to similar clusters (groups), determined by their distance from local centroids. The k-means clustering algorithm creates ‘k’ similar clusters of data points. Data instances falling out of these clusters could be considered to be anomalies. - Support Vector Machine-Based Anomaly Detection:
A support vector machine (SVM) could also be used for the purpose of anomaly detection. Extensions like OneClassCVM can be used for anomaly detection in unsupervised problems. The algorithm learns a soft boundary to create clusters of the normal data points in the training set and then makes use of the testing instance to detect the anomalies that fall outside the learned region.
How do you resolve anomalies?
Anomalies can be prevented and resolved by normalizing the database. This is essentially a systematic approach towards decomposing tables and eliminating redundancy, thus getting rid of insertion, modification, and deletion anomalies. Here are the goals of normalization:
- Eliminating all the redundant or repeated data from the database.
- Getting rid of undesirable insertions, updates and deletion dependencies.
- Minimizing the need for restructuring your data base on every occasion when new fields are added to the database.
- Increasing the usefulness and understandability of the relationships between tables.