What is DBSCAN?
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular learning method utilized in model building and machine learning algorithms. This is a clustering method that is used in machine learning to separate clusters of high density from clusters of low density. Clustering analysis or simply Clustering is an unsupervised learning method that divides the data points into several specific batches or groups, such that the data points in the same groups have similar properties and data points in different groups have different properties in some sense.
DBSCAN can also sort data into clusters of varying shapes, another strong advantage. DBSCAN works as such:
- Divide the dataset into n dimensions
- For each point in the dataset, DBSCAN forms an n-dimensional shape around that data point and then counts how many data points fall within that shape.
- DBSCAN counts this shape as a cluster. DBSCAN iteratively expands the cluster, by going through each individual point within the cluster and counting the number of other data points nearby.
What are the components of DBSCAN?
The parameters of DBSCAN
The main idea behind DBSCAN is that a point belongs to a cluster if it is close to many points from that cluster.There are two key parameters of DBSCAN:
- eps: The distance that specifies the neighbourhoods. Two points are considered to be neighbours if the distance between them is less than or equal to eps.
- minPts: Minimum number of data points to define a cluster.
These parameters can be understood if we explore two concepts called Density Reachability and Density Connectivity.
- Reachability in terms of density establishes a point to be reachable from another if it lies within a particular distance (eps) from it.
- Connectivity, on the other hand, involves a transitivity-based chaining approach to determine whether points are located in a particular cluster. For example, p and q points could be connected if p->r->s->t->q, where a->b means b is in the neighbourhood of a.
The points of DBSCAN
Based on these two parameters, points are classified as core points, border points, or outliers:
- Core point: A point is a core point if there are at least minPts number of points (including the point itself) in its surrounding area with radius eps.
- Border point: A point is a border point if it is reachable from a core point and there are less than minPts number of points within its surrounding area.
- Outlier (also known as noise): A point is an outlier if it is not a core point and not reachable from any core points.
What are the algorithmic steps of DBSCAN?
Algorithms start by picking a point(one record) x from your dataset at random and assign it to cluster 1. Then it counts how many points are located within the ε (epsilon) distance from x. If this quantity is greater than or equal to minPoints (n), then consider it as core point, which then pulls out all the ε-neighbours to the same cluster 1.
It will then examine each member of cluster 1 and find their respective ε -neighbours. If some member of cluster 1 has n or moreε-neighbours, it will expand cluster 1 by putting those ε-neighbours into the cluster. It will continue expanding cluster 1 until there are no more examples to put in it.
In the latter case, it will pick another point from the dataset not belonging to any cluster and put it into cluster 2. It will continue like this until all examples belong to some cluster or are marked as outliers.
If there are at least ‘minPoint’ points within a radius of ‘ε’ to the point then we consider all these points to be part of the same cluster.
The clusters are then expanded by recursively repeating the neighbourhood calculation for each neighbouring point.
What is DBSCAN used for?
DBSCAN is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing. Clustering can also help marketers discover distinct groups in their customer base. And they can characterize their customer groups based on the purchasing patterns. Density-based spatial clustering of applications with noise (DBSCAN) is a well-known data clustering algorithm that is commonly used in data mining and machine learning.
The easier-to-set parameter of DBSCAN is the minPts parameter. Its purpose is to smooth the density estimate, and for many datasets, it can be kept at the default value of minPts = 4 (for two-dimensional data). The advantage of this is great at separating clusters of high density versus clusters of low density within a given dataset and is great at handling outliers within the dataset.
What are the advantages and disadvantages of DBSCAN?
Advantages of DBSCAN
- Is great at separating clusters of high density versus clusters of low density within a given dataset.
- Is great for handling outliers within the dataset.
Disadvantages of DBSCAN
- While DBSCAN is great at separating high-density clusters from low-density clusters, DBSCAN struggles with clusters of similar density.
- Struggles with high dimensionality data. If given data with too many dimensions, DBSCAN suffers
What are the metrics for measuring DBSCAN’s performance?
Silhouette Score
The silhouette score is calculated utilizing the mean intra-cluster distance between points, AND the mean nearest-cluster distance. A silhouette score ranges from -1 to 1, with -1 being the worst score possible and 1 being the best score. Silhouette scores of 0 suggest overlapping clusters.
Inertia
Inertia measures the internal cluster sum of squares (the sum of squares is the sum of all residuals). Inertia is utilized to measure how related clusters are amongst themselves, the lower the inertia score the better. However, it is important to note that inertia heavily relies on the assumption that the clusters are convex (of spherical shape). DBSCAN does not necessarily divide data into spherical clusters, therefore inertia is not a good metric to use for evaluating DBSCAN models (which is why I did not include inertia in the code above). Inertia is more often used in other clustering methods, such as K-means clustering.