Birch

Table of contents

Automate your business at $5/day with Engati

REQUEST A DEMO
Switch to Engati: Smarter choice for WhatsApp Campaigns 🚀
TRY NOW
Birch

What does BIRCH mean?

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. With modifications it can also be used to accelerate k-means clustering and Gaussian mixture modeling with the expectation–maximization algorithm. 

An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources (memory and time constraints). In most cases, BIRCH only requires a single scan of the database.

Its inventors claim BIRCH to be the "first clustering algorithm proposed in the database area to handle 'noise' (data points that are not part of the underlying pattern) effectively", beating DBSCAN by two months. The BIRCH algorithm received the SIGMOD 10 year test of time award in 2006.

birch
Source: Ques10

What is the key idea of BIRCH in data mining?

BIRCH attempts to minimize the memory requirements of large datasets by summarizing the information contained in dense regions as CF, or Clustering Feature entries.

BIRCH is often used to complement other clustering algorithms by creating a summary of the dataset that the other clustering algorithm can now use. However, BIRCH has one major drawback – it can only process metric attributes. A metric attribute is any attribute whose values can be represented in Euclidean space i.e., no categorical attributes should be present.

What is CF tree?

BIRCH summarizes large datasets into smaller, dense regions called Clustering Feature (CF) entries. Formally, a Clustering Feature entry is defined as an ordered triple, (N, LS, SS) where ‘N’ is the number of data points in the cluster, ‘LS’ is the linear sum of the data points and ‘SS’ is the squared sum of the data points in the cluster. It is possible for a CF entry to be composed of other CF entries.

The CF tree is the actual compact representation that we have been speaking of so far. A CF tree is a tree where each leaf node contains a sub-cluster. Every entry in a CF tree contains a pointer to a child node and a CF entry made up of the sum of CF entries in the child nodes. There is a maximum number of entries in each leaf node. This maximum number is called the threshold.

What are the stages of BIRCH?

The BIRCH clustering algorithm consists of two stages:

Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is a clustering algorithm that can cluster large datasets by first generating a small and compact summary of the the large dataset that retains as much information as possible. This smaller summary is then clustered instead of clustering the larger dataset. BIRCH is often used to complement other clustering algorithms by creating a summary of the dataset that the other clustering algorithm can now use. However, BIRCH has one major drawback — it can only process metric attributes. A metric attribute is any attribute whose values can be represented in Euclidean space i.e., no categorical attributes should be present.

The BIRCH clustering algorithm consists of two stages:

Building the CF Tree

BIRCH summarizes large datasets into smaller, dense regions called Clustering Feature (CF) entries. Formally, a Clustering Feature entry is defined as an ordered triple, (N, LS, SS) where ’N’ is the number of data points in the cluster, ‘LS’ is the linear sum of the data points and ‘SS’ is the squared sum of the data points in the cluster. It is possible for a CF entry to be composed of other CF entries. Optionally, we can condense this initial CF tree into a smaller CF.

Global Clustering

Applies an existing clustering algorithm on the leaves of the CF tree. A CF tree is a tree where each leaf node contains a sub-cluster. Every entry in a CF tree contains a pointer to a child node and a CF entry made up of the sum of CF entries in the child nodes. Optionally, we can refine these clusters.

What is the significance of CF in BIRCH algorithm?

BIRCH attempts to minimize the memory requirements of large datasets by summarizing the information contained in dense regions as Clustering Feature (CF) entries. The CF-tree is a very compact representation of the dataset because each entry in a leaf node is not a single data point but a subcluster. Each nonleaf node contains at most B entries. 

In this context, a single entry contains a pointer to a child node and a CF made up of the sum of the CFs in the child (subclusters of subclusters). On the other hand, a leaf node contains at most L entries, and each entry is a CF (subclusters of data points). All entries in a leaf node must satisfy a threshold requirement. That is to say, the diameter of each leaf entry has to be less than Threshold. In addition, every leaf node has two pointers, prev and next, which are used to chain all leaf nodes together for efficient scans.

Where do we use Birch algorithm?

As the size of datasets increases, the harder it it becomes to scale. Memory requirements, running time, and result quality dwindle, making it difficult to cluster and segregate the data. We primarily use BIRCH in these scenarios where the dataset is colossal. 

BIRCH is also applied to solve two real-life problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression

What are the statistics used in the BIRCH clustering?

Clustering features used by BIRCH are simple summary statistics that can easily be updated with new data: 

  • The number of points, 
  • The linear sums, 
  • The sum of squared values. 

Unfortunately, how the sum of squares is then used in BIRCH is prone to catastrophic cancellation.

Close Icon
Request a Demo!
Get started on Engati with the help of a personalised demo.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
*only for sharing demo link on WhatsApp
Thanks for the information.
We will be shortly getting in touch with you.
Oops! something went wrong!
For any query reach out to us on contact@engati.com
Close Icon
Congratulations! Your demo is recorded.

Select an option on how Engati can help you.

I am looking for a conversational AI engagement solution for the web and other channels.

I would like for a conversational AI engagement solution for WhatsApp as the primary channel

I am an e-commerce store with Shopify. I am looking for a conversational AI engagement solution for my business

I am looking to partner with Engati to build conversational AI solutions for other businesses

continue
Finish
Close Icon
You're a step away from building your Al chatbot

How many customers do you expect to engage in a month?

Less Than 2000

2000-5000

More than 5000

Finish
Close Icon
Thanks for the information.

We will be shortly getting in touch with you.

Close Icon
Close Icon

Contact Us

Please fill in your details and we will contact you shortly.

This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
Thanks for the information.
We will be shortly getting in touch with you.
Oops! Looks like there is a problem.
Never mind, drop us a mail at contact@engati.com