What is Hashing or Hash function?
In simple words, the term Hashing refers to a computing function that divides & mix datasets and efficiently stores them in memory/computer/servers by assigning them hash codes or integers. This helps in the retrieval of data based on data queries anywhere on the network. To start with the process, firstly we design a Hash Function where Hash Strings are assigned to a range of data inputs to match the data outputs.
For example, the string "Apple" is assigned to the number 67, and "Internet" to the number 29, and any other possible string to some number within that range, this process in the Hash Function is called Mapping, wherein we assign integers to key inputs or objects. Since there are way more possible inputs than outputs, any given number in the range will have many strings mapped (keys/objects) to it.
In simple words, hash functions generate a random number (Hash) mapped to the storage (Servers) by taking mod (modulus of the operation), using the total number of servers.
Distribution depends on the size of the array (servers) by performing a modulo operation on the hash to get the array index. So, the index would be index = hash(object) mod N, where N is the size of the array. To add an object, we need to calculate its hash modulo N, and check the bucket (a further division of servers). Similarly to search an object in the index system looks into the bucket to check if the object (key) is there. This structure is called a hash table.
A properly sized hash table should have a reasonably small number of objects per bucket, to have constant time access (an average complexity of O(N/k), where k is the number of buckets) to maintain a balanced load on all the servers.
For instance, we have a series of inputs (string keys) which is as follows:
London, United Kingdom;
Cairo, Eygpt;
Texas, Dallas;
Delhi, India;
Tokyo, Japan,
and have 4 servers to store this data with their hashes:
Now when you retrieve this set of data through value keys it should contact the right server where this data is saved.
Where the distribution would look like this on all 4 servers:
This sums up the hashing functions and what it means. Now, let’s understand why do we need Consistent hashing?
Why do we need Consistent Hashing?
As you see that data restoration and hashing function is been designed to work in a linear array system where the keys are assigned to each server are based on the modulo system in the unique pairs or matches.
What if we discard or add a server or if there’s a server failure due to a technical glitch, how system would retrieve the data from the deleted server? In order to maintain the load balancing, we again need to work around the hash functions to distribute the data uniformly. This means re-assigning servers to data inputs (keys) based on the hash modulo that would take a lot of time and looks close to impossible. To solve this problem programmers or developers use consistent hashing, where the distribution happens on a Hash Ring rather than a linear array.
What is Consistent Hashing?
Consistent Hashing is a system that operates independently of the number of servers or objects in a distributed hash table by assigning them a position on an abstract circle, or hash ring. Where the positions are also known as "Nodes" which helps in equal mapping of the data on the hash ring. Another interesting thing about Consistent Hashing is that it puts servers, data, and requests on the same hash ring and creates a system where each node interacts with the closest query or data request or to the hash keys to create a balanced distribution.
Continuing with the previous example; we’ve placed queries (keys) and servers on the same hash ring at particular angles. If a query gets triggered in the system it tries to look for the information in the nearest server allocated to it. This circulation or arrangement could be adjusted as per the system design. Each of these servers and memory units is called “Nodes” in system design, which here are represented as A, B, C, and D, which are placed in the clockwise direction followed by the keys.
Now, if a system gets a request for “Cairo, Eygpt”, it’ll automatically look for that piece of information in the corresponding node i.e “A”. Similarly, for keys “London, UK & Tokyo, Japan” the nearest corresponding location or node is “D” in the clockwise direction, hence it’ll interact with that particular node to retrieve the data.
Unlike traditional hashing, if a system faces an issue of server failure, addition, or removals, the request or data query connects or is assigned automatically to the closest servers or node.
The traditional hashing method is very insufficient to use and handle requests over a network in case of server issues or breakdowns. As it assumes that we have a fixed number of servers, and mapping of keys and servers happens at once. In the case of the addition of a server, it needs heavy computation and re-mapping and hashing of the objects to the new server.
On the other hand, under Consistent Hashing non-linear placement of nodes allow them to interact with each other in case of any changes on the system.
Although, sometimes it happens that the distribution or load doesn’t remain the same or proportional for all the nodes present on the Hash Ring, which results in unbalanced distribution. Also, how would a consistent hashing system respond to the Addition or Removal of Servers/Nodes, and without creating an imbalance on the system?
How does consistent hashing respond to the failure or removal of a server?
In the below example, if the Node 3 or 3rd Server fails, the data & hash for key “Jem” would be shifted to Node 1. Initially, Node 1 was only serving to the key “John”, but now there’ll be an additional load of key “Jem” also on the same server. At the same time, the rest of the hash ring and request-node assignments will remain unaffected, unlike traditional hashing. But load bearing of the corresponding server after the failed server would be high as it'll handle all the previous requests from the hash ring.
Hotspot
A node that bears the load of unevenly distributed data requests becomes a Hotspot. As it’ll cater to all the preceding requests from the hash ring. To solve this problem system engineers use Virtual Nodes and enable the hash ring to distribute the request evenly among all the active nodes.
How does consistent hashing respond to the addition of a new server?
Continuing with the previous example (failure of a node), if we add a new server to the ring, say between the keys “Srushtoka & Freddie”. Initially, Node 5 was catering to both the keys as given in the picture above. Now, after a new server (Node 6), the query or assignment of the key “Freddie” would be assigned/mapped to Node 6, and not Node 5. However, the assignment of the Key “Srushtika” would remain mapped to Node 5.
Hash ring creates the ease to the entire process of adding or removing a server or in case of node failure. The re-assignment doesn’t take much re-work and time as compared to the traditional hashing mechanism.
How does Consistent Hashing respond to non-uniform distribution?
To avoid non-uniform distribution on the hash ring, system engineers use something called “Virtual Nodes/Vnodes”.
Virtual Nodes
These Vnodes are the representatives of physical nodes present on the system ring and are allocated between 2 physical nodes. They help to distribute the load of physical nodes more efficiently & evenly and fasten the process of remapping or re-hashing due to the addition or removal of nodes. Vnodes are non-contagious which means no two adjoining Vnodes are assigned to the same physical node or servers on the hash ring.
As shown in the above diagram, there are 3 Nodes (A, B & C) and in between, there are around 27 Vnodes equally distributed on the hash ring. Each of these Vnodes are assigned to physical nodes and will accordingly coordinate with the nodes in case of a new query or data distribution.
Let's suppose, server B is removed or facing a failure. Hence, the Vnodes assigned to Node B would also be removed or would be non-functional. This will automatically trigger the re-hashing or re-mapping of the keys to the remaining nodes i.e A & C. Now you would think, what would happen to the keys assigned to A & C Nodes? So, the assignment of these nodes will remain intact, and object keys of node B would be assigned to A & C nodes evenly. Removal or absence of these nodes/labels does not affect those keys in any way.
Therefore, implementing a consistent hashing system helps the organization to optimize the data processing and saves time, resources, and heads for sure.