More
CodingCollision Handling and Resolving Clustering using Double Hashing

# Collision Handling and Resolving Clustering using Double Hashing

With increase in the number of keys, it becomes usual for the keys to result a same value. This causes collisions.

If we know the keys in advance, we can write out a Hash Function which perfectly handles the collision. But, what if we don’t know the keys in advance, for that, we have two different approach,

• Chaining

The idea of chaining is to connect each cell of the Hash Table to another Array or Linked List. This makes the keys resulting into a same value to be stored in that List linked to a specific index of that Hash Table. This requires additional memory outside the table, as we need m (Size of Hash Table) number of Linked Lists or Arrays in addition to a Hash Table.

The idea of Open Addressing is to store the keys in the Hash Table by checking whether the index as value of that key is occupied or not. If an index already have a key with that index as value, the Hash Function searches for the next empty slot consequently. During the search of a specific key, the Hash Function checks for the key until it is found, and if it is not able to find the key, the search function returns false.

Open Addressing is further divided into three parts,

• Linear Probing
• Double Hashing

Chaining:

It is an approach to store the keys with same values to a separate lists linked to each of the cell of the Hash Table.

We have used an Array of Linked List Headers, while we can also use an Array of Vectors or ArrayList, according to the performance requirement of our program.

In the above illustration, we have 10 keys, and the keys which are having same value after the Hash Function (key%7) is applied are stored in the LinkedList of that cell.

For example, 21, 49 and 56 after Hash Function is applied gives a value of 0, so to avoid collision, we are storing them in a LinkedList with 0 as index.

Performance of Chaining:

m = Number of Slots in Hash Table

n = Number of Keys to be inserted

Whenever you creates a HashMap or HashSet in Java and unordered_set or unordered_map in C++, they provide an option to set the load factor.

Load factor defines, How big your Hash Table should be, it the other words, it’s a trade off between space and time.

If your Hash Table size (m) is small or your load factor is big, it will going to have a lot of collisions, so we try to minimize the load factor.

Expected Chain Length = α

Expected chain length determine the length of the chain i.e., LinkedList, ArrayList or Vector which has been used to store keys with same values.

Expected Time to Search/Insert/Delete = O(1 + α)

Here, 1 is for traversing the Hash Table, and α for traversing the chain.

Data Structures for Storing Chains:

The time that would be taken for Search, Insert and Delete woule be O(l), where l is the size of the Linked List.

The Disadvantage of Linked Lists are that they are not cache friendly, and they took some extra space for storing the references of next pointers or nodes.

2. Dynamic Sized Arrays

Vectors in C++

ArrayList in Java

List in Python

The time that would be taken for Search, Insert and Delete woule be O(l), where l is the size of the dynamic array.

The advantage of dynamic sized arrays is that they are cache friendly.

3. Self Balancing BST (AVL Tree, Red Black Tree)

The time that would be taken for Search, Insert and Delete woule be O(log(l)), where l is the size of the chain.

The one disadvantage of Self Balancing BST is that they are not cache friendly.

Java version 8 and above have started using the Self Balancing BST for HashMap because of its O(log(l)) time complexity.

Implementation of Chaining:

We can implement the chaining in Java and C++ by creating an Array of LinkedList Headers.

In C++

We can implement the insert, search and remove functions in C++ as,

In Java:

We can implement the insert, search and remove functions in Java as,

Open Addressing is an approach where we store all of the keys inside the Hash Table itself. Whenever a collision occurs, we handle it in such a way that it can be inserted in the Hash Table itself. For this, we need a Hash Table of size larger than or equal to the number of elements present with us.

Open Addressing is done in following ways:

1. Linear Probing:

Linear probing is a method by which we search for the next empty slot one-by-one or linearly. If we found a slot already filled, we go the next index i.e., increment the index to be checked next by one.

Let us consider a simple hash function as “key mod 7” and a sequence of keys as 50, 700, 76, 85, 92, 73, 101.

When we inserted the key ‘50’, it got inserted in the index 1, (as 50%7 = 1).

Then we inserted 700 at index 0 and 76 at index 6, according to their hash values.

Then we inserted 85, it got collided with the element at index 1 (85%7 = 1), so we searched for the next free slot, we got index 2 free, so 85 got inserted at index 2.

We continue to insert the elements to their indexes and whenever a collision occurs, we insert that element to the next free slot.

We search for the next empty slots in a circular way, for example, if we got an element collided at index 6, we search for the empty slot from index 0 till index 5, because, their might be free slots anywhere in the Hash Table.

Let’s see how actually search works in open addressing:

We compute Hash Function, we go that index and compare, if we find the required element, we return true, otherwise, we search linearly.

We stop when one of the following three cases arrive,

When we found an empty slot

When we found the key

When we have traversed through the whole table circularly.

If we found the key, we return true, and return false in other cases.

How to handle deletion in Open Addressing?

We cannot simply delete an element from that index, because we have defined the search function in such as way that the search function will stop searching for the required element whenever the function encounter an empty cell and will return false, even if the key is present beyond that empty cell.

So, to handle this, we mark the deleted slot as ‘Deleted’ instead of making that cell empty. Due to this, the search function will continue its search even when it encounter a deleted slot.

There’s a problem with the Linear Probing, and that is ‘Clustering’.

Clustering:

This problem arises due to the Linear search of an empty slot when using the Linear Probing.

Due to this, the elements are clustered or grouped together contiguously. To resolve this issue, we use other type of probing such as Quadratic Probing or Double Hashing.

Let us compare the Linear Probing, Quadratic Probing and Double Hashing to understand the things better.

In Linear Probing, we use the following equation when we are on the i’th probing, for example, when we want to insert a key and we fail to insert because that cell is not empty, we start the probing with i =1 and keep incrementing the value of i by 1 each time we fail to insert the key.

hash(key, i) = (h(key) + i )%m

where, h(key) = key%m

In Quadratic Probing, we use the following probing quation when we are on the i’th probing.

hash(key, i) = (h(key) + i2 )%m

where, h(key) = key%m

As we can observe that in Linear Probing, we have incremented the probed index value by a linear term while in Quadratic Probing, we have incremented the probed index by a quadratic term.

This quadratic term minimizes the problem of clustering. Because of this, most of the keys tend to acquire empty space uniformly in the table, and not in a contigious block.

We still have the problem of Clustering in Quadratic Probing, but me managed to minimize it.

This problem is due to the fact that we have incremented the probe value by a quadratic term which have a definite value each time we fail to insert a key, so this creates secondary clusters which are better than primary clusters in Linear Probing.

The another problem with Quadratic Problem which is not present in Linear Probing is that, it can fail to find an empty slot even if there are empty slotes present in the Hash Table. This was not a Problem with the Linear Probing because we were searching for the empty slot linearly.

It’s a mathematically proven fact that when the value of Load Factor (α) < 0.5

And,

Value of ‘m’ is a prime number, then only we can ensure that, we would be able to find an empty slot.

Load Factor less than 0.5 makes a validation that the number of empty slotes are more than double the number of keys present with us.

Double Hashing:

Double Hashing is a technique of probing where we use two hash functions simultaneously.

It uses a function such as,

hash(key, i) = (h1(key) + i*h2(key))%m

• If h2(key) is relatively prime to m, then it always find a free slot, if there is one.
• Distributes keys more uniformly compared to Linear Probing and Quadratic Probing.
• No Clustering

Let us consider an example of Double Hashing,

We have designed the h2 hash function in such a way that it would never give us a value 0, because, if it provides a value 0, our hash(key, i) function will iterate on the same location indefinitely.

During the first iteration, we want to insert 49 into the Hash Table, so we computed the hash value using the h1 hash function.

49%7 = 0, gave us a Hash Value of 0, so we inserted the key 49 at index 0.

In second iteration, we want to insert 63 into the Hash Table, we computed its Hash Value using the h1 hash function. We got a value of 0, which is already filled with the key 49. Due to this collision, we would find the next location using double hash function.

h1(key) = (key%7) = (63%7) = 0

h2(key) = (6 – (63%6)) = (6 – 3) = 3

This is the first time, we are trying to handle the collision for this key, so, value of ‘i’ would be 1.

Putting all these values in the double hash function,

hash(key, i) = (h1(key) + i*h2(key))%m

Here, m = 7

hash(63, 1) = (0 + 1*3)%7 = 3

We have got a value of 3, so 63 will be inserted at index 3.

In third iteration, we want to insert 56 into the Hash Table, we computed its Hash Value using the h1 hash function. We got a value of 0, which is already filled with the key 49. Due to this collision, we would find the next location using double hash function.

h1(key) = (key%7) = (56%7) = 0

h2(key) = (6 – (56%6)) = (6 – 2) = 4

This is the first time, we are trying to handle the collision for this key, so, value of ‘i’ would be 1.

Putting all these values in the double hash function,

hash(key, i) = (h1(key) + i*h2(key))%m

Here, m = 7

hash(56, 1) = (0 + 1*4)%7 = 4

We have got a value of 4, so 56 will be inserted at index 4.

In fourth iteration, we want to insert 52 into the Hash Table, we computed its Hash Value using the h1 hash function. We got a value of 3, which is already filled with the key 63. Due to this collision, we would find the next location using double hash function.

h1(key) = (key%7) = (52%7) = 3

h2(key) = (6 – (52%6)) = (6 – 4) = 2

This is the first time, we are trying to handle the collision for this key, so, value of ‘i’ would be 1.

Putting all these values in the double hash function,

hash(key, i) = (h1(key) + i*h2(key))%m

Here, m = 7

hash(56, 1) = (3 + 1*2)%7 = 5

We have got a value of 5, so 52 will be inserted at index 5.

In fifth iteration, we want to insert 54 into the Hash Table, we computed its Hash Value using the h1 hash function. We got a value of 5, which is already filled with the key 52. Due to this collision, we would find the next location using double hash function.

h1(key) = (key%7) = (54%7) = 5

h2(key) = (6 – (54%6)) = (6 – 0) = 6

This is the first time, we are trying to handle the collision for this key, so, value of ‘i’ would be 1.

Putting all these values in the double hash function,

hash(key, i) = (h1(key) + i*h2(key))%m

Here, m = 7

hash(54, 1) = (5 + 1*6)%7 = 4

We have got a value of 4, so we will try inserting 54 at the index 4, but its again filled with 56, due to this collision, we will again compute the Hash Value using the Double Function.

This is the second time, we are trying to handle the collision for this key, so, value of ‘i’ would be 2.

Putting all these values in the double hash function,

hash(key, i) = (h1(key) + i*h2(key))%m

Here, m = 7

hash(54, 1) = (5 + 2*6)%7 = 3

We have got a value of 3, so we will try inserting 54 at the index 3, but its again filled with 63, due to this collision, we will again compute the Hash Value using the Double Function.

This is the third time, we are trying to handle the collision for this key, so, value of ‘i’ would be 3.

Putting all these values in the double hash function,

hash(key, i) = (h1(key) + i*h2(key))%m

Here, m = 7

hash(54, 1) = (5 + 3*6)%7 = 2

We have got a value of 2, so 54 will be inserted at index 2.

During the sixth iteration, we want to insert 48 into the Hash Table, so we computed the hash value using the h1 hash function.

48%7 = 6, gave us a Hash Value of 6, so we inserted the key 48 at index 6.

The major benefit of this double hash function is that it makes less collisions with other keys because of its randomness. This reduces the time of probing the table for empty slotes.

Here’s a pseudo code for the Double Hashing,

In this code, we are simply adding offset to the probe each time we find an occupied cell. This is similar to Linear Probing where the value of offset was fixed i.e., 1. Or we can say that, Double Hashing is the Generalised form of Linear Probing. #### 1 COMMENT

Subscribe Today

Get unlimited access to our EXCLUSIVE Content and our archive of subscriber stories.