A hash table, also known as a hash map, is a widely used data structure in computer science and software development. It is designed to store and retrieve data in an efficient manner. Hash tables are built on the principle of key-value pairs, making them particularly suitable for situations where you need to associate some data (the value) with a unique identifier (the key).

Key-Value Pairs

At the core of a hash table are key-value pairs. Each key is a unique identifier, while the associated value is the data you want to store or retrieve. Keys can be any data type but are typically strings or numbers. They are like arrays but the keys are not ordered.

Hash Functions

Hash tables rely on a crucial component known as a hash function. This function takes an input (in this case, the key) and transforms it into a fixed-size numerical value, which is used to determine the location within the hash table where the associated value should be stored. The primary goal of a good hash function is to ensure that keys are distributed evenly across the hash table’s storage buckets. An even distribution minimizes collisions, which can slow down retrieval.

Hash tables in different languages

Python has Dictionaries
Javascript has Maps and Objects (with some restrictions)
Java, Go & Scala have Maps
Ruby has Hashes

Importance in Modern Computing Languages

Here are five real-world applications of hash tables in modern computer applications:

DNS Resolution: Hash tables play a pivotal role in Domain Name System (DNS) resolution, enabling the efficient mapping of domain names to IP addresses. By storing and retrieving this information quickly, hash tables ensure the seamless translation of human-readable domain names into machine-readable IP addresses, a cornerstone of internet connectivity.

Checking Duplicates: Hash tables are instrumental in identifying and eliminating duplicate values in data. By storing unique values in the table and efficiently comparing new data against existing entries, hash tables help ensure data integrity and prevent redundant information, crucial in databases, contact lists, and numerous data-driven applications.

Databases and Indexing: Hash tables are frequently used in database management systems (DBMS) to implement hash-based indexing. They accelerate query performance by allowing for quick and direct access to data records based on their unique keys. Hash indexing is commonly employed in scenarios where exact match lookups are prevalent.

Caches and Memory Management: Hash tables are a fundamental component of caching mechanisms in various software applications, including web browsers and operating systems. They help manage the cache contents, ensuring that frequently accessed data is readily available in memory. This reduces the need for time-consuming data retrieval operations from slower storage media.

Symbol Tables in Compilers: Compilers use hash tables to store and retrieve variable and function names. This speeds up the compilation process, enables the resolution of references, and detects errors in the source code. Symbol tables are essential for correct and efficient code compilation.

Implementation of Hash Tables

Anatomy of a Hash Table

A hash table is a data structure composed of an array, with each element known as a “bucket.” These buckets serve as storage units for organizing and managing data. The fundamental principle of a hash table is to map unique keys to specific bucket locations through a hash function. This mapping ensures efficient data retrieval, reducing the need for time-consuming searches through the entire dataset.

Buckets and Linked Lists

Each bucket can hold one or more key-value pairs. In practice, collision handling mechanisms are often employed, such as using linked lists within buckets. When multiple keys hash to the same bucket, a linked list is used to manage and store these key-value pairs, allowing for efficient access and management of the data.

Key-Value Pair Storage

The key-value pairs stored within buckets are the heart of the hash table. Keys serve as unique identifiers, while values represent associated data. This combination allows for rapid data retrieval, where given a key, the corresponding value can be quickly located. Key-value storage is the key to the hash table’s versatility and wide-ranging applications across software development and data management.

The Role of Hash Functions

Hash functions are the engine behind hash tables. They take an input, typically a key, and transform it into a fixed-size numerical value, known as a hash code or hash value. This value determines the location within the hash table where the associated value should be stored or retrieved. The primary role of hash functions is to ensure an even distribution of keys across the table, minimizing collisions and facilitating efficient data access.

Characteristics of a Good Hash Function

A good hash function exhibits several critical characteristics, including uniformity, determinism, efficiency, and resistance to collisions.

Fast (i.e constant time)
Doesn’t cluster outputs at specific indices, but distributes uniformly.
Deterministic (same input yields same output)

Building Custom Hash Functions

In certain applications, developers create custom hash functions tailored to specific data types or requirements. Custom hash functions should maintain the essential characteristics of good hash functions while addressing the unique needs of the application, such as handling special data structures or optimizing performance for specific use cases. Custom hash functions can significantly enhance the efficiency and effectiveness of hash tables in various contexts.

Collision Resolution Methods

Collisions occur when different keys generate the same hash value, necessitating effective collision resolution methods. Two primary techniques are commonly employed in hash tables: chaining and open addressing.

Separate Chaining

In the chaining approach, each bucket in the hash table contains a linked list or another data structure. When a collision occurs, the new key-value pair is added to the linked list within the corresponding bucket. Chaining is easy to implement and accommodates multiple values for the same hash code.

In the below example multiple items with the hash key of 4 are stored in the LinkedList next to each other.

hashtable chaining - collision resolution method — hashtable chaining – collision resolution method

Pros:

Simplicity: It’s relatively straightforward to implement separate chaining using linked lists or other data structures.
Efficiency in Handling Collisions: Separate chaining effectively manages collisions by storing multiple key-value pairs in the same slot.
Dynamic Size: It allows for dynamic resizing, so the table can grow or shrink as needed, optimizing space.

Cons:

Increased Memory Usage: Separate chaining can consume more memory due to the storage of linked lists or other data structures in each slot.
Less Predictable Performance: While it’s generally efficient, the performance can be less predictable, particularly when dealing with long linked lists.
Slower Iteration: Traversing elements in separate chains can be slower than other methods.

Open Addressing

Open addressing resolves collisions by searching for the next available slot in the hash table. This method includes linear probing, quadratic probing, and double hashing, which determine the next location for the key-value pair based on a predefined algorithm. Open addressing directly stores data within the hash table, making it suitable for scenarios where space efficiency is a concern. It’s efficient in minimizing storage overhead and handling collisions.

In the example below, the salmon, having the same hashkey value of 4 as the previous element (tomato), is positioned in the next available slot at index 4.

Pros:

Space Efficiency: Open addressing typically consumes less memory as it doesn’t store additional data structures.
Predictable Performance: Open addressing can offer more predictable performance because it avoids data structures like linked lists.
Cache-Friendly: It can be more cache-friendly since elements are stored contiguously.

Cons:

Complex Deletion: Deleting elements in open addressing can be more complex, especially when dealing with probing and maintaining clustering.
Load Factor Sensitivity: Open addressing is more sensitive to load factors, which can affect its efficiency.
Difficulty in Resizing: Resizing open addressing tables can be challenging and less straightforward than separate chaining.

Both chaining and open addressing offer effective means to manage collisions, ensuring that hash tables maintain their fast and efficient data retrieval capabilities even in the presence of duplicate hash values. The choice between these methods depends on the specific use case and the performance considerations of the application.

Hash Table Operations

For the Hash table operations, we will do the below operations

Set to insert a key-value pair
Get to retrieve a value given its key
Remove to delete a key-value pair
Hashing function to convert a string key to a numeric index

We will start with a HashTable class and implement its properties and methods. Within the table, we will add a constructor to initialize a fixed-length array called a bucket. Constructor can receive the size as a parameter.

class HashTable {
  constructor(size) {
    this.keyMap = new Array(size);
    this.size = size;
  }
  // rest of the methods below
}

Insertion

Insertion in a hash table involves using the hash function to find the index for a new key-value pair.

The code snippet below shows how to handle hash collisions in a set method for a hash table. It calculates the hash index, checks if there’s a bucket (an array for storing key-value pairs) at that index, and appends a new key-value pair to the bucket. If an existing item with the same key exists, then just update the value of the existing key.

This ensures efficient collision management for storing and retrieving key-value pairs in the hash table.

/**
 * @param {string} key
 * @param {string|number} value
 */ 
set(key, value) {
    const index = this.hash(key);
    const bucket = this.keyMap[index];

    if (!bucket) {
      this.keyMap[index] = [[key, value]];
    } else {
      const existingItemWithSameKey = bucket.find(([k]) => k === key);

      if (existingItemWithSameKey) {
        existingItemWithSameKey[1] = value;
      } else {
        bucket.push([key, value]);
      }
    }
  }

Retrieval

Given a key, the hash function generates its index, enabling rapid access to the associated value. The below code calculates the hash index, checks if a bucket exists at that index, and looks for an existing key-value pair within the bucket, returning the associated value if found.

/**
* @param {string} key
* @return {string|undefined}
*/
get(key) {
  const index = this.hash(key);
  const bucket = this.keyMap[index];
  if(bucket) {
    const existingItemWithSameKey = bucket.find(item => item[0] === key);
    if(existingItemWithSameKey) {
      return existingItemWithSameKey
    }
  }
}

The efficiency of retrieval is often close to O(1), making hash tables invaluable for quick data access in various applications, from dictionaries to databases.

Deletion

Deletion involves removing a key-value pair from the hash table.

The below code computes the hash index, verifies the existence of a bucket at that index, and searches for an existing key-value pair within the bucket. If found, it removes the pair from the bucket.

/**
 * @param {string} key
 */
remove(key) {
  const index = this.hash(key);
  const bucket = this.keyMap[index];
  if (bucket) {
    const existingItemWithSameKey = bucket.find(item => item[0] === key);
    if (existingItemWithSameKey) {        
 
bucket.splice(bucket.indexOf(existingItemWithSameKey, 1));
    }
  }
}

Hashing

Hashing function involves conversion of a string key to a numeric index.

The below code iterates through the characters in the key, converting each to an ASCII value and combining them with the hash total using the prime.

It uses a prime number (31), which is helpful in spreading out the keys more evenly and reducing the likelihood of collisions. It’s also helpful if the array that you are putting values into has a prime length.

The value 100 limits the loop iteration over the key to improve efficiency and prevent excessive processing of very long keys. This balance between efficiency and maintaining hash quality is arbitrary and can be adjusted as needed

The table below displays the frequency of collisions for a selected table length. It is evident that the prime number has far fewer collisions compared to non prime number.

Hash function	8191	8192
a + c	1.92	3510

/**
 * Calculates hashing index without collisions 
 * @param {string} key
 * @return {number} index
 */
hash(key) {
  let total = 0;
  let RANDOM_PRIME = 31;
  for (let i = 0; i < Math.min(key.length, 100); i++) {
    const currentChar = key[i];
    const asciiValue = currentChar.charCodeAt(0) - 96;
    total = (total * RANDOM_PRIME + asciiValue) % this.size;
  }
}

Display

You can also display all the elements in the hash table with the below code.

display(){
  for(const key in this.table) {
    console.log(this.table[key]);
  }
}

Find the full code here.

Hash table operations are the backbone of efficient data management. Their speed and reliability make them a cornerstone of countless applications, from managing caches and dictionaries to powering database systems.

Time and Space Complexity of Hash Tables

Time Complexity

In hash tables, common operations such as insertion, retrieval, deletion, and updating are designed for efficiency. These operations often have an average time complexity of O(1) due to the hash function’s ability to distribute keys uniformly. However, the efficiency of these operations can vary based on factors like the quality of the hash function, collision handling, and load factor. Analyzing these operations helps in understanding and optimizing hash table performance.

Best and Worst Case Scenarios

Hash table performance is assessed in terms of best and worst-case scenarios. The best case is when there are no collisions, and operations are consistently fast, close to O(1).

Hashtable functions	Average case
insert	O(1)
deletion	O(1)
access	O(1)

In the worst case, numerous collisions occur, and operations can degrade to O(n), where n is the number of elements.

Understanding these extremes guides developers in selecting the right hash function, collision resolution method, and table size to ensure optimal performance under various conditions.

Space Complexity

Hash tables use memory for both the data itself (key-value pairs) and the underlying data structure, including buckets, linked lists, or arrays. Efficient memory usage is essential, particularly when working with large datasets or in memory-constrained environments.

Factors Influencing Space Complexity

Several factors affect the space complexity of hash tables. The primary factors include the size of the hash table (number of buckets), the number of key-value pairs stored, and the chosen collision resolution method. The distribution of keys and the load factor (ratio of stored data to total capacity) also impact space efficiency.

Strategies for Space Optimization

To optimize space usage, developers can consider strategies such as dynamic resizing (increasing or decreasing the number of buckets as needed), load factor management, and selecting an appropriate collision resolution method. Efficient memory management ensures that a hash table consumes the least amount of memory while maintaining its core functionality. Space optimization is critical in resource-constrained systems and when working with extensive datasets, ensuring that memory is used efficiently.

Performance Considerations

Load Factor

The load factor, a crucial performance metric, represents the ratio of stored data to the total capacity of the hash table.

A low load factor = (Number of items in hash table / total number of slots)

For example, take a case of 5 buckets with 2 buckets filled. This hash table has a load factor of 2/5, or 0.4.

When the load factor exceeds 1, it indicates an overflow of items compared to available array slots. When this happens, it’s essential to perform resizing, which involves creating a new array approximately twice the size of the original. All items must be re-inserted into this expanded hash table using the hash function. Although resizing is costly, it’s typically infrequent, and on average, hash tables maintain an O(1) time complexity, even when resizing is considered.

A well-managed load factor ensures that the hash table remains efficient by minimizing collisions. Maintaining a load factor within an acceptable range (often around 0.7) prevents excessive collisions and supports consistent O(1) retrieval times.

Rehashing

Rehashing is a process triggered when the load factor exceeds a predefined threshold. It involves resizing the hash table to accommodate more elements. Rehashing ensures the table remains efficient as it grows, preventing performance degradation. It’s a dynamic approach that optimizes memory usage and maintains fast data retrieval.

Choosing the Right Hash Table Size

Selecting the initial size of a hash table is essential. It should strike a balance between space and time efficiency. A size too small may lead to frequent resizing, while one too large results in wasted memory. The right size depends on the expected number of elements and the desired load factor.

Comparison of Hashtables with Arrays, LinkedLists

Hash tables offer the swiftness of arrays when it comes to searching for values by index and are as efficient as linked lists for insertions and deletions. This dual advantage makes them a versatile data structure. However, it’s crucial to prevent worst-case scenarios, where hash tables can become sluggish in all these operations. The key to avoiding these issues lies in sidestepping collisions, and this necessitates maintaining:

A low load factor
An effective hash function

	Hash Tables (Average)	Hash Tables (Worst)	Arrays	Linked Lists
Search	O(1)	O(n)	O(1)	O(n)
Insert	O(1)	O(n)	O(n)	O(1)
Delete	O(1)	O(n)	O(n)	O(1)