Hash Tables: Implementation, Collision Handling, and Hash Algorithms

Hash Table Implementation

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 20

typedef struct {
    int key;
    char* value;
} KeyValue;

typedef struct {
    void* data;
    int count;
} Collection;

typedef struct {
    KeyValue* slots[TABLE_SIZE];
} SimpleHashMap;

SimpleHashMap* createMap() {
    return (SimpleHashMap*)malloc(sizeof(SimpleHashMap));
}

void destroyMap(SimpleHashMap* map) {
    for (int i = 0; i < TABLE_SIZE; i++) {
        if (map->slots[i] != NULL) {
            free(map->slots[i]->value);
            free(map->slots[i]);
        }
    }
    free(map);
}

int computeIndex(int key) {
    return abs(key) % TABLE_SIZE;
}

void insert(SimpleHashMap* map, int key, const char* value) {
    KeyValue* entry = (KeyValue*)malloc(sizeof(KeyValue));
    entry->key = key;
    entry->value = (char*)malloc(strlen(value) + 1);
    strcpy(entry->value, value);
    int idx = computeIndex(key);
    map->slots[idx] = entry;
}

void erase(SimpleHashMap* map, int key) {
    int idx = computeIndex(key);
    if (map->slots[idx] != NULL) {
        free(map->slots[idx]->value);
        free(map->slots[idx]);
        map->slots[idx] = NULL;
    }
}

void collectAll(SimpleHashMap* map, Collection* result) {
    int count = 0;
    for (int i = 0; i < TABLE_SIZE; i++) {
        if (map->slots[i] != NULL) count++;
    }
    
    KeyValue* entries = (KeyValue*)malloc(sizeof(KeyValue) * count);
    int pos = 0;
    for (int i = 0; i < TABLE_SIZE; i++) {
        if (map->slots[i] != NULL) {
            entries[pos].key = map->slots[i]->key;
            entries[pos].value = (char*)malloc(strlen(map->slots[i]->value) + 1);
            strcpy(entries[pos].value, map->slots[i]->value);
            pos++;
        }
    }
    result->data = entries;
    result->count = count;
}

void collectKeys(SimpleHashMap* map, Collection* result) {
    int count = 0;
    for (int i = 0; i < TABLE_SIZE; i++) {
        if (map->slots[i] != NULL) count++;
    }
    
    int* keys = (int*)malloc(sizeof(int) * count);
    int pos = 0;
    for (int i = 0; i < TABLE_SIZE; i++) {
        if (map->slots[i] != NULL) {
            keys[pos++] = map->slots[i]->key;
        }
    }
    result->data = keys;
    result->count = count;
}

void collectValues(SimpleHashMap* map, Collection* result) {
    int count = 0;
    for (int i = 0; i < TABLE_SIZE; i++) {
        if (map->slots[i] != NULL) count++;
    }
    
    char** vals = (char**)malloc(sizeof(char*) * count);
    int pos = 0;
    for (int i = 0; i < TABLE_SIZE; i++) {
        if (map->slots[i] != NULL) {
            vals[pos++] = map->slots[i]->value;
        }
    }
    result->data = vals;
    result->count = count;
}

void display(SimpleHashMap* map) {
    Collection set;
    collectAll(map, &set);
    KeyValue* entries = (KeyValue*)set.data;
    for (int i = 0; i < set.count; i++) {
        printf("%d -> %s\n", entries[i].key, entries[i].value);
    }
    free(set.data);
}

A typical approach combines an array with linked structures to build hash tables. The process involves:

  • Creating a fixed-size array as the underlying bucket array
  • Converting keys to array indices through a hash function
  • Handling collisions (when different keys map to the same index) by storing multiple entries in the same bucket using linked lists or alternative structures
  • Implementing insert, search, and delete operations that traverse the bucket's linked structure when necessary

Hash Collisions

Multiple distinct inputs producing identical outputs is inevitable in hash tables.

A larger table capacity reduces the probability of multiple keys landing in the same bucket. Therefore, expanding the hash table mitigates collision frequency.

Similar to array resizing, hash table expansion requires migrating all entries, which is expenssive. Additionally, since the capacity changes, recalculating storage positions for all entries via the hash function further increases computational overhead. Consequently, most programming languages pre-allocate sufficient capacity to avoid frequent resizing operations.

The load factor measures collision severity and serves as the resize trigger. It is defined as the number of elements divided by bucket count. For instance, Java triggers a resize when the load factor exceeds 0.75, doubling the table capacity.

Since the hash function's input space typically vastly exceeds its output space, collisions are mathematically unavoidable. When mapping all integers to an array of fixed size, multiple integers inevitably resolve to identical bucket indices.

Collisions compromise correctness and degrade performance. A naive solution—resizing until collisions disappear—is too inefficient. Improved strategies include:

  1. Modifying the hash table structure to function correctly despite collisions
  2. Resizing only when collisions become severe

The primary structural improvements are separate chaining and open addressing.

Separate Chaining

typedef struct ListNode {
    KeyValue* kv;
    struct ListNode* next;
} ListNode;

typedef struct {
    int entryCount;
    int slotCount;
    double loadThreshold;
    int growthFactor;
    ListNode** buckets;
} ChainedHashMap;

ChainedHashMap* createChainedMap() {
    ChainedHashMap* map = (ChainedHashMap*)malloc(sizeof(ChainedHashMap));
    map->entryCount = 0;
    map->slotCount = 4;
    map->loadThreshold = 0.667;
    map->growthFactor = 2;
    map->buckets = (ListNode**)malloc(map->slotCount * sizeof(ListNode*));
    for (int i = 0; i < map->slotCount; i++) {
        map->buckets[i] = NULL;
    }
    return map;
}

void freeChainedMap(ChainedHashMap* map) {
    for (int i = 0; i < map->slotCount; i++) {
        ListNode* current = map->buckets[i];
        while (current) {
            ListNode* temp = current;
            current = current->next;
            free(temp->kv);
            free(temp);
        }
    }
    free(map->buckets);
    free(map);
}

int computeSlot(ChainedHashMap* map, int key) {
    return key % map->slotCount;
}

double currentLoad(ChainedHashMap* map) {
    return (double)map->entryCount / (double)map->slotCount;
}

char* lookup(ChainedHashMap* map, int key) {
    int idx = computeSlot(map, key);
    ListNode* current = map->buckets[idx];
    while (current) {
        if (current->kv->key == key) {
            return current->kv->value;
        }
        current = current->next;
    }
    return "";
}

void insertEntry(ChainedHashMap* map, int key, const char* value) {
    if (currentLoad(map) > map->loadThreshold) {
        rehash(map);
    }
    
    int idx = computeSlot(map, key);
    ListNode* current = map->buckets[idx];
    while (current) {
        if (current->kv->key == key) {
            free(current->kv->value);
            current->kv->value = (char*)malloc(strlen(value) + 1);
            strcpy(current->kv->value, value);
            return;
        }
        current = current->next;
    }
    
    KeyValue* kv = (KeyValue*)malloc(sizeof(KeyValue));
    kv->key = key;
    kv->value = (char*)malloc(strlen(value) + 1);
    strcpy(kv->value, value);
    
    ListNode* newNode = (ListNode*)malloc(sizeof(ListNode));
    newNode->kv = kv;
    newNode->next = map->buckets[idx];
    map->buckets[idx] = newNode;
    map->entryCount++;
}

void rehash(ChainedHashMap* map) {
    ListNode** oldBuckets = map->buckets;
    int oldSize = map->slotCount;
    
    map->slotCount *= map->growthFactor;
    map->buckets = (ListNode**)malloc(map->slotCount * sizeof(ListNode*));
    for (int i = 0; i < map->slotCount; i++) {
        map->buckets[i] = NULL;
    }
    map->entryCount = 0;
    
    for (int i = 0; i < oldSize; i++) {
        ListNode* current = oldBuckets[i];
        while (current) {
            insertEntry(map, current->kv->key, current->kv->value);
            ListNode* temp = current;
            current = current->next;
            free(temp->kv);
            free(temp);
        }
    }
    free(oldBuckets);
}

void deleteEntry(ChainedHashMap* map, int key) {
    int idx = computeSlot(map, key);
    ListNode* current = map->buckets[idx];
    ListNode* previous = NULL;
    
    while (current) {
        if (current->kv->key == key) {
            if (previous) {
                previous->next = current->next;
            } else {
                map->buckets[idx] = current->next;
            }
            free(current->kv);
            free(current);
            map->entryCount--;
            return;
        }
        previous = current;
        current = current->next;
    }
}

void showMap(ChainedHashMap* map) {
    for (int i = 0; i < map->slotCount; i++) {
        printf("[");
        ListNode* current = map->buckets[i];
        while (current) {
            printf("%d -> %s, ", current->kv->key, current->kv->value);
            current = current->next;
        }
        printf("]\n");
    }
}

In a basic hash table, each bucket holds exactly one entry. Separate chaining transforms each bucket into a linked list, storing all colliding entries in the same list.

Operations change as follows:

  • Search: Hash the key to determine the bucket, then traverse the linked list to locate the target entry
  • Insert: Navigate to the bucket's linked list and append the new node
  • Delete: Locate the bucket's linked list, find the target node, and remove it

Limitations of separate chaining:

  • Memory overhead: Linked list nodes require addditional pointers, consuming more memory than simple arrays
  • Search degradation: Linear traversal through the list becomes increasingly slow as chains grow longer

Note: When lists grow excessively long, consider converting them to self-balancing trees (AVL or Red-Black trees) to achieve O(log n) search complexity instead of O(n).

Open Addressing

Open addressing avoids auxiliary data structures and resolves collisions through repeated probing. Common strategies include linear probing, quadratic probing, and double hashing.

Linear Probing

Linear probing uses fixed-step sequential search for collision resolution.

  • Insert: Compute the bucket index. If occupied, linearly search forward (step = 1) until finding a empty slot
  • Search: On collision, continue linear traversal with the same step. Return the value if found; return null if encountering an empty bucket

This approach tends to produce clustering, where consecutive occupied slots grow longer, increasing collision probability for subsequent insertions and degrading all operations.

Direct deletion is problematic in open addressing. Removing an entry creates an empty bucket, causing subsequent searches to terminate prematurely and incorrectly report missing entries.

Lazy deletion solves this by marking deleted slots with a special TOMBSTONE constant. Both null and tombstone represent empty buckets, but searches must continue past tombstones since entries may exist beyond them.

However, lazy deletion accelerates performance degradation as tombstone accumulation increases search time. An optimization tracks the first tombstone encountered and swaps found entries toward the ideal position, keeping elements closer to their hash positions and improving search efficiency.

typedef struct {
    int entryCount;
    int slotCount;
    double loadThreshold;
    int growthFactor;
    KeyValue** slots;
    KeyValue* tombstone;
} OpenAddressMap;

OpenAddressMap* createOpenMap() {
    OpenAddressMap* map = (OpenAddressMap*)malloc(sizeof(OpenAddressMap));
    map->entryCount = 0;
    map->slotCount = 4;
    map->loadThreshold = 0.667;
    map->growthFactor = 2;
    map->slots = (KeyValue**)malloc(sizeof(KeyValue*) * map->slotCount);
    map->tombstone = (KeyValue*)malloc(sizeof(KeyValue));
    map->tombstone->key = -1;
    map->tombstone->value = "";
    return map;
}

void freeOpenMap(OpenAddressMap* map) {
    for (int i = 0; i < map->slotCount; i++) {
        KeyValue* entry = map->slots[i];
        if (entry != NULL && entry != map->tombstone) {
            free(entry->value);
            free(entry);
        }
    }
    free(map->slots);
    free(map->tombstone);
    free(map);
}

int computeSlotIndex(OpenAddressMap* map, int key) {
    return key % map->slotCount;
}

double getLoadFactor(OpenAddressMap* map) {
    return (double)map->entryCount / (double)map->slotCount;
}

int locateSlot(OpenAddressMap* map, int key) {
    int idx = computeSlotIndex(map, key);
    int firstTombstone = -1;
    
    while (map->slots[idx] != NULL) {
        if (map->slots[idx]->key == key) {
            if (firstTombstone != -1) {
                map->slots[firstTombstone] = map->slots[idx];
                map->slots[idx] = map->tombstone;
                return firstTombstone;
            }
            return idx;
        }
        
        if (firstTombstone == -1 && map->slots[idx] == map->tombstone) {
            firstTombstone = idx;
        }
        
        idx = (idx + 1) % map->slotCount;
    }
    
    return firstTombstone == -1 ? idx : firstTombstone;
}

char* search(OpenAddressMap* map, int key) {
    int idx = locateSlot(map, key);
    if (map->slots[idx] != NULL && map->slots[idx] != map->tombstone) {
        return map->slots[idx]->value;
    }
    return "";
}

void put(OpenAddressMap* map, int key, char* value) {
    if (getLoadFactor(map) > map->loadThreshold) {
        expand(map);
    }
    
    int idx = locateSlot(map, key);
    if (map->slots[idx] != NULL && map->slots[idx] != map->tombstone) {
        free(map->slots[idx]->value);
        map->slots[idx]->value = (char*)malloc(strlen(value) + 1);
        strcpy(map->slots[idx]->value, value);
        return;
    }
    
    KeyValue* entry = (KeyValue*)malloc(sizeof(KeyValue));
    entry->key = key;
    entry->value = (char*)malloc(strlen(value) + 1);
    strcpy(entry->value, value);
    map->slots[idx] = entry;
    map->entryCount++;
}

void remove(OpenAddressMap* map, int key) {
    int idx = locateSlot(map, key);
    if (map->slots[idx] != NULL && map->slots[idx] != map->tombstone) {
        free(map->slots[idx]->value);
        free(map->slots[idx]);
        map->slots[idx] = map->tombstone;
        map->entryCount--;
    }
}

void expand(OpenAddressMap* map) {
    KeyValue** oldSlots = map->slots;
    int oldCap = map->slotCount;
    
    map->slotCount *= map->growthFactor;
    map->slots = (KeyValue**)malloc(sizeof(KeyValue*) * map->slotCount);
    map->entryCount = 0;
    
    for (int i = 0; i < oldCap; i++) {
        KeyValue* entry = oldSlots[i];
        if (entry != NULL && entry != map->tombstone) {
            put(map, entry->key, entry->value);
            free(entry->value);
            free(entry);
        }
    }
    free(oldSlots);
}

void dump(OpenAddressMap* map) {
    for (int i = 0; i < map->slotCount; i++) {
        KeyValue* entry = map->slots[i];
        if (entry == NULL) {
            printf("NULL\n");
        } else if (entry == map->tombstone) {
            printf("TOMBSTONE\n");
        } else {
            printf("%d -> %s\n", entry->key, entry->value);
        }
    }
}

Quadratic Probing

Quadratic probing offsets collisions by squared probe counts: 1, 4, 9, 16, ... steps.

  • Attempts to mitigate linear probing's clustering by skipping quadratic distances
  • Helps achieve more uniform data distribution across the table

However, quadratic probing is imperfect:

  • Clustering still occurs; certain positions remain more favorable than others
  • Quadratic growth may not explore the entire table, potentially missing empty slots

Double Hashing

Double hashing employs multiple hash functions f1(x), f2(x), f3(x), ... for probing.

  • Insert: If f1(x) collides, try f2(x), continuing until finding an empty slot
  • Search: Follow the same function sequence until locating the target or exhausting all probes

Compared to linear probing, double hashing distributes entries more uniformly but incurs higher computational overhead from evaluating multiple hash functions.

Hash Algorithms

The previous sections covered hash table mechanics and collision resolution. However, both open addressing and separate chaining merely ensure functionality under collisions—they cannot reduce collision frequency.

Excessive collisions severely degrade performance. In separate chaining, ideal distribution spreads entries evenly across buckets for optimal O(1) search. Worst case places all entries in a single bucket, degrading to O(n).

Entry distribution depends on the hash function. Recalling the mapping formula:

index = hash(key) % capacity

With fixed capacity, the hash algorithm determines output distribution. Therefore, reducing collisions requires careful hash algorithm design.

Hash Algorithm Goals

A robust hash table demands these properties from the hash function:

  • Determinism: Identical inputs always produce identical outputs
  • Efficiency: Hash computation should be fast; lower overhead improves practicality
  • Uniformity: Entries should distribute evenly across the table, minimizing collision probability

Hash algorithms apply beyond data structures:

  • Password storage: Systems store password hashes rather than plaintext; input passwords are hashed and compared against stored hashes for authentication
  • Data integrity verification: Senders compute and attach checksums; receivers recompute and compare to detect corruption

Cryptographic applications require stronger properties:

  • One-wayness: Hash values reveal nothing about original input
  • Collision resistance: Finding two different inputs with identical hashes should be computationally infeasible
  • Avalanche effect: Small input changes should produce drastically different, unpredictable outputs

Note that uniformity and collision resistance are independent requirements. A function like key % 100 produces uniform distribution for random keys but trivially reveals keys from hash values since all keys with identical last two digits map to the same output.

Simple Hash Functions

int additiveHash(char* key) {
    long long result = 0;
    const int MOD = 1000000007;
    for (int i = 0; i < strlen(key); i++) {
        result = (result + (unsigned char)key[i]) % MOD;
    }
    return (int)result;
}

int multiplicativeHash(char* key) {
    long long result = 0;
    const int MOD = 1000000007;
    for (int i = 0; i < strlen(key); i++) {
        result = (31 * result + (unsigned char)key[i]) % MOD;
    }
    return (int)result;
}

int xorHash(char* key) {
    int result = 0;
    const int MOD = 1000000007;
    for (int i = 0; i < strlen(key); i++) {
        result ^= (unsigned char)key[i];
    }
    return result & MOD;
}

int rotationalHash(char* key) {
    long long result = 0;
    const int MOD = 1000000007;
    for (int i = 0; i < strlen(key); i++) {
        result = ((result << 4) ^ (result >> 28) ^ (unsigned char)key[i]) % MOD;
    }
    return (int)result;
}

Notice the final modulo operation against 1000000007 (a large prime) in every function. Why use prime moduli?

Large primes maximize uniform hash value distribution because primes share no common factors with most numbers, reducing periodic patterns from the modulo operation and minimizing collisions.

For example, choosing composite 9 as the modulus means all keys divisible by 3 map to only three hash values (0, 3, 6). If input keys exhibit such arithmetic progression, hash values cluster severely.

Replacing the modulus with prime 13 eliminates shared factors between keys and modulus, producing better distribution.

If keys are truly random and uniformly distributed, prime or composite moduli both yield uniform outputs. However, when key distributions contain periodicity, composite moduli more readily exhibit clustering.

Therefore, selecting a large prime as the modulus enhances hash algorithm robustness by diminishing periodic patterns.

Common Hash Algorithms

Production hash algorithms include MD5, SHA-1, SHA-2 family, and SHA-3. MD5 validates file integrity but is cryptographically broken. SHA-1 is deprecated for security applications. SHA-2 (including SHA-256) powers security protocols like TLS and digital signatures. SHA-3 offers alternative construction with different security properties.

Programming languages provide built-in hash functions for their native types, computing bucket indices in standard library hash tables. Typically, only immutable objects support hashing.

Tags: Hash Table Data Structures collision handling chaining open addressing

Posted on Fri, 05 Jun 2026 19:01:27 +0000 by bguzel