Implementing Bloom Filters and Hash Function Applications

Bloom Filter Implementation

Fundamentals of Bloom Filters

Scenario: An unsafe webpage blacklist contains 100 billion URLs, each occupying up to 64 bytes. Design a filtering system to check if a URL exists in the blacklist.

Requirements:
1) Allow false positive rate below 0.01%
2) Additional space must not exceed 30GB (≈30×10^9 bytes)

Analysis: Storing all URLs requires 64×10^10 bytes (640GB).

Key characteristics of hash functions:

1) Infinite input domain
2) Finite output range S, where identical inputs yield identical outputs (collisions may occur)
3) High-quality hash functions ensure uniform output distribution

Bloom filter components:

1) Bit array of size m
2) k independent high-quality hash functions with output range ≥m

Design Process

Step 1: Calculate bit array size m based on input count n and error rate p
Step 2: Determine hash function count k using n and m
Step 3: Verify actual error rate meets requirements

Bit array size formula:

[m=-\frac{n×\ln p}{(\ln 2)^2}]

Example claculation:

n=100 billion, p=0.0001
m≈19.19n → 2000 billion bits (25GB)

Hash function count formula:

[k=\ln 2×\frac{m}{n}≈0.7×\frac{m}{n}]

Example: k≈14 hash functions

Actual error rate:

[\left(1-e^{-\frac{nk}{m}}\right)^k≈0.006%]

Hash Function Applications

Large-Scale Data Analysis with Memory Constraints

Scenario: Find the most frequent 32-bit integer in a 2 billion-entry file (4B per integer) with 2GB memory limit.

Solution approach:
1) Split file into partitions using hash functions
2) Process each partition within memory limits
3) Handle data skew through multiple passes

Top-K Query Processing Straetgies

General solution pattern:
1) Partition data using hash functions
2) Use trie trees or hash maps for frequency counting
3) Apply min-heaps for local top-K extraction
4) Merge results across partitions

Implementation variants:
- Single machine with sufficient memory: Direct in-memory processing
- Multi-core systems: Parallel processing with load balancing
- Memory-constrained environments: Hierarchical partitioning
- Distributed systems: MapReduce framework with fault tolerance

Tags: bloom-filter hash-functions data-structures algorithm-design distributed-systems

Posted on Wed, 24 Jun 2026 17:07:19 +0000 by BrandonK