Bloom Filter Implementation
Fundamentals of Bloom Filters
Scenario: An unsafe webpage blacklist contains 100 billion URLs, each occupying up to 64 bytes. Design a filtering system to check if a URL exists in the blacklist.
Requirements:
1) Allow false positive rate below 0.01%
2) Additional space must not exceed 30GB (≈30×10^9 bytes)
Analysis: Storing all URLs requires 64×10^10 bytes (640GB).
Key characteristics of hash functions:
1) Infinite input domain
2) Finite output range S, where identical inputs yield identical outputs (collisions may occur)
3) High-quality hash functions ensure uniform output distribution
Bloom filter components:
1) Bit array of size m
2) k independent high-quality hash functions with output range ≥m
Design Process
Step 1: Calculate bit array size m based on input count n and error rate p
Step 2: Determine hash function count k using n and m
Step 3: Verify actual error rate meets requirements
Bit array size formula:
[m=-\frac{n×\ln p}{(\ln 2)^2}]
Example claculation:
n=100 billion, p=0.0001
m≈19.19n → 2000 billion bits (25GB)
Hash function count formula:
[k=\ln 2×\frac{m}{n}≈0.7×\frac{m}{n}]
Example: k≈14 hash functions
Actual error rate:
[\left(1-e^{-\frac{nk}{m}}\right)^k≈0.006%]
Hash Function Applications
Large-Scale Data Analysis with Memory Constraints
Scenario: Find the most frequent 32-bit integer in a 2 billion-entry file (4B per integer) with 2GB memory limit.
Solution approach:
1) Split file into partitions using hash functions
2) Process each partition within memory limits
3) Handle data skew through multiple passes
Top-K Query Processing Straetgies
General solution pattern:
1) Partition data using hash functions
2) Use trie trees or hash maps for frequency counting
3) Apply min-heaps for local top-K extraction
4) Merge results across partitions
Implementation variants:
- Single machine with sufficient memory: Direct in-memory processing
- Multi-core systems: Parallel processing with load balancing
- Memory-constrained environments: Hierarchical partitioning
- Distributed systems: MapReduce framework with fault tolerance