- Solving Cache Penetration
Consider a typical product lookup service:
public Product getProductById(Long id) {
Product product = cache.get(id);
if (product != null) {
return product;
}
product = database.query(id);
if (product != null) {
cache.put(id, product);
}
return product;
}
If a product ID does not exist in either the cache or database, the system repeatedly queries the database for non-existent keys — a scenario known as cache penetration. A naive fix is to store empty placeholders with short TTLs, but this wastes memory at scale.
The core challenge: How can we quickly and efficiently determine whether an item is likely absent from a dataset, using minimal resources?
Bloom filters answer this by trading a small probability of false positives for drastic reductions in memory and latency.
- Core Principles
A Bloom filter consists of:
- A fixed-length bit array (initially all 0s)
- K independent hash functions that map input elements to positions in the array
Insertion: When adding an element, apply all K hash functions. Each produces an index in the bit array; set those bits to 1.
Query: To check membership, hash the element using the same K functions. If any bit at the resulting positions is 0, the element is definitely absent. If all bits are 1, the element is possibly present — but there’s a chance of false positives.
For example, with a 16-bit array and 3 hash functions:
- Element A hashes to positions 2, 7, 13 → set bits 2, 7, 13 to 1
- Element B hashes to positions 5, 7, 11 → set bits 5, 7, 11 to 1
Now, querying for element C that hashes to positions 2, 5, 13: all bits are 1. The filter returns "possibly present," even if C was never inserted — a false positive.
False Positive Rate
The probability of a false positive depends on four parameters:
- m: size of the bit array
- n: number of inserted elements
- k: number of hash functions
- p: target false positive rate
The optimal bit array size and number of hash functions are derived as:
m = - (n * ln p) / (ln 2)²
k = (m / n) * ln 2
Increasing m reduces false positives. Increasing k improves accuracy but increases computation cost. There’s a mathematical optimum for each n and p.
Deletion Limitation
Bloom filters do not support deletion. Removing a bit might affect other elements that hash to the same position. This is an intentional design trade-off to preserve space efficiency.
Performance Characteristics
- Space complexity: O(m)
- Insert and query time: O(k)
- No growth in memory or latency with more elements
Hash Function Selection
Non-cryptographic hashes are preferred for speed and uniform distribution:
- MurmurHash3 — fast, low collision, widely adopted
- FNV-1a — simple and efficient
- Jenkins Hash — good statistical properties
- Implementation with Guava
Google Guava provides a simple, in-JVM Bloom filter implementation:
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>31.0.1-jre</version>
</dependency>
BloomFilter<Long> filter = BloomFilter.create(
Funnels.longFunnel(),
10000, // expected insertions
0.001 // target false positive rate
);
// Insert elements
filter.put(1L);
filter.put(2L);
filter.put(3L);
// Query
boolean mightExist = filter.mightContain(2L); // true
boolean definitelyMissing = filter.mightContain(9999L); // likely false
Guava’s internal logic computes optimal m and k using the same formulas above. The mightContain() method returns false only if at least one hash position is unset — guaranteeing no false negatives.
- Implementation with Redisson
Redisson offers distributed Bloom filters backed by Redis bitmaps:
<dependency>
<groupId>org.redisson</groupId>
<artifactId>redisson</artifactId>
<version>3.16.1</version>
</dependency>
RBloomFilter<Long> bloomFilter = redissonClient.getBloomFilter("productIds");
bloomFilter.tryInit(10000, 0.001);
bloomFilter.add(1L);
bloomFilter.add(2L);
boolean exists = bloomFilter.contains(1L); // true
Under the hood, Redisson uses Redis’s BITSET structure. Each Bloom filter is stored as a Redis string (up to 512MB), with individual bits manipulated via SETBIT and GETBIT commands. Configuration (size, hash count) is stored in a Redis hash.
Insertion logic:
- Compute K hash values from the input
- Map each to a bit index within the Redis bit array
- Use pipelined
SETBIToperations to set bits atomically
- Practical Considerations
Cache Penetration Mitigation
Integrate the Bloom filter as a pre-check before querying cache or database:
- Query Bloom filter for ID
- If
mightContain(id) == false→ return 404 immediately - If
true→ proceed to cache → if cache miss → query DB → cache result
This eliminates >99% of non-existent key queries from reaching the database.
Handling Element Deletion
Since standard Bloom filters don’t support deletion, two practical approaches exist:
Counting Bloom Filter: Replace bits with small counters. Increment on insert, decrement on delete. However, this increases memory usage significant and risks counter overflow.
Rebuild-on-Update: Periodically regenerate the filter from the authoritative dataset:
- Triggger a background job to fetch all active product IDs
- Build a new Bloom filter with fresh data
- Atomically switch application references from old to new filter
- Garbage collect the old filter after a grace period
This approach is simple, reliable, and avoids the complexity of counter-based variants.
- Why Bloom Filters Are Widely Adopted
Bloom filters exemplify elegant engineering: they provide near-constant time membership checks with minimal memory, at the cost of a tunable false positive rate. In systems where false positives are acceptable (e.g., avoiding unnecessary disk reads), the trade-off is overwhelmingly beneficial.
They are not a silver bullet — but they are a powerful tool in the performance engineer’s toolkit.