Understanding Bloom Filters: Efficient Probabilistic Set Membership

Solving Cache Penetration

Consider a typical product lookup service:

public Product getProductById(Long id) {
    Product product = cache.get(id);
    if (product != null) {
        return product;
    }
    product = database.query(id);
    if (product != null) {
        cache.put(id, product);
    }
    return product;
}

If a product ID does not exist in either the cache or database, the system repeatedly queries the database for non-existent keys — a scenario known as cache penetration. A naive fix is to store empty placeholders with short TTLs, but this wastes memory at scale.

The core challenge: How can we quickly and efficiently determine whether an item is likely absent from a dataset, using minimal resources?

Bloom filters answer this by trading a small probability of false positives for drastic reductions in memory and latency.

Core Principles

A Bloom filter consists of:

A fixed-length bit array (initially all 0s)
K independent hash functions that map input elements to positions in the array

Insertion: When adding an element, apply all K hash functions. Each produces an index in the bit array; set those bits to 1.

Query: To check membership, hash the element using the same K functions. If any bit at the resulting positions is 0, the element is definitely absent. If all bits are 1, the element is possibly present — but there’s a chance of false positives.

For example, with a 16-bit array and 3 hash functions:

Element A hashes to positions 2, 7, 13 → set bits 2, 7, 13 to 1
Element B hashes to positions 5, 7, 11 → set bits 5, 7, 11 to 1

Now, querying for element C that hashes to positions 2, 5, 13: all bits are 1. The filter returns "possibly present," even if C was never inserted — a false positive.

False Positive Rate

The probability of a false positive depends on four parameters:

m: size of the bit array
n: number of inserted elements
k: number of hash functions
p: target false positive rate

The optimal bit array size and number of hash functions are derived as:

m = - (n * ln p) / (ln 2)²

k = (m / n) * ln 2

Increasing m reduces false positives. Increasing k improves accuracy but increases computation cost. There’s a mathematical optimum for each n and p.

Deletion Limitation

Bloom filters do not support deletion. Removing a bit might affect other elements that hash to the same position. This is an intentional design trade-off to preserve space efficiency.

Performance Characteristics

Space complexity: O(m)
Insert and query time: O(k)
No growth in memory or latency with more elements

Hash Function Selection

Non-cryptographic hashes are preferred for speed and uniform distribution:

MurmurHash3 — fast, low collision, widely adopted
FNV-1a — simple and efficient
Jenkins Hash — good statistical properties

Implementation with Guava

Google Guava provides a simple, in-JVM Bloom filter implementation:

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>31.0.1-jre</version>
</dependency>

BloomFilter<Long> filter = BloomFilter.create(
    Funnels.longFunnel(),
    10000,     // expected insertions
    0.001      // target false positive rate
);

// Insert elements
filter.put(1L);
filter.put(2L);
filter.put(3L);

// Query
boolean mightExist = filter.mightContain(2L); // true
boolean definitelyMissing = filter.mightContain(9999L); // likely false

Guava’s internal logic computes optimal m and k using the same formulas above. The mightContain() method returns false only if at least one hash position is unset — guaranteeing no false negatives.

Implementation with Redisson

Redisson offers distributed Bloom filters backed by Redis bitmaps:

<dependency>
    <groupId>org.redisson</groupId>
    <artifactId>redisson</artifactId>
    <version>3.16.1</version>
</dependency>

RBloomFilter<Long> bloomFilter = redissonClient.getBloomFilter("productIds");
bloomFilter.tryInit(10000, 0.001);

bloomFilter.add(1L);
bloomFilter.add(2L);

boolean exists = bloomFilter.contains(1L); // true

Under the hood, Redisson uses Redis’s BITSET structure. Each Bloom filter is stored as a Redis string (up to 512MB), with individual bits manipulated via SETBIT and GETBIT commands. Configuration (size, hash count) is stored in a Redis hash.

Insertion logic:

Compute K hash values from the input
Map each to a bit index within the Redis bit array
Use pipelined SETBIT operations to set bits atomically

Practical Considerations

Cache Penetration Mitigation

Integrate the Bloom filter as a pre-check before querying cache or database:

Query Bloom filter for ID
If mightContain(id) == false → return 404 immediately
If true → proceed to cache → if cache miss → query DB → cache result

This eliminates >99% of non-existent key queries from reaching the database.

Handling Element Deletion

Since standard Bloom filters don’t support deletion, two practical approaches exist:

Counting Bloom Filter: Replace bits with small counters. Increment on insert, decrement on delete. However, this increases memory usage significant and risks counter overflow.

Rebuild-on-Update: Periodically regenerate the filter from the authoritative dataset:

Triggger a background job to fetch all active product IDs
Build a new Bloom filter with fresh data
Atomically switch application references from old to new filter
Garbage collect the old filter after a grace period

This approach is simple, reliable, and avoids the complexity of counter-based variants.

Why Bloom Filters Are Widely Adopted

Bloom filters exemplify elegant engineering: they provide near-constant time membership checks with minimal memory, at the cost of a tunable false positive rate. In systems where false positives are acceptable (e.g., avoiding unnecessary disk reads), the trade-off is overwhelmingly beneficial.

They are not a silver bullet — but they are a powerful tool in the performance engineer’s toolkit.

Tags: BloomFilter Redisson guava cache-penetration bitmask

Posted on Sun, 24 May 2026 19:43:02 +0000 by facets

Freaks City