Bitfilters DuckDB Extension
A high-performance DuckDB extension providing probabilistic data structures for fast set membership testing and approximate duplicate detection. This extension implements state-of-the-art filter algorithms including Quotient filters, XOR filters, Binary Fuse filters, and soon Bloom filters.
bitfilters
provides space-efficient probabilistic data structures that can answer βIs element X in set S?β with:
No false negatives: If the filter says an element is not present, itβs definitely not there
Possible false positives: If the filter says an element is present, it might be there (with configurable probability)
You may find it useful to use this module in combination with the hashfuncs
extension to produce the hash outputs that these filters require. Most filters require a UBIGINT
or unsigned 64-bit integer value as their input type. You can use the functions provided by hashfuncs
or the DuckDB hash()
function to produce those values from most DuckDB data types.
Installation
bitfilters
is a DuckDB Community Extension.
You can install and use this extension with the following SQL commands:
FROM community;
INSTALL bitfilters LOAD bitfilters;
For more information about DuckDB extensions, see the official documentation.
What are bitfilters?
Bitfilters are probabilistic data structures that provide fast, memory-efficient approximate set membership testing. They are designed to answer the question βIs element X in set S?β with:
- Space efficiency: Use significantly less memory than storing the actual set
- Speed: Extremely fast lookups (typically O(1) or O(k) where k is small)
- No false negatives: If a filter says βNOβ, the element is definitely not in the set
- Possible false positives: If a filter says βYESβ, the element might be in the set
Common Use Cases
- Pre-filtering expensive operations: Avoid costly disk I/O or network calls for non-existent data
- Duplicate detection: Quickly identify potential duplicates in large datasets
- Cache optimization: Determine if data might be in cache before expensive lookups
- Data skipping: Skip irrelevant data partitions in analytical queries
- Set operations: Approximate set intersections and unions on massive datasets
- Database join optimization: Pre-filter join candidates to reduce computation
Performance Benefits
-- Without bitfilters: Expensive operation on every row
SELECT expensive_function(data)
FROM large_table
WHERE complex_condition(data);
-- With bitfilters: Pre-filter to reduce expensive operations by 90%+
SELECT expensive_function(lt.data)
FROM large_table lt
JOIN precomputed_filters pf ON lt.partition = pf.partition
WHERE filter_contains(pf.filter, lt.key) -- Fast filter check
AND complex_condition(lt.data); -- Expensive check only when needed
Available Filters
1. Quotient Filters
Space-efficient filters that support deletion and resizing operations. Learn more about Quotient Filters.
Functions:
quotient_filter(q, r, hash_value)
- Create filter from hash valuesquotient_filter_contains(filter, hash_value)
- Test membership
Characteristics:
- Use case: Dynamic datasets, applications requiring deletion
- Pros: Supports deletion, better cache locality, resizable
- Cons: More complex implementation, slightly slower than Bloom filters
- Memory: Similar to Bloom filters but with better cache performance
-- Create a quotient filter with q=16 and r=4
CREATE TABLE quotient_filters AS (
SELECT id % 2 AS remainder,
16, 4, hash(id)) AS filter
quotient_filter(FROM series_data
GROUP BY id % 2
);
-- Test membership
SELECT quotient_filter_contains(filter, hash(12345)) AS might_exist
FROM quotient_filters WHERE remainder = 1;
2. XOR Filters
Modern high-performance filters with better space efficiency than Bloom filters. Read the XOR Filters paper for technical details.
Functions:
xor16_filter(hash_value)
- Create 16-bit XOR filter from hash valuesxor8_filter(hash_value)
- Create 8-bit XOR filter from hash valuesxor16_filter_contains(filter, hash_value)
- Test membership in 16-bit filterxor8_filter_contains(filter, hash_value)
- Test membership in 8-bit filter
Characteristics:
- Use case: Read-heavy workloads, static datasets
- Pros: Better space efficiency (~20% less memory), faster queries
- Cons: Static size, more complex construction, no incremental updates
- Memory: ~1.23 bits per element for 1% false positive rate
-- Create XOR filters (both 8-bit and 16-bit versions)
CREATE TABLE xor_filters AS (
SELECT id % 2 AS remainder,
hash(id)) AS xor16_filter,
xor16_filter(hash(id)) AS xor8_filter
xor8_filter(FROM series_data
GROUP BY id % 2
);
-- Test membership
SELECT
hash(12345)) AS in_xor16,
xor16_filter_contains(xor16_filter, hash(12345)) AS in_xor8
xor8_filter_contains(xor8_filter, FROM xor_filters WHERE remainder = 1;
3. Binary Fuse Filters
Latest generation filters with optimal space usage and excellent performance. See the Binary Fuse Filters paper for implementation details.
Functions:
binary_fuse16_filter(hash_value)
- Create 16-bit Binary Fuse filter from hash valuesbinary_fuse8_filter(hash_value)
- Create 8-bit Binary Fuse filter from hash valuesbinary_fuse16_filter_contains(filter, hash_value)
- Test membership in 16-bit filterbinary_fuse8_filter_contains(filter, hash_value)
- Test membership in 8-bit filter
Characteristics:
- Use case: Applications requiring minimal memory footprint
- Pros: Best space efficiency, fast construction and queries
- Cons: Static size, newer algorithm with less production experience
- Memory: Significantly more space-efficient than other filters
-- Create Binary Fuse filters (both 8-bit and 16-bit versions)
CREATE TABLE binary_fuse_filters AS (
SELECT id % 2 AS remainder,
hash(id)) AS binary_fuse16_filter,
binary_fuse16_filter(hash(id)) AS binary_fuse8_filter
binary_fuse8_filter(FROM series_data
GROUP BY id % 2
);
-- Test membership
SELECT
hash(12345)) AS in_fuse16,
binary_fuse16_filter_contains(binary_fuse16_filter, hash(12345)) AS in_fuse8
binary_fuse8_filter_contains(binary_fuse8_filter, FROM binary_fuse_filters WHERE remainder = 1;
Usage Examples
Basic Filter Operations
-- Create test data
CREATE TABLE series_data AS (
SELECT * AS id FROM generate_series(1, 100000) AS id
);
-- Create quotient filters with q=16 and r=4
CREATE TABLE quotient_filters AS (
SELECT id % 2 AS remainder,
16, 4, hash(id)) AS filter
quotient_filter(FROM series_data
GROUP BY id % 2
);
-- Test membership (should find all matching elements)
SELECT remainder,
COUNT(CASE WHEN quotient_filter_contains(filter, hash(id)) THEN 1 ELSE NULL END) AS matches
FROM series_data, quotient_filters
WHERE series_data.id % 2 = quotient_filters.remainder
GROUP BY remainder;
βββββββββββββ¬ββββββββββremainder β matches β
β
β int64 β int64 β
βββββββββββββΌββββββββββ€0 β 50000 β
β 1 β 50000 β
β
βββββββββββββ΄ββββββββββ
-- Check false positives (elements not in the filter that test positive)
SELECT remainder,
COUNT(CASE WHEN quotient_filter_contains(filter, hash(id)) THEN 1 ELSE NULL END) AS false_positives
FROM series_data, quotient_filters
WHERE series_data.id % 2 != quotient_filters.remainder
GROUP BY remainder;
βββββββββββββ¬ββββββββββββββββββremainder β false_positives β
β
β int64 β int64 β
βββββββββββββΌββββββββββββββββββ€0 β 2264 β
β 1 β 2273 β
β βββββββββββββ΄ββββββββββββββββββ
XOR Filter Examples
-- Create XOR filters (both 8-bit and 16-bit)
CREATE TABLE xor_filters AS (
SELECT id % 2 AS remainder,
hash(id)) AS xor16_filter,
xor16_filter(hash(id)) AS xor8_filter
xor8_filter(FROM series_data
GROUP BY id % 2
);
-- Verify all elements are found (no false negatives)
SELECT remainder,
COUNT(CASE WHEN xor16_filter_contains(xor16_filter, hash(id)) THEN 1 ELSE NULL END) AS xor16_matches,
COUNT(CASE WHEN xor8_filter_contains(xor8_filter, hash(id)) THEN 1 ELSE NULL END) AS xor8_matches
FROM series_data, xor_filters
WHERE series_data.id % 2 = xor_filters.remainder
GROUP BY remainder;
βββββββββββββ¬ββββββββββββββββ¬βββββββββββββββremainder β xor16_matches β xor8_matches β
β
β int64 β int64 β int64 β
βββββββββββββΌββββββββββββββββΌβββββββββββββββ€0 β 50000 β 50000 β
β 1 β 50000 β 50000 β
β
βββββββββββββ΄ββββββββββββββββ΄βββββββββββββββ
-- Compare filter performance
SELECT
'XOR16' AS filter_type,
AS size_bytes
octet_length(xor16_filter) FROM xor_filters WHERE remainder = 0
UNION ALL
SELECT
'XOR8' AS filter_type,
AS size_bytes
octet_length(xor8_filter) FROM xor_filters WHERE remainder = 0;
βββββββββββββββ¬βββββββββββββ
β filter_type β size_bytes βvarchar β int64 β
β
βββββββββββββββΌβββββββββββββ€123076 β
β XOR16 β 61546 β
β XOR8 β βββββββββββββββ΄βββββββββββββ
Binary Fuse Filter Examples
-- Create Binary Fuse filters
CREATE TABLE binary_fuse_filters AS (
SELECT id % 2 AS remainder,
hash(id)) AS binary_fuse16_filter,
binary_fuse16_filter(hash(id)) AS binary_fuse8_filter
binary_fuse8_filter(FROM series_data
GROUP BY id % 2
);
-- Test all elements are found
SELECT remainder,
COUNT(CASE WHEN binary_fuse16_filter_contains(binary_fuse16_filter, hash(id)) THEN 1 ELSE NULL END) AS fuse16_matches,
COUNT(CASE WHEN binary_fuse8_filter_contains(binary_fuse8_filter, hash(id)) THEN 1 ELSE NULL END) AS fuse8_matches
FROM series_data, binary_fuse_filters
WHERE series_data.id % 2 = binary_fuse_filters.remainder
GROUP BY remainder;
βββββββββββββ¬βββββββββββββββββ¬ββββββββββββββββremainder β fuse16_matches β fuse8_matches β
β
β int64 β int64 β int64 β
βββββββββββββΌβββββββββββββββββΌββββββββββββββββ€0 β 50000 β 50000 β
β 1 β 50000 β 50000 β
β
βββββββββββββ΄βββββββββββββββββ΄ββββββββββββββββ
-- Check false positive rates
SELECT remainder,
COUNT(CASE WHEN binary_fuse16_filter_contains(binary_fuse16_filter, hash(id)) THEN 1 ELSE NULL END) AS fuse16_false_positives,
COUNT(CASE WHEN binary_fuse8_filter_contains(binary_fuse8_filter, hash(id)) THEN 1 ELSE NULL END) AS fuse8_false_positives
FROM series_data, binary_fuse_filters
WHERE series_data.id % 2 != binary_fuse_filters.remainder
GROUP BY remainder;
βββββββββββββ¬βββββββββββββββββββββββββ¬ββββββββββββββββββββββββremainder β fuse16_false_positives β fuse8_false_positives β
β
β int64 β int64 β int64 β
βββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββββββββββ€0 β 1 β 171 β
β 1 β 1 β 199 β
β βββββββββββββ΄βββββββββββββββββββββββββ΄ββββββββββββββββββββββββ
Practical Use Case: User Activity Tracking
-- Track active users by day using quotient filters
CREATE TABLE user_activity AS (
SELECT
CURRENT_DATE - (random() * 30)::INTEGER) AS activity_date,
(random() * 1000000)::INTEGER AS user_id
(FROM generate_series(1, 5000000)
);
-- Create daily quotient filters for active users
CREATE TABLE daily_user_filters AS (
SELECT
activity_date,16, 4, hash(user_id)) AS user_filter,
quotient_filter(COUNT(DISTINCT user_id) AS actual_unique_users
FROM user_activity
GROUP BY activity_date
);
-- Check if specific users were active on specific dates
SELECT
activity_date,hash(12345)) AS user_12345_active,
quotient_filter_contains(user_filter, hash(67890)) AS user_67890_active,
quotient_filter_contains(user_filter,
actual_unique_usersFROM daily_user_filters
ORDER BY activity_date;
-- Find days when specific users might have been active
WITH target_users AS (
SELECT unnest([12345, 67890, 11111, 99999]) AS user_id
)SELECT
tu.user_id,
duf.activity_date,hash(tu.user_id)) AS possibly_active,
quotient_filter_contains(duf.user_filter, EXISTS(
SELECT 1 FROM user_activity ua
WHERE ua.user_id = tu.user_id
AND ua.activity_date = duf.activity_date
AS actually_active
) FROM target_users tu
CROSS JOIN daily_user_filters duf
WHERE quotient_filter_contains(duf.user_filter, hash(tu.user_id))
ORDER BY tu.user_id, duf.activity_date;
Filter Comparison Example
-- Compare all filter types side by side
WITH sample_data AS (
SELECT hash(id) AS hash_value
FROM generate_series(1, 1000000) AS id
),AS (
all_filters SELECT
20, 4, hash_value) AS qf,
quotient_filter(AS xor16,
xor16_filter(hash_value) AS xor8,
xor8_filter(hash_value) AS bf16,
binary_fuse16_filter(hash_value) AS bf8
binary_fuse8_filter(hash_value) FROM sample_data
)SELECT
'Quotient Filter' AS filter_type,
AS size_bytes,
octet_length(qf) / 1000000.0 AS bytes_per_element
octet_length(qf) FROM all_filters
UNION ALL
SELECT
'XOR16 Filter',
octet_length(xor16),/ 1000000.0
octet_length(xor16) FROM all_filters
UNION ALL
SELECT
'XOR8 Filter',
octet_length(xor8),/ 1000000.0
octet_length(xor8) FROM all_filters
UNION ALL
SELECT
'Binary Fuse16 Filter',
octet_length(bf16),/ 1000000.0
octet_length(bf16) FROM all_filters
UNION ALL
SELECT
'Binary Fuse8 Filter',
octet_length(bf8),/ 1000000.0
octet_length(bf8) FROM all_filters;
ββββββββββββββββββββββββ¬βββββββββββββ¬ββββββββββββββββββββ
β filter_type β size_bytes β bytes_per_element βvarchar β int64 β double β
β
ββββββββββββββββββββββββΌβββββββββββββΌββββββββββββββββββββ€Filter β 917544 β 0.917544 β
β Quotient Filter β 2460076 β 2.460076 β
β XOR16 Filter β 1230046 β 1.230046 β
β XOR8 Filter β 2261024 β 2.261024 β
β Binary Fuse16 Filter β 1130524 β 1.130524 β
β Binary Fuse8 ββββββββββββββββββββββββ΄βββββββββββββ΄ββββββββββββββββββββ
Best Practices
β Doβs
- Use a hash function to create consistent hash values for filter operations
- Store filters in materialized views for reuse across queries
- Combine filters with exact checks for final results
- Monitor actual false positive rates in production
- Choose appropriate bit-width (8 vs 16) based on memory/accuracy tradeoffs
β Donβts
- Donβt rely on filters for exact results without confirmation
- Donβt mix hash values from different sources in the same filter
- Donβt rebuild filters frequently for dynamic data
- Donβt use filters for very small datasets (overhead not worth it)
- Donβt forget that quotient filter parameters (q, r) affect capacity and accuracy
-- Good: Use filter to pre-screen, then exact check
SELECT expensive_operation(data)
FROM large_table lt
JOIN precomputed_filter pf ON lt.partition = pf.partition
WHERE quotient_filter_contains(pf.filter, hash(lt.key)) -- Fast pre-filter
AND exact_expensive_condition(lt.data); -- Exact check
-- Bad: Using filter as final arbiter
SELECT data
FROM large_table lt
JOIN precomputed_filter pf ON lt.partition = pf.partition
WHERE quotient_filter_contains(pf.filter, hash(lt.key)); -- May have false positives!
API Reference
Quotient Filter Functions
quotient_filter(q, r, hash_value)
Creates a quotient filter with 2^q slots and r remainder bits.
Parameters:
q
(INTEGER
): Log2 of the number of slots (capacity = 2^q)r
(INTEGER
): Number of remainder bits (affects accuracy)hash_value
(UBIGINT
): Hash values to insert into the filter
Returns:
BLOB containing the serialized quotient filter
Example
-- Create filter with 2^16 = 65536 slots and 4 remainder bits
SELECT quotient_filter(16, 4, hash(user_id)) FROM users;
quotient_filter_contains(filter, hash_value)
Tests if a quotient filter may contain a hash value.
Parameters:
filter
(BLOB): Serialized quotient filterhash_value
(UBIGINT): Hash value to test
Returns:
BOOLEAN
true
: Hash value might be in the set (possible false positive)false
: Hash value is definitely not in the set (no false negatives)
XOR Filter Functions
xor16_filter(hash_value)
/ xor8_filter(hash_value)
Creates XOR filters with 16-bit or 8-bit fingerprints.
Parameters:
hash_value
(UBIGINT
): Hash values to insert into the filter
Returns:
BLOB
containing the serialized XOR filter
Example:
SELECT xor16_filter(hash(product_id)) FROM products;
SELECT xor8_filter(hash(product_id)) FROM products; -- Smaller but higher FPR
xor16_filter_contains(filter, hash_value)
/ xor8_filter_contains(filter, hash_value)
Tests if an XOR filter may contain a hash value.
Parameters:
filter
(BLOB
): Serialized XOR filterhash_value
(UBIGINT
): Hash value to test
Returns:
BOOLEAN
(same semantics as quotient filters)
Binary Fuse Filter Functions
binary_fuse16_filter(hash_value)
/ binary_fuse8_filter(hash_value)
Creates Binary Fuse filters with 16-bit or 8-bit fingerprints.
Parameters:
hash_value
(UBIGINT
): Hash values to insert into the filter
Returns:
BLOB
containing the serialized Binary Fuse filter
Example:
SELECT binary_fuse16_filter(hash(transaction_id)) FROM transactions;
SELECT binary_fuse8_filter(hash(transaction_id)) FROM transactions;
binary_fuse16_filter_contains(filter, hash_value)
/ binary_fuse8_filter_contains(filter, hash_value)
Tests if a Binary Fuse filter may contain a hash value.
Parameters:
filter
(BLOB
): Serialized Binary Fuse filterhash_value
(BIGINT
): Hash value to test
Returns:
BOOLEAN
(same semantics as other filters)
Filter Characteristics Summary
Filter Type | Bits per Element | False Positive Rate | Notes | ||
---|---|---|---|---|---|
Quotient | Variable | Depends on q,r | Supports deletion | ||
XOR16 | ~9-10 bits | ~0.4% | High performance | ||
XOR8 | ~9-10 bits | ~0.4% | Smaller size | ||
Binary Fuse16 | ~9-10 bits | ~0.4% | Best space efficiency | ||
Binary Fuse8 | ~9-10 bits | ~1.5% | Smallest size |
Integration with Other Extensions
Using with hashfuncs
Extension
The hashfuncs
extension provides additional hash functions that can improve filter performance and distribution.
-- Load both extensions
LOAD hashfuncs;
LOAD bitfilters;
-- Use specialized hash functions for better distribution
SELECT quotient_filter(16, 4, xxh64(complex_key || salt))
FROM my_table;
Hash Function Recommendations
For optimal performance, consider these hash functions from the hashfuncs
extension:
Limitations
- Hash-based Input: All filters require hash values as input
- Static Size: XOR and Binary Fuse filters cannot be resized after creation
- No Direct Deletion: Only quotient filters support element removal
- False Positives: All filters may return false positives but never false negatives
- Type Consistency: Hash values must be consistent between creation and testing
- Memory Usage: Larger filters provide better accuracy but use more memory
Performance Characteristics
For detailed performance analysis, see the respective research papers: Quotient Filters, XOR Filters, and Binary Fuse Filters.
Operation | Quotient Filter | XOR Filter | Binary Fuse Filter |
---|---|---|---|
Construction | O(n) | O(n) | O(n) |
Query | O(1) average | O(1) | O(1) |
Space | Variable | ~9.84 bits/element | ~9.1 bits/element |
False Positive Rate | Configurable | ~0.39% | 8-bit: ~1.56%, 16-bit: ~0.39% |
Supports Deletion | β | β | β |
Resizable | β | β | β |