Apache DataSketches
Bring [Apache DataSketches](https://datasketches.apache.org/) into DuckDB. Approximate distinct-counts (HLL, CPC, Theta), quantiles (KLL, TDigest, REQ, classic Quantiles), and frequent items — every sketch serializes to a portable BLOB you can store in a column, ship between processes, and merge later without rescanning.
Install
-- Install the extension
INSTALL datasketches FROM community;
-- Load it into your session
LOAD datasketches;
-- Approximate distinct users with HLL (lg_k = 12 → ~4 KB sketch, ~1.6% error)
SELECT datasketch_hll_estimate(datasketch_hll(12, user_id)) AS distinct_users
FROM events;
-- p50 / p95 / p99 latency from a KLL quantile sketch
WITH agg AS (
SELECT datasketch_kll(200, latency_ms) AS sketch FROM requests
)
SELECT
datasketch_kll_quantile(sketch, 0.50, true) AS p50,
datasketch_kll_quantile(sketch, 0.95, true) AS p95,
datasketch_kll_quantile(sketch, 0.99, true) AS p99
FROM agg; Technical Overview
Why use DataSketches in DuckDB?
Bring Apache DataSketches — the production-grade C++ library for streaming approximate aggregates — into DuckDB SQL. Every sketch serializes to a portable BLOB you can store in a column, ship between processes, and merge later without rescanning the source data. The value prop is billion-row aggregates with bounded memory and merge-friendly state, not exact answers.
🎯 Cardinality — distinct counts that merge
Three sketch families estimate COUNT(DISTINCT x) in fixed memory and let you union per-partition / per-day sketches together later. Pick one based on whether you optimize for serialization speed, on-disk size, or set algebra.
- • HLL — speed and interop: HyperLogLog is the industry-standard distinct-count sketch. Build with
datasketch_hll, estimate withdatasketch_hll_estimate, merge withdatasketch_hll_union. Default choice when you need cross-system compatibility (Druid, Pinot, BigQuery, Spark) or fastest read paths. - • CPC — same accuracy, ~40% less storage: Compressed Probability Counting trades slower serialization for smaller sketches at the same accuracy. Build with
datasketch_cpc, merge withdatasketch_cpc_union. Reach for this when you store millions of per-partition sketches at rest. - • Theta — union, intersect, A-not-B: Theta sketches are the only family that supports set operations beyond merge.
datasketch_theta_union,datasketch_theta_intersect, anddatasketch_theta_a_not_bcompute approximate|A ∪ B|,|A ∩ B|, and|A \ B|directly from sketch state — funnel analysis, retention cohorts, churn — without touching the underlying rows.
📈 Quantiles — p50 / p95 / p99 in fixed memory
Four mergeable quantile sketches. They differ in error behavior and input-type support; the right pick depends on whether you care about the median, the tail, or skewed distributions.
- • KLL — modern default: KLL gives the best balance of accuracy, speed, and size. Build with
datasketch_kll; read percentiles withdatasketch_kll_quantile, the inverse withdatasketch_kll_rank, full distribution withdatasketch_kll_cdf/datasketch_kll_pmf. Start here unless you have a specific reason to pick something else. - • TDigest — tail-accurate: t-digest is most accurate at the tails (p99, p999) — exactly where SLOs live. FLOAT/DOUBLE input only. Build with
datasketch_tdigest, read withdatasketch_tdigest_quantile. - • REQ — relative-error guarantees: Relative-error quantiles — error scales with rank rather than being fixed. Predictable accuracy on highly skewed data. Build with
datasketch_req, read withdatasketch_req_quantile. - • Quantiles — classic, broadly compatible: The original mergeable quantile sketch. Solid and well-tested; KLL supersedes it for new work. Build with
datasketch_quantiles.
🔝 Frequency — top-K heavy hitters
Frequent Items finds the most-frequent items in a stream and returns each candidate with a lower-bound, upper-bound, and estimate. Use for top-N over high-cardinality streams where exact GROUP BY count() ORDER BY is too expensive — error-page sources, abuse detection, top searches, hot keys.
- • Build the sketch:
datasketch_frequent_items(map_size, value)—map_sizeis a power of two; bigger sketches keep more candidates. - • Pull the top-K:
datasketch_frequent_items_get_frequentwith'NO_FALSE_POSITIVES'for items that are definitely heavy, or'NO_FALSE_NEGATIVES'for the union of all candidates that might be. - • Inspect bounds: Per-item
datasketch_frequent_items_estimate,datasketch_frequent_items_lower_bound, anddatasketch_frequent_items_upper_boundgive the confidence interval around each frequency.
🧭 When to reach for sketches vs. exact aggregates
Sketches are approximate by design. The trade-off is bounded memory, mergeable state, and order-of-magnitude faster aggregation against a small, predictable error. They are not a replacement for COUNT(DISTINCT) when you need exact answers.
Reach for sketches when
Input is too large for exact aggregation, you want to persist state and merge it later (per-day → rolling 7d/30d, per-shard → global), or you need cross-system interop (Druid, Pinot, Spark, BigQuery) on the same sketch BLOB.
Stick with exact aggregates when
Your data fits and exact answers are policy (billing, audits, reconciliation). DuckDB's built-in COUNT(DISTINCT) and quantile_cont are vectorized and very fast — measure before reaching for a sketch.
Reach for this extension over DuckDB's built-ins when
DuckDB ships with approx_count_distinct (HLL) and approx_quantile (TDigest), but those expose only the final estimate — not the sketch state. Use this extension when you need to persist the sketch as a column, merge sketches later, tune sketch parameters (lg_k, K, map_size), or interop with the wider Apache DataSketches ecosystem.
Deep Dive
Technical Details
DuckDB ↔ Apache DataSketches
Mergeable, persistable streaming sketches — distinct counts, quantiles, and heavy hitters — from the Apache DataSketches library.
A billion-row distinct count, persisted
A naive COUNT(DISTINCT user_id) over a billion-row event table is a sort plus a hash table — minutes of work, gigabytes of memory, a fresh scan every time someone asks again. With a Theta sketch, it’s a fixed-memory aggregate you build once per partition and roll up on demand:
-- One pass, per-day sketch state, persisted as a typed BLOB column
CREATE TABLE daily_uniques AS
SELECT date_trunc('day', ts) AS day,
datasketch_theta(12, user_id) AS sketch -- ~64 KB per day
FROM events
GROUP BY 1;
-- Rolling 7-day uniques: union the per-day sketches, no rescan
SELECT datasketch_theta_estimate(
datasketch_theta_union(sketch)
) AS uniques_last_7d
FROM daily_uniques
WHERE day >= CURRENT_DATE - 7;
-- Retention: users active in BOTH last week AND this week
WITH last_week AS (SELECT datasketch_theta_union(sketch) AS s
FROM daily_uniques
WHERE day BETWEEN CURRENT_DATE - 14 AND CURRENT_DATE - 7),
this_week AS (SELECT datasketch_theta_union(sketch) AS s
FROM daily_uniques
WHERE day >= CURRENT_DATE - 7)
SELECT datasketch_theta_estimate(
datasketch_theta_intersect(last_week.s, this_week.s)
) AS retained
FROM last_week, this_week;
The first query touches the source rows once. The next two — and any other window, set difference, or intersection you care to ask — operate on the sketch state alone.
Sketches are not a faster way to get exact answers. They give bounded-memory, mergeable, approximately correct aggregates with a small, predictable error. The default HLL sketch (lg_k = 12) is roughly ±1.6% on distinct counts; the default KLL (K = 200) puts quantile error at roughly ±1.5% rank. Pick the size knob to match your accuracy budget — and don’t reach for sketches when policy demands exact (billing, audits, reconciliation).
Architecture
This extension is a thin DuckDB wrapper around the Apache DataSketches C++ library. Every sketch family registers:
- An aggregate (
datasketch_kll,datasketch_hll,datasketch_theta, …) that builds the sketch in one pass over input values. - A typed BLOB column type (
sketch_kll_double,sketch_hll,sketch_theta, …) — a strict superset ofBLOBcarrying the serialized sketch. - Reader scalars (
datasketch_*_quantile,datasketch_*_estimate,datasketch_*_cdf, …) that decode the BLOB and return values without rebuilding from rows. - Mergers (
datasketch_hll_union,datasketch_theta_union, …) — aggregates over a sketch column that produce a rolled-up sketch.
The serialization format is the same one the Java reference implementation writes, which is the format Druid, Pinot, BigQuery, and Spark integrations consume. A sketch built in DuckDB and written to Parquet can be merged inside Druid or vice versa — no re-aggregation across systems.
Choosing a sketch
The right sketch depends on what you’re estimating and which constraint binds first — read latency, sketch size, or set algebra.
Distinct counts (cardinality)
| Family | When to pick | Notes |
|---|---|---|
| HLL | Default — fastest reads, broad cross-system interop | lg_k 4–21; 12 is a good default (~4 KB sketch) |
| CPC | Storage-bound — many sketches at rest | ~40% smaller than HLL at the same accuracy |
| Theta | You need union, intersection, or A-not-B (funnel, retention, churn) | Only family with full set algebra |
Quantiles
| Family | When to pick | Notes |
|---|---|---|
| KLL | Default — best accuracy/size balance | K = 200 is a sensible production starting point |
| TDigest | p99 / p999 SLO reporting | FLOAT/DOUBLE only; tail-accurate |
| REQ | Skewed distributions | Error scales with rank, not fixed |
| Quantiles | Compatibility with classic pipelines | Use KLL for new work |
Frequency
A single family — Frequent Items — for top-K heavy hitters with confidence-bounded frequencies. Use 'NO_FALSE_POSITIVES' to get items that are definitely heavy, 'NO_FALSE_NEGATIVES' to get every candidate that might be.
Compared to alternatives
- DuckDB’s built-in
approx_count_distinctandapprox_quantile— both ship in DuckDB and use HLL / TDigest internally, but expose only the final estimate. Reach for this extension when you need the sketch as state — to persist it as a column, merge later, tune sketch parameters (lg_k,K,map_size), or interoperate with other Apache DataSketches consumers. - Exact aggregates (
COUNT(DISTINCT),quantile_cont,mode) — the right answer when policy demands exact and your data fits. DuckDB’s vectorized exact aggregates are fast at most analytical scales; measure before reaching for an approximate sketch. - Reservoir / streaming aggregates rolled by hand — sketches give you the same bounded-memory streaming property plus mathematically grounded error bounds plus mergeability across processes. For anything that needs to survive a process restart or roll up across shards, the sketch is the better primitive.
- Druid / Pinot / BigQuery sketch columns — those systems use the same serialization format. If you’re already producing sketches there, this extension lets DuckDB read and merge them; if you’re building them in DuckDB, downstream OLAP systems can consume the BLOBs directly.
Persistence and portability
Every sketch round-trips through a typed BLOB column (sketch_kll_double, sketch_hll, sketch_theta, etc.) that’s a strict superset of BLOB. Sketches written into a Parquet file can be read back into any DuckDB instance with this extension loaded, or into any other DataSketches-aware system, and merged without a re-scan. Aggregate functions accept either raw input values or an existing sketch column, so incremental rollup is just another SELECT.
Reference
Extension Contents
Quick reference to all available functions and settings organized by category.
| Name | Description | |
|---|---|---|
| CPC Compressed Probability Counting — distinct-counting at roughly 40% the size of HLL at the same accuracy, traded against slower serialization. Choose this when you store many sketches at rest. | ||
| datasketch_cpc() | Aggregate input values into a Compressed Probability Counting sketch — a distinct-count sketch that's roughly 40% smaller than HLL at the same accuracy, traded against slower serialization | |
| datasketch_cpc_describe() | Return a string representation of the sketch | |
| datasketch_cpc_estimate() | Read the estimated distinct count from a CPC sketch | |
| datasketch_cpc_is_empty() | Return a boolean indicating if the sketch is empty | |
| datasketch_cpc_lower_bound() | Return the lower bound of the number of distinct items seen by the sketch | |
| datasketch_cpc_union() | Merge a column of sketch_cpc BLOBs into one rollup sketch — the standard per-partition / per-day rollup pattern | |
| datasketch_cpc_upper_bound() | Return the upper bound of the number of distinct items seen by the sketch | |
| Frequent Items Heavy-hitter sketch that identifies the most-frequent items in a stream along with confidence-bounded frequency estimates. Use for top-N analysis on high-cardinality streams where exact | ||
| datasketch_frequent_items() | Aggregate input values into a Frequent Items (heavy-hitter) sketch | |
| datasketch_frequent_items_epsilon() | Returns the epsilon value (relative error) of the sketch | |
| datasketch_frequent_items_estimate() | Estimated frequency for a specific item | |
| datasketch_frequent_items_get_frequent() | Return the heavy-hitter candidates with per-item estimate, lower bound, and upper bound | |
| datasketch_frequent_items_is_empty() | Returns true if the sketch is empty | |
| datasketch_frequent_items_lower_bound() | Returns the lower bound frequency estimate for a specific item | |
| datasketch_frequent_items_num_active() | Returns the number of active items currently tracked by the sketch | |
| datasketch_frequent_items_total_weight() | Returns the total weight (sum of all item counts) processed by the sketch | |
| datasketch_frequent_items_upper_bound() | Returns the upper bound frequency estimate for a specific item | |
| HLL HyperLogLog distinct-counting sketch — the industry standard. Fast serialize/deserialize and broad cross-system compatibility. Choose this when speed and interop matter more than storage. | ||
| datasketch_hll() | Aggregate input values into a HyperLogLog sketch for distinct counting | |
| datasketch_hll_describe() | Return a string representation of the sketch | |
| datasketch_hll_estimate() | Read the estimated distinct count from an HLL sketch | |
| datasketch_hll_is_compact() | Return whether the sketch is in compact form | |
| datasketch_hll_is_empty() | Return a boolean indicating if the sketch is empty | |
| datasketch_hll_lg_config_k() | Return the value of log base 2 K for this sketch | |
| datasketch_hll_lower_bound() | Lower bound of the HLL distinct-count estimate at a given number of standard deviations | |
| datasketch_hll_union() | Merge multiple HLL sketches into one | |
| datasketch_hll_upper_bound() | Upper bound of the HLL distinct-count estimate at a given number of standard deviations | |
| KLL KLL quantile sketch — modern mergeable quantile estimator. Best balance of accuracy, speed, and size for general-purpose quantile work. Default choice unless you have a specific reason to pick TDigest or REQ. | ||
| datasketch_kll() | Aggregate input values into a KLL quantile sketch | |
| datasketch_kll_cdf() | CDF over a list of split points — one call returns the cumulative rank at each | |
| datasketch_kll_describe() | Return a description of this sketch | |
| datasketch_kll_is_empty() | Return a boolean indicating if the sketch is empty | |
| datasketch_kll_is_estimation_mode() | Return a boolean indicating if the sketch is in estimation mode | |
| datasketch_kll_k() | Return the value of K for this sketch | |
| datasketch_kll_max_item() | Return the maxium item in the sketch | |
| datasketch_kll_min_item() | Return the minimum item in the sketch | |
| datasketch_kll_n() | Return the number of items contained in the sketch | |
| datasketch_kll_normalized_rank_error() | Return the normalized rank error of the sketch | |
| datasketch_kll_num_retained() | Return the number of retained items in the sketch | |
| datasketch_kll_pmf() | PMF (probability mass) over a list of split points — fraction of the distribution falling in each bucket | |
| datasketch_kll_quantile() | Approximate quantile at a given rank — given a sketch and r ∈ [0, 1], returns the value at that rank in the sorted distribution | |
| datasketch_kll_rank() | Inverse of datasketch_kll_quantile — given a value, return its approximate rank r ∈ [0, 1] in the sorted distribution | |
| Quantiles Classic mergeable quantile sketch from the original DataSketches paper. Solid general-purpose choice supporting all numeric types; KLL is now preferred for new projects. | ||
| datasketch_quantiles() | Aggregate input values into the classic mergeable quantiles sketch from the original DataSketches paper | |
| datasketch_quantiles_cdf() | Return the Cumulative Distribution Function (CDF) of the sketch for a series of points | |
| datasketch_quantiles_describe() | Return a description of this sketch | |
| datasketch_quantiles_is_empty() | Return a boolean indicating if the sketch is empty | |
| datasketch_quantiles_is_estimation_mode() | Return a boolean indicating if the sketch is in estimation mode | |
| datasketch_quantiles_k() | Return the value of K for this sketch | |
| datasketch_quantiles_max_item() | Return the maxium item in the sketch | |
| datasketch_quantiles_min_item() | Return the minimum item in the sketch | |
| datasketch_quantiles_n() | Return the number of items contained in the sketch | |
| datasketch_quantiles_normalized_rank_error() | Return the normalized rank error of the sketch | |
| datasketch_quantiles_num_retained() | Return the number of retained items in the sketch | |
| datasketch_quantiles_pmf() | Return the Probability Mass Function (PMF) of the sketch for a series of points | |
| datasketch_quantiles_quantile() | Approximate quantile at a given rank from a classic Quantiles sketch | |
| datasketch_quantiles_rank() | Approximate rank of a value within the classic Quantiles sketch | |
| REQ Relative-error quantile sketch. Error scales with rank rather than being fixed across the distribution — predictable accuracy on highly skewed data. | ||
| datasketch_req() | Aggregate input values into a Relative Error Quantile sketch | |
| datasketch_req_cdf() | Return the Cumulative Distribution Function (CDF) of the sketch for a series of points | |
| datasketch_req_describe() | Return a description of this sketch | |
| datasketch_req_is_empty() | Return a boolean indicating if the sketch is empty | |
| datasketch_req_is_estimation_mode() | Return a boolean indicating if the sketch is in estimation mode | |
| datasketch_req_k() | Return the value of K for this sketch | |
| datasketch_req_max_item() | Return the maxium item in the sketch | |
| datasketch_req_min_item() | Return the minimum item in the sketch | |
| datasketch_req_n() | Return the number of items contained in the sketch | |
| datasketch_req_num_retained() | Return the number of retained items in the sketch | |
| datasketch_req_pmf() | Return the Probability Mass Function (PMF) of the sketch for a series of points | |
| datasketch_req_quantile() | Approximate quantile at a given rank from a REQ sketch | |
| datasketch_req_rank() | Approximate rank of a value within the REQ sketch | |
| TDigest t-digest — quantile sketch optimized for the tails (p99, p999). Most accurate where SLOs live; FLOAT/DOUBLE input only. Use when tail latencies or outliers matter more than median accuracy. | ||
| datasketch_tdigest() | Aggregate input values into a t-digest quantile sketch — most accurate at the tails (p99, p999), exactly where SLOs live | |
| datasketch_tdigest_cdf() | Return the Cumulative Distribution Function (CDF) of the sketch for a series of points | |
| datasketch_tdigest_describe() | Return a description of this sketch | |
| datasketch_tdigest_is_empty() | Return a boolean indicating if the sketch is empty | |
| datasketch_tdigest_k() | Return the value of K for this sketch | |
| datasketch_tdigest_pmf() | Return the Probability Mass Function (PMF) of the sketch for a series of points | |
| datasketch_tdigest_quantile() | Approximate quantile at a given rank from a t-digest sketch | |
| datasketch_tdigest_rank() | Approximate rank of a value within the t-digest sketch | |
| datasketch_tdigest_total_weight() | Return the total weight of this sketch | |
| Theta Theta sketch — the only distinct-count family that supports set operations beyond union. Combine cohorts with | ||
| datasketch_theta() | Aggregate input values into a Theta sketch — the distinct-count family that supports set operations (union, intersect, A-not-B) beyond simple merge | |
| datasketch_theta_a_not_b() | Approximate |A \ B| — distinct items present in A but not in B | |
| datasketch_theta_describe() | Returns a human-readable description of the Theta sketch | |
| datasketch_theta_estimate() | Read the estimated distinct count from a Theta sketch | |
| datasketch_theta_get_seed() | Returns the seed hash used by the sketch | |
| datasketch_theta_get_theta() | Returns the theta value of the sketch (sampling probability) | |
| datasketch_theta_intersect() | Approximate |A ∩ B| — distinct items present in both cohorts | |
| datasketch_theta_is_empty() | Returns true if the Theta sketch is empty | |
| datasketch_theta_is_estimation_mode() | Returns true if the sketch is in estimation mode (has exceeded exact counting capacity) | |
| datasketch_theta_lower_bound() | Returns the lower bound estimate at the given number of standard deviations (1, 2, or 3) | |
| datasketch_theta_num_retained() | Returns the number of hash values retained in the sketch | |
| datasketch_theta_union() | Approximate |A ∪ B| — distinct count across the union of two cohorts | |
| datasketch_theta_upper_bound() | Returns the upper bound estimate at the given number of standard deviations (1, 2, or 3) | |
API Reference
Function Documentation
Detailed documentation for each function including signatures, parameters, and examples.
datasketch_cpc
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | INTEGER | Positional | |
col1 | FLOAT | BLOB | DOUBLE | … 12 concrete typesBIGINTBLOBDOUBLEFLOATINTEGERSMALLINTTINYINTUBIGINTUINTEGERUSMALLINTUTINYINTVARCHAR | Positional |
Returns
Description
Aggregate input values into a Compressed Probability Counting sketch — a distinct-count sketch that's roughly 40% smaller than HLL at the same accuracy, traded against slower serialization. Choose CPC when you store many sketches at rest and reads are infrequent. The leading lg_k controls the size/accuracy trade-off the same way it does for HLL.
Examples
Distinct-count with CPC
SELECT datasketch_cpc_estimate(datasketch_cpc(12, user_id)) AS distinct_users
FROM events; Related Functions
- datasketch_cpc_estimate() — Read the estimated distinct count from a CPC sketch
- datasketch_cpc_union() — Merge a column of `sketch_cpc` BLOBs into one rollup sketch — the standard per-partition / per-day rollup pattern
- datasketch_cpc_lower_bound() — Return the lower bound of the number of distinct items seen by the sketch
- datasketch_cpc_upper_bound() — Return the upper bound of the number of distinct items seen by the sketch
datasketch_cpc_describe
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | sketch_cpc | Positional |
Returns
Description
Return a string representation of the sketch
Examples
SELECT datasketch_cpc_describe(sketch); datasketch_cpc_estimate
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | sketch_cpc | Positional |
Returns
Description
Read the estimated distinct count from a CPC sketch.
Examples
SELECT datasketch_cpc_estimate(sketch); Related Functions
- datasketch_cpc() — Aggregate input values into a Compressed Probability Counting sketch — a distinct-count sketch that's roughly 40% smaller than HLL at the same accuracy, traded against slower serialization
- datasketch_cpc_union() — Merge a column of `sketch_cpc` BLOBs into one rollup sketch — the standard per-partition / per-day rollup pattern
datasketch_cpc_is_empty
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | sketch_cpc | Positional |
Returns
Description
Return a boolean indicating if the sketch is empty
Examples
SELECT datasketch_cpc_is_empty(sketch); datasketch_cpc_lower_bound
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | sketch_cpc | Positional | |
col1 | UTINYINT | Positional |
Returns
Description
Return the lower bound of the number of distinct items seen by the sketch
Examples
SELECT datasketch_cpc_lower_bound(sketch, std_dev); Related Functions
- datasketch_cpc() — Aggregate input values into a Compressed Probability Counting sketch — a distinct-count sketch that's roughly 40% smaller than HLL at the same accuracy, traded against slower serialization
- datasketch_cpc_upper_bound() — Return the upper bound of the number of distinct items seen by the sketch
datasketch_cpc_union
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | INTEGER | Positional | |
col1 | sketch_cpc | Positional |
Returns
Description
Merge a column of sketch_cpc BLOBs into one rollup sketch — the standard per-partition / per-day rollup pattern.
Examples
SELECT datasketch_cpc_union(k, data); Related Functions
datasketch_cpc_upper_bound
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | sketch_cpc | Positional | |
col1 | UTINYINT | Positional |
Returns
Description
Return the upper bound of the number of distinct items seen by the sketch
Examples
SELECT datasketch_cpc_upper_bound(sketch, std_dev); Related Functions
- datasketch_cpc() — Aggregate input values into a Compressed Probability Counting sketch — a distinct-count sketch that's roughly 40% smaller than HLL at the same accuracy, traded against slower serialization
- datasketch_cpc_lower_bound() — Return the lower bound of the number of distinct items seen by the sketch
datasketch_frequent_items
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | FLOAT | UINTEGER | SMALLINT | … 12 concrete typesBIGINTDOUBLEFLOATINTEGERSMALLINTTINYINTUBIGINTUINTEGERUSMALLINTUTINYINTVARCHARsketch_frequent_items | Positional |
Returns
Description
Aggregate input values into a Frequent Items (heavy-hitter) sketch. The first argument is lg_max_map_size — log₂ of the maximum number of tracked items, so 8 → up to 256 candidates, 12 → up to 4096. Bigger values track more candidates and tighten the per-item error bound. Typical values run 4–12.
Examples
Top countries by page views
SELECT datasketch_frequent_items_get_frequent(
datasketch_frequent_items(8, country_code),
'NO_FALSE_POSITIVES'
) AS heavy_hitters
FROM page_views; Related Functions
- datasketch_frequent_items_get_frequent() — Return the heavy-hitter candidates with per-item estimate, lower bound, and upper bound
- datasketch_frequent_items_estimate() — Estimated frequency for a specific item
- datasketch_frequent_items_lower_bound() — Returns the lower bound frequency estimate for a specific item
- datasketch_frequent_items_upper_bound() — Returns the upper bound frequency estimate for a specific item
datasketch_frequent_items_epsilon
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | sketch_frequent_items | Positional |
Returns
Description
Returns the epsilon value (relative error) of the sketch
Examples
SELECT datasketch_frequent_items_epsilon(sketch); Related Functions
datasketch_frequent_items_estimate
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | sketch_frequent_items | Positional | |
col1 | UBIGINT | BIGINT | SMALLINT | … 11 concrete typesBIGINTDOUBLEFLOATINTEGERSMALLINTTINYINTUBIGINTUINTEGERUSMALLINTUTINYINTVARCHAR | Positional |
Returns
Description
Estimated frequency for a specific item.
Examples
SELECT datasketch_frequent_items_estimate(sketch, 'item'); Related Functions
datasketch_frequent_items_get_frequent
Return the heavy-hitter candidates with per-item estimate, lower bound, and upper bound.
datasketch_frequent_items_is_empty
Returns true if the sketch is empty
datasketch_frequent_items_lower_bound
Returns the lower bound frequency estimate for a specific item
datasketch_frequent_items_num_active
Returns the number of active items currently tracked by the sketch
datasketch_frequent_items_total_weight
Returns the total weight (sum of all item counts) processed by the sketch
datasketch_frequent_items_upper_bound
Returns the upper bound frequency estimate for a specific item
datasketch_hll
Aggregate input values into a HyperLogLog sketch for distinct counting.
datasketch_hll_describe
Return a string representation of the sketch
datasketch_hll_estimate
Read the estimated distinct count from an HLL sketch.
datasketch_hll_is_compact
Return whether the sketch is in compact form
datasketch_hll_is_empty
Return a boolean indicating if the sketch is empty
datasketch_hll_lg_config_k
Return the value of log base 2 K for this sketch
datasketch_hll_lower_bound
Lower bound of the HLL distinct-count estimate at a given number of standard deviations.
datasketch_hll_union
Merge multiple HLL sketches into one.
datasketch_hll_upper_bound
Upper bound of the HLL distinct-count estimate at a given number of standard deviations.
datasketch_kll
Aggregate input values into a KLL quantile sketch.
datasketch_kll_cdf
CDF over a list of split points — one call returns the cumulative rank at each.
datasketch_kll_describe
Return a description of this sketch
datasketch_kll_is_empty
Return a boolean indicating if the sketch is empty
datasketch_kll_is_estimation_mode
Return a boolean indicating if the sketch is in estimation mode
datasketch_kll_k
Return the value of K for this sketch
datasketch_kll_max_item
Return the maxium item in the sketch
datasketch_kll_min_item
Return the minimum item in the sketch
datasketch_kll_n
Return the number of items contained in the sketch
datasketch_kll_normalized_rank_error
Return the normalized rank error of the sketch
datasketch_kll_num_retained
Return the number of retained items in the sketch
datasketch_kll_pmf
PMF (probability mass) over a list of split points — fraction of the distribution falling in each bucket.
datasketch_kll_quantile
Approximate quantile at a given rank — given a sketch and `r ∈ [0, 1]`, returns the value at that rank in the sorted distribution.
datasketch_kll_rank
Inverse of `datasketch_kll_quantile` — given a value, return its approximate rank `r ∈ [0, 1]` in the sorted distribution.
datasketch_quantiles
Aggregate input values into the classic mergeable quantiles sketch from the original DataSketches paper.
datasketch_quantiles_cdf
Return the Cumulative Distribution Function (CDF) of the sketch for a series of points
datasketch_quantiles_describe
Return a description of this sketch
datasketch_quantiles_is_empty
Return a boolean indicating if the sketch is empty
datasketch_quantiles_is_estimation_mode
Return a boolean indicating if the sketch is in estimation mode
datasketch_quantiles_k
Return the value of K for this sketch
datasketch_quantiles_max_item
Return the maxium item in the sketch
datasketch_quantiles_min_item
Return the minimum item in the sketch
datasketch_quantiles_n
Return the number of items contained in the sketch
datasketch_quantiles_normalized_rank_error
Return the normalized rank error of the sketch
datasketch_quantiles_num_retained
Return the number of retained items in the sketch
datasketch_quantiles_pmf
Return the Probability Mass Function (PMF) of the sketch for a series of points
datasketch_quantiles_quantile
Approximate quantile at a given rank from a classic Quantiles sketch.
datasketch_quantiles_rank
Approximate rank of a value within the classic Quantiles sketch.
datasketch_req
Aggregate input values into a Relative Error Quantile sketch.
datasketch_req_cdf
Return the Cumulative Distribution Function (CDF) of the sketch for a series of points
datasketch_req_describe
Return a description of this sketch
datasketch_req_is_empty
Return a boolean indicating if the sketch is empty
datasketch_req_is_estimation_mode
Return a boolean indicating if the sketch is in estimation mode
datasketch_req_k
Return the value of K for this sketch
datasketch_req_max_item
Return the maxium item in the sketch
datasketch_req_min_item
Return the minimum item in the sketch
datasketch_req_n
Return the number of items contained in the sketch
datasketch_req_num_retained
Return the number of retained items in the sketch
datasketch_req_pmf
Return the Probability Mass Function (PMF) of the sketch for a series of points
datasketch_req_quantile
Approximate quantile at a given rank from a REQ sketch.
datasketch_req_rank
Approximate rank of a value within the REQ sketch.
datasketch_tdigest
Aggregate input values into a t-digest quantile sketch — most accurate at the tails (p99, p999), exactly where SLOs live.
datasketch_tdigest_cdf
Return the Cumulative Distribution Function (CDF) of the sketch for a series of points
datasketch_tdigest_describe
Return a description of this sketch
datasketch_tdigest_is_empty
Return a boolean indicating if the sketch is empty
datasketch_tdigest_k
Return the value of K for this sketch
datasketch_tdigest_pmf
Return the Probability Mass Function (PMF) of the sketch for a series of points
datasketch_tdigest_quantile
Approximate quantile at a given rank from a t-digest sketch.
datasketch_tdigest_rank
Approximate rank of a value within the t-digest sketch.
datasketch_tdigest_total_weight
Return the total weight of this sketch
datasketch_theta
Aggregate input values into a Theta sketch — the distinct-count family that supports set operations (union, intersect, A-not-B) beyond simple merge.
datasketch_theta_a_not_b
Approximate `|A \ B|` — distinct items present in A but not in B.
datasketch_theta_describe
Returns a human-readable description of the Theta sketch
datasketch_theta_estimate
Read the estimated distinct count from a Theta sketch.
datasketch_theta_get_seed
Returns the seed hash used by the sketch
datasketch_theta_get_theta
Returns the theta value of the sketch (sampling probability)
datasketch_theta_intersect
Approximate `|A ∩ B|` — distinct items present in both cohorts.
datasketch_theta_is_empty
Returns true if the Theta sketch is empty
datasketch_theta_is_estimation_mode
Returns true if the sketch is in estimation mode (has exceeded exact counting capacity)
datasketch_theta_lower_bound
Returns the lower bound estimate at the given number of standard deviations (1, 2, or 3)
datasketch_theta_num_retained
Returns the number of hash values retained in the sketch
datasketch_theta_union
Approximate `|A ∪ B|` — distinct count across the union of two cohorts.
datasketch_theta_upper_bound
Returns the upper bound estimate at the given number of standard deviations (1, 2, or 3)
Practical Examples
Cookbook
Real-world recipes and patterns for common use cases.
Recipes for distinct counts, quantiles, and heavy hitters with Apache DataSketches.
Distinct count with HLL
The bread-and-butter sketch — approximate COUNT(DISTINCT) in a few KB of memory.
SELECT datasketch_hll_estimate(datasketch_hll(12, user_id)) AS distinct_users
FROM events;
The 12 is lg_k — base-2 log of the bucket count. lg_k = 12 (4096 buckets, ~4 KB sketch, ~1.6% standard error) is a common default. Bracket the estimate with datasketch_hll_lower_bound / datasketch_hll_upper_bound when you want a confidence interval.
Per-day HLL sketches, rolled up on demand
Build the sketch state once; query any window without rescanning.
CREATE TABLE daily_uniques (day DATE, hll sketch_hll);
INSERT INTO daily_uniques
SELECT date_trunc('day', ts) AS day,
datasketch_hll(12, user_id) AS hll
FROM events
GROUP BY 1;
-- Rolling 7-day uniques — reads sketch state, not source events
SELECT datasketch_hll_estimate(datasketch_hll_union(12, hll)) AS uniques_last_7d
FROM daily_uniques
WHERE day >= CURRENT_DATE - 7;
The same pattern works for datasketch_cpc (smaller on disk, slower to write) and datasketch_theta (supports set operations beyond union).
Set operations with Theta — funnel, retention, churn
Theta is the only distinct-count family with full set algebra.
WITH a AS (SELECT datasketch_theta(12, user_id) AS s FROM events_jan),
b AS (SELECT datasketch_theta(12, user_id) AS s FROM events_feb)
SELECT
datasketch_theta_estimate(datasketch_theta_union(a.s, b.s)) AS jan_or_feb,
datasketch_theta_estimate(datasketch_theta_intersect(a.s, b.s)) AS retained, -- in both
datasketch_theta_estimate(datasketch_theta_a_not_b(a.s, b.s)) AS churned, -- jan not feb
datasketch_theta_estimate(datasketch_theta_a_not_b(b.s, a.s)) AS new_users -- feb not jan
FROM a, b;
This is the headline use case for Theta — funnel and cohort analysis directly off sketch state.
p50 / p95 / p99 with KLL
KLL is the recommended general-purpose quantile sketch.
WITH agg AS (
SELECT datasketch_kll(200, latency_ms) AS sketch FROM requests
)
SELECT
datasketch_kll_quantile(sketch, 0.50, true) AS p50,
datasketch_kll_quantile(sketch, 0.95, true) AS p95,
datasketch_kll_quantile(sketch, 0.99, true) AS p99
FROM agg;
K = 200 is a sensible production starting point. Higher K → smaller error, larger sketch. The trailing true is the inclusive flag — true for inclusive search (P[X ≤ q]), false for exclusive (P[X < q]). Use datasketch_kll_normalized_rank_error to inspect the actual rank-error guarantee for the configured K.
Histograms in one query — KLL CDF / PMF
Get the cumulative distribution at any set of split points without a second pass.
SELECT datasketch_kll_cdf(
datasketch_kll(200, latency_ms),
[10, 50, 100, 250, 500, 1000],
/* inclusive */ true
) AS cdf
FROM requests;
Returns one row of cumulative ranks at each split point — directly usable for histograms. Swap cdf for datasketch_kll_pmf to get per-bucket probability mass instead.
Tail-accurate quantiles with TDigest
When p99 / p999 matters more than the median (SLO reporting), reach for t-digest.
CREATE TABLE readings(temp DOUBLE);
INSERT INTO readings(temp) SELECT unnest(generate_series(1, 10))::DOUBLE;
-- Rank of value 5 within the aggregated sketch
SELECT datasketch_tdigest_rank(datasketch_tdigest(100, temp), 5) AS rank_of_5
FROM readings;
Output
| rank_of_5 |
|---|
| 0.45 |
CDF over multiple split points in one call:
SELECT datasketch_tdigest_cdf(datasketch_tdigest(100, temp), [1, 5, 9]) AS cdf
FROM readings;
Output
| cdf |
|---|
| [0.0357, 0.3214, 0.6071, 1.0] |
TDigest accepts FLOAT/DOUBLE only. For integer or skewed-rank quantile work, see datasketch_kll or datasketch_req.
Top-K heavy hitters with Frequent Items
Find the most frequent items in a stream — with confidence bounds on each estimate.
SELECT datasketch_frequent_items_get_frequent(
datasketch_frequent_items(8, country_code),
'NO_FALSE_POSITIVES'
) AS heavy_hitters
FROM page_views;
The result is a list of structs — (item, estimate, lower_bound, upper_bound). Use 'NO_FALSE_POSITIVES' for items that are definitely heavy hitters, 'NO_FALSE_NEGATIVES' for the union of all candidates that might be. The leading 8 is lg_max_map_size — log₂ of the maximum number of tracked items (so 8 → up to 256 candidates). Bigger values track more candidates and tighten the per-item error bound; typical values run 4–12.
Skew-aware quantiles with REQ
When the distribution is heavily skewed and you want predictable accuracy across the whole rank space, REQ scales its error with rank rather than fixing it.
SELECT
datasketch_req_quantile(datasketch_req(12, value), 0.001, true) AS p001,
datasketch_req_quantile(datasketch_req(12, value), 0.5, true) AS p50,
datasketch_req_quantile(datasketch_req(12, value), 0.999, true) AS p999
FROM measurements;
Persist sketches to Parquet, merge in another DuckDB
Sketch BLOBs are portable across processes and across DuckDB instances:
COPY (SELECT day, hll FROM daily_uniques)
TO 'daily_uniques.parquet' (FORMAT PARQUET);
-- Later, anywhere with this extension loaded:
SELECT datasketch_hll_estimate(datasketch_hll_union(12, hll))
FROM read_parquet('daily_uniques.parquet')
WHERE day BETWEEN DATE '2026-01-01' AND DATE '2026-01-31';
The serialization format is the same one used by Druid, Pinot, BigQuery, and the Apache DataSketches Java library — sketches built here can be merged in those systems and vice versa.
Platform Support
Compatibility
Extension availability may vary by platform and DuckDB version. Check below to ensure this extension supports your environment before installation.
Quick Facts
Platforms
Supported platform architectures
Compiled binary sizes
| Platform | Architecture | Size |
|---|---|---|
| Linux | aarch64 | 3.64 MB |
| Linux | x86_64 | 4.15 MB |
| Linux (musl) | x86_64 | 4.08 MB |
| macOS | Apple Silicon | 2.81 MB |
| macOS | Intel | 3.19 MB |
| Windows | x86_64 | 8.24 MB |
| WASM | eh | 273.1 KB |
| WASM | mvp | 266.8 KB |
| WASM | threads | 265.7 KB |
Gzipped download size from the DuckDB community-extensions registry.
Billion-row aggregates with bounded memory
Install the DataSketches extension to compute distinct counts, quantiles, and heavy hitters as mergeable sketch state — persistable, shippable, and interoperable with the broader Apache DataSketches ecosystem.