Apache DataSketches icon

Apache DataSketches

Bring [Apache DataSketches](https://datasketches.apache.org/) into DuckDB. Approximate distinct-counts (HLL, CPC, Theta), quantiles (KLL, TDigest, REQ, classic Quantiles), and frequent items — every sketch serializes to a portable BLOB you can store in a column, ship between processes, and merge later without rescanning.

Install

-- Install the extension
INSTALL datasketches FROM community;

-- Load it into your session
LOAD datasketches;

-- Approximate distinct users with HLL (lg_k = 12 → ~4 KB sketch, ~1.6% error)
SELECT datasketch_hll_estimate(datasketch_hll(12, user_id)) AS distinct_users
FROM events;

-- p50 / p95 / p99 latency from a KLL quantile sketch
WITH agg AS (
  SELECT datasketch_kll(200, latency_ms) AS sketch FROM requests
)
SELECT
  datasketch_kll_quantile(sketch, 0.50, true) AS p50,
  datasketch_kll_quantile(sketch, 0.95, true) AS p95,
  datasketch_kll_quantile(sketch, 0.99, true) AS p99
FROM agg;

Technical Overview

Why use DataSketches in DuckDB?

Bring Apache DataSketches — the production-grade C++ library for streaming approximate aggregates — into DuckDB SQL. Every sketch serializes to a portable BLOB you can store in a column, ship between processes, and merge later without rescanning the source data. The value prop is billion-row aggregates with bounded memory and merge-friendly state, not exact answers.

🎯 Cardinality — distinct counts that merge

Three sketch families estimate COUNT(DISTINCT x) in fixed memory and let you union per-partition / per-day sketches together later. Pick one based on whether you optimize for serialization speed, on-disk size, or set algebra.

  • HLL — speed and interop: HyperLogLog is the industry-standard distinct-count sketch. Build with datasketch_hll, estimate with datasketch_hll_estimate, merge with datasketch_hll_union. Default choice when you need cross-system compatibility (Druid, Pinot, BigQuery, Spark) or fastest read paths.
  • CPC — same accuracy, ~40% less storage: Compressed Probability Counting trades slower serialization for smaller sketches at the same accuracy. Build with datasketch_cpc, merge with datasketch_cpc_union. Reach for this when you store millions of per-partition sketches at rest.
  • Theta — union, intersect, A-not-B: Theta sketches are the only family that supports set operations beyond merge. datasketch_theta_union, datasketch_theta_intersect, and datasketch_theta_a_not_b compute approximate |A ∪ B|, |A ∩ B|, and |A \ B| directly from sketch state — funnel analysis, retention cohorts, churn — without touching the underlying rows.

📈 Quantiles — p50 / p95 / p99 in fixed memory

Four mergeable quantile sketches. They differ in error behavior and input-type support; the right pick depends on whether you care about the median, the tail, or skewed distributions.

  • KLL — modern default: KLL gives the best balance of accuracy, speed, and size. Build with datasketch_kll; read percentiles with datasketch_kll_quantile, the inverse with datasketch_kll_rank, full distribution with datasketch_kll_cdf / datasketch_kll_pmf. Start here unless you have a specific reason to pick something else.
  • TDigest — tail-accurate: t-digest is most accurate at the tails (p99, p999) — exactly where SLOs live. FLOAT/DOUBLE input only. Build with datasketch_tdigest, read with datasketch_tdigest_quantile.
  • REQ — relative-error guarantees: Relative-error quantiles — error scales with rank rather than being fixed. Predictable accuracy on highly skewed data. Build with datasketch_req, read with datasketch_req_quantile.
  • Quantiles — classic, broadly compatible: The original mergeable quantile sketch. Solid and well-tested; KLL supersedes it for new work. Build with datasketch_quantiles.

🔝 Frequency — top-K heavy hitters

Frequent Items finds the most-frequent items in a stream and returns each candidate with a lower-bound, upper-bound, and estimate. Use for top-N over high-cardinality streams where exact GROUP BY count() ORDER BY is too expensive — error-page sources, abuse detection, top searches, hot keys.

  • Build the sketch: datasketch_frequent_items(map_size, value)map_size is a power of two; bigger sketches keep more candidates.
  • Pull the top-K: datasketch_frequent_items_get_frequent with 'NO_FALSE_POSITIVES' for items that are definitely heavy, or 'NO_FALSE_NEGATIVES' for the union of all candidates that might be.
  • Inspect bounds: Per-item datasketch_frequent_items_estimate, datasketch_frequent_items_lower_bound, and datasketch_frequent_items_upper_bound give the confidence interval around each frequency.

🧭 When to reach for sketches vs. exact aggregates

Sketches are approximate by design. The trade-off is bounded memory, mergeable state, and order-of-magnitude faster aggregation against a small, predictable error. They are not a replacement for COUNT(DISTINCT) when you need exact answers.

Reach for sketches when

Input is too large for exact aggregation, you want to persist state and merge it later (per-day → rolling 7d/30d, per-shard → global), or you need cross-system interop (Druid, Pinot, Spark, BigQuery) on the same sketch BLOB.

Stick with exact aggregates when

Your data fits and exact answers are policy (billing, audits, reconciliation). DuckDB's built-in COUNT(DISTINCT) and quantile_cont are vectorized and very fast — measure before reaching for a sketch.

Reach for this extension over DuckDB's built-ins when

DuckDB ships with approx_count_distinct (HLL) and approx_quantile (TDigest), but those expose only the final estimate — not the sketch state. Use this extension when you need to persist the sketch as a column, merge sketches later, tune sketch parameters (lg_k, K, map_size), or interop with the wider Apache DataSketches ecosystem.

Deep Dive

Technical Details

DuckDB ↔ Apache DataSketches

Mergeable, persistable streaming sketches — distinct counts, quantiles, and heavy hitters — from the Apache DataSketches library.

A billion-row distinct count, persisted

A naive COUNT(DISTINCT user_id) over a billion-row event table is a sort plus a hash table — minutes of work, gigabytes of memory, a fresh scan every time someone asks again. With a Theta sketch, it’s a fixed-memory aggregate you build once per partition and roll up on demand:

-- One pass, per-day sketch state, persisted as a typed BLOB column
CREATE TABLE daily_uniques AS
SELECT date_trunc('day', ts)        AS day,
       datasketch_theta(12, user_id) AS sketch        -- ~64 KB per day
FROM events
GROUP BY 1;

-- Rolling 7-day uniques: union the per-day sketches, no rescan
SELECT datasketch_theta_estimate(
         datasketch_theta_union(sketch)
       ) AS uniques_last_7d
FROM daily_uniques
WHERE day >= CURRENT_DATE - 7;

-- Retention: users active in BOTH last week AND this week
WITH last_week AS (SELECT datasketch_theta_union(sketch) AS s
                   FROM daily_uniques
                   WHERE day BETWEEN CURRENT_DATE - 14 AND CURRENT_DATE - 7),
     this_week AS (SELECT datasketch_theta_union(sketch) AS s
                   FROM daily_uniques
                   WHERE day >= CURRENT_DATE - 7)
SELECT datasketch_theta_estimate(
         datasketch_theta_intersect(last_week.s, this_week.s)
       ) AS retained
FROM last_week, this_week;

The first query touches the source rows once. The next two — and any other window, set difference, or intersection you care to ask — operate on the sketch state alone.

Approximate by design — that's the value prop

Sketches are not a faster way to get exact answers. They give bounded-memory, mergeable, approximately correct aggregates with a small, predictable error. The default HLL sketch (lg_k = 12) is roughly ±1.6% on distinct counts; the default KLL (K = 200) puts quantile error at roughly ±1.5% rank. Pick the size knob to match your accuracy budget — and don’t reach for sketches when policy demands exact (billing, audits, reconciliation).

Architecture

This extension is a thin DuckDB wrapper around the Apache DataSketches C++ library. Every sketch family registers:

  • An aggregate (datasketch_kll, datasketch_hll, datasketch_theta, …) that builds the sketch in one pass over input values.
  • A typed BLOB column type (sketch_kll_double, sketch_hll, sketch_theta, …) — a strict superset of BLOB carrying the serialized sketch.
  • Reader scalars (datasketch_*_quantile, datasketch_*_estimate, datasketch_*_cdf, …) that decode the BLOB and return values without rebuilding from rows.
  • Mergers (datasketch_hll_union, datasketch_theta_union, …) — aggregates over a sketch column that produce a rolled-up sketch.

The serialization format is the same one the Java reference implementation writes, which is the format Druid, Pinot, BigQuery, and Spark integrations consume. A sketch built in DuckDB and written to Parquet can be merged inside Druid or vice versa — no re-aggregation across systems.

Choosing a sketch

The right sketch depends on what you’re estimating and which constraint binds first — read latency, sketch size, or set algebra.

Distinct counts (cardinality)

FamilyWhen to pickNotes
HLLDefault — fastest reads, broad cross-system interoplg_k 4–21; 12 is a good default (~4 KB sketch)
CPCStorage-bound — many sketches at rest~40% smaller than HLL at the same accuracy
ThetaYou need union, intersection, or A-not-B (funnel, retention, churn)Only family with full set algebra

Quantiles

FamilyWhen to pickNotes
KLLDefault — best accuracy/size balanceK = 200 is a sensible production starting point
TDigestp99 / p999 SLO reportingFLOAT/DOUBLE only; tail-accurate
REQSkewed distributionsError scales with rank, not fixed
QuantilesCompatibility with classic pipelinesUse KLL for new work

Frequency

A single family — Frequent Items — for top-K heavy hitters with confidence-bounded frequencies. Use 'NO_FALSE_POSITIVES' to get items that are definitely heavy, 'NO_FALSE_NEGATIVES' to get every candidate that might be.

Compared to alternatives

  • DuckDB’s built-in approx_count_distinct and approx_quantile — both ship in DuckDB and use HLL / TDigest internally, but expose only the final estimate. Reach for this extension when you need the sketch as state — to persist it as a column, merge later, tune sketch parameters (lg_k, K, map_size), or interoperate with other Apache DataSketches consumers.
  • Exact aggregates (COUNT(DISTINCT), quantile_cont, mode) — the right answer when policy demands exact and your data fits. DuckDB’s vectorized exact aggregates are fast at most analytical scales; measure before reaching for an approximate sketch.
  • Reservoir / streaming aggregates rolled by hand — sketches give you the same bounded-memory streaming property plus mathematically grounded error bounds plus mergeability across processes. For anything that needs to survive a process restart or roll up across shards, the sketch is the better primitive.
  • Druid / Pinot / BigQuery sketch columns — those systems use the same serialization format. If you’re already producing sketches there, this extension lets DuckDB read and merge them; if you’re building them in DuckDB, downstream OLAP systems can consume the BLOBs directly.

Persistence and portability

Every sketch round-trips through a typed BLOB column (sketch_kll_double, sketch_hll, sketch_theta, etc.) that’s a strict superset of BLOB. Sketches written into a Parquet file can be read back into any DuckDB instance with this extension loaded, or into any other DataSketches-aware system, and merged without a re-scan. Aggregate functions accept either raw input values or an existing sketch column, so incremental rollup is just another SELECT.

Reference

Extension Contents

Quick reference to all available functions and settings organized by category.

Name Description
CPC

Compressed Probability Counting — distinct-counting at roughly 40% the size of HLL at the same accuracy, traded against slower serialization. Choose this when you store many sketches at rest.

datasketch_cpc() Aggregate input values into a Compressed Probability Counting sketch — a distinct-count sketch that's roughly 40% smaller than HLL at the same accuracy, traded against slower serialization
datasketch_cpc_describe() Return a string representation of the sketch
datasketch_cpc_estimate() Read the estimated distinct count from a CPC sketch
datasketch_cpc_is_empty() Return a boolean indicating if the sketch is empty
datasketch_cpc_lower_bound() Return the lower bound of the number of distinct items seen by the sketch
datasketch_cpc_union() Merge a column of sketch_cpc BLOBs into one rollup sketch — the standard per-partition / per-day rollup pattern
datasketch_cpc_upper_bound() Return the upper bound of the number of distinct items seen by the sketch
Frequent Items

Heavy-hitter sketch that identifies the most-frequent items in a stream along with confidence-bounded frequency estimates. Use for top-N analysis on high-cardinality streams where exact GROUP BY count() is too expensive.

datasketch_frequent_items() Aggregate input values into a Frequent Items (heavy-hitter) sketch
datasketch_frequent_items_epsilon() Returns the epsilon value (relative error) of the sketch
datasketch_frequent_items_estimate() Estimated frequency for a specific item
datasketch_frequent_items_get_frequent() Return the heavy-hitter candidates with per-item estimate, lower bound, and upper bound
datasketch_frequent_items_is_empty() Returns true if the sketch is empty
datasketch_frequent_items_lower_bound() Returns the lower bound frequency estimate for a specific item
datasketch_frequent_items_num_active() Returns the number of active items currently tracked by the sketch
datasketch_frequent_items_total_weight() Returns the total weight (sum of all item counts) processed by the sketch
datasketch_frequent_items_upper_bound() Returns the upper bound frequency estimate for a specific item
HLL

HyperLogLog distinct-counting sketch — the industry standard. Fast serialize/deserialize and broad cross-system compatibility. Choose this when speed and interop matter more than storage.

datasketch_hll() Aggregate input values into a HyperLogLog sketch for distinct counting
datasketch_hll_describe() Return a string representation of the sketch
datasketch_hll_estimate() Read the estimated distinct count from an HLL sketch
datasketch_hll_is_compact() Return whether the sketch is in compact form
datasketch_hll_is_empty() Return a boolean indicating if the sketch is empty
datasketch_hll_lg_config_k() Return the value of log base 2 K for this sketch
datasketch_hll_lower_bound() Lower bound of the HLL distinct-count estimate at a given number of standard deviations
datasketch_hll_union() Merge multiple HLL sketches into one
datasketch_hll_upper_bound() Upper bound of the HLL distinct-count estimate at a given number of standard deviations
KLL

KLL quantile sketch — modern mergeable quantile estimator. Best balance of accuracy, speed, and size for general-purpose quantile work. Default choice unless you have a specific reason to pick TDigest or REQ.

datasketch_kll() Aggregate input values into a KLL quantile sketch
datasketch_kll_cdf() CDF over a list of split points — one call returns the cumulative rank at each
datasketch_kll_describe() Return a description of this sketch
datasketch_kll_is_empty() Return a boolean indicating if the sketch is empty
datasketch_kll_is_estimation_mode() Return a boolean indicating if the sketch is in estimation mode
datasketch_kll_k() Return the value of K for this sketch
datasketch_kll_max_item() Return the maxium item in the sketch
datasketch_kll_min_item() Return the minimum item in the sketch
datasketch_kll_n() Return the number of items contained in the sketch
datasketch_kll_normalized_rank_error() Return the normalized rank error of the sketch
datasketch_kll_num_retained() Return the number of retained items in the sketch
datasketch_kll_pmf() PMF (probability mass) over a list of split points — fraction of the distribution falling in each bucket
datasketch_kll_quantile() Approximate quantile at a given rank — given a sketch and r ∈ [0, 1], returns the value at that rank in the sorted distribution
datasketch_kll_rank() Inverse of datasketch_kll_quantile — given a value, return its approximate rank r ∈ [0, 1] in the sorted distribution
Quantiles

Classic mergeable quantile sketch from the original DataSketches paper. Solid general-purpose choice supporting all numeric types; KLL is now preferred for new projects.

datasketch_quantiles() Aggregate input values into the classic mergeable quantiles sketch from the original DataSketches paper
datasketch_quantiles_cdf() Return the Cumulative Distribution Function (CDF) of the sketch for a series of points
datasketch_quantiles_describe() Return a description of this sketch
datasketch_quantiles_is_empty() Return a boolean indicating if the sketch is empty
datasketch_quantiles_is_estimation_mode() Return a boolean indicating if the sketch is in estimation mode
datasketch_quantiles_k() Return the value of K for this sketch
datasketch_quantiles_max_item() Return the maxium item in the sketch
datasketch_quantiles_min_item() Return the minimum item in the sketch
datasketch_quantiles_n() Return the number of items contained in the sketch
datasketch_quantiles_normalized_rank_error() Return the normalized rank error of the sketch
datasketch_quantiles_num_retained() Return the number of retained items in the sketch
datasketch_quantiles_pmf() Return the Probability Mass Function (PMF) of the sketch for a series of points
datasketch_quantiles_quantile() Approximate quantile at a given rank from a classic Quantiles sketch
datasketch_quantiles_rank() Approximate rank of a value within the classic Quantiles sketch
REQ

Relative-error quantile sketch. Error scales with rank rather than being fixed across the distribution — predictable accuracy on highly skewed data.

datasketch_req() Aggregate input values into a Relative Error Quantile sketch
datasketch_req_cdf() Return the Cumulative Distribution Function (CDF) of the sketch for a series of points
datasketch_req_describe() Return a description of this sketch
datasketch_req_is_empty() Return a boolean indicating if the sketch is empty
datasketch_req_is_estimation_mode() Return a boolean indicating if the sketch is in estimation mode
datasketch_req_k() Return the value of K for this sketch
datasketch_req_max_item() Return the maxium item in the sketch
datasketch_req_min_item() Return the minimum item in the sketch
datasketch_req_n() Return the number of items contained in the sketch
datasketch_req_num_retained() Return the number of retained items in the sketch
datasketch_req_pmf() Return the Probability Mass Function (PMF) of the sketch for a series of points
datasketch_req_quantile() Approximate quantile at a given rank from a REQ sketch
datasketch_req_rank() Approximate rank of a value within the REQ sketch
TDigest

t-digest — quantile sketch optimized for the tails (p99, p999). Most accurate where SLOs live; FLOAT/DOUBLE input only. Use when tail latencies or outliers matter more than median accuracy.

datasketch_tdigest() Aggregate input values into a t-digest quantile sketch — most accurate at the tails (p99, p999), exactly where SLOs live
datasketch_tdigest_cdf() Return the Cumulative Distribution Function (CDF) of the sketch for a series of points
datasketch_tdigest_describe() Return a description of this sketch
datasketch_tdigest_is_empty() Return a boolean indicating if the sketch is empty
datasketch_tdigest_k() Return the value of K for this sketch
datasketch_tdigest_pmf() Return the Probability Mass Function (PMF) of the sketch for a series of points
datasketch_tdigest_quantile() Approximate quantile at a given rank from a t-digest sketch
datasketch_tdigest_rank() Approximate rank of a value within the t-digest sketch
datasketch_tdigest_total_weight() Return the total weight of this sketch
Theta

Theta sketch — the only distinct-count family that supports set operations beyond union. Combine cohorts with union, intersect, and a_not_b directly on sketch state, without rescanning the source data.

datasketch_theta() Aggregate input values into a Theta sketch — the distinct-count family that supports set operations (union, intersect, A-not-B) beyond simple merge
datasketch_theta_a_not_b() Approximate |A \ B| — distinct items present in A but not in B
datasketch_theta_describe() Returns a human-readable description of the Theta sketch
datasketch_theta_estimate() Read the estimated distinct count from a Theta sketch
datasketch_theta_get_seed() Returns the seed hash used by the sketch
datasketch_theta_get_theta() Returns the theta value of the sketch (sampling probability)
datasketch_theta_intersect() Approximate |A ∩ B| — distinct items present in both cohorts
datasketch_theta_is_empty() Returns true if the Theta sketch is empty
datasketch_theta_is_estimation_mode() Returns true if the sketch is in estimation mode (has exceeded exact counting capacity)
datasketch_theta_lower_bound() Returns the lower bound estimate at the given number of standard deviations (1, 2, or 3)
datasketch_theta_num_retained() Returns the number of hash values retained in the sketch
datasketch_theta_union() Approximate |A ∪ B| — distinct count across the union of two cohorts
datasketch_theta_upper_bound() Returns the upper bound estimate at the given number of standard deviations (1, 2, or 3)

API Reference

Function Documentation

Detailed documentation for each function including signatures, parameters, and examples.

datasketch_cpc

Aggregate Function CPC
Signature
datasketch_cpc(col0: INTEGER, col1: FLOAT | BLOB | DOUBLE | …) → sketch_cpc
Parameters (Positional)
Parameter Type Mode Description
col0 INTEGER Positional
col1 FLOAT | BLOB | DOUBLE | …
12 concrete types
BIGINTBLOBDOUBLEFLOATINTEGERSMALLINTTINYINTUBIGINTUINTEGERUSMALLINTUTINYINTVARCHAR
Positional
Returns
Description

Aggregate input values into a Compressed Probability Counting sketch — a distinct-count sketch that's roughly 40% smaller than HLL at the same accuracy, traded against slower serialization. Choose CPC when you store many sketches at rest and reads are infrequent. The leading lg_k controls the size/accuracy trade-off the same way it does for HLL.

Examples
1

Distinct-count with CPC

SELECT datasketch_cpc_estimate(datasketch_cpc(12, user_id)) AS distinct_users
FROM events;

datasketch_cpc_describe

Scalar Function CPC
Signature
datasketch_cpc_describe(col0: sketch_cpc) → VARCHAR
Parameters (Positional)
Parameter Type Mode Description
col0 sketch_cpc Positional
Returns
Description

Return a string representation of the sketch

Examples
1
SELECT datasketch_cpc_describe(sketch);

datasketch_cpc_estimate

Scalar Function CPC
Signature
datasketch_cpc_estimate(col0: sketch_cpc) → DOUBLE
Parameters (Positional)
Parameter Type Mode Description
col0 sketch_cpc Positional
Returns
Description

Read the estimated distinct count from a CPC sketch.

Examples
1
SELECT datasketch_cpc_estimate(sketch);

datasketch_cpc_is_empty

Scalar Function CPC
Signature
datasketch_cpc_is_empty(col0: sketch_cpc) → BOOLEAN
Parameters (Positional)
Parameter Type Mode Description
col0 sketch_cpc Positional
Returns
Description

Return a boolean indicating if the sketch is empty

Examples
1
SELECT datasketch_cpc_is_empty(sketch);

datasketch_cpc_lower_bound

Scalar Function CPC
Signature
datasketch_cpc_lower_bound(col0: sketch_cpc, col1: UTINYINT) → DOUBLE
Parameters (Positional)
Parameter Type Mode Description
col0 sketch_cpc Positional
col1 UTINYINT Positional
Returns
Description

Return the lower bound of the number of distinct items seen by the sketch

Examples
1
SELECT datasketch_cpc_lower_bound(sketch, std_dev);

datasketch_cpc_union

Aggregate Function CPC
Signature
datasketch_cpc_union(col0: INTEGER, col1: sketch_cpc) → sketch_cpc
Parameters (Positional)
Parameter Type Mode Description
col0 INTEGER Positional
col1 sketch_cpc Positional
Returns
Description

Merge a column of sketch_cpc BLOBs into one rollup sketch — the standard per-partition / per-day rollup pattern.

Examples
1
SELECT datasketch_cpc_union(k, data);

datasketch_cpc_upper_bound

Scalar Function CPC
Signature
datasketch_cpc_upper_bound(col0: sketch_cpc, col1: UTINYINT) → DOUBLE
Parameters (Positional)
Parameter Type Mode Description
col0 sketch_cpc Positional
col1 UTINYINT Positional
Returns
Description

Return the upper bound of the number of distinct items seen by the sketch

Examples
1
SELECT datasketch_cpc_upper_bound(sketch, std_dev);

datasketch_frequent_items

Aggregate Function Frequent Items
Signature
datasketch_frequent_items(col0: FLOAT | UINTEGER | SMALLINT | …) → sketch_frequent_items
Parameters (Positional)
Parameter Type Mode Description
col0 FLOAT | UINTEGER | SMALLINT | …
12 concrete types
BIGINTDOUBLEFLOATINTEGERSMALLINTTINYINTUBIGINTUINTEGERUSMALLINTUTINYINTVARCHARsketch_frequent_items
Positional
Returns
Description

Aggregate input values into a Frequent Items (heavy-hitter) sketch. The first argument is lg_max_map_size — log₂ of the maximum number of tracked items, so 8 → up to 256 candidates, 12 → up to 4096. Bigger values track more candidates and tighten the per-item error bound. Typical values run 4–12.

Examples
1

Top countries by page views

SELECT datasketch_frequent_items_get_frequent(
         datasketch_frequent_items(8, country_code),
         'NO_FALSE_POSITIVES'
       ) AS heavy_hitters
FROM page_views;

datasketch_frequent_items_epsilon

Scalar Function Frequent Items
Signature
datasketch_frequent_items_epsilon(col0: sketch_frequent_items) → DOUBLE
Parameters (Positional)
Parameter Type Mode Description
col0 sketch_frequent_items Positional
Returns
Description

Returns the epsilon value (relative error) of the sketch

Examples
1
SELECT datasketch_frequent_items_epsilon(sketch);

datasketch_frequent_items_estimate

Scalar Function Frequent Items
Signature
datasketch_frequent_items_estimate(col0: sketch_frequent_items, col1: UBIGINT | BIGINT | SMALLINT | …) → BIGINT
Parameters (Positional)
Parameter Type Mode Description
col0 sketch_frequent_items Positional
col1 UBIGINT | BIGINT | SMALLINT | …
11 concrete types
BIGINTDOUBLEFLOATINTEGERSMALLINTTINYINTUBIGINTUINTEGERUSMALLINTUTINYINTVARCHAR
Positional
Returns
Description

Estimated frequency for a specific item.

Examples
1
SELECT datasketch_frequent_items_estimate(sketch, 'item');

datasketch_frequent_items_get_frequent

Frequent Items

Return the heavy-hitter candidates with per-item estimate, lower bound, and upper bound.

datasketch_frequent_items_is_empty

Frequent Items

Returns true if the sketch is empty

datasketch_frequent_items_lower_bound

Frequent Items

Returns the lower bound frequency estimate for a specific item

datasketch_frequent_items_num_active

Frequent Items

Returns the number of active items currently tracked by the sketch

datasketch_frequent_items_total_weight

Frequent Items

Returns the total weight (sum of all item counts) processed by the sketch

datasketch_frequent_items_upper_bound

Frequent Items

Returns the upper bound frequency estimate for a specific item

datasketch_hll

HLL

Aggregate input values into a HyperLogLog sketch for distinct counting.

datasketch_hll_describe

HLL

Return a string representation of the sketch

datasketch_hll_estimate

HLL

Read the estimated distinct count from an HLL sketch.

datasketch_hll_is_compact

HLL

Return whether the sketch is in compact form

datasketch_hll_is_empty

HLL

Return a boolean indicating if the sketch is empty

datasketch_hll_lg_config_k

HLL

Return the value of log base 2 K for this sketch

datasketch_hll_lower_bound

HLL

Lower bound of the HLL distinct-count estimate at a given number of standard deviations.

datasketch_hll_union

HLL

Merge multiple HLL sketches into one.

datasketch_hll_upper_bound

HLL

Upper bound of the HLL distinct-count estimate at a given number of standard deviations.

datasketch_kll

KLL

Aggregate input values into a KLL quantile sketch.

datasketch_kll_cdf

KLL

CDF over a list of split points — one call returns the cumulative rank at each.

datasketch_kll_describe

KLL

Return a description of this sketch

datasketch_kll_is_empty

KLL

Return a boolean indicating if the sketch is empty

datasketch_kll_is_estimation_mode

KLL

Return a boolean indicating if the sketch is in estimation mode

datasketch_kll_k

KLL

Return the value of K for this sketch

datasketch_kll_max_item

KLL

Return the maxium item in the sketch

datasketch_kll_min_item

KLL

Return the minimum item in the sketch

datasketch_kll_n

KLL

Return the number of items contained in the sketch

datasketch_kll_normalized_rank_error

KLL

Return the normalized rank error of the sketch

datasketch_kll_num_retained

KLL

Return the number of retained items in the sketch

datasketch_kll_pmf

KLL

PMF (probability mass) over a list of split points — fraction of the distribution falling in each bucket.

datasketch_kll_quantile

KLL

Approximate quantile at a given rank — given a sketch and `r ∈ [0, 1]`, returns the value at that rank in the sorted distribution.

datasketch_kll_rank

KLL

Inverse of `datasketch_kll_quantile` — given a value, return its approximate rank `r ∈ [0, 1]` in the sorted distribution.

datasketch_quantiles

Quantiles

Aggregate input values into the classic mergeable quantiles sketch from the original DataSketches paper.

datasketch_quantiles_cdf

Quantiles

Return the Cumulative Distribution Function (CDF) of the sketch for a series of points

datasketch_quantiles_describe

Quantiles

Return a description of this sketch

datasketch_quantiles_is_empty

Quantiles

Return a boolean indicating if the sketch is empty

datasketch_quantiles_is_estimation_mode

Quantiles

Return a boolean indicating if the sketch is in estimation mode

datasketch_quantiles_k

Quantiles

Return the value of K for this sketch

datasketch_quantiles_max_item

Quantiles

Return the maxium item in the sketch

datasketch_quantiles_min_item

Quantiles

Return the minimum item in the sketch

datasketch_quantiles_n

Quantiles

Return the number of items contained in the sketch

datasketch_quantiles_normalized_rank_error

Quantiles

Return the normalized rank error of the sketch

datasketch_quantiles_num_retained

Quantiles

Return the number of retained items in the sketch

datasketch_quantiles_pmf

Quantiles

Return the Probability Mass Function (PMF) of the sketch for a series of points

datasketch_quantiles_quantile

Quantiles

Approximate quantile at a given rank from a classic Quantiles sketch.

datasketch_quantiles_rank

Quantiles

Approximate rank of a value within the classic Quantiles sketch.

datasketch_req

REQ

Aggregate input values into a Relative Error Quantile sketch.

datasketch_req_cdf

REQ

Return the Cumulative Distribution Function (CDF) of the sketch for a series of points

datasketch_req_describe

REQ

Return a description of this sketch

datasketch_req_is_empty

REQ

Return a boolean indicating if the sketch is empty

datasketch_req_is_estimation_mode

REQ

Return a boolean indicating if the sketch is in estimation mode

datasketch_req_k

REQ

Return the value of K for this sketch

datasketch_req_max_item

REQ

Return the maxium item in the sketch

datasketch_req_min_item

REQ

Return the minimum item in the sketch

datasketch_req_n

REQ

Return the number of items contained in the sketch

datasketch_req_num_retained

REQ

Return the number of retained items in the sketch

datasketch_req_pmf

REQ

Return the Probability Mass Function (PMF) of the sketch for a series of points

datasketch_req_quantile

REQ

Approximate quantile at a given rank from a REQ sketch.

datasketch_req_rank

REQ

Approximate rank of a value within the REQ sketch.

datasketch_tdigest

TDigest

Aggregate input values into a t-digest quantile sketch — most accurate at the tails (p99, p999), exactly where SLOs live.

datasketch_tdigest_cdf

TDigest

Return the Cumulative Distribution Function (CDF) of the sketch for a series of points

datasketch_tdigest_describe

TDigest

Return a description of this sketch

datasketch_tdigest_is_empty

TDigest

Return a boolean indicating if the sketch is empty

datasketch_tdigest_k

TDigest

Return the value of K for this sketch

datasketch_tdigest_pmf

TDigest

Return the Probability Mass Function (PMF) of the sketch for a series of points

datasketch_tdigest_quantile

TDigest

Approximate quantile at a given rank from a t-digest sketch.

datasketch_tdigest_rank

TDigest

Approximate rank of a value within the t-digest sketch.

datasketch_tdigest_total_weight

TDigest

Return the total weight of this sketch

datasketch_theta

Theta

Aggregate input values into a Theta sketch — the distinct-count family that supports set operations (union, intersect, A-not-B) beyond simple merge.

datasketch_theta_a_not_b

Theta

Approximate `|A \ B|` — distinct items present in A but not in B.

datasketch_theta_describe

Theta

Returns a human-readable description of the Theta sketch

datasketch_theta_estimate

Theta

Read the estimated distinct count from a Theta sketch.

datasketch_theta_get_seed

Theta

Returns the seed hash used by the sketch

datasketch_theta_get_theta

Theta

Returns the theta value of the sketch (sampling probability)

datasketch_theta_intersect

Theta

Approximate `|A ∩ B|` — distinct items present in both cohorts.

datasketch_theta_is_empty

Theta

Returns true if the Theta sketch is empty

datasketch_theta_is_estimation_mode

Theta

Returns true if the sketch is in estimation mode (has exceeded exact counting capacity)

datasketch_theta_lower_bound

Theta

Returns the lower bound estimate at the given number of standard deviations (1, 2, or 3)

datasketch_theta_num_retained

Theta

Returns the number of hash values retained in the sketch

datasketch_theta_union

Theta

Approximate `|A ∪ B|` — distinct count across the union of two cohorts.

datasketch_theta_upper_bound

Theta

Returns the upper bound estimate at the given number of standard deviations (1, 2, or 3)

Practical Examples

Cookbook

Real-world recipes and patterns for common use cases.

Recipes for distinct counts, quantiles, and heavy hitters with Apache DataSketches.

Distinct count with HLL

The bread-and-butter sketch — approximate COUNT(DISTINCT) in a few KB of memory.

SELECT datasketch_hll_estimate(datasketch_hll(12, user_id)) AS distinct_users
FROM events;

The 12 is lg_k — base-2 log of the bucket count. lg_k = 12 (4096 buckets, ~4 KB sketch, ~1.6% standard error) is a common default. Bracket the estimate with datasketch_hll_lower_bound / datasketch_hll_upper_bound when you want a confidence interval.

Per-day HLL sketches, rolled up on demand

Build the sketch state once; query any window without rescanning.

CREATE TABLE daily_uniques (day DATE, hll sketch_hll);

INSERT INTO daily_uniques
SELECT date_trunc('day', ts) AS day,
       datasketch_hll(12, user_id) AS hll
FROM events
GROUP BY 1;

-- Rolling 7-day uniques — reads sketch state, not source events
SELECT datasketch_hll_estimate(datasketch_hll_union(12, hll)) AS uniques_last_7d
FROM daily_uniques
WHERE day >= CURRENT_DATE - 7;

The same pattern works for datasketch_cpc (smaller on disk, slower to write) and datasketch_theta (supports set operations beyond union).

Set operations with Theta — funnel, retention, churn

Theta is the only distinct-count family with full set algebra.

WITH a AS (SELECT datasketch_theta(12, user_id) AS s FROM events_jan),
     b AS (SELECT datasketch_theta(12, user_id) AS s FROM events_feb)
SELECT
  datasketch_theta_estimate(datasketch_theta_union(a.s, b.s))     AS jan_or_feb,
  datasketch_theta_estimate(datasketch_theta_intersect(a.s, b.s)) AS retained,    -- in both
  datasketch_theta_estimate(datasketch_theta_a_not_b(a.s, b.s))   AS churned,     -- jan not feb
  datasketch_theta_estimate(datasketch_theta_a_not_b(b.s, a.s))   AS new_users    -- feb not jan
FROM a, b;

This is the headline use case for Theta — funnel and cohort analysis directly off sketch state.

p50 / p95 / p99 with KLL

KLL is the recommended general-purpose quantile sketch.

WITH agg AS (
  SELECT datasketch_kll(200, latency_ms) AS sketch FROM requests
)
SELECT
  datasketch_kll_quantile(sketch, 0.50, true) AS p50,
  datasketch_kll_quantile(sketch, 0.95, true) AS p95,
  datasketch_kll_quantile(sketch, 0.99, true) AS p99
FROM agg;

K = 200 is a sensible production starting point. Higher K → smaller error, larger sketch. The trailing true is the inclusive flag — true for inclusive search (P[X ≤ q]), false for exclusive (P[X < q]). Use datasketch_kll_normalized_rank_error to inspect the actual rank-error guarantee for the configured K.

Histograms in one query — KLL CDF / PMF

Get the cumulative distribution at any set of split points without a second pass.

SELECT datasketch_kll_cdf(
         datasketch_kll(200, latency_ms),
         [10, 50, 100, 250, 500, 1000],
         /* inclusive */ true
       ) AS cdf
FROM requests;

Returns one row of cumulative ranks at each split point — directly usable for histograms. Swap cdf for datasketch_kll_pmf to get per-bucket probability mass instead.

Tail-accurate quantiles with TDigest

When p99 / p999 matters more than the median (SLO reporting), reach for t-digest.

CREATE TABLE readings(temp DOUBLE);
INSERT INTO readings(temp) SELECT unnest(generate_series(1, 10))::DOUBLE;

-- Rank of value 5 within the aggregated sketch
SELECT datasketch_tdigest_rank(datasketch_tdigest(100, temp), 5) AS rank_of_5
FROM readings;

Output

rank_of_5
0.45

CDF over multiple split points in one call:

SELECT datasketch_tdigest_cdf(datasketch_tdigest(100, temp), [1, 5, 9]) AS cdf
FROM readings;

Output

cdf
[0.0357, 0.3214, 0.6071, 1.0]

TDigest accepts FLOAT/DOUBLE only. For integer or skewed-rank quantile work, see datasketch_kll or datasketch_req.

Top-K heavy hitters with Frequent Items

Find the most frequent items in a stream — with confidence bounds on each estimate.

SELECT datasketch_frequent_items_get_frequent(
         datasketch_frequent_items(8, country_code),
         'NO_FALSE_POSITIVES'
       ) AS heavy_hitters
FROM page_views;

The result is a list of structs — (item, estimate, lower_bound, upper_bound). Use 'NO_FALSE_POSITIVES' for items that are definitely heavy hitters, 'NO_FALSE_NEGATIVES' for the union of all candidates that might be. The leading 8 is lg_max_map_size — log₂ of the maximum number of tracked items (so 8 → up to 256 candidates). Bigger values track more candidates and tighten the per-item error bound; typical values run 4–12.

Skew-aware quantiles with REQ

When the distribution is heavily skewed and you want predictable accuracy across the whole rank space, REQ scales its error with rank rather than fixing it.

SELECT
  datasketch_req_quantile(datasketch_req(12, value), 0.001, true) AS p001,
  datasketch_req_quantile(datasketch_req(12, value), 0.5,   true) AS p50,
  datasketch_req_quantile(datasketch_req(12, value), 0.999, true) AS p999
FROM measurements;

Persist sketches to Parquet, merge in another DuckDB

Sketch BLOBs are portable across processes and across DuckDB instances:

COPY (SELECT day, hll FROM daily_uniques)
  TO 'daily_uniques.parquet' (FORMAT PARQUET);

-- Later, anywhere with this extension loaded:
SELECT datasketch_hll_estimate(datasketch_hll_union(12, hll))
FROM read_parquet('daily_uniques.parquet')
WHERE day BETWEEN DATE '2026-01-01' AND DATE '2026-01-31';

The serialization format is the same one used by Druid, Pinot, BigQuery, and the Apache DataSketches Java library — sketches built here can be merged in those systems and vice versa.

Platform Support

Compatibility

Extension availability may vary by platform and DuckDB version. Check below to ensure this extension supports your environment before installation.

Quick Facts

Software License Apache-2.0
Pricing Free
Written In C++
Source Available Yes
View on GitHub
Usage
65,581+
loads in last 30 days

Platforms

Linux
Linux (musl)
macOS
Windows
WASM
Supported platform architectures
Linux: aarch64, x86_64
Linux (musl): x86_64
macOS: Intel, Apple Silicon
Windows: x86_64
WASM: mvp, eh, threads
Compiled binary sizes
Platform Architecture Size
Linux aarch64 3.64 MB
Linux x86_64 4.15 MB
Linux (musl) x86_64 4.08 MB
macOS Apple Silicon 2.81 MB
macOS Intel 3.19 MB
Windows x86_64 8.24 MB
WASM eh 273.1 KB
WASM mvp 266.8 KB
WASM threads 265.7 KB

Gzipped download size from the DuckDB community-extensions registry.

DuckDB Versions

Release calendar
Supported
v1.4.4 v1.5.2

Billion-row aggregates with bounded memory

Install the DataSketches extension to compute distinct counts, quantiles, and heavy hitters as mergeable sketch state — persistable, shippable, and interoperable with the broader Apache DataSketches ecosystem.