🌳

Marisa

Static, space-efficient MARISA trie data structures for DuckDB. Build a compact searchable string set as a single BLOB, then run lookups, prefix searches, and predictive completions against it. The lookup primitive for autocomplete, spell-check, and routing tables.

Install

-- Install the extension
INSTALL marisa FROM community;

-- Load it into your session
LOAD marisa;

-- Build a trie from a column
CREATE TABLE employees(name TEXT);
INSERT INTO employees VALUES ('Alice'),('Bob'),('Megan'),('Melissa');
CREATE TABLE employees_trie AS
  SELECT marisa_trie(name) AS trie FROM employees;

-- Membership test
SELECT marisa_lookup(trie, 'Alice') FROM employees_trie;

-- Autocomplete: names starting with 'Me'
SELECT marisa_predictive(trie, 'Me', 10) FROM employees_trie;

Technical Overview

Why Use a MARISA Trie from DuckDB?

Pack a column of strings into a single BLOB and answer membership, autocomplete, and longest-prefix questions against it β€” in time proportional to the search key, not the dictionary size. Pairs well with bitfilters: both answer set-membership questions, but a MARISA trie keeps the strings and supports prefix queries.

🌳 What this extension is for

Big static string sets β€” wordlists, place names, route prefixes, autocomplete dictionaries β€” compiled into one self-contained BLOB you can store in any DuckDB column, persist to Parquet, and ship anywhere DuckDB runs.

  • β€’ Autocomplete and type-ahead: Microsecond "starts with" lookups against a single BLOB. Drop a search box on a static corpus without standing up Elasticsearch.
  • β€’ Spell-check and dictionary membership: Exact "is this string in the set?" tests, smaller than a hash set and prefix-aware in a way bitfilters can't be.
  • β€’ Longest-prefix-match routing: Find every dictionary string that is itself a prefix of the input β€” the primitive behind URL routers, CIDR label tables, and rule-matching lookups.
  • β€’ A compact join side: One trie per partition / day / tenant ships dramatically smaller than the equivalent string list, and serves membership and prefix queries from the same blob.

βš™οΈ How it works

MARISA is a succinct static Patricia trie maintained as s-yata/marisa-trie; this extension is a thin DuckDB wrapper.

  • β€’ One aggregate produces one BLOB: Aggregate a VARCHAR column once at write time and you get a single immutable BLOB. Store it anywhere DuckDB stores BLOBs β€” table column, Parquet file, S3 object.
  • β€’ Lookup cost scales with the key, not the dictionary: The first query is as fast as the millionth, whether the trie holds a thousand strings or ten million.
  • β€’ Prefix-rich data compresses sharply: URLs, paths, and dictionary words share heavy prefix structure β€” a 470K-entry English wordlist fits in roughly 800 KB, about a fifth the size of the same words as a flat sorted list.

πŸ›‘οΈ When to choose something else

MARISA is a sharp, narrow tool. Knowing where it isn't the right fit is half the value.

  • β€’ Static β€” no incremental insert or delete: Once built, the trie is immutable. Updates mean rebuilding from the source set.
  • β€’ String keys only: VARCHAR in, VARCHAR out. Numeric keys need to be encoded as text upstream or stored in a different index.
  • β€’ Unrelated random strings: pick a Bloom: Without shared prefixes, the compression edge disappears. For pure membership on uncorrelated keys, bitfilters is smaller and just as fast β€” but it can't return the strings or answer prefix queries.

🎯 Common use cases

Autocomplete backend in SQL

Build the trie once at ingest time, store the BLOB, and serve type-ahead requests directly from DuckDB. No separate search service for static dictionaries.

Spell-check shipped inside a Parquet file

A wordlist or term list compiled into a single BLOB column. Pair with fuzzycomplete or rapidfuzz for fuzzy fallbacks when exact lookup misses.

Routing and longest-prefix-match tables

URL paths, feature-flag rules, geo / CIDR labels β€” anywhere you ask "which rules match this input?"

Compact join dictionary

One trie per partition or per day. The BLOB ships smaller than the underlying string list and serves prefix and membership queries from the same blob.

Deep Dive

Technical Details

What you can do with one query

Pack a column of strings into a single, self-contained BLOB β€” then run membership, predictive, and longest-prefix-match lookups against it forever:

-- Build once at ingest time
CREATE TABLE words_trie AS
  SELECT marisa_trie(word) AS trie FROM dictionary;

-- Three lookups against the same BLOB
SELECT
  marisa_lookup(trie, 'duckdb')                 AS exact,         -- BOOLEAN
  marisa_predictive(trie, 'duck', 10)           AS completions,   -- VARCHAR[]
  marisa_common_prefix(trie, 'duckdbtools', 10) AS prefixes_of    -- VARCHAR[]
FROM words_trie;

marisa_trie is the only constructor β€” an aggregate that emits a single BLOB. The three query primitives (marisa_lookup, marisa_predictive, marisa_common_prefix) read the BLOB directly. No decode step, no warm-up; the first query is as fast as the millionth.

Static is the feature

A MARISA trie is built once and queried many times. There is no marisa_insert or marisa_delete β€” to add or remove keys, rebuild the trie from the updated source set. For a 470K-entry English wordlist that’s milliseconds; for billions of keys, plan an offline rebuild step.

The static design is what makes the structure succinct: keys are packed near the information-theoretic minimum, the BLOB is mmap-ready on load, and lookups don’t pay any synchronization cost. If your set churns continuously, a hash set or a bitfilters Quotient filter is a better fit.

What is MARISA?

MARISA β€” Matching Algorithm with Recursively Implemented StorAge β€” is a static, succinct Patricia / radix trie. The reference C++ implementation is s-yata/marisa-trie, maintained for over a decade. This extension is a thin DuckDB wrapper around that library.

A trie is the right data structure when:

  • The strings have shared prefixes β€” URLs, paths, names, dictionary words. Prefix-rich data compresses dramatically.
  • You need prefix queries β€” autocomplete, longest-prefix match, IP-CIDR routing β€” that hash sets and Bloom filters can’t answer at all.
  • You want bounded lookup time β€” O(|key|), independent of dictionary size.

For random unrelated strings without prefix structure, a hash set or a Bloom / XOR / Binary Fuse filter from bitfilters will be smaller and faster.

Storage

Each marisa_trie call produces one BLOB. The BLOB is self-contained and portable β€” write it to a column, persist in Parquet, ship to another DuckDB instance. Loading is essentially a pointer-walk; there is no parsing or decompression step at query time.

A typical English-word trie of ~470K entries weighs in around 800 KB β€” roughly a fifth the size of the same words as a flat sorted list, before any column compression.

Lookup operations

OperationFunctionReturns
Exact membershipmarisa_lookupBOOLEAN
Predictive (autocomplete)marisa_predictiveVARCHAR[]
Longest-prefix-matchmarisa_common_prefixVARCHAR[]

All three run against the BLOB directly. The first query is as fast as the millionth.

MARISA vs bitfilters

Both extensions answer set-membership questions; they sit in the same Indexing sub-bucket on the catalog page for that reason. The mechanics are different in ways that matter:

marisabitfilters
Stores the strings?Yes β€” keys are recoverableNo β€” only a probabilistic fingerprint
False positivesNone β€” answers are exactBounded (e.g. ~0.4% for xor8)
Prefix / predictive queriesYes (marisa_predictive, marisa_common_prefix)No
Best whenPrefix-rich strings, autocompleteRandom keys, smallest possible memory
MutationStatic β€” rebuild to updateStatic (XOR / Fuse / Bloom) or dynamic (Quotient)

Use MARISA when you need the strings back or you need prefix queries. Use bitfilters when you only need a yes/no on uncorrelated keys and memory is the binding constraint.

Pairings

  • Build at ingest, query at runtime. Materialize one trie per user / partition / day at write time, then query at read time. The static design is a feature here β€” the trie won’t change underneath you.
  • Pair with fuzzycomplete or rapidfuzz for fuzzy + prefix lookup pipelines: exact prefix via MARISA, fuzzy fallback via the others.
  • Use as the dictionary side of a join β€” much smaller wire size than transferring the underlying string list.

Limitations

  • Static β€” no incremental insert or delete. Rebuild the trie from the updated source set.
  • String-only. No numeric or composite keys.
  • Lexicographic order only. If you need a different ordering at query time, build separate tries.

Reference

Extension Contents

Quick reference to all available functions and settings organized by category.

Name Description
Build

The aggregate that turns a column of strings into a single MARISA trie BLOB. Build once at ingest time, store the BLOB in any column type, query repeatedly. Updates require a rebuild β€” there is no incremental insert.

marisa_trie() Aggregate that builds a MARISA trie from a column of strings
Query

Membership tests, autocomplete (predictive prefix), and longest-prefix-match against an existing trie BLOB. All three primitives run in O(|key|) β€” lookup work is independent of dictionary size.

marisa_common_prefix() Returns VARCHAR[] β€” every string in the trie that is a prefix of the search string, capped at max_results
marisa_lookup() Returns BOOLEAN β€” true iff the search string is in the trie
marisa_predictive() Returns VARCHAR[] β€” every string in the trie that starts with the given prefix, capped at max_results

API Reference

Function Documentation

Detailed documentation for each function including signatures, parameters, and examples.

marisa_common_prefix

Scalar Function Query
Signature
marisa_common_prefix(col0: BLOB, col1: VARCHAR, col2: INTEGER) β†’ VARCHAR[]
Parameters (Positional)
Parameter Type Mode Description
col0 BLOB Positional
col1 VARCHAR Positional
col2 INTEGER Positional
Returns
Description

Returns VARCHAR[] β€” every string in the trie that is a prefix of the search string, capped at max_results. The inverse of marisa_predictive: useful for routing tables, IP / CIDR label lookups, and longest-prefix-match style queries ("which rules match this input?").

Examples
1

Find every country-code prefix that matches an input

SELECT marisa_common_prefix(trie, 'USA', 10) AS matches
FROM country_codes_trie;
-- ['U', 'US', 'USA']

marisa_lookup

Scalar Function Query
Signature
marisa_lookup(col0: BLOB, col1: VARCHAR) β†’ BOOLEAN
Parameters (Positional)
Parameter Type Mode Description
col0 BLOB Positional
col1 VARCHAR Positional
Returns
Description

Returns BOOLEAN β€” true iff the search string is in the trie. The fastest membership test in this extension; an exact, allocation-free yes/no against the BLOB. Use it for spell-check, dictionary membership, or anywhere a hash set would otherwise live.

Examples
1

Spell-check against an English wordlist trie

SELECT marisa_lookup(trie, 'duckling') AS in_dictionary
FROM words_trie;

marisa_predictive

Scalar Function Query
Signature
marisa_predictive(col0: BLOB, col1: VARCHAR, col2: INTEGER) β†’ VARCHAR[]
Parameters (Positional)
Parameter Type Mode Description
col0 BLOB Positional
col1 VARCHAR Positional
col2 INTEGER Positional
Returns
Description

Returns VARCHAR[] β€” every string in the trie that starts with the given prefix, capped at max_results. The autocomplete primitive: feed it the user's partial input, get back the candidate completions. The cap is a hard cutoff so a short prefix on a million-entry dictionary doesn't return everything.

Examples
1

Type-ahead suggestions for a search box

SELECT marisa_predictive(trie, 'duck', 10) AS suggestions
FROM words_trie;

marisa_trie

Aggregate Function Build
Signature
marisa_trie(col0: VARCHAR) β†’ BLOB
Parameters (Positional)
Parameter Type Mode Description
col0 VARCHAR Positional
Returns
Description

Aggregate that builds a MARISA trie from a column of strings. Returns a single self-contained BLOB you can store in any column type, persist in Parquet, or ship to another DuckDB instance. The trie is static β€” to add or remove keys, rebuild from the updated source set.

Examples
1

Build a trie from a column at ingest time

CREATE TABLE employees_trie AS
  SELECT marisa_trie(name) AS trie FROM employees;
2

One trie per logical partition (day, tenant, region)

CREATE TABLE name_tries AS
SELECT day,
       marisa_trie(name) AS trie
FROM signups
GROUP BY day;

Practical Examples

Cookbook

Real-world recipes and patterns for common use cases.

Build a trie

marisa_trie is an aggregate β€” feed it a column of strings, get one BLOB back. Build at ingest time, store the BLOB, query forever.

CREATE TABLE employees(name TEXT);
INSERT INTO employees VALUES
  ('Alice'),('Bob'),('Charlie'),('David'),('Eve'),('Frank'),
  ('Mallory'),('Megan'),('Oscar'),('Melissa');

CREATE TABLE employees_trie AS
  SELECT marisa_trie(name) AS trie FROM employees;

SELECT octet_length(trie) AS bytes FROM employees_trie;

Trie size scales with shared-prefix density rather than raw input bytes; the more prefix structure your strings share, the bigger the win versus a flat sorted list.

Membership lookup (spell-check, dictionary)

marisa_lookup returns BOOLEAN β€” does this exact string exist in the trie?

SELECT marisa_lookup(trie, 'Alice')   FROM employees_trie;  -- TRUE
SELECT marisa_lookup(trie, 'Unknown') FROM employees_trie;  -- FALSE

This is the spell-check / β€œis this term in our dictionary?” primitive. It also stands in for a hash set anywhere you’d otherwise pre-load one for IN (...)-style filtering.

Autocomplete (type-ahead UI)

marisa_predictive(trie, prefix, max_results) returns every entry in the trie whose prefix matches:

-- Names starting with 'Me'
SELECT marisa_predictive(trie, 'Me', 10) AS suggestions
FROM employees_trie;
-- ['Megan', 'Melissa']

Drop this into the backend of a type-ahead search box: one row, one BLOB, microsecond lookups. The max_results cap is a hard cutoff so a short prefix against a million-entry dictionary doesn’t return the whole table.

Longest-prefix match (routing tables)

marisa_common_prefix is the inverse of predictive β€” every string in the trie that is itself a prefix of the search input:

CREATE TABLE country_codes(code TEXT);
INSERT INTO country_codes VALUES ('U'), ('US'), ('USA');

CREATE TABLE country_codes_trie AS
  SELECT marisa_trie(code) AS trie FROM country_codes;

SELECT marisa_common_prefix(trie, 'USA', 10) AS matches
FROM country_codes_trie;
-- ['U', 'US', 'USA']

Useful for routing tables, IP-prefix labels, feature-flag rule lookups, and any β€œfind every rule that matches this input” pattern.

Big static dictionary: English words

The classic MARISA workload β€” load a wordlist into a trie once, ship the BLOB inside a Parquet file:

-- Load any newline-delimited wordlist (e.g. /usr/share/dict/words on macOS/Linux)
CREATE TABLE words AS
  SELECT trim(column0) AS word
  FROM read_csv('/usr/share/dict/words', columns := {'column0': 'VARCHAR'});

CREATE TABLE words_trie AS
  SELECT marisa_trie(word) AS trie FROM words;

-- Spell-check
SELECT marisa_lookup(trie, 'duckling') FROM words_trie;        -- TRUE
SELECT marisa_lookup(trie, 'qwxzy')    FROM words_trie;        -- FALSE

-- Autocomplete
SELECT marisa_predictive(trie, 'duck', 8) FROM words_trie;
-- ['duck','duckbill','duckboard','ducker',...]

For ~470K English words the resulting BLOB lands around 800 KB β€” about a fifth the size of the source list, queryable directly with no decode step.

Persist a trie to Parquet

The trie is just a BLOB β€” Parquet handles it natively. Build once, write, ship anywhere a DuckDB lives:

COPY (SELECT trie FROM words_trie)
  TO 'words_trie.parquet' (FORMAT 'parquet');

-- On the other side
CREATE TABLE words_trie AS
  SELECT * FROM 'words_trie.parquet';

SELECT marisa_predictive(trie, 'rou', 5) FROM words_trie;

Loading is a pointer-walk β€” no parsing, no decompression. Useful for shipping a fixed dictionary alongside an application without standing up a separate index service.

One trie per partition / tenant

Materialize one trie per logical partition (day, tenant, region) at write time. The set of tries is tiny compared to the source data and serves all three lookup operations at query time:

CREATE TABLE name_tries AS
SELECT day,
       marisa_trie(name) AS trie
FROM signups
GROUP BY day;

-- Did a given name show up on a specific day?
SELECT day
FROM name_tries
WHERE marisa_lookup(trie, 'Megan');

-- All names that started with 'Me' on each day
SELECT day, marisa_predictive(trie, 'Me', 20) AS matches
FROM name_tries
ORDER BY day;

Updating a trie

There is no marisa_insert β€” to update, rebuild from the new source set:

-- Source set changed; rebuild the trie
CREATE OR REPLACE TABLE words_trie AS
  SELECT marisa_trie(word) AS trie FROM words;

For a dictionary that changes daily this is a millisecond-scale CTAS. For a billion-key set, schedule the rebuild offline and treat the trie as a build artifact.

When a trie vs a HashSet vs bitfilters

  • Trie β€” wins when the data has shared prefixes (URLs, names, dictionary words) or you need prefix / predictive queries. Memory: roughly 30–60% of the equivalent flat string list. Lookup: O(|key|), independent of dictionary size.
  • HashSet (DuckDB’s built-in IN) β€” wins for small, random, unrelated string sets where simplicity dominates.
  • bitfilters β€” wins for very large random key sets where you only need a yes/no answer and memory is the binding constraint. Doesn’t store the strings; can’t do prefix queries.

If you’re building autocomplete or have prefix-rich data, MARISA is the right structure. For pure membership tests on uncorrelated random strings, a Bloom / XOR / Binary Fuse filter is smaller. For evolving sets that need insert and delete, neither MARISA nor the static bitfilters variants apply β€” use a Quotient filter or a real index.

Platform Support

Compatibility

Extension availability may vary by platform and DuckDB version. Check below to ensure this extension supports your environment before installation.

Quick Facts

Software License MIT
Pricing Free
Written In C++
Source Available Yes
View on GitHub
Usage
13,933+
loads in last 30 days

Platforms

Linux
Linux (musl)
macOS
Windows
WASM
Supported platform architectures
Linux: aarch64, x86_64
Linux (musl): x86_64
macOS: Apple Silicon, Intel
Windows: x86_64
WASM: threads, mvp, eh
Compiled binary sizes
Platform Architecture Size
Linux aarch64 2.72 MB
Linux x86_64 3.09 MB
Linux (musl) x86_64 2.36 MB
macOS Apple Silicon 1.85 MB
macOS Intel 2.16 MB
Windows x86_64 7.54 MB
WASM eh 17.9 KB
WASM mvp 16.3 KB
WASM threads 16.2 KB

Gzipped download size from the DuckDB community-extensions registry.

DuckDB Versions

Release calendar
Supported
v1.4.4 v1.5.2

Compact String Lookups in SQL

Install Marisa to build and query MARISA tries β€” fast, memory-tight string sets for autocomplete, spell-check, prefix routing, and dedupe.