🌳

Marisa

Static, space-efficient MARISA trie data structures for DuckDB. Build a compact searchable string set as a single BLOB, then run lookups, prefix searches, and predictive completions against it. The lookup primitive for autocomplete, spell-check, and routing tables.

Browse Contents

Install

-- Install the extension
INSTALL marisa FROM community;

-- Load it into your session
LOAD marisa;

-- Build a trie from a column
CREATE TABLE employees(name TEXT);
INSERT INTO employees VALUES ('Alice'),('Bob'),('Megan'),('Melissa');
CREATE TABLE employees_trie AS
  SELECT marisa_trie(name) AS trie FROM employees;

-- Membership test
SELECT marisa_lookup(trie, 'Alice') FROM employees_trie;

-- Autocomplete: names starting with 'Me'
SELECT marisa_predictive(trie, 'Me', 10) FROM employees_trie;

Technical Overview

Why Use a MARISA Trie from DuckDB?

Pack a column of strings into a single BLOB and answer membership, autocomplete, and longest-prefix questions against it — in time proportional to the search key, not the dictionary size. Pairs well with bitfilters: both answer set-membership questions, but a MARISA trie keeps the strings and supports prefix queries.

🌳 What this extension is for

Big static string sets — wordlists, place names, route prefixes, autocomplete dictionaries — compiled into one self-contained BLOB you can store in any DuckDB column, persist to Parquet, and ship anywhere DuckDB runs.

• Autocomplete and type-ahead: Microsecond "starts with" lookups against a single BLOB. Drop a search box on a static corpus without standing up Elasticsearch.
• Spell-check and dictionary membership: Exact "is this string in the set?" tests, smaller than a hash set and prefix-aware in a way bitfilters can't be.
• Longest-prefix-match routing: Find every dictionary string that is itself a prefix of the input — the primitive behind URL routers, CIDR label tables, and rule-matching lookups.
• A compact join side: One trie per partition / day / tenant ships dramatically smaller than the equivalent string list, and serves membership and prefix queries from the same blob.

⚙️ How it works

MARISA is a succinct static Patricia trie maintained as s-yata/marisa-trie; this extension is a thin DuckDB wrapper.

• One aggregate produces one BLOB: Aggregate a VARCHAR column once at write time and you get a single immutable BLOB. Store it anywhere DuckDB stores BLOBs — table column, Parquet file, S3 object.
• Lookup cost scales with the key, not the dictionary: The first query is as fast as the millionth, whether the trie holds a thousand strings or ten million.
• Prefix-rich data compresses sharply: URLs, paths, and dictionary words share heavy prefix structure — a 470K-entry English wordlist fits in roughly 800 KB, about a fifth the size of the same words as a flat sorted list.

🛡️ When to choose something else

MARISA is a sharp, narrow tool. Knowing where it isn't the right fit is half the value.

• Static — no incremental insert or delete: Once built, the trie is immutable. Updates mean rebuilding from the source set.
• String keys only: VARCHAR in, VARCHAR out. Numeric keys need to be encoded as text upstream or stored in a different index.
• Unrelated random strings: pick a Bloom: Without shared prefixes, the compression edge disappears. For pure membership on uncorrelated keys, bitfilters is smaller and just as fast — but it can't return the strings or answer prefix queries.

🎯 Common use cases

Autocomplete backend in SQL

Build the trie once at ingest time, store the BLOB, and serve type-ahead requests directly from DuckDB. No separate search service for static dictionaries.

Spell-check shipped inside a Parquet file

A wordlist or term list compiled into a single BLOB column. Pair with fuzzycomplete or rapidfuzz for fuzzy fallbacks when exact lookup misses.

Routing and longest-prefix-match tables

URL paths, feature-flag rules, geo / CIDR labels — anywhere you ask "which rules match this input?"

Compact join dictionary

One trie per partition or per day. The BLOB ships smaller than the underlying string list and serves prefix and membership queries from the same blob.

Deep Dive

Technical Details

What you can do with one query

Pack a column of strings into a single, self-contained BLOB — then run membership, predictive, and longest-prefix-match lookups against it forever:

-- Build once at ingest time
CREATE TABLE words_trie AS
  SELECT marisa_trie(word) AS trie FROM dictionary;

-- Three lookups against the same BLOB
SELECT
  marisa_lookup(trie, 'duckdb')                 AS exact,         -- BOOLEAN
  marisa_predictive(trie, 'duck', 10)           AS completions,   -- VARCHAR[]
  marisa_common_prefix(trie, 'duckdbtools', 10) AS prefixes_of    -- VARCHAR[]
FROM words_trie;

marisa_trie is the only constructor — an aggregate that emits a single BLOB. The three query primitives (marisa_lookup, marisa_predictive, marisa_common_prefix) read the BLOB directly. No decode step, no warm-up; the first query is as fast as the millionth.

Static is the feature

A MARISA trie is built once and queried many times. There is no marisa_insert or marisa_delete — to add or remove keys, rebuild the trie from the updated source set. For a 470K-entry English wordlist that’s milliseconds; for billions of keys, plan an offline rebuild step.

The static design is what makes the structure succinct: keys are packed near the information-theoretic minimum, the BLOB is mmap-ready on load, and lookups don’t pay any synchronization cost. If your set churns continuously, a hash set or a bitfilters Quotient filter is a better fit.

What is MARISA?

MARISA — Matching Algorithm with Recursively Implemented StorAge — is a static, succinct Patricia / radix trie. The reference C++ implementation is s-yata/marisa-trie, maintained for over a decade. This extension is a thin DuckDB wrapper around that library.

A trie is the right data structure when:

The strings have shared prefixes — URLs, paths, names, dictionary words. Prefix-rich data compresses dramatically.
You need prefix queries — autocomplete, longest-prefix match, IP-CIDR routing — that hash sets and Bloom filters can’t answer at all.
You want bounded lookup time — O(|key|), independent of dictionary size.

For random unrelated strings without prefix structure, a hash set or a Bloom / XOR / Binary Fuse filter from bitfilters will be smaller and faster.

Storage

Each marisa_trie call produces one BLOB. The BLOB is self-contained and portable — write it to a column, persist in Parquet, ship to another DuckDB instance. Loading is essentially a pointer-walk; there is no parsing or decompression step at query time.

A typical English-word trie of ~470K entries weighs in around 800 KB — roughly a fifth the size of the same words as a flat sorted list, before any column compression.

Lookup operations

Operation	Function	Returns
Exact membership	`marisa_lookup`	`BOOLEAN`
Predictive (autocomplete)	`marisa_predictive`	`VARCHAR[]`
Longest-prefix-match	`marisa_common_prefix`	`VARCHAR[]`

All three run against the BLOB directly. The first query is as fast as the millionth.

MARISA vs `bitfilters`

Both extensions answer set-membership questions; they sit in the same Indexing sub-bucket on the catalog page for that reason. The mechanics are different in ways that matter:

	`marisa`	`bitfilters`
Stores the strings?	Yes — keys are recoverable	No — only a probabilistic fingerprint
False positives	None — answers are exact	Bounded (e.g. ~0.4% for `xor8`)
Prefix / predictive queries	Yes (`marisa_predictive`, `marisa_common_prefix`)	No
Best when	Prefix-rich strings, autocomplete	Random keys, smallest possible memory
Mutation	Static — rebuild to update	Static (XOR / Fuse / Bloom) or dynamic (Quotient)

Use MARISA when you need the strings back or you need prefix queries. Use bitfilters when you only need a yes/no on uncorrelated keys and memory is the binding constraint.

Pairings

Build at ingest, query at runtime. Materialize one trie per user / partition / day at write time, then query at read time. The static design is a feature here — the trie won’t change underneath you.
Pair with fuzzycomplete or rapidfuzz for fuzzy + prefix lookup pipelines: exact prefix via MARISA, fuzzy fallback via the others.
Use as the dictionary side of a join — much smaller wire size than transferring the underlying string list.

Limitations

Static — no incremental insert or delete. Rebuild the trie from the updated source set.
String-only. No numeric or composite keys.
Lexicographic order only. If you need a different ordering at query time, build separate tries.

Reference

Extension Contents

Quick reference to all available functions and settings organized by category.

Name	Type	Description
Build The aggregate that turns a column of strings into a single MARISA trie BLOB. Build once at ingest time, store the BLOB in any column type, query repeatedly. Updates require a rebuild — there is no incremental insert.
marisa_trie()	Aggregate Function	Aggregate that builds a MARISA trie from a column of strings
Query Membership tests, autocomplete (predictive prefix), and longest-prefix-match against an existing trie BLOB. All three primitives run in O(\|key\|) — lookup work is independent of dictionary size.
marisa_common_prefix()	Scalar Function	Returns `VARCHAR[]` — every string in the trie that is a prefix of the search string, capped at `max_results`
marisa_lookup()	Scalar Function	Returns `BOOLEAN` — `true` iff the search string is in the trie
marisa_predictive()	Scalar Function	Returns `VARCHAR[]` — every string in the trie that starts with the given prefix, capped at `max_results`

API Reference

Function Documentation

Detailed documentation for each function including signatures, parameters, and examples.

marisa_common_prefix

Scalar Function Query

Signature

marisa_common_prefix(col0: BLOB, col1: VARCHAR, col2: INTEGER) → VARCHAR[]

Parameters (Positional)

Parameter	Type	Mode
`col0`	`BLOB`	Positional
`col1`	`VARCHAR`	Positional
`col2`	`INTEGER`	Positional

Returns

Description

Returns VARCHAR[] — every string in the trie that is a prefix of the search string, capped at max_results. The inverse of marisa_predictive: useful for routing tables, IP / CIDR label lookups, and longest-prefix-match style queries ("which rules match this input?").

Examples

Find every country-code prefix that matches an input

SELECT marisa_common_prefix(trie, 'USA', 10) AS matches
FROM country_codes_trie;
-- ['U', 'US', 'USA']

Related Functions

marisa_predictive() — Returns `VARCHAR[]` — every string in the trie that **starts with** the given prefix, capped at `max_results`
marisa_lookup() — Returns `BOOLEAN` — `true` iff the search string is in the trie
marisa_trie() — Aggregate that builds a MARISA trie from a column of strings

Back to Extension Contents

marisa_lookup

Scalar Function Query

Signature

marisa_lookup(col0: BLOB, col1: VARCHAR) → BOOLEAN

Parameters (Positional)

Parameter	Type	Mode	Description
`col0`	`BLOB`	Positional
`col1`	`VARCHAR`	Positional

Returns

Description

Returns BOOLEAN — true iff the search string is in the trie. The fastest membership test in this extension; an exact, allocation-free yes/no against the BLOB. Use it for spell-check, dictionary membership, or anywhere a hash set would otherwise live.

Examples

Spell-check against an English wordlist trie

SELECT marisa_lookup(trie, 'duckling') AS in_dictionary
FROM words_trie;

Related Functions

marisa_predictive() — Returns `VARCHAR[]` — every string in the trie that **starts with** the given prefix, capped at `max_results`
marisa_common_prefix() — Returns `VARCHAR[]` — every string in the trie that is **a prefix of** the search string, capped at `max_results`
marisa_trie() — Aggregate that builds a MARISA trie from a column of strings

Back to Extension Contents

marisa_predictive

Scalar Function Query

Signature

marisa_predictive(col0: BLOB, col1: VARCHAR, col2: INTEGER) → VARCHAR[]

Parameters (Positional)

Parameter	Type	Mode
`col0`	`BLOB`	Positional
`col1`	`VARCHAR`	Positional
`col2`	`INTEGER`	Positional

Returns

Description

Returns VARCHAR[] — every string in the trie that starts with the given prefix, capped at max_results. The autocomplete primitive: feed it the user's partial input, get back the candidate completions. The cap is a hard cutoff so a short prefix on a million-entry dictionary doesn't return everything.

Examples

Type-ahead suggestions for a search box

SELECT marisa_predictive(trie, 'duck', 10) AS suggestions
FROM words_trie;

Related Functions

marisa_lookup() — Returns `BOOLEAN` — `true` iff the search string is in the trie
marisa_common_prefix() — Returns `VARCHAR[]` — every string in the trie that is **a prefix of** the search string, capped at `max_results`
marisa_trie() — Aggregate that builds a MARISA trie from a column of strings

Back to Extension Contents

marisa_trie

Aggregate Function Build

Signature

marisa_trie(col0: VARCHAR) → BLOB

Parameters (Positional)

Parameter	Type	Mode	Description
`col0`	`VARCHAR`	Positional

Returns

Description

Aggregate that builds a MARISA trie from a column of strings. Returns a single self-contained BLOB you can store in any column type, persist in Parquet, or ship to another DuckDB instance. The trie is static — to add or remove keys, rebuild from the updated source set.

Examples

Build a trie from a column at ingest time

CREATE TABLE employees_trie AS
  SELECT marisa_trie(name) AS trie FROM employees;

One trie per logical partition (day, tenant, region)

CREATE TABLE name_tries AS
SELECT day,
       marisa_trie(name) AS trie
FROM signups
GROUP BY day;

Related Functions

marisa_lookup() — Returns `BOOLEAN` — `true` iff the search string is in the trie
marisa_predictive() — Returns `VARCHAR[]` — every string in the trie that **starts with** the given prefix, capped at `max_results`
marisa_common_prefix() — Returns `VARCHAR[]` — every string in the trie that is **a prefix of** the search string, capped at `max_results`

Back to Extension Contents

Practical Examples

Cookbook

Real-world recipes and patterns for common use cases.

Build a trie

marisa_trie is an aggregate — feed it a column of strings, get one BLOB back. Build at ingest time, store the BLOB, query forever.

CREATE TABLE employees(name TEXT);
INSERT INTO employees VALUES
  ('Alice'),('Bob'),('Charlie'),('David'),('Eve'),('Frank'),
  ('Mallory'),('Megan'),('Oscar'),('Melissa');

CREATE TABLE employees_trie AS
  SELECT marisa_trie(name) AS trie FROM employees;

SELECT octet_length(trie) AS bytes FROM employees_trie;

Trie size scales with shared-prefix density rather than raw input bytes; the more prefix structure your strings share, the bigger the win versus a flat sorted list.

Membership lookup (spell-check, dictionary)

marisa_lookup returns BOOLEAN — does this exact string exist in the trie?

SELECT marisa_lookup(trie, 'Alice')   FROM employees_trie;  -- TRUE
SELECT marisa_lookup(trie, 'Unknown') FROM employees_trie;  -- FALSE

This is the spell-check / “is this term in our dictionary?” primitive. It also stands in for a hash set anywhere you’d otherwise pre-load one for IN (...)-style filtering.

Autocomplete (type-ahead UI)

marisa_predictive(trie, prefix, max_results) returns every entry in the trie whose prefix matches:

-- Names starting with 'Me'
SELECT marisa_predictive(trie, 'Me', 10) AS suggestions
FROM employees_trie;
-- ['Megan', 'Melissa']

Drop this into the backend of a type-ahead search box: one row, one BLOB, microsecond lookups. The max_results cap is a hard cutoff so a short prefix against a million-entry dictionary doesn’t return the whole table.

Longest-prefix match (routing tables)

marisa_common_prefix is the inverse of predictive — every string in the trie that is itself a prefix of the search input:

CREATE TABLE country_codes(code TEXT);
INSERT INTO country_codes VALUES ('U'), ('US'), ('USA');

CREATE TABLE country_codes_trie AS
  SELECT marisa_trie(code) AS trie FROM country_codes;

SELECT marisa_common_prefix(trie, 'USA', 10) AS matches
FROM country_codes_trie;
-- ['U', 'US', 'USA']

Useful for routing tables, IP-prefix labels, feature-flag rule lookups, and any “find every rule that matches this input” pattern.

Big static dictionary: English words

The classic MARISA workload — load a wordlist into a trie once, ship the BLOB inside a Parquet file:

-- Load any newline-delimited wordlist (e.g. /usr/share/dict/words on macOS/Linux)
CREATE TABLE words AS
  SELECT trim(column0) AS word
  FROM read_csv('/usr/share/dict/words', columns := {'column0': 'VARCHAR'});

CREATE TABLE words_trie AS
  SELECT marisa_trie(word) AS trie FROM words;

-- Spell-check
SELECT marisa_lookup(trie, 'duckling') FROM words_trie;        -- TRUE
SELECT marisa_lookup(trie, 'qwxzy')    FROM words_trie;        -- FALSE

-- Autocomplete
SELECT marisa_predictive(trie, 'duck', 8) FROM words_trie;
-- ['duck','duckbill','duckboard','ducker',...]

For ~470K English words the resulting BLOB lands around 800 KB — about a fifth the size of the source list, queryable directly with no decode step.

Persist a trie to Parquet

The trie is just a BLOB — Parquet handles it natively. Build once, write, ship anywhere a DuckDB lives:

COPY (SELECT trie FROM words_trie)
  TO 'words_trie.parquet' (FORMAT 'parquet');

-- On the other side
CREATE TABLE words_trie AS
  SELECT * FROM 'words_trie.parquet';

SELECT marisa_predictive(trie, 'rou', 5) FROM words_trie;

Loading is a pointer-walk — no parsing, no decompression. Useful for shipping a fixed dictionary alongside an application without standing up a separate index service.

One trie per partition / tenant

Materialize one trie per logical partition (day, tenant, region) at write time. The set of tries is tiny compared to the source data and serves all three lookup operations at query time:

CREATE TABLE name_tries AS
SELECT day,
       marisa_trie(name) AS trie
FROM signups
GROUP BY day;

-- Did a given name show up on a specific day?
SELECT day
FROM name_tries
WHERE marisa_lookup(trie, 'Megan');

-- All names that started with 'Me' on each day
SELECT day, marisa_predictive(trie, 'Me', 20) AS matches
FROM name_tries
ORDER BY day;

Updating a trie

There is no marisa_insert — to update, rebuild from the new source set:

-- Source set changed; rebuild the trie
CREATE OR REPLACE TABLE words_trie AS
  SELECT marisa_trie(word) AS trie FROM words;

For a dictionary that changes daily this is a millisecond-scale CTAS. For a billion-key set, schedule the rebuild offline and treat the trie as a build artifact.

When a trie vs a HashSet vs `bitfilters`

Trie — wins when the data has shared prefixes (URLs, names, dictionary words) or you need prefix / predictive queries. Memory: roughly 30–60% of the equivalent flat string list. Lookup: O(|key|), independent of dictionary size.
HashSet (DuckDB’s built-in IN) — wins for small, random, unrelated string sets where simplicity dominates.
bitfilters — wins for very large random key sets where you only need a yes/no answer and memory is the binding constraint. Doesn’t store the strings; can’t do prefix queries.

If you’re building autocomplete or have prefix-rich data, MARISA is the right structure. For pure membership tests on uncorrelated random strings, a Bloom / XOR / Binary Fuse filter is smaller. For evolving sets that need insert and delete, neither MARISA nor the static bitfilters variants apply — use a Quotient filter or a real index.

Platform Support

Compatibility

Extension availability may vary by platform and DuckDB version. Check below to ensure this extension supports your environment before installation.

Quick Facts

Software License MIT

Pricing Free

Written In C++

Source Available Yes

View on GitHub

Usage

13,933+

loads in last 30 days

Platforms

Linux

Linux (musl)

macOS

Windows

WASM

Supported platform architectures

Linux: aarch64, x86_64

Linux (musl): x86_64

macOS: Apple Silicon, Intel

Windows: x86_64

WASM: threads, mvp, eh

Compiled binary sizes

Platform	Architecture	Size
Linux	aarch64	2.72 MB
Linux	x86_64	3.09 MB
Linux (musl)	x86_64	2.36 MB
macOS	Apple Silicon	1.85 MB
macOS	Intel	2.16 MB
Windows	x86_64	7.54 MB
WASM	eh	17.9 KB
WASM	mvp	16.3 KB
WASM	threads	16.2 KB

Gzipped download size from the DuckDB community-extensions registry.

DuckDB Versions

Release calendar

Supported

v1.4.4 v1.5.2

Compact String Lookups in SQL

Install Marisa to build and query MARISA tries — fast, memory-tight string sets for autocomplete, spell-check, prefix routing, and dedupe.

Install Now View Other Extensions

Marisa

Install

Why Use a MARISA Trie from DuckDB?

🌳 What this extension is for

⚙️ How it works

🛡️ When to choose something else

🎯 Common use cases

Autocomplete backend in SQL

Spell-check shipped inside a Parquet file

Routing and longest-prefix-match tables

Compact join dictionary

Technical Details

What you can do with one query

What is MARISA?

Storage

Lookup operations

MARISA vs bitfilters

Pairings

Limitations

Extension Contents

Function Documentation

marisa_common_prefix

Signature

Parameters (Positional)

Returns

Description

Examples

Related Functions

marisa_lookup

Signature

Parameters (Positional)

Returns

Description

Examples

Related Functions

marisa_predictive

Signature

Parameters (Positional)

Returns

Description

Examples

Related Functions

marisa_trie

Signature

Parameters (Positional)

Returns

Description

Examples

Related Functions

Cookbook

Build a trie

Membership lookup (spell-check, dictionary)

Autocomplete (type-ahead UI)

Longest-prefix match (routing tables)

Big static dictionary: English words

Persist a trie to Parquet

One trie per partition / tenant

Updating a trie

When a trie vs a HashSet vs bitfilters

Compatibility

Quick Facts

Platforms

DuckDB Versions

Related Extensions

Bitfilters

Fuzzycomplete

RapidFuzz

Compact String Lookups in SQL

MARISA vs `bitfilters`

When a trie vs a HashSet vs `bitfilters`