Marisa
Static, space-efficient MARISA trie data structures for DuckDB. Build a compact searchable string set as a single BLOB, then run lookups, prefix searches, and predictive completions against it. The lookup primitive for autocomplete, spell-check, and routing tables.
Install
-- Install the extension
INSTALL marisa FROM community;
-- Load it into your session
LOAD marisa;
-- Build a trie from a column
CREATE TABLE employees(name TEXT);
INSERT INTO employees VALUES ('Alice'),('Bob'),('Megan'),('Melissa');
CREATE TABLE employees_trie AS
SELECT marisa_trie(name) AS trie FROM employees;
-- Membership test
SELECT marisa_lookup(trie, 'Alice') FROM employees_trie;
-- Autocomplete: names starting with 'Me'
SELECT marisa_predictive(trie, 'Me', 10) FROM employees_trie; Technical Overview
Why Use a MARISA Trie from DuckDB?
Pack a column of strings into a single BLOB and answer membership, autocomplete, and longest-prefix questions against it β in time proportional to the search key, not the dictionary size. Pairs well with bitfilters: both answer set-membership questions, but a MARISA trie keeps the strings and supports prefix queries.
π³ What this extension is for
Big static string sets β wordlists, place names, route prefixes, autocomplete dictionaries β compiled into one self-contained BLOB you can store in any DuckDB column, persist to Parquet, and ship anywhere DuckDB runs.
- β’ Autocomplete and type-ahead: Microsecond "starts with" lookups against a single BLOB. Drop a search box on a static corpus without standing up Elasticsearch.
- β’ Spell-check and dictionary membership: Exact "is this string in the set?" tests, smaller than a hash set and prefix-aware in a way
bitfilterscan't be. - β’ Longest-prefix-match routing: Find every dictionary string that is itself a prefix of the input β the primitive behind URL routers, CIDR label tables, and rule-matching lookups.
- β’ A compact join side: One trie per partition / day / tenant ships dramatically smaller than the equivalent string list, and serves membership and prefix queries from the same blob.
βοΈ How it works
MARISA is a succinct static Patricia trie maintained as s-yata/marisa-trie; this extension is a thin DuckDB wrapper.
- β’ One aggregate produces one BLOB: Aggregate a
VARCHARcolumn once at write time and you get a single immutable BLOB. Store it anywhere DuckDB stores BLOBs β table column, Parquet file, S3 object. - β’ Lookup cost scales with the key, not the dictionary: The first query is as fast as the millionth, whether the trie holds a thousand strings or ten million.
- β’ Prefix-rich data compresses sharply: URLs, paths, and dictionary words share heavy prefix structure β a 470K-entry English wordlist fits in roughly 800 KB, about a fifth the size of the same words as a flat sorted list.
π‘οΈ When to choose something else
MARISA is a sharp, narrow tool. Knowing where it isn't the right fit is half the value.
- β’ Static β no incremental insert or delete: Once built, the trie is immutable. Updates mean rebuilding from the source set.
- β’ String keys only:
VARCHARin,VARCHARout. Numeric keys need to be encoded as text upstream or stored in a different index. - β’ Unrelated random strings: pick a Bloom: Without shared prefixes, the compression edge disappears. For pure membership on uncorrelated keys,
bitfiltersis smaller and just as fast β but it can't return the strings or answer prefix queries.
π― Common use cases
Autocomplete backend in SQL
Build the trie once at ingest time, store the BLOB, and serve type-ahead requests directly from DuckDB. No separate search service for static dictionaries.
Spell-check shipped inside a Parquet file
A wordlist or term list compiled into a single BLOB column. Pair with fuzzycomplete or rapidfuzz for fuzzy fallbacks when exact lookup misses.
Routing and longest-prefix-match tables
URL paths, feature-flag rules, geo / CIDR labels β anywhere you ask "which rules match this input?"
Compact join dictionary
One trie per partition or per day. The BLOB ships smaller than the underlying string list and serves prefix and membership queries from the same blob.
Deep Dive
Technical Details
What you can do with one query
Pack a column of strings into a single, self-contained BLOB β then run membership, predictive, and longest-prefix-match lookups against it forever:
-- Build once at ingest time
CREATE TABLE words_trie AS
SELECT marisa_trie(word) AS trie FROM dictionary;
-- Three lookups against the same BLOB
SELECT
marisa_lookup(trie, 'duckdb') AS exact, -- BOOLEAN
marisa_predictive(trie, 'duck', 10) AS completions, -- VARCHAR[]
marisa_common_prefix(trie, 'duckdbtools', 10) AS prefixes_of -- VARCHAR[]
FROM words_trie;
marisa_trie is the only constructor β an aggregate that emits a single BLOB. The three query primitives (marisa_lookup, marisa_predictive, marisa_common_prefix) read the BLOB directly. No decode step, no warm-up; the first query is as fast as the millionth.
A MARISA trie is built once and queried many times. There is no marisa_insert or marisa_delete β to add or remove keys, rebuild the trie from the updated source set. For a 470K-entry English wordlist thatβs milliseconds; for billions of keys, plan an offline rebuild step.
The static design is what makes the structure succinct: keys are packed near the information-theoretic minimum, the BLOB is mmap-ready on load, and lookups donβt pay any synchronization cost. If your set churns continuously, a hash set or a bitfilters Quotient filter is a better fit.
What is MARISA?
MARISA β Matching Algorithm with Recursively Implemented StorAge β is a static, succinct Patricia / radix trie. The reference C++ implementation is s-yata/marisa-trie, maintained for over a decade. This extension is a thin DuckDB wrapper around that library.
A trie is the right data structure when:
- The strings have shared prefixes β URLs, paths, names, dictionary words. Prefix-rich data compresses dramatically.
- You need prefix queries β autocomplete, longest-prefix match, IP-CIDR routing β that hash sets and Bloom filters canβt answer at all.
- You want bounded lookup time β O(|key|), independent of dictionary size.
For random unrelated strings without prefix structure, a hash set or a Bloom / XOR / Binary Fuse filter from bitfilters will be smaller and faster.
Storage
Each marisa_trie call produces one BLOB. The BLOB is self-contained and portable β write it to a column, persist in Parquet, ship to another DuckDB instance. Loading is essentially a pointer-walk; there is no parsing or decompression step at query time.
A typical English-word trie of ~470K entries weighs in around 800 KB β roughly a fifth the size of the same words as a flat sorted list, before any column compression.
Lookup operations
| Operation | Function | Returns |
|---|---|---|
| Exact membership | marisa_lookup | BOOLEAN |
| Predictive (autocomplete) | marisa_predictive | VARCHAR[] |
| Longest-prefix-match | marisa_common_prefix | VARCHAR[] |
All three run against the BLOB directly. The first query is as fast as the millionth.
MARISA vs bitfilters
Both extensions answer set-membership questions; they sit in the same Indexing sub-bucket on the catalog page for that reason. The mechanics are different in ways that matter:
marisa | bitfilters | |
|---|---|---|
| Stores the strings? | Yes β keys are recoverable | No β only a probabilistic fingerprint |
| False positives | None β answers are exact | Bounded (e.g. ~0.4% for xor8) |
| Prefix / predictive queries | Yes (marisa_predictive, marisa_common_prefix) | No |
| Best when | Prefix-rich strings, autocomplete | Random keys, smallest possible memory |
| Mutation | Static β rebuild to update | Static (XOR / Fuse / Bloom) or dynamic (Quotient) |
Use MARISA when you need the strings back or you need prefix queries. Use bitfilters when you only need a yes/no on uncorrelated keys and memory is the binding constraint.
Pairings
- Build at ingest, query at runtime. Materialize one trie per user / partition / day at write time, then query at read time. The static design is a feature here β the trie wonβt change underneath you.
- Pair with
fuzzycompleteorrapidfuzzfor fuzzy + prefix lookup pipelines: exact prefix via MARISA, fuzzy fallback via the others. - Use as the dictionary side of a join β much smaller wire size than transferring the underlying string list.
Limitations
- Static β no incremental insert or delete. Rebuild the trie from the updated source set.
- String-only. No numeric or composite keys.
- Lexicographic order only. If you need a different ordering at query time, build separate tries.
Reference
Extension Contents
Quick reference to all available functions and settings organized by category.
| Name | Description | |
|---|---|---|
| Build The aggregate that turns a column of strings into a single MARISA trie BLOB. Build once at ingest time, store the BLOB in any column type, query repeatedly. Updates require a rebuild β there is no incremental insert. | ||
| marisa_trie() | Aggregate that builds a MARISA trie from a column of strings | |
| Query Membership tests, autocomplete (predictive prefix), and longest-prefix-match against an existing trie BLOB. All three primitives run in O(|key|) β lookup work is independent of dictionary size. | ||
| marisa_common_prefix() | Returns VARCHAR[] β every string in the trie that is a prefix of the search string, capped at max_results | |
| marisa_lookup() | Returns BOOLEAN β true iff the search string is in the trie | |
| marisa_predictive() | Returns VARCHAR[] β every string in the trie that starts with the given prefix, capped at max_results | |
API Reference
Function Documentation
Detailed documentation for each function including signatures, parameters, and examples.
marisa_common_prefix
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | BLOB | Positional | |
col1 | VARCHAR | Positional | |
col2 | INTEGER | Positional |
Returns
Description
Returns VARCHAR[] β every string in the trie that is a prefix of the search string, capped at max_results. The inverse of marisa_predictive: useful for routing tables, IP / CIDR label lookups, and longest-prefix-match style queries ("which rules match this input?").
Examples
Find every country-code prefix that matches an input
SELECT marisa_common_prefix(trie, 'USA', 10) AS matches
FROM country_codes_trie;
-- ['U', 'US', 'USA'] Related Functions
- marisa_predictive() β Returns `VARCHAR[]` β every string in the trie that **starts with** the given prefix, capped at `max_results`
- marisa_lookup() β Returns `BOOLEAN` β `true` iff the search string is in the trie
- marisa_trie() β Aggregate that builds a MARISA trie from a column of strings
marisa_lookup
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | BLOB | Positional | |
col1 | VARCHAR | Positional |
Returns
Description
Returns BOOLEAN β true iff the search string is in the trie. The fastest membership test in this extension; an exact, allocation-free yes/no against the BLOB. Use it for spell-check, dictionary membership, or anywhere a hash set would otherwise live.
Examples
Spell-check against an English wordlist trie
SELECT marisa_lookup(trie, 'duckling') AS in_dictionary
FROM words_trie; Related Functions
- marisa_predictive() β Returns `VARCHAR[]` β every string in the trie that **starts with** the given prefix, capped at `max_results`
- marisa_common_prefix() β Returns `VARCHAR[]` β every string in the trie that is **a prefix of** the search string, capped at `max_results`
- marisa_trie() β Aggregate that builds a MARISA trie from a column of strings
marisa_predictive
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | BLOB | Positional | |
col1 | VARCHAR | Positional | |
col2 | INTEGER | Positional |
Returns
Description
Returns VARCHAR[] β every string in the trie that starts with the given prefix, capped at max_results. The autocomplete primitive: feed it the user's partial input, get back the candidate completions. The cap is a hard cutoff so a short prefix on a million-entry dictionary doesn't return everything.
Examples
Type-ahead suggestions for a search box
SELECT marisa_predictive(trie, 'duck', 10) AS suggestions
FROM words_trie; Related Functions
- marisa_lookup() β Returns `BOOLEAN` β `true` iff the search string is in the trie
- marisa_common_prefix() β Returns `VARCHAR[]` β every string in the trie that is **a prefix of** the search string, capped at `max_results`
- marisa_trie() β Aggregate that builds a MARISA trie from a column of strings
marisa_trie
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | VARCHAR | Positional |
Returns
Description
Aggregate that builds a MARISA trie from a column of strings. Returns a single self-contained BLOB you can store in any column type, persist in Parquet, or ship to another DuckDB instance. The trie is static β to add or remove keys, rebuild from the updated source set.
Examples
Build a trie from a column at ingest time
CREATE TABLE employees_trie AS
SELECT marisa_trie(name) AS trie FROM employees; One trie per logical partition (day, tenant, region)
CREATE TABLE name_tries AS
SELECT day,
marisa_trie(name) AS trie
FROM signups
GROUP BY day; Related Functions
- marisa_lookup() β Returns `BOOLEAN` β `true` iff the search string is in the trie
- marisa_predictive() β Returns `VARCHAR[]` β every string in the trie that **starts with** the given prefix, capped at `max_results`
- marisa_common_prefix() β Returns `VARCHAR[]` β every string in the trie that is **a prefix of** the search string, capped at `max_results`
Practical Examples
Cookbook
Real-world recipes and patterns for common use cases.
Build a trie
marisa_trie is an aggregate β feed it a column of strings, get one BLOB back. Build at ingest time, store the BLOB, query forever.
CREATE TABLE employees(name TEXT);
INSERT INTO employees VALUES
('Alice'),('Bob'),('Charlie'),('David'),('Eve'),('Frank'),
('Mallory'),('Megan'),('Oscar'),('Melissa');
CREATE TABLE employees_trie AS
SELECT marisa_trie(name) AS trie FROM employees;
SELECT octet_length(trie) AS bytes FROM employees_trie;
Trie size scales with shared-prefix density rather than raw input bytes; the more prefix structure your strings share, the bigger the win versus a flat sorted list.
Membership lookup (spell-check, dictionary)
marisa_lookup returns BOOLEAN β does this exact string exist in the trie?
SELECT marisa_lookup(trie, 'Alice') FROM employees_trie; -- TRUE
SELECT marisa_lookup(trie, 'Unknown') FROM employees_trie; -- FALSE
This is the spell-check / βis this term in our dictionary?β primitive. It also stands in for a hash set anywhere youβd otherwise pre-load one for IN (...)-style filtering.
Autocomplete (type-ahead UI)
marisa_predictive(trie, prefix, max_results) returns every entry in the trie whose prefix matches:
-- Names starting with 'Me'
SELECT marisa_predictive(trie, 'Me', 10) AS suggestions
FROM employees_trie;
-- ['Megan', 'Melissa']
Drop this into the backend of a type-ahead search box: one row, one BLOB, microsecond lookups. The max_results cap is a hard cutoff so a short prefix against a million-entry dictionary doesnβt return the whole table.
Longest-prefix match (routing tables)
marisa_common_prefix is the inverse of predictive β every string in the trie that is itself a prefix of the search input:
CREATE TABLE country_codes(code TEXT);
INSERT INTO country_codes VALUES ('U'), ('US'), ('USA');
CREATE TABLE country_codes_trie AS
SELECT marisa_trie(code) AS trie FROM country_codes;
SELECT marisa_common_prefix(trie, 'USA', 10) AS matches
FROM country_codes_trie;
-- ['U', 'US', 'USA']
Useful for routing tables, IP-prefix labels, feature-flag rule lookups, and any βfind every rule that matches this inputβ pattern.
Big static dictionary: English words
The classic MARISA workload β load a wordlist into a trie once, ship the BLOB inside a Parquet file:
-- Load any newline-delimited wordlist (e.g. /usr/share/dict/words on macOS/Linux)
CREATE TABLE words AS
SELECT trim(column0) AS word
FROM read_csv('/usr/share/dict/words', columns := {'column0': 'VARCHAR'});
CREATE TABLE words_trie AS
SELECT marisa_trie(word) AS trie FROM words;
-- Spell-check
SELECT marisa_lookup(trie, 'duckling') FROM words_trie; -- TRUE
SELECT marisa_lookup(trie, 'qwxzy') FROM words_trie; -- FALSE
-- Autocomplete
SELECT marisa_predictive(trie, 'duck', 8) FROM words_trie;
-- ['duck','duckbill','duckboard','ducker',...]
For ~470K English words the resulting BLOB lands around 800 KB β about a fifth the size of the source list, queryable directly with no decode step.
Persist a trie to Parquet
The trie is just a BLOB β Parquet handles it natively. Build once, write, ship anywhere a DuckDB lives:
COPY (SELECT trie FROM words_trie)
TO 'words_trie.parquet' (FORMAT 'parquet');
-- On the other side
CREATE TABLE words_trie AS
SELECT * FROM 'words_trie.parquet';
SELECT marisa_predictive(trie, 'rou', 5) FROM words_trie;
Loading is a pointer-walk β no parsing, no decompression. Useful for shipping a fixed dictionary alongside an application without standing up a separate index service.
One trie per partition / tenant
Materialize one trie per logical partition (day, tenant, region) at write time. The set of tries is tiny compared to the source data and serves all three lookup operations at query time:
CREATE TABLE name_tries AS
SELECT day,
marisa_trie(name) AS trie
FROM signups
GROUP BY day;
-- Did a given name show up on a specific day?
SELECT day
FROM name_tries
WHERE marisa_lookup(trie, 'Megan');
-- All names that started with 'Me' on each day
SELECT day, marisa_predictive(trie, 'Me', 20) AS matches
FROM name_tries
ORDER BY day;
Updating a trie
There is no marisa_insert β to update, rebuild from the new source set:
-- Source set changed; rebuild the trie
CREATE OR REPLACE TABLE words_trie AS
SELECT marisa_trie(word) AS trie FROM words;
For a dictionary that changes daily this is a millisecond-scale CTAS. For a billion-key set, schedule the rebuild offline and treat the trie as a build artifact.
When a trie vs a HashSet vs bitfilters
- Trie β wins when the data has shared prefixes (URLs, names, dictionary words) or you need prefix / predictive queries. Memory: roughly 30β60% of the equivalent flat string list. Lookup: O(|key|), independent of dictionary size.
- HashSet (DuckDBβs built-in
IN) β wins for small, random, unrelated string sets where simplicity dominates. bitfiltersβ wins for very large random key sets where you only need a yes/no answer and memory is the binding constraint. Doesnβt store the strings; canβt do prefix queries.
If youβre building autocomplete or have prefix-rich data, MARISA is the right structure. For pure membership tests on uncorrelated random strings, a Bloom / XOR / Binary Fuse filter is smaller. For evolving sets that need insert and delete, neither MARISA nor the static bitfilters variants apply β use a Quotient filter or a real index.
Platform Support
Compatibility
Extension availability may vary by platform and DuckDB version. Check below to ensure this extension supports your environment before installation.
Quick Facts
Platforms
Supported platform architectures
Compiled binary sizes
| Platform | Architecture | Size |
|---|---|---|
| Linux | aarch64 | 2.72 MB |
| Linux | x86_64 | 3.09 MB |
| Linux (musl) | x86_64 | 2.36 MB |
| macOS | Apple Silicon | 1.85 MB |
| macOS | Intel | 2.16 MB |
| Windows | x86_64 | 7.54 MB |
| WASM | eh | 17.9 KB |
| WASM | mvp | 16.3 KB |
| WASM | threads | 16.2 KB |
Gzipped download size from the DuckDB community-extensions registry.
Compact String Lookups in SQL
Install Marisa to build and query MARISA tries β fast, memory-tight string sets for autocomplete, spell-check, prefix routing, and dedupe.