RapidFuzz
Fuzzy string matching and edit-distance for DuckDB, powered by the RapidFuzz C++ library. Five ratio variants for record matching plus eight distance/similarity algorithms (Jaro, Jaro-Winkler, Hamming, Indel, Levenshtein/OSA, LCS, prefix/postfix) β every one in distance, similarity, and normalized forms.
Install
-- Install the extension
INSTALL rapidfuzz FROM community;
-- Load it into your session
LOAD rapidfuzz;
-- Headline ratio (0β100, like Python rapidfuzz.fuzz.ratio)
SELECT rapidfuzz_ratio('hello world', 'helo wrld'); -- 90.0
-- Match with reordered words
SELECT rapidfuzz_token_sort_ratio('world hello', 'hello world'); -- 100.0
-- Pick best match per row
SELECT a.id, b.id,
rapidfuzz_ratio(a.name, b.name) AS score
FROM left_records a, right_records b
WHERE rapidfuzz_ratio(a.name, b.name) > 85
ORDER BY a.id, score DESC; Technical Overview
Why Use RapidFuzz?
Dedup messy customer, address, or product data with one SQL query β or fuzzy-join two datasets that don't share clean keys. Wraps the RapidFuzz C++ library so Levenshtein, Jaro-Winkler, Hamming, OSA / Damerau-Levenshtein, LCS, and the full fuzz-style ratio family run inline as scalar functions over DuckDB columns.
π What this extension is for
Anywhere you have two strings and want a numeric answer to "how similar are they?" β record linkage across customer or vendor lists, dedup of free-text fields, spell-check against a dictionary, ranking of candidate matches.
- β’ Record linkage / fuzzy join: Two datasets that should match on
nameoraddressbut don't share clean keys: score every candidate pair withrapidfuzz_jaro_winkler_normalized_similarityorrapidfuzz_ratio, filter by threshold, keep the best per row withQUALIFY ROW_NUMBER(). - β’ Dedup messy free-text: Cluster near-duplicate rows in a customer or product table β pick a ratio variant that matches your data semantics (token-set for reordered or stuttering text, partial-ratio for fragments) and emit pairs above a threshold.
- β’ Spell-check / typo tolerance: Score an input against a known-good dictionary with
rapidfuzz_osa_distanceorrapidfuzz_jaro_winkler_normalized_similarityand pick the closest match. Handles single-character typos and adjacent-key swaps directly. - β’ Similarity ranking: Score a query string against every row in a small table and
ORDER BYthe score β the simplest "which of these is closest?" pattern, no separate search index.
βοΈ How it works
RapidFuzz is a C++ implementation of the same algorithms exposed by Python's rapidfuzz and the older fuzzywuzzy. This extension binds those algorithms as DuckDB scalar functions β every call is a vectorized in-process C++ kernel; no Python, no FFI, no per-row interpreter overhead.
- β’ Two function families: Ratio scores (
rapidfuzz_ratio,rapidfuzz_partial_ratio,rapidfuzz_token_sort_ratio,rapidfuzz_token_set_ratio,rapidfuzz_partial_token_set_ratio) return 0β100 similarity, matching Pythonrapidfuzz.fuzz. Distance algorithms (Jaro, Jaro-Winkler, Hamming, Indel, Levenshtein/OSA, LCS, prefix, postfix) each ship four variants β_distance,_similarity,_normalized_distance,_normalized_similarity. - β’ Pick distance vs. normalized: Raw
_distance/_similarityreturn integer edit counts β useful when the strings are the same length or you want a numeric threshold (e.g. "differ by β€ 2 characters"). The_normalized_*variants scale to[0, 1]and divide by the longer string's length, so they don't penalize length mismatches; reach for these for cross-pair comparisons. - β’ Vectorized over columns: Every function is scalar with
(VARCHAR, VARCHAR) -> DOUBLE. Use it inSELECT,WHERE,JOIN ... ON,ORDER BY,QUALIFYβ anywhere a scalar fits. DuckDB pipelines and multi-threads the call. - β’ Behavior parity with upstream: Outputs match the upstream Python library, so the RapidFuzz docs are an authoritative reference for algorithm semantics. The naming scheme is
rapidfuzz_<algo>[_normalized]_<distance|similarity>β predictable across all eight algorithm groups.
π‘οΈ Cost and scaling
Every comparison is cheap. Comparing every row to every other row is not β pairwise self-joins are O(nΒ²) and that's the limit you hit on real datasets.
- β’ Self-joins are O(nΒ²): 100k rows Γ 100k rows is 10 billion pairs β still tractable on a fast machine for the simpler ratios, but will dominate query time. 1M Γ 1M is not. Filter the pair space before scoring.
- β’ Use blocking to prune candidates: Blocking (a.k.a. indexing) is the standard record-linkage technique: only compare rows that share a cheap key. First letter, length bucket, normalized prefix, or a phonetic code (e.g. soundex via the
fuzzyextension) β join on the block, score on the strings. Reduces O(nΒ²) to O(nΒ·k) where k is the average block size. - β’ Pre-normalize once: Lowercase, strip punctuation, collapse whitespace before scoring β every per-row transformation you can hoist out of the join saves it from running on every pair. Pair with the
inflectorextension when you need consistent casing across both sides. - β’ Threshold inside the join: Push the similarity threshold into the
WHERE/ONclause so DuckDB can short-circuit. For top-1 per left row, layerQUALIFY ROW_NUMBER() OVER (PARTITION BY left_id ORDER BY score DESC) = 1.
π― Common Use Cases
Dedup a customer / vendor / product table
Self-join with a blocking key (length bucket, first letter, soundex), score with rapidfuzz_token_set_ratio or rapidfuzz_jaro_winkler_normalized_similarity, filter by threshold, walk the resulting pair graph to assign cluster IDs.
Fuzzy-join two datasets without a clean key
Score every candidate pair, take the best per row with QUALIFY. The classic record-linkage pattern, expressed as one SQL statement.
Match free-text input against a dictionary
Spell-correct a query against a known-good list β rapidfuzz_osa_distance β€ 2 captures most typing mistakes.
Rank candidates by similarity
ORDER BY rapidfuzz_ratio(query, candidate) DESC LIMIT N β the simplest "top-N closest" without standing up a search index.
Deep Dive
Technical Details
What you can do with one query
The magic moment β dedup a messy customer table without leaving SQL:
CREATE TABLE customer_dupes AS
SELECT a.id AS left_id,
b.id AS right_id,
rapidfuzz_token_set_ratio(a.name, b.name) AS score
FROM customers a
JOIN customers b
ON a.id < b.id
AND length(a.name) BETWEEN length(b.name) - 3 AND length(b.name) + 3
AND lower(left(a.name, 1)) = lower(left(b.name, 1))
WHERE rapidfuzz_token_set_ratio(a.name, b.name) > 88
ORDER BY score DESC;
Two predicates do the heavy lifting before any string comparison runs: a length-bucket filter and a same-first-letter filter. Only pairs that pass both get scored by rapidfuzz_token_set_ratio. Thatβs blocking β and itβs what makes fuzzy matching tractable at scale.
Every individual call is fast β the algorithms are vectorized C++ kernels from the upstream RapidFuzz library. Pairwise self-joins, on the other hand, are O(nΒ²): 100k rows comparing against themselves is 10 billion candidate pairs. Without a blocking strategy that prunes the pair space first, fuzzy matching does not scale past a few tens of thousands of rows.
This page treats blocking as a first-class concern, not an afterthought.
Architecture
This extension binds the RapidFuzz C++ library β the same engine behind Pythonβs rapidfuzz package and the spiritual successor to fuzzywuzzy β as a set of DuckDB scalar functions. Each function takes (VARCHAR, VARCHAR) and returns a DOUBLE. There is no extra state, no secrets, no ATTACH interface; calls are pure functions composable into any SELECT, WHERE, ON, or ORDER BY clause.
Two function families coexist:
- Ratio scores (
rapidfuzz_ratioand friends) return 0β100, mirroring the Pythonrapidfuzz.fuzzfamily. - Distance algorithms (Jaro, Jaro-Winkler, Hamming, Indel, Levenshtein / OSA, LCS, Prefix, Postfix) ship four variants each β
_distance,_similarity,_normalized_distance,_normalized_similarity. The naming is mechanical:rapidfuzz_<algo>[_normalized]_<distance|similarity>.
Behavior matches the upstream library closely enough that the RapidFuzz documentation is the authoritative reference for algorithm semantics.
Pick the right algorithm
| Situation | Function |
|---|---|
| One score, 0β100, sane default | rapidfuzz_ratio |
| One string is a fragment of the other | rapidfuzz_partial_ratio |
| Same words, different order | rapidfuzz_token_sort_ratio |
| Duplicated or stray words | rapidfuzz_token_set_ratio |
| Human names, typos at the end | rapidfuzz_jaro_winkler_normalized_similarity |
| Single-character typos / adjacent swaps | rapidfuzz_osa_distance |
| Equal-length identifiers (SKUs, hashes) | rapidfuzz_hamming_distance |
For cross-pair comparisons where strings differ in length, prefer the _normalized_* variants β raw distance penalizes longer strings just for being long. See the Wikipedia article on JaroβWinkler distance for the formal definitions of the name-matching variants.
Blocking strategies
The whole game on real datasets is reducing the candidate-pair space before scoring. A few patterns that compose cleanly:
Length bucket. Two strings that differ in length by more than k characters canβt have a Levenshtein distance β€ k. Filter on length(a) BETWEEN length(b) - k AND length(b) + k.
First-letter (or first-N) prefix. Cheap join key. Misses pairs where the first letter is itself the typo (rare for names, common for OCRβd text).
Length + first-letter combined. What the lead snippet uses; usually enough to make a 100k Γ 100k join finish in seconds.
Phonetic prefix (soundex / metaphone). When typos are phonetic (βSmythβ vs βSmithβ). DuckDBβs built-in fuzzy extension supplies the codes; group by code and only score within a group.
Token-based blocking. For multi-word strings, require at least one token in common (e.g. via a normalized array intersection). Excellent for company names where the legal suffix differs but the brand word is shared.
Compose blocking with the actual scoring step in a single SQL pass β the planner pushes the cheap predicates ahead of the expensive ratio call automatically.
Pre-normalize once, score many
Hoist deterministic transformations out of the join so they run O(n) instead of O(nΒ²):
WITH normalized AS (
SELECT id,
lower(regexp_replace(name, '[^a-z0-9 ]', '', 'g')) AS name_n,
length(name) AS len,
left(lower(name), 1) AS block_letter
FROM customers
)
SELECT a.id, b.id,
rapidfuzz_token_set_ratio(a.name_n, b.name_n) AS score
FROM normalized a
JOIN normalized b
ON a.id < b.id
AND a.block_letter = b.block_letter
AND a.len BETWEEN b.len - 3 AND b.len + 3
WHERE rapidfuzz_token_set_ratio(a.name_n, b.name_n) > 90;
When you also need consistent casing, pluralization, or struct-key rewrites, the sibling inflector extension covers the cleanup side of the same workflow β both sit in the matching-and-casing string-munging bucket.
Top-1 per left row
Record-linkage usually wants exactly one best match per left record. Use QUALIFY (DuckDBβs window-filter clause):
SELECT a.id AS left_id,
b.id AS right_id,
rapidfuzz_jaro_winkler_normalized_similarity(a.name, b.name) AS score
FROM left_records a
JOIN right_records b
ON left(a.name, 1) = left(b.name, 1) -- blocking
WHERE rapidfuzz_jaro_winkler_normalized_similarity(a.name, b.name) > 0.92
QUALIFY ROW_NUMBER() OVER (
PARTITION BY a.id
ORDER BY rapidfuzz_jaro_winkler_normalized_similarity(a.name, b.name) DESC
) = 1;
When to reach for something else
- Trigram indexes / FTS. For a single query string against millions of rows, a trigram or full-text index returns top-N candidates in sub-linear time. Score the candidates with RapidFuzz, but donβt expect this extension to replace the index.
- Embeddings / vector search. If βsimilarityβ really means semantic similarity (paraphrase, translation), edit distance is the wrong tool. Reach for an embedding model and a vector index instead.
- Production fuzzy joins on >10M rows. Even with aggressive blocking, very large self-joins benefit from purpose-built record-linkage tooling (Splink, Dedupe). RapidFuzz is the right primitive for the inner-loop score; a record-linkage framework manages the candidate generation, blocking, and resolution around it.
Reference
Extension Contents
Quick reference to all available functions and settings organized by category.
| Name | Description | |
|---|---|---|
| Hamming Position-by-position character match count. Only meaningful when both strings have the same length. | ||
| rapidfuzz_hamming_distance() | Number of differing positions between two equal-length strings | |
| rapidfuzz_hamming_normalized_distance() | Number of differing positions between two equal-length strings | |
| rapidfuzz_hamming_normalized_similarity() | Number of differing positions between two equal-length strings | |
| rapidfuzz_hamming_similarity() | Number of differing positions between two equal-length strings | |
| Indel Insertion-and-deletion edit distance β no substitutions counted. | ||
| rapidfuzz_indel_distance() | Insertions+deletions to transform one string into another (no substitutions) | |
| rapidfuzz_indel_normalized_distance() | Insertions+deletions to transform one string into another (no substitutions) | |
| rapidfuzz_indel_normalized_similarity() | Insertions+deletions to transform one string into another (no substitutions) | |
| rapidfuzz_indel_similarity() | Insertions+deletions to transform one string into another (no substitutions) | |
| Jaro Jaro similarity β match-and-transposition based. Good for short strings and typos. | ||
| rapidfuzz_jaro_distance() | Jaro similarity β character matches and transpositions | |
| rapidfuzz_jaro_normalized_distance() | Jaro similarity β character matches and transpositions | |
| rapidfuzz_jaro_normalized_similarity() | Jaro similarity β character matches and transpositions | |
| rapidfuzz_jaro_similarity() | Jaro similarity β character matches and transpositions | |
| Jaro Winkler Jaro-Winkler β Jaro with extra weight on a shared prefix. Strong default for human names. | ||
| rapidfuzz_jaro_winkler_distance() | Jaro-Winkler β Jaro with extra weight on common prefixes | |
| rapidfuzz_jaro_winkler_normalized_distance() | Jaro-Winkler β Jaro with extra weight on common prefixes | |
| rapidfuzz_jaro_winkler_normalized_similarity() | Jaro-Winkler β Jaro with extra weight on common prefixes | |
| rapidfuzz_jaro_winkler_similarity() | Jaro-Winkler β Jaro with extra weight on common prefixes | |
| Lcs Seq Longest common subsequence β order-preserving but not adjacency-preserving. | ||
| rapidfuzz_lcs_seq_distance() | Longest common subsequence β preserves character order but not adjacency | |
| rapidfuzz_lcs_seq_normalized_distance() | Longest common subsequence β preserves character order but not adjacency | |
| rapidfuzz_lcs_seq_normalized_similarity() | Longest common subsequence β preserves character order but not adjacency | |
| rapidfuzz_lcs_seq_similarity() | Longest common subsequence β preserves character order but not adjacency | |
| Osa Optimal String Alignment (a.k.a. restricted Damerau-Levenshtein) β counts adjacent transpositions as a single edit. | ||
| rapidfuzz_osa_distance() | Optimal string alignment β Levenshtein with adjacent transpositions counted as one edit | |
| rapidfuzz_osa_normalized_distance() | Optimal string alignment β Levenshtein with adjacent transpositions counted as one edit | |
| rapidfuzz_osa_normalized_similarity() | Optimal string alignment β Levenshtein with adjacent transpositions counted as one edit | |
| rapidfuzz_osa_similarity() | Optimal string alignment β Levenshtein with adjacent transpositions counted as one edit | |
| Postfix Edit distance / similarity considering only the matching suffix portion. | ||
| rapidfuzz_postfix_distance() | Edit distance considering only matching suffixes | |
| rapidfuzz_postfix_normalized_distance() | Edit distance considering only matching suffixes | |
| rapidfuzz_postfix_normalized_similarity() | Edit distance considering only matching suffixes | |
| rapidfuzz_postfix_similarity() | Edit distance considering only matching suffixes | |
| Prefix Edit distance / similarity considering only the matching prefix portion of the strings. | ||
| rapidfuzz_prefix_distance() | Edit distance considering only matching prefixes | |
| rapidfuzz_prefix_normalized_distance() | Edit distance considering only matching prefixes | |
| rapidfuzz_prefix_normalized_similarity() | Edit distance considering only matching prefixes | |
| rapidfuzz_prefix_similarity() | Edit distance considering only matching prefixes | |
| Ratio Scores 0β100 similarity scores matching the Python | ||
| rapidfuzz_partial_ratio() | Best similarity score for any substring of the longer input against the shorter β useful when one string is a fragment of the other | |
| rapidfuzz_partial_token_set_ratio() | Token-set comparison with a partial-ratio fallback β best when one string is a subset/partial match of the other | |
| rapidfuzz_ratio() | Levenshtein-based similarity (0β100) | |
| rapidfuzz_token_set_ratio() | Compare token sets (deduped, unordered) | |
| rapidfuzz_token_sort_ratio() | Compare after sorting tokens (words) | |
API Reference
Function Documentation
Detailed documentation for each function including signatures, parameters, and examples.
rapidfuzz_hamming_distance
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | VARCHAR | Positional | |
col1 | VARCHAR | Positional |
Returns
Description
Number of differing positions between two equal-length strings. Raw count of edits required.
Examples
Related Functions
- rapidfuzz_hamming_similarity() β Number of differing positions between two equal-length strings
- rapidfuzz_hamming_normalized_distance() β Number of differing positions between two equal-length strings
- rapidfuzz_hamming_normalized_similarity() β Number of differing positions between two equal-length strings
rapidfuzz_hamming_normalized_distance
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | VARCHAR | Positional | |
col1 | VARCHAR | Positional |
Returns
Description
Number of differing positions between two equal-length strings. Edit distance scaled to [0, 1] β 0 means identical.
Examples
Related Functions
- rapidfuzz_hamming_distance() β Number of differing positions between two equal-length strings
- rapidfuzz_hamming_similarity() β Number of differing positions between two equal-length strings
- rapidfuzz_hamming_normalized_similarity() β Number of differing positions between two equal-length strings
rapidfuzz_hamming_normalized_similarity
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | VARCHAR | Positional | |
col1 | VARCHAR | Positional |
Returns
Description
Number of differing positions between two equal-length strings. Similarity scaled to [0, 1] β 1 means identical.
Examples
Related Functions
- rapidfuzz_hamming_distance() β Number of differing positions between two equal-length strings
- rapidfuzz_hamming_similarity() β Number of differing positions between two equal-length strings
- rapidfuzz_hamming_normalized_distance() β Number of differing positions between two equal-length strings
rapidfuzz_hamming_similarity
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | VARCHAR | Positional | |
col1 | VARCHAR | Positional |
Returns
Description
Number of differing positions between two equal-length strings. Raw count of matches.
Examples
Related Functions
- rapidfuzz_hamming_distance() β Number of differing positions between two equal-length strings
- rapidfuzz_hamming_normalized_distance() β Number of differing positions between two equal-length strings
- rapidfuzz_hamming_normalized_similarity() β Number of differing positions between two equal-length strings
rapidfuzz_indel_distance
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | VARCHAR | Positional | |
col1 | VARCHAR | Positional |
Returns
Description
Insertions+deletions to transform one string into another (no substitutions). Raw count of edits required.
Examples
Related Functions
- rapidfuzz_indel_similarity() β Insertions+deletions to transform one string into another (no substitutions)
- rapidfuzz_indel_normalized_distance() β Insertions+deletions to transform one string into another (no substitutions)
- rapidfuzz_indel_normalized_similarity() β Insertions+deletions to transform one string into another (no substitutions)
rapidfuzz_indel_normalized_distance
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | VARCHAR | Positional | |
col1 | VARCHAR | Positional |
Returns
Description
Insertions+deletions to transform one string into another (no substitutions). Edit distance scaled to [0, 1] β 0 means identical.
Examples
Related Functions
- rapidfuzz_indel_distance() β Insertions+deletions to transform one string into another (no substitutions)
- rapidfuzz_indel_similarity() β Insertions+deletions to transform one string into another (no substitutions)
- rapidfuzz_indel_normalized_similarity() β Insertions+deletions to transform one string into another (no substitutions)
rapidfuzz_indel_normalized_similarity
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | VARCHAR | Positional | |
col1 | VARCHAR | Positional |
Returns
Description
Insertions+deletions to transform one string into another (no substitutions). Similarity scaled to [0, 1] β 1 means identical.
Examples
Related Functions
- rapidfuzz_indel_distance() β Insertions+deletions to transform one string into another (no substitutions)
- rapidfuzz_indel_similarity() β Insertions+deletions to transform one string into another (no substitutions)
- rapidfuzz_indel_normalized_distance() β Insertions+deletions to transform one string into another (no substitutions)
rapidfuzz_indel_similarity
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | VARCHAR | Positional | |
col1 | VARCHAR | Positional |
Returns
Description
Insertions+deletions to transform one string into another (no substitutions). Raw count of matches.
Examples
Related Functions
- rapidfuzz_indel_distance() β Insertions+deletions to transform one string into another (no substitutions)
- rapidfuzz_indel_normalized_distance() β Insertions+deletions to transform one string into another (no substitutions)
- rapidfuzz_indel_normalized_similarity() β Insertions+deletions to transform one string into another (no substitutions)
rapidfuzz_jaro_distance
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | VARCHAR | Positional | |
col1 | VARCHAR | Positional |
Returns
Description
Jaro similarity β character matches and transpositions. Good for short strings and typos. Raw count of edits required.
Examples
Related Functions
rapidfuzz_jaro_normalized_distance
Signature
Parameters (Positional)
| Parameter | Type | Mode | Description |
|---|---|---|---|
col0 | VARCHAR | Positional | |
col1 | VARCHAR | Positional |
Returns
Description
Jaro similarity β character matches and transpositions. Good for short strings and typos. Edit distance scaled to [0, 1] β 0 means identical.
Examples
Related Functions
rapidfuzz_jaro_normalized_similarity
Jaro similarity β character matches and transpositions.
rapidfuzz_jaro_similarity
Jaro similarity β character matches and transpositions.
rapidfuzz_jaro_winkler_distance
Jaro-Winkler β Jaro with extra weight on common prefixes.
rapidfuzz_jaro_winkler_normalized_distance
Jaro-Winkler β Jaro with extra weight on common prefixes.
rapidfuzz_jaro_winkler_normalized_similarity
Jaro-Winkler β Jaro with extra weight on common prefixes.
rapidfuzz_jaro_winkler_similarity
Jaro-Winkler β Jaro with extra weight on common prefixes.
rapidfuzz_lcs_seq_distance
Longest common subsequence β preserves character order but not adjacency.
rapidfuzz_lcs_seq_normalized_distance
Longest common subsequence β preserves character order but not adjacency.
rapidfuzz_lcs_seq_normalized_similarity
Longest common subsequence β preserves character order but not adjacency.
rapidfuzz_lcs_seq_similarity
Longest common subsequence β preserves character order but not adjacency.
rapidfuzz_osa_distance
Optimal string alignment β Levenshtein with adjacent transpositions counted as one edit.
rapidfuzz_osa_normalized_distance
Optimal string alignment β Levenshtein with adjacent transpositions counted as one edit.
rapidfuzz_osa_normalized_similarity
Optimal string alignment β Levenshtein with adjacent transpositions counted as one edit.
rapidfuzz_osa_similarity
Optimal string alignment β Levenshtein with adjacent transpositions counted as one edit.
rapidfuzz_partial_ratio
Best similarity score for any substring of the longer input against the shorter β useful when one string is a fragment of the other.
rapidfuzz_partial_token_set_ratio
Token-set comparison with a partial-ratio fallback β best when one string is a subset/partial match of the other.
rapidfuzz_postfix_distance
Edit distance considering only matching suffixes.
rapidfuzz_postfix_normalized_distance
Edit distance considering only matching suffixes.
rapidfuzz_postfix_normalized_similarity
Edit distance considering only matching suffixes.
rapidfuzz_postfix_similarity
Edit distance considering only matching suffixes.
rapidfuzz_prefix_distance
Edit distance considering only matching prefixes.
rapidfuzz_prefix_normalized_distance
Edit distance considering only matching prefixes.
rapidfuzz_prefix_normalized_similarity
Edit distance considering only matching prefixes.
rapidfuzz_prefix_similarity
Edit distance considering only matching prefixes.
rapidfuzz_ratio
Levenshtein-based similarity (0β100).
rapidfuzz_token_set_ratio
Compare token sets (deduped, unordered).
rapidfuzz_token_sort_ratio
Compare after sorting tokens (words).
Practical Examples
Cookbook
Real-world recipes and patterns for common use cases.
Quick reference
| Need | Pick |
|---|---|
| One score, 0β100, sane default | rapidfuzz_ratio |
| Match a fragment inside a longer string | rapidfuzz_partial_ratio |
| Words in different orders | rapidfuzz_token_sort_ratio |
| Tolerate duplicated words | rapidfuzz_token_set_ratio |
| Names / typos | rapidfuzz_jaro_winkler_normalized_similarity |
| Spelling distance with transpositions | rapidfuzz_osa_distance |
| Strings of identical length | rapidfuzz_hamming_distance |
Record linkage
Find the best match for each row in left against right, scored by Jaro-Winkler:
SELECT a.id AS left_id,
b.id AS right_id,
rapidfuzz_jaro_winkler_normalized_similarity(a.name, b.name) AS sim
FROM left_records a,
right_records b
WHERE rapidfuzz_jaro_winkler_normalized_similarity(a.name, b.name) > 0.92
ORDER BY a.id, sim DESC;
For the single best match per left row, wrap with QUALIFY:
SELECT *
FROM (
SELECT a.id AS left_id, b.id AS right_id,
rapidfuzz_jaro_winkler_normalized_similarity(a.name, b.name) AS sim,
ROW_NUMBER() OVER (PARTITION BY a.id ORDER BY rapidfuzz_jaro_winkler_normalized_similarity(a.name, b.name) DESC) AS rn
FROM left_records a, right_records b
)
WHERE rn = 1 AND sim > 0.92;
Dedupe
Cluster rows by similarity threshold:
WITH pairs AS (
SELECT a.id AS l, b.id AS r,
rapidfuzz_token_set_ratio(a.title, b.title) AS sim
FROM titles a, titles b
WHERE a.id < b.id
AND rapidfuzz_token_set_ratio(a.title, b.title) > 90
)
SELECT * FROM pairs ORDER BY sim DESC;
Use the pair list to assign cluster IDs (graph-walk in Python or with a recursive CTE).
Spell check / typo tolerance
For a known-good dictionary, score a candidate against every entry and pick the closest:
SELECT word
FROM dictionary
WHERE rapidfuzz_osa_distance(word, :input) <= 2
ORDER BY rapidfuzz_osa_distance(word, :input)
LIMIT 5;
OSA (Optimal String Alignment) handles single-character typos and adjacent transpositions naturally.
Ratio choice β visual guide
input1: 'New York City'
input2: 'new york new york city'
rapidfuzz_ratio β 56 (different overall)
rapidfuzz_partial_ratio β 100 (one is a substring)
rapidfuzz_token_sort_ratio β 78 (sorted tokens differ in count)
rapidfuzz_token_set_ratio β 100 (token sets identical)
rapidfuzz_partial_token_set_ratio β 100 (sets identical, partial fallback)
The differences matter β pick the one that matches your data semantics. For names with reordered words, token_sort_ratio is usually right. For names with stray duplicates, token_set_ratio. For substring matching, partial_ratio.
Algorithm function naming
Each of the 8 algorithms exposes 4 variants:
rapidfuzz_<algo>_distance(a, b)β raw edit count.rapidfuzz_<algo>_similarity(a, b)β raw match count.rapidfuzz_<algo>_normalized_distance(a, b)β[0, 1], 0 = identical.rapidfuzz_<algo>_normalized_similarity(a, b)β[0, 1], 1 = identical.
For comparisons across pairs of unequal length, the normalized_* variants are usually what you want β they donβt punish longer strings.
Platform Support
Compatibility
Extension availability may vary by platform and DuckDB version. Check below to ensure this extension supports your environment before installation.
Quick Facts
Platforms
Supported platform architectures
Compiled binary sizes
| Platform | Architecture | Size |
|---|---|---|
| Linux | aarch64 | 2.73 MB |
| Linux | x86_64 | 3.10 MB |
| Linux (musl) | x86_64 | 2.36 MB |
| macOS | Apple Silicon | 1.87 MB |
| macOS | Intel | 2.17 MB |
| Windows | x86_64 | 7.55 MB |
| WASM | eh | 35.2 KB |
| WASM | mvp | 36.3 KB |
| WASM | threads | 36.3 KB |
Gzipped download size from the DuckDB community-extensions registry.
High-Performance Fuzzy Matching
Install RapidFuzz for record linkage, dedupe, spell-check, and similarity ranking β Jaro/Jaro-Winkler, Levenshtein, LCS, and the full token_*_ratio family from the Python rapidfuzz library.