πŸ”

RapidFuzz

Fuzzy string matching and edit-distance for DuckDB, powered by the RapidFuzz C++ library. Five ratio variants for record matching plus eight distance/similarity algorithms (Jaro, Jaro-Winkler, Hamming, Indel, Levenshtein/OSA, LCS, prefix/postfix) β€” every one in distance, similarity, and normalized forms.

Install

-- Install the extension
INSTALL rapidfuzz FROM community;

-- Load it into your session
LOAD rapidfuzz;

-- Headline ratio (0–100, like Python rapidfuzz.fuzz.ratio)
SELECT rapidfuzz_ratio('hello world', 'helo wrld');  -- 90.0

-- Match with reordered words
SELECT rapidfuzz_token_sort_ratio('world hello', 'hello world');  -- 100.0

-- Pick best match per row
SELECT a.id, b.id,
       rapidfuzz_ratio(a.name, b.name) AS score
FROM left_records a, right_records b
WHERE rapidfuzz_ratio(a.name, b.name) > 85
ORDER BY a.id, score DESC;

Technical Overview

Why Use RapidFuzz?

Dedup messy customer, address, or product data with one SQL query β€” or fuzzy-join two datasets that don't share clean keys. Wraps the RapidFuzz C++ library so Levenshtein, Jaro-Winkler, Hamming, OSA / Damerau-Levenshtein, LCS, and the full fuzz-style ratio family run inline as scalar functions over DuckDB columns.

πŸ” What this extension is for

Anywhere you have two strings and want a numeric answer to "how similar are they?" β€” record linkage across customer or vendor lists, dedup of free-text fields, spell-check against a dictionary, ranking of candidate matches.

  • β€’ Record linkage / fuzzy join: Two datasets that should match on name or address but don't share clean keys: score every candidate pair with rapidfuzz_jaro_winkler_normalized_similarity or rapidfuzz_ratio, filter by threshold, keep the best per row with QUALIFY ROW_NUMBER().
  • β€’ Dedup messy free-text: Cluster near-duplicate rows in a customer or product table β€” pick a ratio variant that matches your data semantics (token-set for reordered or stuttering text, partial-ratio for fragments) and emit pairs above a threshold.
  • β€’ Spell-check / typo tolerance: Score an input against a known-good dictionary with rapidfuzz_osa_distance or rapidfuzz_jaro_winkler_normalized_similarity and pick the closest match. Handles single-character typos and adjacent-key swaps directly.
  • β€’ Similarity ranking: Score a query string against every row in a small table and ORDER BY the score β€” the simplest "which of these is closest?" pattern, no separate search index.

βš™οΈ How it works

RapidFuzz is a C++ implementation of the same algorithms exposed by Python's rapidfuzz and the older fuzzywuzzy. This extension binds those algorithms as DuckDB scalar functions β€” every call is a vectorized in-process C++ kernel; no Python, no FFI, no per-row interpreter overhead.

  • β€’ Two function families: Ratio scores (rapidfuzz_ratio, rapidfuzz_partial_ratio, rapidfuzz_token_sort_ratio, rapidfuzz_token_set_ratio, rapidfuzz_partial_token_set_ratio) return 0–100 similarity, matching Python rapidfuzz.fuzz. Distance algorithms (Jaro, Jaro-Winkler, Hamming, Indel, Levenshtein/OSA, LCS, prefix, postfix) each ship four variants β€” _distance, _similarity, _normalized_distance, _normalized_similarity.
  • β€’ Pick distance vs. normalized: Raw _distance / _similarity return integer edit counts β€” useful when the strings are the same length or you want a numeric threshold (e.g. "differ by ≀ 2 characters"). The _normalized_* variants scale to [0, 1] and divide by the longer string's length, so they don't penalize length mismatches; reach for these for cross-pair comparisons.
  • β€’ Vectorized over columns: Every function is scalar with (VARCHAR, VARCHAR) -> DOUBLE. Use it in SELECT, WHERE, JOIN ... ON, ORDER BY, QUALIFY β€” anywhere a scalar fits. DuckDB pipelines and multi-threads the call.
  • β€’ Behavior parity with upstream: Outputs match the upstream Python library, so the RapidFuzz docs are an authoritative reference for algorithm semantics. The naming scheme is rapidfuzz_<algo>[_normalized]_<distance|similarity> β€” predictable across all eight algorithm groups.

πŸ›‘οΈ Cost and scaling

Every comparison is cheap. Comparing every row to every other row is not β€” pairwise self-joins are O(nΒ²) and that's the limit you hit on real datasets.

  • β€’ Self-joins are O(nΒ²): 100k rows Γ— 100k rows is 10 billion pairs β€” still tractable on a fast machine for the simpler ratios, but will dominate query time. 1M Γ— 1M is not. Filter the pair space before scoring.
  • β€’ Use blocking to prune candidates: Blocking (a.k.a. indexing) is the standard record-linkage technique: only compare rows that share a cheap key. First letter, length bucket, normalized prefix, or a phonetic code (e.g. soundex via the fuzzy extension) β€” join on the block, score on the strings. Reduces O(nΒ²) to O(nΒ·k) where k is the average block size.
  • β€’ Pre-normalize once: Lowercase, strip punctuation, collapse whitespace before scoring β€” every per-row transformation you can hoist out of the join saves it from running on every pair. Pair with the inflector extension when you need consistent casing across both sides.
  • β€’ Threshold inside the join: Push the similarity threshold into the WHERE/ON clause so DuckDB can short-circuit. For top-1 per left row, layer QUALIFY ROW_NUMBER() OVER (PARTITION BY left_id ORDER BY score DESC) = 1.

🎯 Common Use Cases

Dedup a customer / vendor / product table

Self-join with a blocking key (length bucket, first letter, soundex), score with rapidfuzz_token_set_ratio or rapidfuzz_jaro_winkler_normalized_similarity, filter by threshold, walk the resulting pair graph to assign cluster IDs.

Fuzzy-join two datasets without a clean key

Score every candidate pair, take the best per row with QUALIFY. The classic record-linkage pattern, expressed as one SQL statement.

Match free-text input against a dictionary

Spell-correct a query against a known-good list β€” rapidfuzz_osa_distance ≀ 2 captures most typing mistakes.

Rank candidates by similarity

ORDER BY rapidfuzz_ratio(query, candidate) DESC LIMIT N β€” the simplest "top-N closest" without standing up a search index.

Deep Dive

Technical Details

What you can do with one query

The magic moment β€” dedup a messy customer table without leaving SQL:

CREATE TABLE customer_dupes AS
SELECT a.id AS left_id,
       b.id AS right_id,
       rapidfuzz_token_set_ratio(a.name, b.name) AS score
FROM customers a
JOIN customers b
  ON a.id < b.id
 AND length(a.name) BETWEEN length(b.name) - 3 AND length(b.name) + 3
 AND lower(left(a.name, 1)) = lower(left(b.name, 1))
WHERE rapidfuzz_token_set_ratio(a.name, b.name) > 88
ORDER BY score DESC;

Two predicates do the heavy lifting before any string comparison runs: a length-bucket filter and a same-first-letter filter. Only pairs that pass both get scored by rapidfuzz_token_set_ratio. That’s blocking β€” and it’s what makes fuzzy matching tractable at scale.

Be honest about cost

Every individual call is fast β€” the algorithms are vectorized C++ kernels from the upstream RapidFuzz library. Pairwise self-joins, on the other hand, are O(nΒ²): 100k rows comparing against themselves is 10 billion candidate pairs. Without a blocking strategy that prunes the pair space first, fuzzy matching does not scale past a few tens of thousands of rows.

This page treats blocking as a first-class concern, not an afterthought.

Architecture

This extension binds the RapidFuzz C++ library β€” the same engine behind Python’s rapidfuzz package and the spiritual successor to fuzzywuzzy β€” as a set of DuckDB scalar functions. Each function takes (VARCHAR, VARCHAR) and returns a DOUBLE. There is no extra state, no secrets, no ATTACH interface; calls are pure functions composable into any SELECT, WHERE, ON, or ORDER BY clause.

Two function families coexist:

  • Ratio scores (rapidfuzz_ratio and friends) return 0–100, mirroring the Python rapidfuzz.fuzz family.
  • Distance algorithms (Jaro, Jaro-Winkler, Hamming, Indel, Levenshtein / OSA, LCS, Prefix, Postfix) ship four variants each β€” _distance, _similarity, _normalized_distance, _normalized_similarity. The naming is mechanical: rapidfuzz_<algo>[_normalized]_<distance|similarity>.

Behavior matches the upstream library closely enough that the RapidFuzz documentation is the authoritative reference for algorithm semantics.

Pick the right algorithm

SituationFunction
One score, 0–100, sane defaultrapidfuzz_ratio
One string is a fragment of the otherrapidfuzz_partial_ratio
Same words, different orderrapidfuzz_token_sort_ratio
Duplicated or stray wordsrapidfuzz_token_set_ratio
Human names, typos at the endrapidfuzz_jaro_winkler_normalized_similarity
Single-character typos / adjacent swapsrapidfuzz_osa_distance
Equal-length identifiers (SKUs, hashes)rapidfuzz_hamming_distance

For cross-pair comparisons where strings differ in length, prefer the _normalized_* variants β€” raw distance penalizes longer strings just for being long. See the Wikipedia article on Jaro–Winkler distance for the formal definitions of the name-matching variants.

Blocking strategies

The whole game on real datasets is reducing the candidate-pair space before scoring. A few patterns that compose cleanly:

Length bucket. Two strings that differ in length by more than k characters can’t have a Levenshtein distance ≀ k. Filter on length(a) BETWEEN length(b) - k AND length(b) + k.

First-letter (or first-N) prefix. Cheap join key. Misses pairs where the first letter is itself the typo (rare for names, common for OCR’d text).

Length + first-letter combined. What the lead snippet uses; usually enough to make a 100k Γ— 100k join finish in seconds.

Phonetic prefix (soundex / metaphone). When typos are phonetic (β€œSmyth” vs β€œSmith”). DuckDB’s built-in fuzzy extension supplies the codes; group by code and only score within a group.

Token-based blocking. For multi-word strings, require at least one token in common (e.g. via a normalized array intersection). Excellent for company names where the legal suffix differs but the brand word is shared.

Compose blocking with the actual scoring step in a single SQL pass β€” the planner pushes the cheap predicates ahead of the expensive ratio call automatically.

Pre-normalize once, score many

Hoist deterministic transformations out of the join so they run O(n) instead of O(nΒ²):

WITH normalized AS (
  SELECT id,
         lower(regexp_replace(name, '[^a-z0-9 ]', '', 'g')) AS name_n,
         length(name) AS len,
         left(lower(name), 1)              AS block_letter
  FROM customers
)
SELECT a.id, b.id,
       rapidfuzz_token_set_ratio(a.name_n, b.name_n) AS score
FROM normalized a
JOIN normalized b
  ON a.id < b.id
 AND a.block_letter = b.block_letter
 AND a.len BETWEEN b.len - 3 AND b.len + 3
WHERE rapidfuzz_token_set_ratio(a.name_n, b.name_n) > 90;

When you also need consistent casing, pluralization, or struct-key rewrites, the sibling inflector extension covers the cleanup side of the same workflow β€” both sit in the matching-and-casing string-munging bucket.

Top-1 per left row

Record-linkage usually wants exactly one best match per left record. Use QUALIFY (DuckDB’s window-filter clause):

SELECT a.id AS left_id,
       b.id AS right_id,
       rapidfuzz_jaro_winkler_normalized_similarity(a.name, b.name) AS score
FROM left_records  a
JOIN right_records b
  ON left(a.name, 1) = left(b.name, 1)            -- blocking
WHERE rapidfuzz_jaro_winkler_normalized_similarity(a.name, b.name) > 0.92
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY a.id
  ORDER BY rapidfuzz_jaro_winkler_normalized_similarity(a.name, b.name) DESC
) = 1;

When to reach for something else

  • Trigram indexes / FTS. For a single query string against millions of rows, a trigram or full-text index returns top-N candidates in sub-linear time. Score the candidates with RapidFuzz, but don’t expect this extension to replace the index.
  • Embeddings / vector search. If β€œsimilarity” really means semantic similarity (paraphrase, translation), edit distance is the wrong tool. Reach for an embedding model and a vector index instead.
  • Production fuzzy joins on >10M rows. Even with aggressive blocking, very large self-joins benefit from purpose-built record-linkage tooling (Splink, Dedupe). RapidFuzz is the right primitive for the inner-loop score; a record-linkage framework manages the candidate generation, blocking, and resolution around it.

Reference

Extension Contents

Quick reference to all available functions and settings organized by category.

Name Description
Hamming

Position-by-position character match count. Only meaningful when both strings have the same length.

rapidfuzz_hamming_distance() Number of differing positions between two equal-length strings
rapidfuzz_hamming_normalized_distance() Number of differing positions between two equal-length strings
rapidfuzz_hamming_normalized_similarity() Number of differing positions between two equal-length strings
rapidfuzz_hamming_similarity() Number of differing positions between two equal-length strings
Indel

Insertion-and-deletion edit distance β€” no substitutions counted.

rapidfuzz_indel_distance() Insertions+deletions to transform one string into another (no substitutions)
rapidfuzz_indel_normalized_distance() Insertions+deletions to transform one string into another (no substitutions)
rapidfuzz_indel_normalized_similarity() Insertions+deletions to transform one string into another (no substitutions)
rapidfuzz_indel_similarity() Insertions+deletions to transform one string into another (no substitutions)
Jaro

Jaro similarity β€” match-and-transposition based. Good for short strings and typos.

rapidfuzz_jaro_distance() Jaro similarity β€” character matches and transpositions
rapidfuzz_jaro_normalized_distance() Jaro similarity β€” character matches and transpositions
rapidfuzz_jaro_normalized_similarity() Jaro similarity β€” character matches and transpositions
rapidfuzz_jaro_similarity() Jaro similarity β€” character matches and transpositions
Jaro Winkler

Jaro-Winkler β€” Jaro with extra weight on a shared prefix. Strong default for human names.

rapidfuzz_jaro_winkler_distance() Jaro-Winkler β€” Jaro with extra weight on common prefixes
rapidfuzz_jaro_winkler_normalized_distance() Jaro-Winkler β€” Jaro with extra weight on common prefixes
rapidfuzz_jaro_winkler_normalized_similarity() Jaro-Winkler β€” Jaro with extra weight on common prefixes
rapidfuzz_jaro_winkler_similarity() Jaro-Winkler β€” Jaro with extra weight on common prefixes
Lcs Seq

Longest common subsequence β€” order-preserving but not adjacency-preserving.

rapidfuzz_lcs_seq_distance() Longest common subsequence β€” preserves character order but not adjacency
rapidfuzz_lcs_seq_normalized_distance() Longest common subsequence β€” preserves character order but not adjacency
rapidfuzz_lcs_seq_normalized_similarity() Longest common subsequence β€” preserves character order but not adjacency
rapidfuzz_lcs_seq_similarity() Longest common subsequence β€” preserves character order but not adjacency
Osa

Optimal String Alignment (a.k.a. restricted Damerau-Levenshtein) β€” counts adjacent transpositions as a single edit.

rapidfuzz_osa_distance() Optimal string alignment β€” Levenshtein with adjacent transpositions counted as one edit
rapidfuzz_osa_normalized_distance() Optimal string alignment β€” Levenshtein with adjacent transpositions counted as one edit
rapidfuzz_osa_normalized_similarity() Optimal string alignment β€” Levenshtein with adjacent transpositions counted as one edit
rapidfuzz_osa_similarity() Optimal string alignment β€” Levenshtein with adjacent transpositions counted as one edit
Postfix

Edit distance / similarity considering only the matching suffix portion.

rapidfuzz_postfix_distance() Edit distance considering only matching suffixes
rapidfuzz_postfix_normalized_distance() Edit distance considering only matching suffixes
rapidfuzz_postfix_normalized_similarity() Edit distance considering only matching suffixes
rapidfuzz_postfix_similarity() Edit distance considering only matching suffixes
Prefix

Edit distance / similarity considering only the matching prefix portion of the strings.

rapidfuzz_prefix_distance() Edit distance considering only matching prefixes
rapidfuzz_prefix_normalized_distance() Edit distance considering only matching prefixes
rapidfuzz_prefix_normalized_similarity() Edit distance considering only matching prefixes
rapidfuzz_prefix_similarity() Edit distance considering only matching prefixes
Ratio Scores

0–100 similarity scores matching the Python rapidfuzz.fuzz family. The headline rapidfuzz_ratio is the right starting point; the partial_, token_sort_, token_set_* variants handle substrings, reordered words, and deduplication.

rapidfuzz_partial_ratio() Best similarity score for any substring of the longer input against the shorter β€” useful when one string is a fragment of the other
rapidfuzz_partial_token_set_ratio() Token-set comparison with a partial-ratio fallback β€” best when one string is a subset/partial match of the other
rapidfuzz_ratio() Levenshtein-based similarity (0–100)
rapidfuzz_token_set_ratio() Compare token sets (deduped, unordered)
rapidfuzz_token_sort_ratio() Compare after sorting tokens (words)

API Reference

Function Documentation

Detailed documentation for each function including signatures, parameters, and examples.

rapidfuzz_hamming_distance

Scalar Function Hamming
Signature
rapidfuzz_hamming_distance(col0: VARCHAR, col1: VARCHAR) β†’ DOUBLE
Parameters (Positional)
Parameter Type Mode Description
col0 VARCHAR Positional
col1 VARCHAR Positional
Returns
Description

Number of differing positions between two equal-length strings. Raw count of edits required.

Examples

rapidfuzz_hamming_normalized_distance

Scalar Function Hamming
Signature
rapidfuzz_hamming_normalized_distance(col0: VARCHAR, col1: VARCHAR) β†’ DOUBLE
Parameters (Positional)
Parameter Type Mode Description
col0 VARCHAR Positional
col1 VARCHAR Positional
Returns
Description

Number of differing positions between two equal-length strings. Edit distance scaled to [0, 1] β€” 0 means identical.

Examples

rapidfuzz_hamming_normalized_similarity

Scalar Function Hamming
Signature
rapidfuzz_hamming_normalized_similarity(col0: VARCHAR, col1: VARCHAR) β†’ DOUBLE
Parameters (Positional)
Parameter Type Mode Description
col0 VARCHAR Positional
col1 VARCHAR Positional
Returns
Description

Number of differing positions between two equal-length strings. Similarity scaled to [0, 1] β€” 1 means identical.

Examples

rapidfuzz_hamming_similarity

Scalar Function Hamming
Signature
rapidfuzz_hamming_similarity(col0: VARCHAR, col1: VARCHAR) β†’ DOUBLE
Parameters (Positional)
Parameter Type Mode Description
col0 VARCHAR Positional
col1 VARCHAR Positional
Returns
Description

Number of differing positions between two equal-length strings. Raw count of matches.

Examples

rapidfuzz_indel_distance

Scalar Function Indel
Signature
rapidfuzz_indel_distance(col0: VARCHAR, col1: VARCHAR) β†’ DOUBLE
Parameters (Positional)
Parameter Type Mode Description
col0 VARCHAR Positional
col1 VARCHAR Positional
Returns
Description

Insertions+deletions to transform one string into another (no substitutions). Raw count of edits required.

Examples

rapidfuzz_indel_normalized_distance

Scalar Function Indel
Signature
rapidfuzz_indel_normalized_distance(col0: VARCHAR, col1: VARCHAR) β†’ DOUBLE
Parameters (Positional)
Parameter Type Mode Description
col0 VARCHAR Positional
col1 VARCHAR Positional
Returns
Description

Insertions+deletions to transform one string into another (no substitutions). Edit distance scaled to [0, 1] β€” 0 means identical.

Examples

rapidfuzz_indel_normalized_similarity

Scalar Function Indel
Signature
rapidfuzz_indel_normalized_similarity(col0: VARCHAR, col1: VARCHAR) β†’ DOUBLE
Parameters (Positional)
Parameter Type Mode Description
col0 VARCHAR Positional
col1 VARCHAR Positional
Returns
Description

Insertions+deletions to transform one string into another (no substitutions). Similarity scaled to [0, 1] β€” 1 means identical.

Examples

rapidfuzz_indel_similarity

Scalar Function Indel
Signature
rapidfuzz_indel_similarity(col0: VARCHAR, col1: VARCHAR) β†’ DOUBLE
Parameters (Positional)
Parameter Type Mode Description
col0 VARCHAR Positional
col1 VARCHAR Positional
Returns
Description

Insertions+deletions to transform one string into another (no substitutions). Raw count of matches.

Examples

rapidfuzz_jaro_distance

Scalar Function Jaro
Signature
rapidfuzz_jaro_distance(col0: VARCHAR, col1: VARCHAR) β†’ DOUBLE
Parameters (Positional)
Parameter Type Mode Description
col0 VARCHAR Positional
col1 VARCHAR Positional
Returns
Description

Jaro similarity β€” character matches and transpositions. Good for short strings and typos. Raw count of edits required.

Examples

rapidfuzz_jaro_normalized_distance

Scalar Function Jaro
Signature
rapidfuzz_jaro_normalized_distance(col0: VARCHAR, col1: VARCHAR) β†’ DOUBLE
Parameters (Positional)
Parameter Type Mode Description
col0 VARCHAR Positional
col1 VARCHAR Positional
Returns
Description

Jaro similarity β€” character matches and transpositions. Good for short strings and typos. Edit distance scaled to [0, 1] β€” 0 means identical.

Examples

rapidfuzz_jaro_normalized_similarity

Jaro

Jaro similarity β€” character matches and transpositions.

rapidfuzz_jaro_similarity

Jaro

Jaro similarity β€” character matches and transpositions.

rapidfuzz_jaro_winkler_distance

Jaro Winkler

Jaro-Winkler β€” Jaro with extra weight on common prefixes.

rapidfuzz_jaro_winkler_normalized_distance

Jaro Winkler

Jaro-Winkler β€” Jaro with extra weight on common prefixes.

rapidfuzz_jaro_winkler_normalized_similarity

Jaro Winkler

Jaro-Winkler β€” Jaro with extra weight on common prefixes.

rapidfuzz_jaro_winkler_similarity

Jaro Winkler

Jaro-Winkler β€” Jaro with extra weight on common prefixes.

rapidfuzz_lcs_seq_distance

Lcs Seq

Longest common subsequence β€” preserves character order but not adjacency.

rapidfuzz_lcs_seq_normalized_distance

Lcs Seq

Longest common subsequence β€” preserves character order but not adjacency.

rapidfuzz_lcs_seq_normalized_similarity

Lcs Seq

Longest common subsequence β€” preserves character order but not adjacency.

rapidfuzz_lcs_seq_similarity

Lcs Seq

Longest common subsequence β€” preserves character order but not adjacency.

rapidfuzz_osa_distance

Osa

Optimal string alignment β€” Levenshtein with adjacent transpositions counted as one edit.

rapidfuzz_osa_normalized_distance

Osa

Optimal string alignment β€” Levenshtein with adjacent transpositions counted as one edit.

rapidfuzz_osa_normalized_similarity

Osa

Optimal string alignment β€” Levenshtein with adjacent transpositions counted as one edit.

rapidfuzz_osa_similarity

Osa

Optimal string alignment β€” Levenshtein with adjacent transpositions counted as one edit.

rapidfuzz_partial_ratio

Ratio Scores

Best similarity score for any substring of the longer input against the shorter β€” useful when one string is a fragment of the other.

rapidfuzz_partial_token_set_ratio

Ratio Scores

Token-set comparison with a partial-ratio fallback β€” best when one string is a subset/partial match of the other.

rapidfuzz_postfix_distance

Postfix

Edit distance considering only matching suffixes.

rapidfuzz_postfix_normalized_distance

Postfix

Edit distance considering only matching suffixes.

rapidfuzz_postfix_normalized_similarity

Postfix

Edit distance considering only matching suffixes.

rapidfuzz_postfix_similarity

Postfix

Edit distance considering only matching suffixes.

rapidfuzz_prefix_distance

Prefix

Edit distance considering only matching prefixes.

rapidfuzz_prefix_normalized_distance

Prefix

Edit distance considering only matching prefixes.

rapidfuzz_prefix_normalized_similarity

Prefix

Edit distance considering only matching prefixes.

rapidfuzz_prefix_similarity

Prefix

Edit distance considering only matching prefixes.

rapidfuzz_ratio

Ratio Scores

Levenshtein-based similarity (0–100).

rapidfuzz_token_set_ratio

Ratio Scores

Compare token sets (deduped, unordered).

rapidfuzz_token_sort_ratio

Ratio Scores

Compare after sorting tokens (words).

Practical Examples

Cookbook

Real-world recipes and patterns for common use cases.

Quick reference

NeedPick
One score, 0–100, sane defaultrapidfuzz_ratio
Match a fragment inside a longer stringrapidfuzz_partial_ratio
Words in different ordersrapidfuzz_token_sort_ratio
Tolerate duplicated wordsrapidfuzz_token_set_ratio
Names / typosrapidfuzz_jaro_winkler_normalized_similarity
Spelling distance with transpositionsrapidfuzz_osa_distance
Strings of identical lengthrapidfuzz_hamming_distance

Record linkage

Find the best match for each row in left against right, scored by Jaro-Winkler:

SELECT a.id   AS left_id,
       b.id   AS right_id,
       rapidfuzz_jaro_winkler_normalized_similarity(a.name, b.name) AS sim
FROM left_records  a,
     right_records b
WHERE rapidfuzz_jaro_winkler_normalized_similarity(a.name, b.name) > 0.92
ORDER BY a.id, sim DESC;

For the single best match per left row, wrap with QUALIFY:

SELECT *
FROM (
  SELECT a.id AS left_id, b.id AS right_id,
         rapidfuzz_jaro_winkler_normalized_similarity(a.name, b.name) AS sim,
         ROW_NUMBER() OVER (PARTITION BY a.id ORDER BY rapidfuzz_jaro_winkler_normalized_similarity(a.name, b.name) DESC) AS rn
  FROM left_records a, right_records b
)
WHERE rn = 1 AND sim > 0.92;

Dedupe

Cluster rows by similarity threshold:

WITH pairs AS (
  SELECT a.id AS l, b.id AS r,
         rapidfuzz_token_set_ratio(a.title, b.title) AS sim
  FROM titles a, titles b
  WHERE a.id < b.id
    AND rapidfuzz_token_set_ratio(a.title, b.title) > 90
)
SELECT * FROM pairs ORDER BY sim DESC;

Use the pair list to assign cluster IDs (graph-walk in Python or with a recursive CTE).

Spell check / typo tolerance

For a known-good dictionary, score a candidate against every entry and pick the closest:

SELECT word
FROM dictionary
WHERE rapidfuzz_osa_distance(word, :input) <= 2
ORDER BY rapidfuzz_osa_distance(word, :input)
LIMIT 5;

OSA (Optimal String Alignment) handles single-character typos and adjacent transpositions naturally.

Ratio choice β€” visual guide

input1: 'New York City'
input2: 'new york new york city'

rapidfuzz_ratio                    β†’ 56  (different overall)
rapidfuzz_partial_ratio            β†’ 100 (one is a substring)
rapidfuzz_token_sort_ratio         β†’ 78  (sorted tokens differ in count)
rapidfuzz_token_set_ratio          β†’ 100 (token sets identical)
rapidfuzz_partial_token_set_ratio  β†’ 100 (sets identical, partial fallback)

The differences matter β€” pick the one that matches your data semantics. For names with reordered words, token_sort_ratio is usually right. For names with stray duplicates, token_set_ratio. For substring matching, partial_ratio.

Algorithm function naming

Each of the 8 algorithms exposes 4 variants:

  • rapidfuzz_<algo>_distance(a, b) β€” raw edit count.
  • rapidfuzz_<algo>_similarity(a, b) β€” raw match count.
  • rapidfuzz_<algo>_normalized_distance(a, b) β€” [0, 1], 0 = identical.
  • rapidfuzz_<algo>_normalized_similarity(a, b) β€” [0, 1], 1 = identical.

For comparisons across pairs of unequal length, the normalized_* variants are usually what you want β€” they don’t punish longer strings.

Platform Support

Compatibility

Extension availability may vary by platform and DuckDB version. Check below to ensure this extension supports your environment before installation.

Quick Facts

Software License MIT
Pricing Free
Written In C++
Source Available Yes
View on GitHub
Usage
184,353+
loads in last 30 days

Platforms

Linux
Linux (musl)
macOS
Windows
WASM
Supported platform architectures
Linux: x86_64, aarch64
Linux (musl): x86_64
macOS: Intel, Apple Silicon
Windows: x86_64
WASM: threads, mvp, eh
Compiled binary sizes
Platform Architecture Size
Linux aarch64 2.73 MB
Linux x86_64 3.10 MB
Linux (musl) x86_64 2.36 MB
macOS Apple Silicon 1.87 MB
macOS Intel 2.17 MB
Windows x86_64 7.55 MB
WASM eh 35.2 KB
WASM mvp 36.3 KB
WASM threads 36.3 KB

Gzipped download size from the DuckDB community-extensions registry.

DuckDB Versions

Release calendar
Supported
v1.4.4 v1.5.2

High-Performance Fuzzy Matching

Install RapidFuzz for record linkage, dedupe, spell-check, and similarity ranking β€” Jaro/Jaro-Winkler, Levenshtein, LCS, and the full token_*_ratio family from the Python rapidfuzz library.