← Blog
Typo-aware email search in PostgreSQL
- Databases
- PostgreSQL
- Search
- TypeScript
Here’s the life of a customer support agent: a customer can’t find their order. You ask for their email, they say john.smith@gmail.com, you search for it, and… nothing. Turns out there’s a typo and the actual record is jhn.smith@gmial.con. If emails are not verified than the row just sits in your database forever.
Customer support needs to find that account anyway and they ask for your help. So we need email search is resilient to typos: gmial instead of gmail, .con instead of .com.
There’s also a second, sneakier requirement. When support searches for foo@gmail.com, they want the foo-ish accounts on Gmail - not every single one of the millions of @gmail.com users in the table. Fuzzy matching the whole email string handles the first requirement and completely fails the second. We’ll need to be a bit smarter than “throw it all at a similarity function”.
This is all doable in plain PostgreSQL.
LIKE is not enough
The obvious first attempt:
SELECT * FROM customers WHERE email ILIKE '%john.smith@gmail.com%';
This needs the substring to be exactly right. One wrong letter and you get zero rows. ILIKE is for “I know part of it”, not “I know roughly what it looks like”. For non exact matches we need trigram similarity.
pg_trgm to the rescue
A trigram is a three-character slice of a string - "john" breaks down into {" j", " jo", "joh", "ohn", "hn "} - and two strings that share a lot of trigrams are probably similar. PostgreSQL’s pg_trgm extension gives you similarity(a, b) returning a number from 0 to 1, and the % operator which is true when that similarity is above pg_trgm.similarity_threshold (default 0.3, tunable per session with set_limit() / SET_LIMIT).
CREATE EXTENSION IF NOT EXISTS pg_trgm;
SELECT similarity('jhn.smith@gmial.con', 'john.smith@gmail.com'); -- ~0.5
0.5 - comfortably above the default threshold. The typo’d email and the real one are “similar” as far as Postgres is concerned. That’s the whole trick. Everything below is about making it fast and making it not return garbage.
Step 1: split the email when you store it
This is the important bit, so I’ll spend words on it.
The problem with email % 'foo@gmail.com' is that gmail.com is a huge trigram bucket. Half your table matches it. The trigram index returns half your table, Postgres throws away almost all of it, and you’ve turned an index lookup into a slow scan. The gmail.com part “poisons” the query.
So: don’t search the whole email when the domain is something common. Split it. Store the local part and the domain in their own columns:
ALTER TABLE customers ADD COLUMN email_name text;
ALTER TABLE customers ADD COLUMN email_domain text;
Keep them in sync with a BEFORE INSERT OR UPDATE trigger:
CREATE OR REPLACE FUNCTION split_customer_email()
RETURNS TRIGGER LANGUAGE plpgsql AS $$
BEGIN
NEW.email_name = split_part(lower(NEW.email), '@', 1);
NEW.email_domain = split_part(lower(NEW.email), '@', 2);
RETURN NEW;
END;
$$;
CREATE TRIGGER split_customer_email_trigger
BEFORE INSERT OR UPDATE ON customers
FOR EACH ROW EXECUTE PROCEDURE split_customer_email();
You could do this in application code instead - it’s the same idea - but a trigger means it’s up to date no matter which code path writes the row. Note: we add lower() for normalization and so we can ignore letter case.
Now, when someone searches foo@gmail.com, we can require email_domain = 'gmail.com' exactly (a cheap btree lookup that immediately discards 99% of the table) and only fuzzy-match email_name against foo. The poison is gone.
For uncommon domains - foo@my-weird-company.io - the domain is selective enough on its own, so fuzzy-matching the whole email is fine, and arguably better, because a typo could be in the domain.
Step 2: index for trigram lookups
Trigram search needs a GIN index with the gin_trgm_ops operator class:
CREATE INDEX customers_email_gin_idx ON customers USING GIN (email gin_trgm_ops);
CREATE INDEX customers_email_name_gin_idx ON customers USING GIN (email_name gin_trgm_ops);
CREATE INDEX customers_email_domain_idx ON customers (email_domain); -- plain btree, exact match
GIN over GIST here: GIN indexes are bigger and slower to build, but faster to query. A support search tool is read-heavy and the index gets built once. The plain btree on email_domain is for the email_domain = ? equality check.
Step 3: the common-domains list
We need to know which domains are “common” - i.e. so popular that the domain part is useless as a filter. There’s no clever way to compute this; it’s just a list. A few dozen entries covers the long tail:
const COMMON_DOMAINS: Record<string, boolean> = [
"gmail.com",
"yahoo.com",
"hotmail.com",
"outlook.com",
"icloud.com",
"aol.com",
"live.com",
"msn.com",
"me.com",
"googlemail.com",
"comcast.net",
"verizon.net",
"protonmail.com",
"yandex.com",
"mail.ru",
// ...add the ones that show up in your data
].reduce<Record<string, boolean>>((acc, domain) => {
acc[domain] = true
return acc
}, {})
A Record<string, boolean> rather than an array so the lookup in the query path is O(1).
Step 4: the query
The plan:
- Run a cheap
ILIKE '%input%'first. This catches exact matches and “I typed part of it” searches, and guarantees a perfect match always lands at the top. - If we still want more rows, run the fuzzy query - and branch on whether the domain is common:
- common domain → fixed
email_domain, fuzzy-matchemail_nameonly, rank bysimilarity(email_name, name). - anything else → fuzzy-match the whole
email, rank bysimilarity(email, input).
- common domain → fixed
SET_LIMIT(0.5)to nudge the threshold up a bit - customer support search wants “close enough”, not “shares one trigram”.
// using Knex.js here for raw query example
import { Knex } from "knex"
import { COMMON_DOMAINS } from "./commonDomains"
export async function searchCustomersByEmail(
knex: Knex,
input: string,
limit = 20,
) {
// 1. exact / substring matches first
const exact = (
await knex.raw(
`SELECT * FROM customers
WHERE email ILIKE :pattern
ORDER BY created_at DESC
LIMIT :limit`,
{ pattern: `%${input}%`, limit },
)
).rows
const remaining = limit - exact.length
if (remaining <= 0) return exact
// integers we already returned, so the fuzzy pass doesn't repeat them.
// these are DB-generated ids, so interpolating them is safe - don't do
// this with user input.
const seenIds = exact.length ? exact.map((r) => r.id) : [-1]
const [name, domain] = input.toLowerCase().split("@")
let fuzzy
if (COMMON_DOMAINS[domain]) {
// "gmail.com" would poison the trigram index - pin the domain
// (cheap btree lookup), only fuzzy-match the local part
fuzzy = (
await knex.raw(
`SELECT *, SET_LIMIT(0.5) FROM customers
WHERE email_domain = :domain
AND email_name % :name
AND id NOT IN (${seenIds.join(",")})
ORDER BY similarity(email_name, :name) DESC
LIMIT :limit`,
{ name, domain, limit: remaining },
)
).rows
} else {
// uncommon domain: the domain itself is distinct, and a typo
// might be in it - fuzzy-match the whole email
fuzzy = (
await knex.raw(
`SELECT *, SET_LIMIT(0.5) FROM customers
WHERE email % :input
AND id NOT IN (${seenIds.join(",")})
ORDER BY similarity(email, :input) DESC
LIMIT :limit`,
{ input, limit: remaining },
)
).rows
}
return exact.concat(fuzzy)
}
Now another tricky scenario if they make a typo in the word gmail.com, e.g. gnail.com. In that case our function finds nothing. A way around it is during a signup to check against common domain names and if a word is close to it, but not quite, then we ask the customer to confirm. Another approach is to make a DNS lookup for an MX record, if it does not exist, it means there’s a typo.
Does it use the index?
EXPLAIN ANALYZE
SELECT * FROM customers
WHERE email_domain = 'gmail.com' AND email_name % 'jhnsmith';
-- Bitmap Heap Scan on customers
-- Recheck Cond: (email_name % 'jhnsmith'::text)
-- Filter: (email_domain = 'gmail.com'::text)
-- -> Bitmap Index Scan on customers_email_name_gin_idx
-- Index Cond: (email_name % 'jhnsmith'::text)
-- Planning Time: 0.2 ms
-- Execution Time: 1.4 ms
Index scan, milliseconds. Compare the naive version:
EXPLAIN ANALYZE
SELECT * FROM customers WHERE email % 'jhnsmith@gmail.com';
-- Bitmap Heap Scan on customers
-- Recheck Cond: (email % 'jhnsmith@gmail.com'::text)
-- Rows Removed by Index Recheck: 480000 <-- the gmail.com poison
-- -> Bitmap Index Scan on customers_email_gin_idx
-- Execution Time: 920 ms
Same index type, but the gmail.com trigrams returns half the table. That Rows Removed by Index Recheck line is the whole reason we split the column.
Gotchas
- Experiment with the threshold.
pg_trgm.similarity_thresholdis set per-session viaSET_LIMIT(x). Too low - too much noise, too high - you miss the typo. Try0.4–0.5and go from there. - Huge table? Limit the candidates before ranking:
SELECT * FROM (SELECT * FROM customers WHERE email_name % :name LIMIT 1000) c ORDER BY similarity(...) DESC LIMIT :limit. - GIN on a write-heavy table has more overhead.