← Blog

Typo-aware email search in PostgreSQL

Here’s the life of a customer support agent: a customer can’t find their order. You ask for their email, they say john.smith@gmail.com, you search for it, and… nothing. Turns out there’s a typo and the actual record is jhn.smith@gmial.con. If emails are not verified than the row just sits in your database forever.

Customer support needs to find that account anyway and they ask for your help. So we need email search is resilient to typos: gmial instead of gmail, .con instead of .com.

There’s also a second, sneakier requirement. When support searches for foo@gmail.com, they want the foo-ish accounts on Gmail - not every single one of the millions of @gmail.com users in the table. Fuzzy matching the whole email string handles the first requirement and completely fails the second. We’ll need to be a bit smarter than “throw it all at a similarity function”.

This is all doable in plain PostgreSQL.

LIKE is not enough

The obvious first attempt:

SELECT * FROM customers WHERE email ILIKE '%john.smith@gmail.com%';

This needs the substring to be exactly right. One wrong letter and you get zero rows. ILIKE is for “I know part of it”, not “I know roughly what it looks like”. For non exact matches we need trigram similarity.

pg_trgm to the rescue

A trigram is a three-character slice of a string - "john" breaks down into {" j", " jo", "joh", "ohn", "hn "} - and two strings that share a lot of trigrams are probably similar. PostgreSQL’s pg_trgm extension gives you similarity(a, b) returning a number from 0 to 1, and the % operator which is true when that similarity is above pg_trgm.similarity_threshold (default 0.3, tunable per session with set_limit() / SET_LIMIT).

CREATE EXTENSION IF NOT EXISTS pg_trgm;

SELECT similarity('jhn.smith@gmial.con', 'john.smith@gmail.com'); -- ~0.5

0.5 - comfortably above the default threshold. The typo’d email and the real one are “similar” as far as Postgres is concerned. That’s the whole trick. Everything below is about making it fast and making it not return garbage.

Step 1: split the email when you store it

This is the important bit, so I’ll spend words on it.

The problem with email % 'foo@gmail.com' is that gmail.com is a huge trigram bucket. Half your table matches it. The trigram index returns half your table, Postgres throws away almost all of it, and you’ve turned an index lookup into a slow scan. The gmail.com part “poisons” the query.

So: don’t search the whole email when the domain is something common. Split it. Store the local part and the domain in their own columns:

ALTER TABLE customers ADD COLUMN email_name   text;
ALTER TABLE customers ADD COLUMN email_domain text;

Keep them in sync with a BEFORE INSERT OR UPDATE trigger:

CREATE OR REPLACE FUNCTION split_customer_email()
RETURNS TRIGGER LANGUAGE plpgsql AS $$
BEGIN
	NEW.email_name   = split_part(lower(NEW.email), '@', 1);
	NEW.email_domain = split_part(lower(NEW.email), '@', 2);
	RETURN NEW;
END;
$$;

CREATE TRIGGER split_customer_email_trigger
BEFORE INSERT OR UPDATE ON customers
FOR EACH ROW EXECUTE PROCEDURE split_customer_email();

You could do this in application code instead - it’s the same idea - but a trigger means it’s up to date no matter which code path writes the row. Note: we add lower() for normalization and so we can ignore letter case.

Now, when someone searches foo@gmail.com, we can require email_domain = 'gmail.com' exactly (a cheap btree lookup that immediately discards 99% of the table) and only fuzzy-match email_name against foo. The poison is gone.

For uncommon domains - foo@my-weird-company.io - the domain is selective enough on its own, so fuzzy-matching the whole email is fine, and arguably better, because a typo could be in the domain.

Step 2: index for trigram lookups

Trigram search needs a GIN index with the gin_trgm_ops operator class:

CREATE INDEX customers_email_gin_idx      ON customers USING GIN (email gin_trgm_ops);
CREATE INDEX customers_email_name_gin_idx ON customers USING GIN (email_name gin_trgm_ops);
CREATE INDEX customers_email_domain_idx   ON customers (email_domain); -- plain btree, exact match

GIN over GIST here: GIN indexes are bigger and slower to build, but faster to query. A support search tool is read-heavy and the index gets built once. The plain btree on email_domain is for the email_domain = ? equality check.

Step 3: the common-domains list

We need to know which domains are “common” - i.e. so popular that the domain part is useless as a filter. There’s no clever way to compute this; it’s just a list. A few dozen entries covers the long tail:

const COMMON_DOMAINS: Record<string, boolean> = [
	"gmail.com",
	"yahoo.com",
	"hotmail.com",
	"outlook.com",
	"icloud.com",
	"aol.com",
	"live.com",
	"msn.com",
	"me.com",
	"googlemail.com",
	"comcast.net",
	"verizon.net",
	"protonmail.com",
	"yandex.com",
	"mail.ru",
	// ...add the ones that show up in your data
].reduce<Record<string, boolean>>((acc, domain) => {
	acc[domain] = true
	return acc
}, {})

A Record<string, boolean> rather than an array so the lookup in the query path is O(1).

Step 4: the query

The plan:

  1. Run a cheap ILIKE '%input%' first. This catches exact matches and “I typed part of it” searches, and guarantees a perfect match always lands at the top.
  2. If we still want more rows, run the fuzzy query - and branch on whether the domain is common:
    • common domain → fixed email_domain, fuzzy-match email_name only, rank by similarity(email_name, name).
    • anything else → fuzzy-match the whole email, rank by similarity(email, input).
  3. SET_LIMIT(0.5) to nudge the threshold up a bit - customer support search wants “close enough”, not “shares one trigram”.
// using Knex.js here for raw query example
import { Knex } from "knex"
import { COMMON_DOMAINS } from "./commonDomains"

export async function searchCustomersByEmail(
	knex: Knex,
	input: string,
	limit = 20,
) {
	// 1. exact / substring matches first
	const exact = (
		await knex.raw(
			`SELECT * FROM customers
			 WHERE email ILIKE :pattern
			 ORDER BY created_at DESC
			 LIMIT :limit`,
			{ pattern: `%${input}%`, limit },
		)
	).rows

	const remaining = limit - exact.length
	if (remaining <= 0) return exact

	// integers we already returned, so the fuzzy pass doesn't repeat them.
	// these are DB-generated ids, so interpolating them is safe - don't do
	// this with user input.
	const seenIds = exact.length ? exact.map((r) => r.id) : [-1]
	const [name, domain] = input.toLowerCase().split("@")

	let fuzzy
	if (COMMON_DOMAINS[domain]) {
		// "gmail.com" would poison the trigram index - pin the domain
		// (cheap btree lookup), only fuzzy-match the local part
		fuzzy = (
			await knex.raw(
				`SELECT *, SET_LIMIT(0.5) FROM customers
				 WHERE email_domain = :domain
				   AND email_name % :name
				   AND id NOT IN (${seenIds.join(",")})
				 ORDER BY similarity(email_name, :name) DESC
				 LIMIT :limit`,
				{ name, domain, limit: remaining },
			)
		).rows
	} else {
		// uncommon domain: the domain itself is distinct, and a typo
		// might be in it - fuzzy-match the whole email
		fuzzy = (
			await knex.raw(
				`SELECT *, SET_LIMIT(0.5) FROM customers
				 WHERE email % :input
				   AND id NOT IN (${seenIds.join(",")})
				 ORDER BY similarity(email, :input) DESC
				 LIMIT :limit`,
				{ input, limit: remaining },
			)
		).rows
	}

	return exact.concat(fuzzy)
}

Now another tricky scenario if they make a typo in the word gmail.com, e.g. gnail.com. In that case our function finds nothing. A way around it is during a signup to check against common domain names and if a word is close to it, but not quite, then we ask the customer to confirm. Another approach is to make a DNS lookup for an MX record, if it does not exist, it means there’s a typo.

Does it use the index?

EXPLAIN ANALYZE
SELECT * FROM customers
WHERE email_domain = 'gmail.com' AND email_name % 'jhnsmith';

-- Bitmap Heap Scan on customers
--   Recheck Cond: (email_name % 'jhnsmith'::text)
--   Filter: (email_domain = 'gmail.com'::text)
--   ->  Bitmap Index Scan on customers_email_name_gin_idx
--         Index Cond: (email_name % 'jhnsmith'::text)
--   Planning Time: 0.2 ms
--   Execution Time: 1.4 ms

Index scan, milliseconds. Compare the naive version:

EXPLAIN ANALYZE
SELECT * FROM customers WHERE email % 'jhnsmith@gmail.com';

-- Bitmap Heap Scan on customers
--   Recheck Cond: (email % 'jhnsmith@gmail.com'::text)
--   Rows Removed by Index Recheck: 480000   <-- the gmail.com poison
--   ->  Bitmap Index Scan on customers_email_gin_idx
--   Execution Time: 920 ms

Same index type, but the gmail.com trigrams returns half the table. That Rows Removed by Index Recheck line is the whole reason we split the column.

Gotchas

  • Experiment with the threshold. pg_trgm.similarity_threshold is set per-session via SET_LIMIT(x). Too low - too much noise, too high - you miss the typo. Try 0.40.5 and go from there.
  • Huge table? Limit the candidates before ranking: SELECT * FROM (SELECT * FROM customers WHERE email_name % :name LIMIT 1000) c ORDER BY similarity(...) DESC LIMIT :limit.
  • GIN on a write-heavy table has more overhead.