Build an autocomplete / mini search engine
Turn algorithms into a product feature. You build a fast autocomplete over a real dataset, think a city list, a product catalog, or a wordbank of hundreds of thousands of entries, that ranks suggestions well and stays under a strict latency budget as the data grows.
What you'll build
An autocomplete engine over a real corpus using a trie for prefix matching plus a ranking layer (frequency + fuzzy), backed by an inverted index for full-text lookup, with a benchmark report proving p50/p99 query latency stays within budget as the dataset scales from 10k to 500k entries.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
import csv
from dataclasses import dataclass
@dataclass(frozen=True)
class Entry:
id: int
term: str # normalized
raw: str # original, for display
freq: int
def normalize(s: str) -> str:
return " ".join(s.strip().lower().split())
def load_corpus(path: str) -> list[Entry]:
seen: dict[str, Entry] = {}
with open(path, newline="", encoding="utf-8") as fh:
for row in csv.DictReader(fh):
term = normalize(row["term"])
if not term:
continue
freq = int(row.get("frequency") or 0)
if term in seen: # dedupe, keep the higher frequency
if freq > seen[term].freq:
seen[term] = Entry(seen[term].id, term, row["term"], freq)
else:
seen[term] = Entry(len(seen), term, row["term"], freq)
return list(seen.values())Reading this file
@dataclass(frozen=True)An immutable record for each entry, frozen means it can be safely shared and hashed without accidental edits.term: str # normalizedThe cleaned-up form used for indexing and matching, kept separate from the original display text.def normalize(s: str)One place that lowercases and trims text, so the index and the query always agree on form.if term in seen:Deduplicates entries, real datasets repeat, so you keep one record per normalized term.if freq > seen[term].freqOn a duplicate, keeps the higher frequency so ranking later reflects the most popular occurrence.
Normalize once at load time so the index and queries agree on casing/whitespace.
That's 1 of 8 explained code blocks in this single project.
The build, milestone by milestone
- 1
Index a real corpus
5 guided stepsReal data is messy, mixed case, duplicates, unicode, popularity columns. Indexing it correctly is half the battle, and the structure you choose here sets the ceiling on how fast queries can be.
- 2
Rank suggestions so the top ones are actually useful
5 guided stepsA correct-but-unranked autocomplete is useless: nobody scrolls past the first few suggestions. Ranking is where the feature becomes a product, and where you confront the latency cost of scoring candidates.
- 3
Benchmark, optimize, and report latency at scale
5 guided stepsThe whole value of choosing the right data structure is performance, and “it feels fast” is not an engineering claim. You make the claim measurable: numbers, at scale, before and after.
- 4
Instrument the engine and set a memory + latency budget
5 guided stepsA benchmark is a snapshot; an instrumented engine tells you, continuously, when it has drifted out of budget. Setting numbers up front (“p99 ≤ 10ms, ≤ 400MB at 500k entries”) is how a feature becomes operable instead of a demo.
- 5
Write the runbook: what to do when suggestions get slow
4 guided stepsMost “it’s slow” incidents are re-debugged from zero because nobody wrote down last time’s diagnosis. A one-page runbook turns a panicked 2am investigation into a checklist.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building