Back to path
MediumWorking system ~15h· 5 milestones

Build an autocomplete / mini search engine

Turn algorithms into a product feature. You build a fast autocomplete over a real dataset, think a city list, a product catalog, or a wordbank of hundreds of thousands of entries, that ranks suggestions well and stays under a strict latency budget as the data grows.

TriesInverted indexRankingBenchmarkingLatency percentilesAlgorithmic optimizationObservability & structured loggingPerformance budgetingRunbooks

What you'll build

An autocomplete engine over a real corpus using a trie for prefix matching plus a ranking layer (frequency + fuzzy), backed by an inverted index for full-text lookup, with a benchmark report proving p50/p99 query latency stays within budget as the dataset scales from 10k to 500k entries.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

src/load.pypy
import csv
from dataclasses import dataclass


@dataclass(frozen=True)
class Entry:
    id: int
    term: str          # normalized
    raw: str           # original, for display
    freq: int


def normalize(s: str) -> str:
    return " ".join(s.strip().lower().split())


def load_corpus(path: str) -> list[Entry]:
    seen: dict[str, Entry] = {}
    with open(path, newline="", encoding="utf-8") as fh:
        for row in csv.DictReader(fh):
            term = normalize(row["term"])
            if not term:
                continue
            freq = int(row.get("frequency") or 0)
            if term in seen:                 # dedupe, keep the higher frequency
                if freq > seen[term].freq:
                    seen[term] = Entry(seen[term].id, term, row["term"], freq)
            else:
                seen[term] = Entry(len(seen), term, row["term"], freq)
    return list(seen.values())

Reading this file

  • @dataclass(frozen=True)An immutable record for each entry, frozen means it can be safely shared and hashed without accidental edits.
  • term: str # normalizedThe cleaned-up form used for indexing and matching, kept separate from the original display text.
  • def normalize(s: str)One place that lowercases and trims text, so the index and the query always agree on form.
  • if term in seen:Deduplicates entries, real datasets repeat, so you keep one record per normalized term.
  • if freq > seen[term].freqOn a duplicate, keeps the higher frequency so ranking later reflects the most popular occurrence.

Normalize once at load time so the index and queries agree on casing/whitespace.

That's 1 of 8 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Index a real corpus

    5 guided steps

    Real data is messy, mixed case, duplicates, unicode, popularity columns. Indexing it correctly is half the battle, and the structure you choose here sets the ceiling on how fast queries can be.

  2. 2

    Rank suggestions so the top ones are actually useful

    5 guided steps

    A correct-but-unranked autocomplete is useless: nobody scrolls past the first few suggestions. Ranking is where the feature becomes a product, and where you confront the latency cost of scoring candidates.

  3. 3

    Benchmark, optimize, and report latency at scale

    5 guided steps

    The whole value of choosing the right data structure is performance, and “it feels fast” is not an engineering claim. You make the claim measurable: numbers, at scale, before and after.

  4. 4

    Instrument the engine and set a memory + latency budget

    5 guided steps

    A benchmark is a snapshot; an instrumented engine tells you, continuously, when it has drifted out of budget. Setting numbers up front (“p99 ≤ 10ms, ≤ 400MB at 500k entries”) is how a feature becomes operable instead of a demo.

  5. 5

    Write the runbook: what to do when suggestions get slow

    4 guided steps

    Most “it’s slow” incidents are re-debugged from zero because nobody wrote down last time’s diagnosis. A one-page runbook turns a panicked 2am investigation into a checklist.

What's inside when you start

3 starter files, ready to clone
5 guided milestones
5 full reference solutions
8 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

A working autocomplete engine over a real dataset with a CLI or web demo
A benchmark report: p50/p99 latency vs dataset size (10k / 100k / 500k)
A before/after note on one profiling-driven optimization, with numbers
An instrumented engine with structured per-query logs, live p50/p99 counters, and a written latency + memory budget enforced by a CI/benchmark check
A one-page runbook mapping degradation modes to confirming metrics and first remediations

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building