Back to Blog
AI Engineering13 min readJun 2026

Embeddings & Vector Search

How text becomes vectors that capture meaning, why cosine similarity matters, and how approximate nearest-neighbor indexes power RAG and semantic search.

AIEmbeddingsVector SearchRAG
SB

Sri Balaji

Founder · TheSimplifiedTech

On this page

Why keyword search keeps failing you

You build a search box over your docs. A user types "how do I reset my password", and gets nothing, because your help article is titled "Account recovery steps." Same meaning, zero matching words. Keyword search only finds documents that share *literal tokens*; it has no idea that "reset password" and "account recovery" are the same idea.

Embeddings fix this. They turn text into numbers that encode meaning, so "reset my password" lands right next to "account recovery" even though they share no words. Vector search is how you find those nearby meanings fast, and it's the retrieval engine underneath every RAG system.

Who this is for

Engineers who've seen embeddings mentioned in RAG tutorials but never understood what the vectors *are*, why cosine similarity shows up everywhere, or why "HNSW" matters. If you've called an embedding API and gotten back a giant array of floats and thought "now what?", start here. No linear algebra degree required.

What an embedding actually is

An embedding is a list of numbers (a vector) that represents a piece of text as a point in a high-dimensional space, positioned so that texts with similar meaning sit close together.

That's it. You hand text to an embedding model and it returns a fixed-length array of floats, often 384, 768, or 1,536 of them. Each number is a coordinate. "The cat sat on the mat" might become [0.021, -0.118, 0.337, ...]. The individual numbers mean nothing to you; what matters is the position they describe and how close it is to other positions.

The model learned these coordinates by reading enormous amounts of text. It discovered that words and sentences used in similar contexts should land in similar places. The result is a space where geometric distance is a proxy for semantic difference.

A city's latitude & longitudeAn embedding vector (coordinates in meaning-space)
Cities close on the map are geographically nearVectors close in the space are semantically similar
"How far is Paris from Lyon?"Distance between two vectors
"Find the 5 nearest towns"Nearest-neighbor query
Think of embeddings as a map of meaning.

On a real map, Paris and Lyon are close, and both are far from Tokyo. In embedding space, "reset password" and "account recovery" are close, and both are far from "export invoice as PDF." Search becomes geometry: to find relevant text, you find the nearest points.

The full pipeline, end to end

send textreturns floatsstoreANN searchranked by similarity
Raw text

"reset my password"

Embedding model

text-embedding-3 / e5

Vector

[0.02, -0.11, ...] · 1536-d

Vector index

HNSW in a vector DB

Query vector

embedded user question

Top-k results

nearest neighbors

Text becomes a vector, lands in a vector space, and a query finds its nearest neighbors.

  1. 1

    Embed your documents

    Run every chunk of your corpus through the embedding model once, offline. You get one vector per chunk.

  2. 2

    Store the vectors in an index

    Load them into a vector database (Pinecone, Qdrant, pgvector, Weaviate). The DB builds an index, usually HNSW, to make search fast.

  3. 3

    Embed the user's query

    At query time, send the user's question through the *same* model to get a query vector in the same space.

  4. 4

    Find nearest neighbors

    The index returns the top-k vectors closest to the query vector, ranked by a similarity metric like cosine.

  5. 5

    Use the results

    Return them as search hits, or stuff them into an LLM prompt as context, that second path is RAG.

Same model, both sides

Documents and queries MUST be embedded with the same model. Two models produce two incompatible coordinate systems, distances between them are meaningless noise. This is the single most common reason vector search "returns garbage."

Similarity metrics: how "close" is measured

"Nearest" needs a definition of distance. Three metrics dominate, and they answer subtly different questions. The wrong choice quietly degrades your results.

MetricMeasuresRangeUse when
Cosine similarityAngle between vectors (direction only)-1 to 1 (higher = closer)The default for text. Ignores magnitude, so document length doesn't skew results.
Dot productDirection AND magnitude combinedUnboundedVectors are already normalized (then it equals cosine), or magnitude carries signal. Fastest to compute.
Euclidean (L2)Straight-line distance between points0 to ∞ (lower = closer)Image/audio embeddings, or when absolute position matters. Sensitive to magnitude.
Pick the metric your model was trained for, usually cosine.

For text, cosine is almost always right. Two sentences about the same topic point in the same direction even if one is a paragraph and the other a phrase, cosine cares about *direction*, not *length*. A handy fact: if you normalize your vectors to unit length first, cosine similarity and dot product become identical, and dot product is cheaper. Many vector DBs do this for you, which is why their docs say "use dot product for cosine results."

Cosine similarity in code

Here's the whole idea in one runnable script: embed two texts, then measure how aligned their vectors are. Swap the model for any embedding API or local model, the math is identical.

similarity.py
python
import numpy as np
from openai import OpenAI

client = OpenAI()

def embed(text: str) -> np.ndarray:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return np.array(resp.data[0].embedding)

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    # dot product divided by the product of magnitudes
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

v1 = embed("how do I reset my password")
v2 = embed("steps to recover my account")
v3 = embed("export an invoice as a PDF")

print(round(cosine_similarity(v1, v2), 3))  # ~0.82  -> very similar
print(round(cosine_similarity(v1, v3), 3))  # ~0.21  -> unrelated

Two phrasings of the same intent score high (~0.82); an unrelated task scores low (~0.21), with zero shared keywords. That gap is the entire value of embeddings. Vector search is just doing this comparison against millions of stored vectors and returning the highest scorers.

Exact vs approximate search: the recall/speed trade-off

The naive way to find nearest neighbors is exact search: compute the similarity between the query and *every* stored vector, then sort. This is perfectly accurate, and perfectly unscalable. Ten thousand vectors? Fine. Ten million? Every single query now does ten million comparisons. Latency dies.

Approximate Nearest Neighbor (ANN) search trades a sliver of accuracy for orders-of-magnitude speed. Instead of scanning everything, it builds a smart index that lets a query jump to the right neighborhood and only compare against a small candidate set. Two families dominate:

  • HNSW (Hierarchical Navigable Small World), builds a layered graph where each vector links to its neighbors. A query enters at the top layer, greedily hops toward closer nodes, and descends. Excellent recall and very fast queries; the trade-off is higher memory use. The default in most vector DBs.
  • IVF (Inverted File Index), clusters vectors into buckets (via k-means), then at query time only searches the few buckets nearest the query. Lower memory, great for huge datasets; recall depends on how many buckets you probe (the nprobe knob).

The key word is *approximate*: ANN might miss the true #3 result and return the true #4 instead. We measure this with recall, what fraction of the true top-k the index actually returned. Tuning ANN is choosing a point on a curve: push parameters (HNSW's ef_search, IVF's nprobe) up for higher recall and slower queries, or down for blazing speed and a few missed neighbors. For most apps, 95-99% recall at a fraction of the latency is a trade you'll happily take.

When exact is fine

Under ~10k vectors, just use exact (brute-force) search, pgvector and others do it well, and you skip all index tuning. Reach for ANN when your corpus and latency budget actually demand it, not before.

Common mistakes that cost hours

  1. Mismatched embedding models. Embedding documents with one model and queries with another puts them in different coordinate systems. Results look random. Pin one model for both sides and re-embed everything if you ever switch.
  2. Wrong distance metric. Using Euclidean on vectors built for cosine (or vice versa) silently ranks the wrong things first. Check your model card, it tells you which metric it was trained for, and configure your index to match.
  3. No normalization. If you use dot product expecting cosine behavior but never normalize to unit length, longer documents get unfairly boosted by their larger magnitude. Normalize, or use a metric that ignores magnitude.
  4. Embedding raw, oversized chunks. Cramming a 4,000-word page into one vector blurs many ideas into one average point. Chunk to coherent passages (see RAG architecture) so each vector represents a single idea.
  5. Chasing 100% recall. Demanding perfect ANN recall throws away the entire speed advantage. Set a recall target (say 98%), tune to it, and move on.

Takeaways

The whole article in seven lines

  • An embedding is a vector, a point in meaning-space where similar texts sit close.
  • Search becomes geometry: to find relevant text, find the nearest vectors.
  • Embed documents and queries with the *same* model, always.
  • Use **cosine** for text; normalize and it equals dot product (which is faster).
  • Exact search is accurate but O(n) per query, fine only for small corpora.
  • ANN indexes (HNSW, IVF) trade a little recall for huge speed gains.
  • Tune to a recall target (~95-99%); don't chase a perfect 100%.

Where to go next

Embeddings and vector search are the *retrieval* half of modern AI apps. The natural next step is wiring them into a generation loop, that's exactly what RAG does. And if the geometry here felt magical, understanding how the models that produce these vectors are trained will demystify the rest.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.