aidatabasessearchengineering

Vector Indexes: How AI Apps Search at Scale

Traditional indexes break when your data isn't a number or a string. Vector indexes solve a fundamentally different problem finding meaning instead of matching text.

June 28, 2026·6 min read

If you've used semantic search, a recommendation engine, or any RAG-based AI app, you've relied on vector indexes without knowing it. They're the reason those systems feel intelligent rather than just fast.

Here's how they work.

The Problem With Traditional Indexes

A B-tree index is brilliant for exact matches. "Give me all users where email = 'x@y.com'" — perfect. It's also great for range queries. But ask it "find me something semantically similar to this sentence" and it falls apart completely.

The core issue: traditional indexes operate on discrete, comparable values. The distance between "cat" and "dog" isn't something a B-tree knows or cares about.

Vector indexes operate on a different primitive: embeddings.

What's an Embedding?

An embedding is a list of floating-point numbers — a point in high-dimensional space — that represents the semantic meaning of something. Text, images, audio, code — all of it can be embedded.

Modern embedding models (like OpenAI's text-embedding-3-small or sentence-transformers) convert any input into a fixed-length vector:

"The quick brown fox" → [0.12, -0.84, 0.33, 0.07, ... ] (1536 numbers)

The key property: semantically similar things produce vectors that are geometrically close. "Dog" and "puppy" end up near each other. "Dog" and "mortgage" end up far apart.

A vector index's job is to answer: "given a query vector, find the K vectors in my dataset that are closest to it."

Why Not Just Brute-Force It?

For 1,000 vectors, a brute-force scan (compute distance to every vector, return the closest K) is fine. For 10 million vectors, you're doing 10M distance calculations per query. At 1536 dimensions each. That's ~60GB of computation per query. Not viable.

This is the Approximate Nearest Neighbor (ANN) problem, and it's what vector index algorithms are designed to solve.

The key word is approximate. Every ANN algorithm trades some recall accuracy for orders-of-magnitude faster search. In practice, 95–99% recall is indistinguishable from perfect for most applications.

The Main Approaches

1. HNSW (Hierarchical Navigable Small World)

The current gold standard. HNSW builds a layered graph structure where each node (vector) is connected to its nearest neighbors. The graph has multiple layers — upper layers are sparse (for coarse navigation), lower layers are dense (for fine-grained search).

At query time, you enter at the top layer, greedily navigate toward your target, then descend layer by layer until you've found the approximate nearest neighbors at the bottom.

Characteristics:

Excellent recall (typically 99%+ at reasonable ef values)
Fast query speed
High memory usage — the graph lives in RAM
Slow index build time for very large datasets

Used by: Weaviate, Qdrant, pgvector, Chroma

2. IVF (Inverted File Index)

IVF clusters your vectors using k-means first. Each cluster gets a centroid. At query time, you find the nearest N centroids, then search only within those clusters.

Think of it like postal codes — you narrow down to the right neighborhood before doing a house-by-house search.

Characteristics:

Lower memory footprint than HNSW
Good for very large datasets that don't fit in RAM
Recall depends heavily on how many clusters (nprobe) you search at query time
Requires training on your data before building

Often paired with Product Quantization (IVF-PQ) to compress vectors and reduce memory further.

Used by: Faiss (Meta), Pinecone (under the hood)

3. LSH (Locality-Sensitive Hashing)

LSH uses random hash functions with the property that similar vectors tend to hash to the same bucket. You hash your query vector, look up that bucket, search only within it.

Characteristics:

Very fast, deterministic
Lower recall than HNSW
Simple to implement and understand
Falling out of favor for high-recall use cases

4. ScaNN (Google)

Google's in-house algorithm. Uses anisotropic quantization — it compresses vectors in a direction-aware way that prioritizes recall on high-dot-product pairs over L2-close pairs. Particularly good for maximum inner product search (MIPS), which is relevant for recommendation systems.

Distance Metrics

The "closest" vector depends on what distance function you use:

Cosine similarity — measures the angle between vectors, ignoring magnitude. Best for text embeddings where direction carries meaning but scale doesn't.

Euclidean distance (L2) — measures straight-line distance. Better when magnitude matters.

Dot product — fast to compute, used in recommendation and MIPS scenarios.

Most embedding models are trained to work with cosine similarity. When in doubt, use cosine.

The Index Parameters That Actually Matter

ef_construction (HNSW): How thoroughly the graph is built. Higher = better recall, slower build time. Set it once and forget it.

M (HNSW): Max connections per node per layer. Higher = better recall, more memory. Typical values: 16–64.

ef (HNSW, query time): How thoroughly the graph is searched per query. This is the knob you tune at runtime to trade speed vs recall. Higher ef = slower but more accurate.

nprobe (IVF): Number of clusters to search. The IVF equivalent of ef. Increase it when recall is too low.

nlist (IVF): Number of clusters. Rule of thumb: sqrt(N) where N is your dataset size.

Quantization: Fitting More Into RAM

Full-precision vectors (float32) cost 4 bytes per dimension. A 1536-dimensional vector takes ~6KB. 10 million of them: ~60GB RAM. Most machines can't hold that.

Product Quantization (PQ) splits each vector into subvectors, trains a small codebook on each, and replaces each subvector with a codebook index. You go from 6KB per vector to maybe 64 bytes. 12x–100x compression with modest recall loss.

Scalar Quantization (SQ8) simply reduces float32 → int8. 4x compression, minimal recall loss. Often the right first step before considering PQ.

pgvector recently added quantization support. Qdrant has had it for a while.

Where This Shows Up in Real Systems

RAG (Retrieval-Augmented Generation): Embed your documents, store in a vector DB, embed the user's query, retrieve top-K chunks, feed to an LLM. The vector index is the retrieval layer.

Semantic search: Search by meaning instead of keywords. "Find me articles about machine learning" retrieves posts about neural networks, deep learning, AI — not just exact keyword matches.

Recommendations: Embed users and items into the same space. Find items closest to the user's embedding (or items similar to what they've interacted with).

Deduplication: Find near-duplicate documents, images, or code by searching for vectors above a similarity threshold.

Choosing a Vector Database

Database	Index	Hosted	Best For
pgvector	HNSW, IVF	Via Supabase/Neon	Existing Postgres users
Pinecone	Proprietary	Yes	Simplicity at scale
Qdrant	HNSW	Self-host or cloud	Control + performance
Weaviate	HNSW	Both	Multi-modal, large graphs
Chroma	HNSW	Self-host	Local dev / prototyping
Faiss	Many	No (library)	Research, custom pipelines

For most projects starting out: pgvector if you're already on Postgres, Qdrant if you want a dedicated vector DB with good ergonomics.

The Catch

ANN indexes are approximate. You will miss some true nearest neighbors. For most applications this doesn't matter — the difference between the 1st and 15th closest vector is semantically negligible. But for fraud detection, medical diagnostics, or other high-stakes recall scenarios, you need to think carefully about your ef and nprobe values and measure actual recall against ground truth.

Also: the quality of your index is bounded by the quality of your embeddings. A great index on weak embeddings still returns irrelevant results. The embedding model is the more important choice.

Vector indexes are one of those foundational pieces that most engineers never need to understand deeply — until they do. When your semantic search starts returning garbage at scale, or your RAG pipeline is too slow to be useful, this is where the answers live.