AI System Design Newsletter

AI System Design Newsletter

Inside the Vector Retrieval Stack: Embeddings, HNSW, and Chunking

May 24, 2026
∙ Paid

The model that scored 94% on the benchmark and failed on the first real question

A legal tech company deployed a retrieval-augmented generation system for contract review. Their RAG pipeline retrieved the five most relevant document chunks for each question and fed them to a 70B model. On their internal evaluation benchmark — 500 questions written by their engineering team against a set of sample contracts — the system scored 94% accuracy.

On the first day of real user traffic, a senior associate at a law firm asked a straightforward question: “What are the indemnification obligations of the service provider under the master services agreement signed in March 2022?”

The system retrieved five chunks. None of them contained the indemnification clause. The clause lived in section 14.3 of the contract — a section that had been split across two chunk boundaries during preprocessing. Half the clause was in one chunk, the other half in the next. Neither chunk alone contained enough context to be retrieved as relevant. The model answered based on what it received and got the answer wrong.

The benchmark had been built on chunks that didn’t split across clause boundaries. Real contracts don’t care about chunk boundaries.

This is the chunking problem. And the vector database sitting underneath that chunk retrieval is the reason the wrong chunks were returned with high confidence.


Turning words into numbers — and why the numbers have to preserve meaning

The fundamental challenge in retrieval is that computers can’t read. They can only compare numbers. So the first step in any retrieval system is turning text into numbers in a way that preserves semantic meaning — similar meanings produce similar numbers, different meanings produce different numbers.

These numbers are called embeddings — dense vectors of floating-point values, typically 768 to 4,096 dimensions. An embedding model — BERT-style encoders, modern E5 or BGE models — takes a piece of text and outputs one vector. The key property: texts that mean similar things produce vectors that point in similar directions in this high-dimensional space. “The party shall indemnify” and “the vendor assumes liability” produce vectors that are close together. “The cat sat on the mat” produces a vector far away from both.

Once you have embeddings, retrieval is a nearest-neighbour search problem. You embed the query, then find which document embeddings are closest to the query embedding in vector space. The standard distance metric is cosine similarity — the cosine of the angle between two vectors. Cosine similarity of 1 means the vectors point in exactly the same direction — perfect match. Cosine similarity of 0 means they’re perpendicular — no relationship. Negative means they’re pointing away from each other.

Think of the embedding space like a city. Each document chunk is a building at a specific address. Semantically similar documents are in the same neighbourhood. When a query arrives, you drop a pin at the query’s address and find the nearest buildings. The buildings closest to your pin are the most relevant chunks.

Here’s the key thing: the quality of your retrieval is entirely determined by two things — how well your embedding model preserves semantic meaning, and how accurately your vector index finds the true nearest neighbours at scale.

User's avatar

Continue reading this post for free, courtesy of AI Engineering.

Or purchase a paid subscription.
© 2026 AI Engineering · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture