Two bits per number, and the search still works

I spent another weekend on a small thing. This one is a movie search engine that understands mood instead of keywords. You type tense but not depressing, under 100 minutes, after 2010 or 80s action energy but modern, and it hands back movies that actually feel like that — not movies whose titles happen to contain the word “tense.”

The whole thing is up on GitHub. It’s the Kaggle TMDB dataset (930k movies), all-MiniLM-L6-v2 for embeddings, a little natural-language filter parser, and a Streamlit front end. Nothing exotic.

What I want to talk about here isn’t the search app — it’s the thing storing the vectors. Because the obvious way to do semantic search is to keep 930k embeddings as float32, eat the 31 GB of RAM, and call it a day. I didn’t do that. I used a library called Turbovec, which takes each 384-dimensional embedding and squashes every number down to two bits. And the search still works. That’s the part that kept me up reading.

The setup: what a vector index has to do

A semantic search index has one job. You hand it a query vector, it finds the stored vectors closest to it, and it returns them ranked. The brute-force version is a dot product against every vector in the database, which is exact but slow and memory-hungry. Everyone in this space is trading away a little exactness to buy back speed and memory.

The usual trade is product quantization — FAISS’s IndexPQ is the canonical version. You chop each vector into sub-vectors, run k-means over your data to learn a codebook of representative chunks, and store each sub-vector as the index of its nearest codebook entry. It works well. But notice the catch hiding in there: you have to train the codebook on your data. Every new dataset, every distribution shift, means another k-means pass. The compression is data-dependent.

Turbovec doesn’t do that. And that’s the whole point.

The trick: quantization that never looks at your data

Turbovec is a Rust library with Python bindings that implements TurboQuant, an algorithm out of Google Research. The thing that made me stop and reread the README is that the quantization is data-oblivious. There is no training step. The codebooks aren’t learned from your vectors — they’re derived from mathematics, ahead of time, and they’re the same for everybody.

How do you compress data well without looking at it? The pipeline is a small stack of clean ideas:

Normalize. Pull each vector’s magnitude off to the side so you’re only quantizing direction. Length gets stored separately.
Randomly rotate. Apply an orthogonal rotation to every vector. This is the clever bit — after a random rotation, the individual coordinates of a unit vector follow a predictable Beta distribution. You no longer need to know anything about your data, because the rotation makes every dataset’s coordinates look statistically the same.
Quantize against precomputed buckets. Now that you know the distribution coordinates will follow, you can compute the optimal Lloyd-Max bucket boundaries once, from the math, and reuse them forever. Four buckets for 2-bit, sixteen for 4-bit.
Bit-pack. Jam those tiny bucket indices into tightly packed bytes.
Renormalize the length. Store a per-vector correction factor that debiases the inner-product estimate — an idea borrowed from RaBitQ — so the compressed dot products don’t systematically drift.

The headline result from the paper is that this lands within 2.7× of the Shannon information-theoretic lower bound on distortion. In plain terms: it compresses almost as well as anything could, and it does it without a single pass over your data.

For my movie app this is invisible and wonderful. I run one script, it writes a 2-bit and a 4-bit index, and that’s it. No “fit” call, no training set, no codebook to version. The 31 GB of float32 a 10M-document corpus would need drops to about 4 GB at 4-bit — roughly an 8× cut — and you can go smaller still at 2-bit.

from turbovec import IdMapIndex

index = IdMapIndex(dim=384, bit_width=2)
index.add_with_ids(embeddings, movie_ids)
index.write("movies_2bit.tvim")

That’s the entire indexing step. Compare that to the k-means dance and it feels like cheating.

The other half: making compressed search actually fast

Compressing the vectors is only worth it if scoring the compressed ones is fast. A 2-bit vector you have to decompress before every comparison buys you nothing.

Turbovec’s answer is hand-written SIMD kernels — NEON for ARM, AVX-512 for x86 with an AVX2 fallback — that score the packed bytes directly using nibble-split lookup tables. It never inflates the vectors back to floats. On an M3 Max it beat FAISS by 12–20%, and on a Xeon it won most 4-bit configs too, while keeping recall ahead of FAISS’s IndexPQ by anywhere from 0.4 to 3.4 points at the top of the ranking. Smaller, and faster, and slightly more accurate. Pick three.

But the feature that actually made my app possible is filtering inside the kernel. My mood search isn’t pure vector search — almost every query has hard constraints stapled to it. “Under 100 minutes.” “After 2010.” “Not horror.” The way you’d normally handle that is ugly: either you over-fetch a big pile of vector results and filter them down afterward (and pray you fetched enough to still have k left), or you filter first and then do a slow brute-force scan over whatever survived.

Turbovec lets you pass an allowlist straight into the search call, and it honors it inside the SIMD kernel, at block granularity, short-circuiting anything that isn’t allowed before it wastes cycles scoring it.

# allowed_ids comes from my metadata filters: runtime, year, genre, language...
scores, ids = index.search(query_emb, k=20, allowlist=allowed_ids)

So my pipeline is: embed the query, parse the natural-language constraints into a metadata allowlist, and let Turbovec do filtered nearest-neighbor in one shot. Filtered search runs as fast as unfiltered search. No over-fetching, no recall cliff, no second code path. The constraint and the similarity get resolved together instead of fighting each other.

Why I bothered

I built the movie thing because I wanted a search box I could talk to like a person, and I built it on Turbovec because I wanted to feel the difference between “compression you have to train” and “compression that’s just true.” Reading that TurboQuant is data-oblivious is one thing. Watching one script turn 930k embeddings into a 2-bit index with no fit step, then typing short emotional indie movie I can finish tonight and getting the right films back in a few milliseconds, is another.

There’s a pattern here that rhymes with something I wrote about last week. That post was about how the next useful model is often not a new model — it’s a good adapter between two existing ones. This is the same shape pointed at a different problem. The clever move in TurboQuant isn’t a bigger network or more data. It’s a random rotation that makes everyone’s data look the same, so the hard work can be done once, in advance, by math instead of by training.

The best compression sometimes isn’t learned from your data at all. It’s a transformation that makes your data stop being special — so the answer can be precomputed for everyone.

If you want to play with it, clone the repo, run the pipeline, and ask it for something oddly specific. Try cozy but a little sad, nothing scary and watch it quietly skip the horror section. I promise that’s more fun than it sounds.