← Glossary

Vector database

A database built to store and search high-dimensional embeddings efficiently, using approximate nearest-neighbor algorithms to find semantically similar content at scale.

The honest version

A vector database is infrastructure. It does not understand language. It stores lists of numbers and finds other lists of numbers that are close to a query list of numbers, as fast as possible.

What makes it useful is what those numbers represent: if you have embedded your entire translation memory into vectors, a vector database can find the most semantically similar approved segments for any new source string in milliseconds. That retrieval is the engine behind semantic TM lookup and most RAG pipelines for localization.

Common options in production systems: pgvector (a Postgres extension, good for teams already on Postgres), Qdrant, Weaviate, Pinecone, and Chroma. The choice is usually about operational familiarity and latency requirements, not about any magical difference in the algorithms. Most implementations use HNSW or similar approximate nearest-neighbor structures.

Why it matters for translation

Traditional TM databases are optimized for exact and fuzzy string match. They are fast and reliable for that purpose, but they cannot find semantic matches: segments that mean the same thing in different words.

A vector database changes the retrieval question from “how similar are these strings?” to “how similar is the meaning?” For high-volume pipelines where source text varies but domain is consistent (product UI updates, regulatory document versioning, SaaS changelog localization), semantic retrieval typically surfaces more useful matches than string fuzzy matching, with fewer irrelevant near-misses.

In a RAG architecture, the vector database is the component that determines what context the model sees. Better retrieval means more relevant terminology and approved segments in the prompt, which directly improves output quality before the model generates a single token.

Where it fails

Approximate nearest neighbor means you do not always get the best match. You get a fast approximation that is usually the best match, with occasional misses. For most localization use cases, this is an acceptable trade-off. For legal or medical content where a missed segment match has compliance consequences, it is worth understanding the error rate of your retrieval system.

Index quality degrades over time. If you continuously add new segments to a vector index without periodic reindexing, retrieval quality can drift. Managing index freshness is an operational concern that is easy to underestimate.

Vector databases also have cost at scale. Embedding an entire enterprise TM (millions of segments) is not trivial. Storing and querying at production latency requires memory-intensive infrastructure. For smaller organizations, the overhead of maintaining a separate vector store may not be worth the marginal improvement over string-based fuzzy matching.

The retrieval is only as good as the embeddings. A vector database does nothing to fix a poor embedding model. If your embeddings don’t encode your domain distinctions, no amount of index tuning will surface the right segments.