Embeddings

Numerical representations of text as vectors in a high-dimensional space, where semantically similar content is positioned close together, enabling meaning-based search and comparison.

The honest version

An embedding is what you get when you ask a neural model to compress a piece of text into a fixed-length list of numbers. That list encodes meaning, or rather, the model’s learned approximation of meaning based on everything it was trained on.

The key property is proximity: text that means similar things produces vectors that are close together in the embedding space. “The file could not be saved” and “Saving failed” end up near each other. “Save” in the data storage sense and “save” in the rescue sense do not. This geometric relationship is what makes semantic search possible: instead of matching exact strings, you find near neighbors.

The number of dimensions (768, 1536, 3072) matters less than which model produced the embeddings and what it was trained on. An embedding model trained on legal corpora will cluster legal terminology differently than one trained on general web text.

Why it matters for translation

Traditional TM fuzzy matching is string-based. It finds segments that share words and characters with the source, scored by edit distance. An 85% fuzzy match might be close in string terms and entirely different in meaning.

Embedding-based retrieval finds segments that are semantically similar, regardless of surface form. “Click Save to continue” and “Select Save to proceed” will score low on string fuzzy match but high on semantic similarity, and rightly so, because the approved translation of one is almost certainly the correct translation of the other.

For RAG pipelines in localization, embedding quality directly controls TM hit rate and term retrieval quality. Better embeddings mean more relevant context surfaces in the prompt, which means better output, which means less post-editing.

Where it fails

Embeddings are not language-neutral unless you use a multilingual model, and even multilingual models encode some languages better than others. If your retrieval is multilingual but your embedding model was trained on 95% English data, your French-to-German retrieval will underperform.

Domain specificity is a real constraint. General-purpose embedding models conflate terms that are distinct in your domain. “Submit” in a UI context and “submit” in a procurement legal context produce similar embeddings in a general model, because in general text they are semantically close. In your product, they are not.

Embeddings also encode whatever biases exist in the training data. If the training corpus associates certain terminology with certain registers or regions, the embeddings will too. This is invisible: the model does not report when it is using a culturally specific prior.

Finally: embeddings are static at inference time. They do not update when your content changes. Reindexing a large TM after a terminology refresh is an engineering task, not a free operation.

Embeddings

The honest version

Why it matters for translation

Where it fails

Related terms