Fine-tuning
Continuing the training of a pre-trained language model on a smaller, curated dataset to adapt its behavior to a specific domain, style, or task, permanently changing the model's weights.
The honest version
Fine-tuning changes the model. RAG changes the context. Both matter, and confusing them leads to the wrong architecture for the job.
When you fine-tune, you run additional training passes on your own data: reviewed translation pairs, approved terminology in context, domain-specific parallel texts. The model’s weights update. After fine-tuning, the model does not need to see your style guide in the prompt; it has internalized it. For stable, high-volume domains where the terminology rarely changes, this can produce consistently better output with simpler, shorter prompts.
The training data is everything. A fine-tuned model is confident in whatever patterns it learned, including your mistakes. If your training set contains inconsistent terminology (“workspace” in some segments, “workbench” in others), the model will learn ambiguity, not precision.
Why it matters for translation
Fine-tuning is most useful when two conditions hold: you have a large body of high-quality reviewed translations in a specific domain, and that domain is stable enough that the investment in retraining will not be obsolete in six months.
Practical cases where fine-tuning wins over RAG alone: highly specialized domains (medical device documentation, aerospace manuals) where the base model’s general training provides a poor prior; extremely high-volume pipelines where per-request retrieval latency is a real constraint; and situations where the target style is so distinctive that prompting alone cannot reliably produce it.
Fine-tuning and RAG are not mutually exclusive. Fine-tuning for domain adaptation, RAG for up-to-date terminology retrieval: this is a common production architecture.
Where it fails
Fine-tuning does not teach a model to be right. It teaches it to be confident in your pattern.
The update problem is serious. When terminology changes (a product rename, a regulatory revision, a new style guide), a fine-tuned model does not automatically update. You retrain, which means compute cost, curation time, and evaluation infrastructure. For organizations with rapidly evolving products, this cycle is often too slow to be practical. RAG handles updates in hours; fine-tuning handles them in weeks.
Evaluation is harder than most teams realize. Measuring whether a fine-tuned model is actually better (not just different) requires systematic human evaluation against a held-out test set. “It looks better to me” is not a measurement. Without rigorous eval, you risk shipping a model that is confidently wrong in ways that are harder to detect because the output is fluent.
Data contamination is real. Fine-tuning on a TM that includes unreviewed MT output teaches the model to reproduce MT artifacts. The fine-tuned model then generates cleaner-looking text that has the same systematic errors as the original MT, just with higher confidence.