BLEU and COMET

Automated metrics for evaluating machine translation output: BLEU measures n-gram overlap against a reference translation; COMET uses a neural model trained on human quality judgments to score outputs more closely aligned with professional evaluation.

The honest version

BLEU was introduced in 2002. It has been wrong, in documented and reproducible ways, ever since. It is still the most widely reported metric in MT research because it is fast, cheap, and produces numbers that look like scores.

BLEU works by counting how many word sequences (n-grams) in the hypothesis translation appear in the reference translation, penalized for outputs that are too short. The logic seems reasonable: a good translation should share vocabulary with a correct reference. The problem is that natural language allows many correct translations of the same source. A BLEU score that compares a hypothesis to a single reference will penalize valid paraphrases that the evaluator did not happen to write. High BLEU does not mean good translation. Low BLEU does not mean bad translation.

COMET is better. It uses a neural model (trained on human direct assessment scores and error annotations) to predict translation quality. It correlates more closely with professional human evaluation than BLEU does, handles paraphrases more gracefully, and distinguishes quality differences that BLEU cannot see. It is the current standard for research benchmarks, replacing BLEU in most serious MT evaluation work.

Neither metric replaces human review.

Why it matters for translation

Automated metrics allow you to evaluate thousands of segments in seconds. For pipeline monitoring (“did the model output regress after we changed the prompt?”), BLEU and COMET provide a fast, reproducible signal. Running COMET before and after a system change is more reliable than asking someone if it looks better.

For comparative evaluation (choosing between MT systems, benchmarking LLM translation against a baseline, measuring the impact of fine-tuning), automated metrics provide the scale that human evaluation cannot. A human can rate 200 segments in a day; a metric can rate 200,000 in a minute.

COMET scores correlate well enough with human judgment that organizations use them as a first-pass quality gate: segments below a threshold score are routed to human post-editing; segments above are passed through with review. This is a reasonable use of the metric: as a filter, not as a ground truth.

Where it fails

BLEU is nearly useless for evaluating translation quality at the segment level. It was designed for corpus-level evaluation and degrades rapidly below that scale. Using BLEU to score individual sentences is a methodological error that is common enough to be worth naming.

COMET is better but not immune to gaming. Models can be fine-tuned specifically to score well on COMET while producing output that human evaluators dislike. This is a known problem in MT research: COMET is a learned metric that can be overfit to, like any other loss function.

Both metrics fail on idiom, register, and pragmatics. A translation can be factually correct, stylistically appropriate, and score low because it does not match the reference wording. A translation can score high and violate your style guide in ways that no metric measures.

In regulated industries (pharmaceutical, medical device, legal), automated metrics are not accepted as quality assurance. A human sign-off from a qualified translator is required, and the BLEU score is irrelevant. Using automated metrics as your primary QA gate for content with regulatory consequences is not a cost optimization; it is a liability.

The most dangerous use of BLEU is as a proxy for human judgment in management presentations. A dashboard that reports “BLEU improved 2.3 points” communicates very little about whether your customers are reading better translations.

BLEU and COMET

The honest version

Why it matters for translation

Where it fails

Related terms