BLEU

BLEU, short for Bilingual Evaluation Understudy, is the classical machine translation evaluation metric introduced by Papineni et al. at IBM in 2002, scoring candidate translations against one or more reference translations using n-gram precision with a brevity penalty. BLEU produces a score from 0 to 100 (or 0 to 1) where higher is better, with state-of-the-art neural translation systems typically scoring 40-60 on standard test sets. The metric was the dominant translation evaluation for two decades and remains widely reported, though it has well-known limitations including poor correlation with human judgment on individual sentences, insensitivity to fluency improvements, and breakage on languages with flexible word order. Modern alternatives include METEOR, ChrF, BERTScore, COMET, and especially direct LLM-as-judge evaluation. BLEU is still used in many academic papers for backward comparability and remains the default metric in some translation toolkits like sacrebleu. AI governance teams encounter BLEU mainly in legacy NLP system evaluation and in academic research; for production LLM evaluation, BLEU has been largely supplanted by LLM-judge metrics.

Translation-evaluated models in Centralpoint: Centralpoint routes translation workloads to multilingual LLMs from any provider in a model-agnostic stack, validated against BLEU, COMET, and other translation metrics. Tokens are metered per skill, prompts stay local, and chatbots deploy through one line of JavaScript on any portal.


Related Keywords:
BLEU,,