ROUGE
ROUGE, short for Recall-Oriented Understudy for Gisting Evaluation, is a family of metrics for evaluating automatic summarization introduced by Chin-Yew Lin in 2004. The most-reported variants are ROUGE-1 (unigram overlap), ROUGE-2 (bigram overlap), and ROUGE-L (longest common subsequence). ROUGE measures recall of n-grams from the reference summary that appear in the candidate summary, complementing the precision-oriented
BLEU. The metric dominated summarization evaluation for two decades and remains widely reported on benchmarks like CNN/DailyMail, XSum, and PubMed. ROUGE has the same well-known limitations as BLEU: poor correlation with human judgment, insensitivity to fluency and factuality, and excessive reward for verbatim copying. Modern alternatives include BERTScore, BARTScore, and direct LLM-as-judge evaluation. ROUGE is still used in academic papers for backward comparability but has been largely supplanted by LLM-judge metrics in production
LLM evaluation. AI governance teams encounter ROUGE mainly in legacy summarization benchmarks and academic baselines.
Summarization-evaluated models in Centralpoint: Centralpoint routes summarization workloads to LLMs from any provider — Claude, GPT, Gemini, Llama — in a model-agnostic stack, validated against ROUGE, BERTScore, and LLM-judge metrics. Tokens are metered per skill, prompts stay local, and chatbots deploy through one line of JavaScript on any portal.
Related Keywords:
ROUGE,
,