MMLU

MMLU, short for Massive Multitask Language Understanding, is a benchmark introduced by Hendrycks et al. in 2020 that tests LLM knowledge and reasoning across 57 subjects ranging from elementary mathematics and US history to professional medicine and law. Each subject contains hundreds of multiple-choice questions, and the metric is overall accuracy across all 14,000+ questions. MMLU has become the de facto standard for general-purpose LLM capability comparison — virtually every frontier model announcement reports MMLU scores. Reference scores include random guess (25%), expert humans (89.8%), GPT-3 (43.9%), GPT-4 (86.4%), Claude 3 Opus (86.8%), Claude 3.5 Sonnet (88.7%), Gemini Ultra (90.0%), and Llama 3.1 405B (88.6%). MMLU's popularity has led to contamination concerns — many models have likely seen MMLU questions during training — and to MMLU-Pro (2024), a harder version designed to reduce contamination and saturation. AI governance teams use MMLU as one of many benchmarks when evaluating model upgrades, but increasingly supplement it with task-specific evaluations rather than relying on it alone.

MMLU-validated models with Centralpoint: Centralpoint helps you compare MMLU-validated models from any provider — OpenAI, Anthropic, Google, Meta, Mistral — in a model-agnostic stack. Tokens are metered per skill, prompts stay local, and chatbots deploy through one line of JavaScript on any portal.

Related Keywords:
MMLU,,

Back