MT-Bench

MT-Bench is a benchmark for evaluating chat-tuned LLMs on multi-turn conversational tasks, introduced by LMSYS in 2023 alongside the Chatbot Arena leaderboard. The benchmark contains 80 multi-turn questions across 8 categories (writing, roleplay, reasoning, math, coding, extraction, STEM, humanities), with model responses scored by GPT-4 as a judge. MT-Bench was designed to test capabilities that automated benchmarks like MMLU miss: instruction following, response quality, multi-turn coherence, and helpfulness. The benchmark popularized the LLM-as-judge evaluation paradigm and demonstrated that GPT-4 judgments correlate well with human preference. Reference scores include Llama 2 70B Chat (6.86), GPT-3.5 Turbo (8.39), GPT-4 (9.18), and Claude 3 Opus (8.94). MT-Bench has been criticized for evaluator bias (GPT-4 prefers GPT-4-style outputs) and saturation, leading to harder follow-ups like MT-Bench-Plus and Arena-Hard. AI governance teams use MT-Bench as one input to chat-model evaluation, supplemented by domain-specific tests and human evaluation.

MT-Bench-validated chat models in Centralpoint: Centralpoint routes conversational workloads to MT-Bench-validated models in a model-agnostic stack. Tokens are metered per skill, prompts stay local, supports generative and embedded models, and deploys chat experiences through one line of JavaScript on any portal.


Related Keywords:
MT-Bench,,