ARC

ARC, short for the AI2 Reasoning Challenge, is a benchmark introduced by Allen Institute for AI in 2018 containing 7,787 grade-school-level multiple-choice science questions from US standardized tests. The benchmark is split into Easy (5,197 questions) and Challenge (2,590 questions designed to require multi-step reasoning beyond simple retrieval). ARC Challenge became an important early benchmark for testing whether language models could perform genuine reasoning rather than pattern matching. Reference scores include GPT-3 (51.4% on Challenge), Llama 2 70B (67.3%), GPT-4 (96.3%), Claude 3 Opus (96.4%), and Claude 3.5 Sonnet (96.7%). ARC has been largely saturated by frontier models in 2024-2025 and is included in evaluation suites mainly for backward compatibility. The successor benchmark ARC-AGI from François Chollet (creator of the ARC abstract reasoning challenge, a different benchmark) has become more discriminating for current models. AI governance teams encounter ARC scores in model documentation and use them for baseline reasoning capability validation. The dataset is available from the Allen Institute and via Hugging Face.

ARC-validated models in Centralpoint: Centralpoint routes to models validated on ARC and other reasoning benchmarks in a model-agnostic stack. Tokens are metered per skill, prompts stay local, supports generative and embedded models, and deploys chatbots through one line of JavaScript on any portal.

Related Keywords:
ARC,,

Back