HumanEval

HumanEval is a code-generation benchmark introduced by OpenAI alongside Codex in a 2021 paper, containing 164 hand-written Python programming problems with unit tests for automatic verification. Each problem provides a function signature with docstring, and the model must produce a function body that passes all the unit tests. The metric is pass@k — the fraction of problems solved when given k attempts — with pass@1 being the most-reported. HumanEval has become the standard benchmark for code-LLM capability comparison: Codex (28.8% pass@1), GPT-4 (67% pass@1), Claude 3.5 Sonnet (92% pass@1), and the dedicated code models DeepSeek-Coder, CodeLlama, and Qwen2.5-Coder all report HumanEval scores. The benchmark has been criticized for narrow scope (Python only, single-function problems, training contamination) and has been supplemented by harder benchmarks like MBPP, HumanEval+, LiveCodeBench, and SWE-Bench. AI governance teams use HumanEval for initial code-LLM screening but rely on task-specific evaluations and SWE-Bench-style real-world tasks for production deployment decisions. The benchmark is hosted on GitHub at github.com/openai/human-eval.

HumanEval-tested coding models in Centralpoint: Centralpoint routes coding workloads to HumanEval-validated models — GPT-4, Claude 3.5 Sonnet, DeepSeek-Coder, CodeLlama — in a model-agnostic stack. Tokens are metered per skill, prompts stay local, and code-aware chatbots deploy through one line of JavaScript on any portal.


Related Keywords:
HumanEval,,