Humanity's Last Exam

Humanity's Last Exam (HLE) is a language model benchmark consisting of over 2,500 expert-level questions across a broad range of subjects. It was created jointly by the Center for AI Safety and Scale AI, and was designed to test reasoning abilities and human-like intelligence, as opposed to just pattern recognition.

History

Benchmark tests like Humanity's Last Exam have long been used to evaluate reasoning and learning capabilities in machines.^[1] Early benchmarks, such as the Turing test, measured whether machines could demonstrate human-like conversation abilities.^[2] Other early benchmark tests evaluated computer vision, like MNIST for handwritten digit recognition and ImageNet for continual image classification.^[3] The emergence of large language models (LLMs) in the 2020s led to the advancement and evolution of benchmark tests, with a focus on emphasizing interpretability, reproducibility, and clearer evaluation criteria. Recent foundation model benchmarks, such as MMLU, HellaSwag, and ARC Challenge, illustrate this shift.^[4]

Creation

Humanity’s Last Exam was created to parallel the quick progression of LLMs and provide a proper assessment of these models. Previous benchmarks evaluated LLMs with about 90% correctness creating the need for a more difficult exam.^[5] Stanford HAI's AI Index 2025 Annual Report cites Humanity's Last Exam as one of the "more challenging benchmarks" developed in response to the popular AI benchmarks having reached "saturation".^[6] The test has been described as the brainchild of Dan Hendrycks, a machine learning researcher and the director of the Center for AI Safety, who stated that he was inspired to create the test after a conversation with Elon Musk, who thought the existing language model benchmarks, such as the MMLU, were too easy. Hendrycks worked with Scale AI to compile the questions.^[7] The questions were crowdsourced from subject matter experts from various institutions across the world.^[8]^[9] The questions were first filtered by the leading AI models; if the models failed to answer the question or did worse than random guessing on the multiple-choice questions, they were reviewed by human experts for accuracy and wording in two rounds, and then approved for inclusion in the dataset. The submitters of the top-rated questions were given prize money from a pool of 500,000 U.S. dollars—$5000 for each of the top 50 questions and $500 for the next 500. After the initial release, a "community feedback bug bounty program" was opened to "identify and remove major errors in the dataset".^[9] AI systems are able to surpass more focused, task-oriented tests, yet few are able to perform well on broader, general ability assessments.^[10] HLE was designed to test reasoning abilities, which are considered a metric of “human” intelligence.^[11]

Composition

The benchmark consists of 2,500 questions in the publicly released set. The paper classifies the questions into the following broad subjects: mathematics (41%), physics (9%), biology/medicine (11%), humanities/social science (9%), computer science/artificial intelligence (10%), engineering (4%), chemistry (7%), and other (9%). Around 14% of the questions require the ability to understand both text and images, i.e., multi-modality. 24% of the questions are multiple-choice; the rest are short-answer, exact-match questions. A private set is also maintained to test for benchmark overfitting.^[9]

An example question:^[7]

Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.

An independent investigation by FutureHouse, published in July 2025, suggested that around 30% of the HLE answers for text-only chemistry and biology questions could be incorrect; the benchmark's team partially replicated the findings, and said they hope to institute a continuous revisions process.^[12]

Results

Performance of various models on the benchmark
Organization	Model	Accuracy (%) ↑	Calibration Error (%) ↓
Google DeepMind	Gemini 3 Pro Preview	37.52	57
OpenAI	GPT-5 Pro	31.64	49
Anthropic	Claude Opus 4.5 (Thinking)	25.20	55
Z.ai	GLM 4.5	8.32	79
Meta AI	Llama 4 Maverick	5.68	83
Mistral AI	Mistral Medium 3	4.52	77
Amazon Web Services	Nova Pro	4.40	80

Performance of various non-multimodal models on the text-only subset of the benchmark
Organization	Model	Accuracy (%) ↑	Calibration Error (%) ↓
OpenAI	gpt-oss-120b	15.48	76
Alibaba Cloud	Qwen3-235B-A22B-Thinking-2507	15.43	78
DeepSeek	DeepSeek-R1-0528	14.04	78
Moonshot AI	Kimi-K2-Instruct	4.68	82
Amazon Web Services	Nova Micro	4.41	84

References

^ "Humanity's Last Exam: The AI Benchmark for LLM Reasoning". IntuitionLabs. Retrieved 2025-11-20.
^ Pinar Saygin, Ayse; Cicekli, Ilyas; Akman, Varol (2000-11-01). "Turing Test: 50 Years Later". Minds and Machines. 10 (4): 463–518. doi:10.1023/A:1011288000451. ISSN 1572-8641.
^ Faber, Kamil; Zurek, Dominik; Pietron, Marcin; Japkowicz, Nathalie; Vergari, Antonio; Corizzo, Roberto (2024-10-01). "From MNIST to ImageNet and back: benchmarking continual curriculum learning". Machine Learning. 113 (10): 8137–8164. doi:10.1007/s10994-024-06524-z. ISSN 1573-0565.
^ Reuel, Anka (20 November 2024). "BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices". arXiv.
^ Phan, Long; et al. (2025). "Humanity's Last Exam". arXiv:2501.14249 [cs.LG].
^ Maslej, Nestor; et al. (April 2025). The AI Index 2025 Annual Report (PDF) (Report). Institute for Human-Centered AI. pp. 141–142.
^ ^a ^b Roose, Kevin (23 January 2025). "When A.I. Passes This Test, Look Out". New York Times. Archived from the original on 29 January 2025. Retrieved 24 January 2025.
^ Dastin, Jeffrey; Paul, Katie (16 September 2024). "AI experts ready 'Humanity's Last Exam' to stump powerful tech". Reuters. Archived from the original on 8 April 2025. Retrieved 24 January 2025.
^ ^a ^b ^c Phan, Long; et al. (2025). "Humanity's Last Exam". arXiv:2501.14249 [cs.LG].
^ José Hernández-Orallo (2016). Evaluation in artificial intelligence: From task-oriented to ability-oriented measurement. Artificial Intelligence Review. 1-51. doi:10.1007/s10462-016- 9505-7. url: https://riunet.upv.es/server/api/core/bitstreams/52884250-5f37-43f6-b966-014799bfac28/content
^ "Humanity's Last Exam: AI vs Human Benchmark Results | Galileo". Galileo AI. Retrieved 2025-11-20.
^ Skarlinski, Michael; Laurent, Jon; Bou, Albert; White, Andrew (16 September 2025). "About 30% of Humanity's Last Exam chemistry/biology answers are likely wrong". FutureHouse. Retrieved 15 October 2025.

External links

Humanity's Last Exam at the Center for AI Safety
Humanity's Last Exam at Scale AI

[:2-1] "Humanity's Last Exam: The AI Benchmark for LLM Reasoning". IntuitionLabs. Retrieved 2025-11-20.

[:5-2] Pinar Saygin, Ayse; Cicekli, Ilyas; Akman, Varol (2000-11-01). "Turing Test: 50 Years Later". Minds and Machines. 10 (4): 463–518. doi:10.1023/A:1011288000451. ISSN 1572-8641.

[:6-3] Faber, Kamil; Zurek, Dominik; Pietron, Marcin; Japkowicz, Nathalie; Vergari, Antonio; Corizzo, Roberto (2024-10-01). "From MNIST to ImageNet and back: benchmarking continual curriculum learning". Machine Learning. 113 (10): 8137–8164. doi:10.1007/s10994-024-06524-z. ISSN 1573-0565.

[:7-4] Reuel, Anka (20 November 2024). "BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices". arXiv.

[arxiv3-5] Phan, Long; et al. (2025). "Humanity's Last Exam". arXiv:2501.14249 [cs.LG].

[6] Maslej, Nestor; et al. (April 2025). The AI Index 2025 Annual Report (PDF) (Report). Institute for Human-Centered AI. pp. 141–142.

[nytimes-7] Roose, Kevin (23 January 2025). "When A.I. Passes This Test, Look Out". New York Times. Archived from the original on 29 January 2025. Retrieved 24 January 2025.

[reuters-8] Dastin, Jeffrey; Paul, Katie (16 September 2024). "AI experts ready 'Humanity's Last Exam' to stump powerful tech". Reuters. Archived from the original on 8 April 2025. Retrieved 24 January 2025.

[arxiv-9] Phan, Long; et al. (2025). "Humanity's Last Exam". arXiv:2501.14249 [cs.LG].

[:0-10] José Hernández-Orallo (2016). Evaluation in artificial intelligence: From task-oriented to ability-oriented measurement. Artificial Intelligence Review. 1-51. doi:10.1007/s10462-016- 9505-7. url: https://riunet.upv.es/server/api/core/bitstreams/52884250-5f37-43f6-b966-014799bfac28/content

[:1-11] "Humanity's Last Exam: AI vs Human Benchmark Results | Galileo". Galileo AI. Retrieved 2025-11-20.

[12] Skarlinski, Michael; Laurent, Jon; Bou, Albert; White, Andrew (16 September 2025). "About 30% of Humanity's Last Exam chemistry/biology answers are likely wrong". FutureHouse. Retrieved 15 October 2025.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]