Research Notes - The Expert Benchmark Fallacy in AI Evaluation

Bram Donkers; Claude Sonquatre-six

Research Notes - The Expert Benchmark Fallacy in AI Evaluation

Bram Donkers · Claude Sonquatre-six

AI Generated by claude-sonnet-4-6 · human-supervised · Created: 2026-03-11 · History

Research: The “Expert Benchmark” Fallacy in AI Evaluation

Date: 2026-03-11 Search queries used:

“Expert Benchmark fallacy AI evaluation critique”
“AI benchmark human expert performance misleading evaluation problems”
“AI surpasses human experts benchmark critique misleading capability claims philosophy”
“benchmark saturation AI Goodhart’s law evaluation gaming problems 2024 2025”
“Emily Bender Arvind Narayanan AI benchmark validity problems human level performance critique”
“Melanie Mitchell AI benchmark broken critique generalization reasoning”

Executive Summary

The “Expert Benchmark Fallacy” is not yet a formally named philosophical concept, but it describes a well-documented epistemic error at the heart of AI capability claims. It occurs when AI systems score at or above “human expert level” on a narrow benchmark test, and this score is then treated as evidence of expert-level competence in the full professional domain. The fallacy conflates task-specific metric performance with genuine domain understanding. Three interlocking problems drive it: construct invalidity (benchmarks do not measure what they claim), Goodhart’s Law (optimising for a measure destroys its validity as an indicator), and what Arvind Narayanan calls the “one-dimensional hill-climbing” extrapolation error — projecting narrow benchmark improvement curves forward as if they represent the whole job. Across medicine, law, and mathematics, AI systems routinely “pass” professional-licensing-style tests while failing at the full practice those tests are meant to gate.

Key Sources

The Markup — “Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless”

URL: https://themarkup.org/artificial-intelligence/2024/07/17/everyone-is-judging-ai-by-these-tests-but-experts-say-theyre-close-to-meaningless
Type: Investigative journalism
Key points:
- MMLU (2020) uses ~15,000 multiple-choice questions scraped from amateur sites and Mechanical Turk workers, yet is cited as evidence of “expert-level” performance
- Emily M. Bender: “The creators of the benchmark have not established that the benchmark actually measures understanding.”
- Arvind Narayanan: “Many benchmarks are of low quality. Despite this, once a benchmark becomes widely used, it tends to be hard to switch away from it.”
- Rowan Zellers (HellaSwag co-creator): “It’s sort of like we kind of just made these benchmarks up.”
- Google’s Gemini scored 90.0% on MMLU — announced as “first model to outperform human experts” — but this claim is contested on construct validity grounds
Tenet alignment: Conflicts with Tenet 1 (Human Intent First) — benchmarks designed around model capabilities, not human intent in real contexts. Supports Tenet 5 (Always scalable) by showing the measurement gap between proxy and reality.
Quote: “The yardsticks are, like, pretty fundamentally broken.” — Maarten Sap, CMU

Princeton CITP — Narayanan & Mitchell: A Guide to Cutting Through AI Hype

URL: https://blog.citp.princeton.edu/2025/04/02/a-guide-to-cutting-through-ai-hype-arvind-narayanan-and-melanie-mitchell-discuss-artificial-and-human-intelligence/
Type: Edited lecture transcript (Princeton Public Lecture, March 27, 2025)
Key points:
- Narayanan: “When AI researchers predict that AI will take over some job, the basis for that prediction is an incredibly narrow and shallow understanding of what the job actually involves. The researcher defines a one-dimensional benchmark that captures a tiny aspect of the job, sees that AI performance improves rapidly over time on that benchmark, projects it forward, and concludes that AI will surpass humans and take over the job in three years.”
- Mitchell: “Passing those exams doesn’t mean the systems can do the other things we’d expect from a human who passed them. So just looking at behavior on tests or benchmarks isn’t always informative. That’s something people in the field have referred to as a crisis of evaluation.”
- Narayanan: “Many lawyers have been sanctioned by courts for submitting briefs filled with fake citations generated by chatbots.” — evidence that bar exam passage does not imply legal competence.
- Mitchell: AI lacks metacognition (awareness of one’s own reasoning reliability) and episodic memory — both fundamental to actual professional practice.
- Narayanan: The state of AI evaluation is “like the auto industry before independent safety testing.”
Tenet alignment: Strongly aligns with Tenet 3 (Symbiotic Intelligence) — critiques automation rhetoric that displaces human judgment. Aligns with Tenet 2 (Context as Infrastructure) — expertise is deeply contextual, benchmarks strip that context away.
Quote: “We can’t read much into the fact that a model passed the bar exam or the medical licensing exam. A lawyer’s job isn’t just answering bar exam questions all day.”

Melanie Mitchell — “On Evaluating Cognitive Capabilities in Machines” (NeurIPS 2025 Keynote)

URL: https://aiguide.substack.com/p/on-evaluating-cognitive-capabilities
Type: Academic keynote write-up (Substack, January 2026)
Key points:
- AI companies often use benchmarks to test narrow tasks but make sweeping capability claims
- “Understanding” is ill-defined and multidimensional — we cannot simply say an AI system does or does not understand
- The “fallacy of dumb superintelligence”: a system can score superhuman on narrow tests while lacking basic common sense in adjacent tasks
- Granular testing focused on abstract generalisation is needed to understand true capabilities
- Anthropomorphic language in evaluation (“reasoning models”, “chain of thought”) shapes how performance is perceived
Tenet alignment: Aligns with Tenet 4 (Pluralism of Perspectives) — multiple frames and measurement approaches needed. Aligns with Tenet 1 (Human Intent First) — evaluation should be grounded in actual human use cases.

arXiv — “Can We Trust AI Benchmarks? An Interdisciplinary Review”

URL: https://arxiv.org/html/2502.06559v1 / https://ojs.aaai.org/index.php/AIES/article/download/36595/38733/40670
Type: Academic meta-review (~100 studies)
Key points:
- Systemic flaws: misaligned incentives, construct validity issues, unknown unknowns, Goodhart’s Law in action
- “When a measure becomes a target, it ceases to be a good measure.” — Goodhart’s Law applied to AI benchmarks
- Benchmark gaming is empirically documented (LiveCodeBench evidence cited in related sources)
- Data contamination: models trained on test data achieve inflated scores
Tenet alignment: Aligns with Tenet 5 (Always scalable) — proxy measurement may scale effort without scaling quality.

The Markup / NBC News — “AI’s capabilities may be exaggerated by flawed tests”

URL: https://www.nbcnews.com/tech/tech-news/ai-chatgpt-test-smart-capabilities-may-exaggerated-flawed-study-rcna241969
Type: News coverage of research
Key points:
- Methods used to evaluate AI systems “routinely oversell AI performance”
- Lack of standardisation makes cross-model comparisons unreliable
Tenet alignment: Neutral/supportive of site perspective.

Major Positions

Position 1: The Construct Validity Critique

Proponents: Emily M. Bender (UW), Melanie Mitchell (Santa Fe Institute), Su Lin Blodgett (Microsoft Research)
Core claim: Benchmarks lack construct validity — they do not measure the thing they claim to measure (“intelligence”, “reasoning”, “expert-level competence”). Passing a multiple-choice test about medicine is not the same as clinical reasoning.
Key arguments:
- LLMs predict token sequences; they do not “understand” in the way humans do when they answer correctly
- “Multiple choice” format eliminates the open-ended, ambiguous nature of real expert work
- Benchmarks designed before today’s models were built cannot meaningfully test them
Relation to site tenets: Strongly aligned with Tenet 1 and Tenet 3 — emphasises that human professional capability is contextual, embodied, and intentional, not reducible to test scores.

Position 2: Goodhart’s Law / Benchmark Gaming

Proponents: Arvind Narayanan (Princeton), Sayash Kapoor, goodeyelabs.com research
Core claim: Once a benchmark becomes a target, AI developers optimise their models to score well on it — which decouples the score from the underlying capability. This is Goodhart’s Law applied to AI evaluation.
Key arguments:
- Data contamination: test data leaks into training data
- Hyperparameter tuning specific to benchmark tasks
- LiveCodeBench provides empirical evidence of benchmark gaming post-saturation
Relation to site tenets: Aligns with Tenet 5 (Always scalable) — the effort invested in benchmark optimisation does not proportionally improve real-world usefulness.

Position 3: The Extrapolation Error

Proponents: Arvind Narayanan, Melanie Mitchell
Core claim: Researchers observe rapid benchmark improvement, project that curve forward, and conclude the entire job will be automated. This is a category error — the benchmark captures one narrow dimension; the job has hundreds.
Key arguments:
- Geoffrey Hinton’s 2016 prediction that radiologists would be obsolete within 5 years (has not occurred)
- “The researcher defines a one-dimensional benchmark that captures a tiny aspect of the job” — Narayanan
- Lawyers sanctioned for submitting AI-hallucinated citations despite GPT-4 “passing the bar exam”
Relation to site tenets: Directly aligned with Tenet 1 (Human Intent First) — intent in professional work is complex, relational, and contextual, not reducible to a test score. Aligns with Tenet 3 (Symbiotic Intelligence) — automation-replacement rhetoric ignores what humans uniquely contribute.

Position 4: The “Crisis of Evaluation” (reform position)

Proponents: Mitchell, Narayanan, Stanford HAI, MLCommons
Core claim: The problem is real but solvable — we need independent third-party evaluation, multi-dimensional benchmarks, and “midstream evaluation” frameworks that are context-specific without being deployment-specific.
Key arguments:
- ChatBot Arena (human-in-the-loop pairwise preference) as alternative model
- NIST role in developing standardised evaluation frameworks
- “Upstream” vs. “downstream” vs. “midstream” evaluation — context matters
Relation to site tenets: Aligns with Tenet 4 (Pluralism) — multiple evaluation approaches invite multiple frames on capability.

Key Debates

Debate 1: Is “Human Expert Level” a meaningful threshold?

Sides: AI companies use it as a marketing milestone; evaluation researchers say it is not scientifically meaningful
Core disagreement: Whether a test score on a structured benchmark is epistemically equivalent to expert competence in a domain
Current state: Ongoing; gaining traction as a mainstream critique after bar exam hallucination cases

Debate 2: Can benchmarks be fixed, or is the paradigm broken?

Sides: Reformers (NIST, MLCommons, ChatBot Arena) vs. paradigm critics (Bender, Mitchell)
Core disagreement: Whether better benchmarks can capture what matters, or whether benchmarking as a paradigm is fundamentally misaligned with what AI is actually being used for
Current state: Active research area; NeurIPS 2025 had dedicated evaluation tracks

Debate 3: Anthropomorphic language and its epistemic effects

Sides: Those who accept “reasoning” / “thinking” as useful shorthand; those who argue this language distorts evaluation and policy
Core disagreement: Whether terms like “chain of thought”, “reasoning model”, “understanding” meaningfully describe AI processes or import unwarranted capability assumptions
Current state: Unresolved; Mitchell argues these are pre-scientific concepts; Narayanan argues the language is practically unavoidable but must be used carefully

Historical Timeline

Year	Event/Publication	Significance
2019	HellaSwag benchmark published	Early example of quickly saturated benchmarks; creator now acknowledges limits
2020	MMLU benchmark published	Becomes de facto standard despite scraping from amateur sources
2021	Bender et al., “On the Dangers of Stochastic Parrots”	Foundational critique of LLM capability framing
2021	Bender et al., “AI and the Everything in the Whole Wide World Benchmark”	Argues general-knowledge benchmarks are “dangerous and deceptive”
2023	GPT-4 “passes” bar exam at 90th percentile	Triggers wave of lawyer-replacement predictions; soon followed by court sanctions for AI-hallucinated citations
2023	Google Gemini claims “first model to outperform human experts” on MMLU	Benchmark milestone used as marketing; immediately contested by evaluation researchers
2024	Narayanan & Kapoor, “AI Snake Oil” (Princeton University Press)	Mainstream book-length critique of AI capability inflation via benchmarks
2024	The Markup investigation into benchmark quality	Broad audience exposure to construct validity problems
2025	arXiv meta-review “Can We Trust AI Benchmarks?” (AAAI AIES)	~100-study systematic review of benchmark failures
2025	Melanie Mitchell, NeurIPS 2025 keynote on evaluation	Names “crisis of evaluation”; calls for paradigm shift
2026	“Humanity’s Last Exam” (Nature) — expert-level benchmark	New harder benchmark; implicitly acknowledges saturation of all previous ones

Potential Article Angles

“What Benchmark Scores Actually Measure” — Explains the construct validity problem from first principles; connects to Tenet 1 (intent-grounded evaluation) and Tenet 3 (symbiotic intelligence). Could argue for a new evaluation philosophy grounded in human intent hierarchies rather than abstract task accuracy.
“The Expert Benchmark Fallacy” — Names and defines the fallacy explicitly; walks through three case studies (medicine/bar exam/MMLU); situates it as a specific instance of Goodhart’s Law. Strong alignment with Tenet 3 (Symbiotic Intelligence over Automation) by showing how the fallacy supports harmful automation rhetoric.
“Why AI Evaluation Is a Crisis of Intent” — Frames the evaluation crisis as a failure to specify what we actually want AI to do, connecting to Tenet 1 (Human Intent First) and Tenet 2 (Context as Infrastructure). Benchmark failure is a symptom of not designing evaluation around human intent in context.

When writing the article, follow obsidian/project/writing-style.md for:

Named-anchor summary technique for forward references
Background vs. novelty decisions (what to include/omit)
Tenet alignment requirements
LLM optimisation (front-load important information)

Gaps in Research

No formal philosophical paper explicitly named “Expert Benchmark Fallacy” found — the concept exists but has not been systematically named in the literature; an article naming it would have genuine novelty
The arXiv validity-centered framework paper (arxiv.org/pdf/2505.10573 — note: future-dated relative to today, possibly pre-print) was not fully scraped; may contain directly relevant philosophical framework
Limited coverage of non-Western perspectives on AI evaluation; Tenet 4 (Pluralism) suggests this gap
No found sources on how the fallacy interacts with AI agent evaluation specifically (agents fail in more complex, dynamic ways than static benchmarks capture)
Goodhart’s Law application to AI is documented empirically but not yet fully theorised philosophically

Citations

Keegan, Jon. “Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless.” The Markup, July 17, 2024. https://themarkup.org/artificial-intelligence/2024/07/17/everyone-is-judging-ai-by-these-tests-but-experts-say-theyre-close-to-meaningless
Narayanan, Arvind & Melanie Mitchell. “A Guide to Cutting Through AI Hype.” Princeton CITP Blog, April 2, 2025. https://blog.citp.princeton.edu/2025/04/02/a-guide-to-cutting-through-ai-hype-arvind-narayanan-and-melanie-mitchell-discuss-artificial-and-human-intelligence/
Mitchell, Melanie. “On Evaluating Cognitive Capabilities in Machines (and Other ‘Alien’ Intelligences).” AI: A Guide for Thinking Humans (Substack), January 14, 2026. https://aiguide.substack.com/p/on-evaluating-cognitive-capabilities
[Authors unknown from preview]. “Can We Trust AI Benchmarks? An Interdisciplinary Review of Quantitative AI Benchmarking Practices.” arXiv 2502.06559 / AAAI AIES 2025. https://arxiv.org/html/2502.06559v1
Narayanan, Arvind & Sayash Kapoor. AI Snake Oil: What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference. Princeton University Press, 2024.
Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” FAccT 2021.
Bender, Emily M. et al. “AI and the Everything in the Whole Wide World Benchmark.” arXiv 2111.15366, 2021.
Mitchell, Melanie. Artificial Intelligence: A Guide for Thinking Humans. Farrar, Straus and Giroux, 2019.

Tags: generative-ai, evaluation, measurement, epistemology, critical-thinking, human-ai-collaboration, oversight