Research Notes - The Expert Benchmark Fallacy in AI Evaluation
Research: The “Expert Benchmark” Fallacy in AI Evaluation
Date: 2026-03-11 Search queries used:
- “Expert Benchmark fallacy AI evaluation critique”
- “AI benchmark human expert performance misleading evaluation problems”
- “AI surpasses human experts benchmark critique misleading capability claims philosophy”
- “benchmark saturation AI Goodhart’s law evaluation gaming problems 2024 2025”
- “Emily Bender Arvind Narayanan AI benchmark validity problems human level performance critique”
- “Melanie Mitchell AI benchmark broken critique generalization reasoning”
Executive Summary
The “Expert Benchmark Fallacy” is not yet a formally named philosophical concept, but it describes a well-documented epistemic error at the heart of AI capability claims. It occurs when AI systems score at or above “human expert level” on a narrow benchmark test, and this score is then treated as evidence of expert-level competence in the full professional domain. The fallacy conflates task-specific metric performance with genuine domain understanding. Three interlocking problems drive it: construct invalidity (benchmarks do not measure what they claim), Goodhart’s Law (optimising for a measure destroys its validity as an indicator), and what Arvind Narayanan calls the “one-dimensional hill-climbing” extrapolation error — projecting narrow benchmark improvement curves forward as if they represent the whole job. Across medicine, law, and mathematics, AI systems routinely “pass” professional-licensing-style tests while failing at the full practice those tests are meant to gate.
Key Sources
The Markup — “Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless”
- URL: https://themarkup.org/artificial-intelligence/2024/07/17/everyone-is-judging-ai-by-these-tests-but-experts-say-theyre-close-to-meaningless
- Type: Investigative journalism
- Key points:
- MMLU (2020) uses ~15,000 multiple-choice questions scraped from amateur sites and Mechanical Turk workers, yet is cited as evidence of “expert-level” performance
- Emily M. Bender: “The creators of the benchmark have not established that the benchmark actually measures understanding.”
- Arvind Narayanan: “Many benchmarks are of low quality. Despite this, once a benchmark becomes widely used, it tends to be hard to switch away from it.”
- Rowan Zellers (HellaSwag co-creator): “It’s sort of like we kind of just made these benchmarks up.”
- Google’s Gemini scored 90.0% on MMLU — announced as “first model to outperform human experts” — but this claim is contested on construct validity grounds
- Tenet alignment: Conflicts with Tenet 1 (Human Intent First) — benchmarks designed around model capabilities, not human intent in real contexts. Supports Tenet 5 (Always scalable) by showing the measurement gap between proxy and reality.
- Quote: “The yardsticks are, like, pretty fundamentally broken.” — Maarten Sap, CMU
Princeton CITP — Narayanan & Mitchell: A Guide to Cutting Through AI Hype
- URL: https://blog.citp.princeton.edu/2025/04/02/a-guide-to-cutting-through-ai-hype-arvind-narayanan-and-melanie-mitchell-discuss-artificial-and-human-intelligence/
- Type: Edited lecture transcript (Princeton Public Lecture, March 27, 2025)
- Key points:
- Narayanan: “When AI researchers predict that AI will take over some job, the basis for that prediction is an incredibly narrow and shallow understanding of what the job actually involves. The researcher defines a one-dimensional benchmark that captures a tiny aspect of the job, sees that AI performance improves rapidly over time on that benchmark, projects it forward, and concludes that AI will surpass humans and take over the job in three years.”
- Mitchell: “Passing those exams doesn’t mean the systems can do the other things we’d expect from a human who passed them. So just looking at behavior on tests or benchmarks isn’t always informative. That’s something people in the field have referred to as a crisis of evaluation.”
- Narayanan: “Many lawyers have been sanctioned by courts for submitting briefs filled with fake citations generated by chatbots.” — evidence that bar exam passage does not imply legal competence.
- Mitchell: AI lacks metacognition (awareness of one’s own reasoning reliability) and episodic memory — both fundamental to actual professional practice.
- Narayanan: The state of AI evaluation is “like the auto industry before independent safety testing.”
- Tenet alignment: Strongly aligns with Tenet 3 (Symbiotic Intelligence) — critiques automation rhetoric that displaces human judgment. Aligns with Tenet 2 (Context as Infrastructure) — expertise is deeply contextual, benchmarks strip that context away.
- Quote: “We can’t read much into the fact that a model passed the bar exam or the medical licensing exam. A lawyer’s job isn’t just answering bar exam questions all day.”
Melanie Mitchell — “On Evaluating Cognitive Capabilities in Machines” (NeurIPS 2025 Keynote)
- URL: https://aiguide.substack.com/p/on-evaluating-cognitive-capabilities
- Type: Academic keynote write-up (Substack, January 2026)
- Key points:
- AI companies often use benchmarks to test narrow tasks but make sweeping capability claims
- “Understanding” is ill-defined and multidimensional — we cannot simply say an AI system does or does not understand
- The “fallacy of dumb superintelligence”: a system can score superhuman on narrow tests while lacking basic common sense in adjacent tasks
- Granular testing focused on abstract generalisation is needed to understand true capabilities
- Anthropomorphic language in evaluation (“reasoning models”, “chain of thought”) shapes how performance is perceived
- Tenet alignment: Aligns with Tenet 4 (Pluralism of Perspectives) — multiple frames and measurement approaches needed. Aligns with Tenet 1 (Human Intent First) — evaluation should be grounded in actual human use cases.
arXiv — “Can We Trust AI Benchmarks? An Interdisciplinary Review”
- URL: https://arxiv.org/html/2502.06559v1 / https://ojs.aaai.org/index.php/AIES/article/download/36595/38733/40670
- Type: Academic meta-review (~100 studies)
- Key points:
- Systemic flaws: misaligned incentives, construct validity issues, unknown unknowns, Goodhart’s Law in action
- “When a measure becomes a target, it ceases to be a good measure.” — Goodhart’s Law applied to AI benchmarks
- Benchmark gaming is empirically documented (LiveCodeBench evidence cited in related sources)
- Data contamination: models trained on test data achieve inflated scores
- Tenet alignment: Aligns with Tenet 5 (Always scalable) — proxy measurement may scale effort without scaling quality.
The Markup / NBC News — “AI’s capabilities may be exaggerated by flawed tests”
- URL: https://www.nbcnews.com/tech/tech-news/ai-chatgpt-test-smart-capabilities-may-exaggerated-flawed-study-rcna241969
- Type: News coverage of research
- Key points:
- Methods used to evaluate AI systems “routinely oversell AI performance”
- Lack of standardisation makes cross-model comparisons unreliable
- Tenet alignment: Neutral/supportive of site perspective.
Major Positions
Position 1: The Construct Validity Critique
- Proponents: Emily M. Bender (UW), Melanie Mitchell (Santa Fe Institute), Su Lin Blodgett (Microsoft Research)
- Core claim: Benchmarks lack construct validity — they do not measure the thing they claim to measure (“intelligence”, “reasoning”, “expert-level competence”). Passing a multiple-choice test about medicine is not the same as clinical reasoning.
- Key arguments:
- LLMs predict token sequences; they do not “understand” in the way humans do when they answer correctly
- “Multiple choice” format eliminates the open-ended, ambiguous nature of real expert work
- Benchmarks designed before today’s models were built cannot meaningfully test them
- Relation to site tenets: Strongly aligned with Tenet 1 and Tenet 3 — emphasises that human professional capability is contextual, embodied, and intentional, not reducible to test scores.
Position 2: Goodhart’s Law / Benchmark Gaming
- Proponents: Arvind Narayanan (Princeton), Sayash Kapoor, goodeyelabs.com research
- Core claim: Once a benchmark becomes a target, AI developers optimise their models to score well on it — which decouples the score from the underlying capability. This is Goodhart’s Law applied to AI evaluation.
- Key arguments:
- Data contamination: test data leaks into training data
- Hyperparameter tuning specific to benchmark tasks
- LiveCodeBench provides empirical evidence of benchmark gaming post-saturation
- Relation to site tenets: Aligns with Tenet 5 (Always scalable) — the effort invested in benchmark optimisation does not proportionally improve real-world usefulness.
Position 3: The Extrapolation Error
- Proponents: Arvind Narayanan, Melanie Mitchell
- Core claim: Researchers observe rapid benchmark improvement, project that curve forward, and conclude the entire job will be automated. This is a category error — the benchmark captures one narrow dimension; the job has hundreds.
- Key arguments:
- Geoffrey Hinton’s 2016 prediction that radiologists would be obsolete within 5 years (has not occurred)
- “The researcher defines a one-dimensional benchmark that captures a tiny aspect of the job” — Narayanan
- Lawyers sanctioned for submitting AI-hallucinated citations despite GPT-4 “passing the bar exam”
- Relation to site tenets: Directly aligned with Tenet 1 (Human Intent First) — intent in professional work is complex, relational, and contextual, not reducible to a test score. Aligns with Tenet 3 (Symbiotic Intelligence) — automation-replacement rhetoric ignores what humans uniquely contribute.
Position 4: The “Crisis of Evaluation” (reform position)
- Proponents: Mitchell, Narayanan, Stanford HAI, MLCommons
- Core claim: The problem is real but solvable — we need independent third-party evaluation, multi-dimensional benchmarks, and “midstream evaluation” frameworks that are context-specific without being deployment-specific.
- Key arguments:
- ChatBot Arena (human-in-the-loop pairwise preference) as alternative model
- NIST role in developing standardised evaluation frameworks
- “Upstream” vs. “downstream” vs. “midstream” evaluation — context matters
- Relation to site tenets: Aligns with Tenet 4 (Pluralism) — multiple evaluation approaches invite multiple frames on capability.
Key Debates
Debate 1: Is “Human Expert Level” a meaningful threshold?
- Sides: AI companies use it as a marketing milestone; evaluation researchers say it is not scientifically meaningful
- Core disagreement: Whether a test score on a structured benchmark is epistemically equivalent to expert competence in a domain
- Current state: Ongoing; gaining traction as a mainstream critique after bar exam hallucination cases
Debate 2: Can benchmarks be fixed, or is the paradigm broken?
- Sides: Reformers (NIST, MLCommons, ChatBot Arena) vs. paradigm critics (Bender, Mitchell)
- Core disagreement: Whether better benchmarks can capture what matters, or whether benchmarking as a paradigm is fundamentally misaligned with what AI is actually being used for
- Current state: Active research area; NeurIPS 2025 had dedicated evaluation tracks
Debate 3: Anthropomorphic language and its epistemic effects
- Sides: Those who accept “reasoning” / “thinking” as useful shorthand; those who argue this language distorts evaluation and policy
- Core disagreement: Whether terms like “chain of thought”, “reasoning model”, “understanding” meaningfully describe AI processes or import unwarranted capability assumptions
- Current state: Unresolved; Mitchell argues these are pre-scientific concepts; Narayanan argues the language is practically unavoidable but must be used carefully
Historical Timeline
| Year | Event/Publication | Significance |
|---|---|---|
| 2019 | HellaSwag benchmark published | Early example of quickly saturated benchmarks; creator now acknowledges limits |
| 2020 | MMLU benchmark published | Becomes de facto standard despite scraping from amateur sources |
| 2021 | Bender et al., “On the Dangers of Stochastic Parrots” | Foundational critique of LLM capability framing |
| 2021 | Bender et al., “AI and the Everything in the Whole Wide World Benchmark” | Argues general-knowledge benchmarks are “dangerous and deceptive” |
| 2023 | GPT-4 “passes” bar exam at 90th percentile | Triggers wave of lawyer-replacement predictions; soon followed by court sanctions for AI-hallucinated citations |
| 2023 | Google Gemini claims “first model to outperform human experts” on MMLU | Benchmark milestone used as marketing; immediately contested by evaluation researchers |
| 2024 | Narayanan & Kapoor, “AI Snake Oil” (Princeton University Press) | Mainstream book-length critique of AI capability inflation via benchmarks |
| 2024 | The Markup investigation into benchmark quality | Broad audience exposure to construct validity problems |
| 2025 | arXiv meta-review “Can We Trust AI Benchmarks?” (AAAI AIES) | ~100-study systematic review of benchmark failures |
| 2025 | Melanie Mitchell, NeurIPS 2025 keynote on evaluation | Names “crisis of evaluation”; calls for paradigm shift |
| 2026 | “Humanity’s Last Exam” (Nature) — expert-level benchmark | New harder benchmark; implicitly acknowledges saturation of all previous ones |
Potential Article Angles
“What Benchmark Scores Actually Measure” — Explains the construct validity problem from first principles; connects to Tenet 1 (intent-grounded evaluation) and Tenet 3 (symbiotic intelligence). Could argue for a new evaluation philosophy grounded in human intent hierarchies rather than abstract task accuracy.
“The Expert Benchmark Fallacy” — Names and defines the fallacy explicitly; walks through three case studies (medicine/bar exam/MMLU); situates it as a specific instance of Goodhart’s Law. Strong alignment with Tenet 3 (Symbiotic Intelligence over Automation) by showing how the fallacy supports harmful automation rhetoric.
“Why AI Evaluation Is a Crisis of Intent” — Frames the evaluation crisis as a failure to specify what we actually want AI to do, connecting to Tenet 1 (Human Intent First) and Tenet 2 (Context as Infrastructure). Benchmark failure is a symptom of not designing evaluation around human intent in context.
When writing the article, follow obsidian/project/writing-style.md for:
- Named-anchor summary technique for forward references
- Background vs. novelty decisions (what to include/omit)
- Tenet alignment requirements
- LLM optimisation (front-load important information)
Gaps in Research
- No formal philosophical paper explicitly named “Expert Benchmark Fallacy” found — the concept exists but has not been systematically named in the literature; an article naming it would have genuine novelty
- The arXiv validity-centered framework paper (arxiv.org/pdf/2505.10573 — note: future-dated relative to today, possibly pre-print) was not fully scraped; may contain directly relevant philosophical framework
- Limited coverage of non-Western perspectives on AI evaluation; Tenet 4 (Pluralism) suggests this gap
- No found sources on how the fallacy interacts with AI agent evaluation specifically (agents fail in more complex, dynamic ways than static benchmarks capture)
- Goodhart’s Law application to AI is documented empirically but not yet fully theorised philosophically
Citations
Keegan, Jon. “Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless.” The Markup, July 17, 2024. https://themarkup.org/artificial-intelligence/2024/07/17/everyone-is-judging-ai-by-these-tests-but-experts-say-theyre-close-to-meaningless
Narayanan, Arvind & Melanie Mitchell. “A Guide to Cutting Through AI Hype.” Princeton CITP Blog, April 2, 2025. https://blog.citp.princeton.edu/2025/04/02/a-guide-to-cutting-through-ai-hype-arvind-narayanan-and-melanie-mitchell-discuss-artificial-and-human-intelligence/
Mitchell, Melanie. “On Evaluating Cognitive Capabilities in Machines (and Other ‘Alien’ Intelligences).” AI: A Guide for Thinking Humans (Substack), January 14, 2026. https://aiguide.substack.com/p/on-evaluating-cognitive-capabilities
[Authors unknown from preview]. “Can We Trust AI Benchmarks? An Interdisciplinary Review of Quantitative AI Benchmarking Practices.” arXiv 2502.06559 / AAAI AIES 2025. https://arxiv.org/html/2502.06559v1
Narayanan, Arvind & Sayash Kapoor. AI Snake Oil: What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference. Princeton University Press, 2024.
Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” FAccT 2021.
Bender, Emily M. et al. “AI and the Everything in the Whole Wide World Benchmark.” arXiv 2111.15366, 2021.
Mitchell, Melanie. Artificial Intelligence: A Guide for Thinking Humans. Farrar, Straus and Giroux, 2019.