Chasing Benchmarks Is Making AI Products Worse

Goodhart law states that when a measure becomes a target, it ceases to be a good measure. AI benchmarks are a textbook case. MMLU, HumanEval, GSM8K, MATH - these tests were designed to evaluate model capabilities across knowledge, reasoning, and problem-solving. They have also become targets. Models are evaluated on them, ranked by them, and in some cases trained on data that is suspiciously close to them.

The result is that benchmark scores and real-world utility have drifted apart. A model that scores highly on HumanEval - a coding benchmark using carefully formatted problems with clean test cases - may actually be worse than a lower-scoring model for real codebases where problems are ambiguous, documentation is incomplete, and test coverage is partial. The benchmark tests a specific slice of coding skill in a specific format. That slice is not representative of most coding tasks.

This is not unique to AI. Any field that develops standardized evaluation ends up producing entities that are good at the evaluation. The difference in AI is that the gap between benchmark performance and practical utility is particularly large right now, because the benchmarks are old relative to how fast the models are improving. A benchmark designed in 2022 to differentiate models of that era may be nearly saturated by 2026 - most frontier models score above 90% - while genuine capability differences that matter for applications are not captured at all.

There is a known response to this: create harder benchmarks. And the field has done this repeatedly. MMLU begat MMLU-Pro begat harder successors. HumanEval begat SWE-bench. The arms race continues. The problem is that harder benchmarks take time to develop and validate, and models improve faster than benchmarks can be refreshed. There is a structural lag that is hard to eliminate.

What should you actually do when selecting a model? Build your own evaluation. Define what good looks like for your specific task, create a test set from your actual application data, and compare models on that. This is more work than consulting a leaderboard, but the information is actually relevant. The leaderboard tells you who won a tournament; your evaluation tells you who performs on your court with your rules.

The deeper problem is that benchmark-chasing is not just a technical issue - it is an incentive issue. Benchmark scores are legible to investors, journalists, and decision-makers who cannot easily evaluate real-world performance. A jump on a public leaderboard generates coverage; a 5% improvement in a specific enterprise task does not. Until the incentives change, the chase will continue. The least you can do is not let it make your model selection decisions for you.