LLMs and the Measurement Problem

There's a curious parallel between quantum measurement and how we evaluate language models. The moment you create a benchmark, you change what gets optimized.

Goodhart's Law, Scaled

Every benchmark we create becomes a target. Models get trained to score well on the metric, and the metric stops measuring what we care about. This is Goodhart's Law at unprecedented scale.

The most valuable capabilities are often the hardest to measure. How do you benchmark taste? How do you quantify whether an AI collaborator asks the right questions before writing code?

The Observer Effect

What I find philosophically interesting is how similar this is to Heisenberg's uncertainty principle. The more precisely you try to measure a model's capability, the more you distort what you're measuring. Narrow evaluations produce narrow optimization.

The models that feel most useful are the ones optimized for something upstream of any particular benchmark. They have something like general competence, which is exactly what benchmarks struggle to capture.

What This Means for Builders

If you're building with LLMs, stop obsessing over benchmark tables. The interesting question isn't "which model scores highest on HumanEval" but "which model makes me most productive in my specific workflow." That's a measurement only you can make, and it's fundamentally irreducible to a number.