Posts / ai

AI Benchmarks Are Lying to You (But Not in the Way You Think)


There’s a post doing the rounds this week about GPT-5.5 cracking something called ProgramBench for the first time. It’s a software engineering benchmark that’s been resistant to frontier models until now, and the result is genuinely interesting. But the discussion underneath it is, predictably, a mess.

Some of it is the usual stuff: people declaring their preferred model the winner, others pointing out the charts are misleading, a few genuinely useful technical observations buried under the noise. Normal internet discourse. What caught my attention wasn’t the headline result though. It was a quieter observation someone made about the benchmark itself.

The short version: some of the unit tests in ProgramBench include assertions for undocumented features. Things a model couldn’t reasonably discover by reading the code or the spec. Which means a significant portion of any “progress” on this benchmark might be coming from contamination, where the model has essentially seen the answer before. Someone linked to a LessWrong post calling it out. The response from a few commenters was essentially: yes, this is a known problem, and it’s not unique to ProgramBench.

That’s the part that sticks with me.

We’ve been through this before with SWE-bench. It was a reasonable eval when it launched, and then models started being specifically optimised for it, and now a good SWE-bench score tells you less than it used to. ProgramBench was partly created to address that, to give the field a fresh surface to test against. And now there are already questions about whether it’s measuring what it claims to measure. The half-life of a useful AI benchmark seems to be getting shorter.

Someone in the thread posted examples of the actual test code, and honestly it’s worth reading if you want to feel less certain about everything. One test checks that an application ran without crashing and produced some output. Any program that prints anything passes it. Another test uses a flag that doesn’t exist in the actual application. These aren’t edge cases; they’re in the benchmark being used to evaluate frontier models.

I want to be careful here, because I don’t think the conclusion is “AI progress is fake.” That’s too easy and probably wrong. The models genuinely are getting better at writing code; I use them every day and the improvement over the last eighteen months is real and noticeable. GPT-5.5 in particular has been solid for the kind of work I do. But there’s a difference between “this model is better at my actual job” and “this model scored higher on a benchmark,” and we keep conflating them.

The other thread running through the discussion is about Google, specifically how absent they’ve been from this particular conversation. Someone pointed out that the AI coding race has settled into a two-horse dynamic between OpenAI and Anthropic, and Gemini barely gets a mention. There’s probably a Google I/O announcement coming that will temporarily change the narrative, but the observation stands. A company with Google’s resources and access to compute should be more present here. Whether they’re quietly cooking something or genuinely behind, I don’t know.

What I keep coming back to is a comment from someone who said, essentially: ignore the benchmarks, try the tools yourself, and decide based on what actually works for you. Which is good advice. It’s also slightly unsatisfying, because “vibes-based evaluation” doesn’t scale, and we need some way to compare models that isn’t just anecdote.

The honest answer is we don’t have a great solution to this yet. Benchmarks get contaminated or gamed. Human evaluation is slow and expensive and inconsistent. Real-world task performance is hard to measure systematically. And the field is moving fast enough that any evaluation framework risks being obsolete before it’s properly validated.

Someone suggested testing GPT-5.5 on writing a functional FPGA operating system. That made me laugh. It’s absurd, but it’s also pointing at something real: the tasks we actually care about are often too complex, too contextual, and too domain-specific to capture in a benchmark. The gap between “passes unit tests” and “writes software I’d trust in production” is enormous, and we don’t have good tools for measuring it.

I don’t think that means we should give up on evals. It means we should hold the numbers more loosely than the announcements encourage us to. A model solving ProgramBench for the first time is worth noting. It’s not worth the breathless framing it tends to get.