AI Benchmarks Are Lying to You (But Not in the Way You Think)

There’s a post doing the rounds this week about GPT-5.5 cracking something called ProgramBench for the first time. It’s a software engineering benchmark that’s been resistant to frontier models until now, and the result is genuinely interesting. But the discussion underneath it is, predictably, a mess.

Some of it is the usual stuff: people declaring their preferred model the winner, others pointing out the charts are misleading, a few genuinely useful technical observations buried under the noise. Normal internet discourse. What caught my attention wasn’t the headline result though. It was a quieter observation someone made about the benchmark itself.