Posts / artificial-intelligence

Are We Just Teaching AI to Cheat on Tests?

There’s a question floating around AI discussion circles lately that’s been rattling around in my head. It goes something like: “Should I walk or drive to the car wash?” — where the obvious catch is that your car needs to be at the car wash for it to be washed. Simple, right? Lateral thinking 101. And yet, for a while, many of the big frontier models kept confidently telling people to walk.

DeepSeek’s latest model apparently gets it right. But here’s the thing that caught my attention in the discussion: it refers to the question as a “classic riddle.” And that little detail opens up a much bigger can of worms.

Because it’s not a classic riddle. It’s a recent AI gotcha test that’s been doing the rounds on Reddit and tech forums. The model only “knows” it’s a riddle because it’s almost certainly seen the question — or close variants of it — in its training data. It’s not reasoning its way to the answer. It’s pattern-matching. It’s essentially peeked at the answer sheet.

Now, I want to be fair here. Some people in the discussion made a genuinely interesting philosophical point: humans learn the same way. We reason about novel situations by drawing on prior experience and pattern recognition. Nobody is born knowing that you need to drive your car to the car wash. But there’s still something that feels qualitatively different about what humans do versus what these models are doing. A reasonably switched-on person who had never encountered this specific scenario could probably work through it in a few seconds. The model, without the training data anchor, apparently struggles.

This gets to the heart of something that’s been nagging at me about the current AI hype cycle. We keep benchmarking these models on problems that eventually end up in their training data. It’s a bit like teaching to the test, except the test keeps getting absorbed into the curriculum. Someone in the thread put it bluntly: there’s now so much “gotcha bullshit” in training data that models are essentially memorising the gotchas rather than developing genuine flexibility. And once a question goes viral — the strawberry letter-counting one being the famous earlier example — you can basically write off that question as a meaningful test of reasoning.

What I find genuinely worrying isn’t that the models get these trick questions wrong. It’s that we might be mistaking “has seen this before” for “can actually think.” The leaderboards keep climbing. The benchmark scores keep improving. But how much of that is real generalised reasoning, and how much is increasingly sophisticated retrieval with a thin layer of inference on top?

My daughter asked me the other week whether AI was actually intelligent or just “a really good search engine that sounds confident.” She’s seventeen and she landed on a question that AI researchers have been wrestling with for years. I didn’t have a clean answer for her, which is probably the most honest response I could give.

The rate limits issue that came up in the same discussion thread is almost a separate comedy — people burning through their free quotas on single questions and then rotating through multiple accounts. It’s a reminder that behind all this breathtaking technology is still a very normal commercial calculus. These companies are spending an extraordinary amount on compute, and they need to monetise. Nothing wrong with that, but it does add some useful grounding when we’re tempted to think of these systems as some kind of neutral public utility.

The thing is, I’m not trying to be a doomer about any of this. The progress genuinely is remarkable, and I find myself using these tools every day in my work. But I think it matters that we’re honest about what kind of progress we’re seeing. Getting a viral riddle right because it’s in your training data is not the same as generalised reasoning. And if we keep confusing the two, we’re going to keep making poor decisions about where to trust these systems and where to remain sceptical.

The car wash question will eventually be useless as a test. So will the next one. The real question — the one worth sitting with over a decent batch brew — is whether we’re building something that can genuinely reason about the next novel problem it’s never encountered, or just something very good at recognising the shape of problems it’s been shown before. Right now, I think the honest answer is: we’re not entirely sure, and we should probably stop pretending otherwise.