Posts / artificial-intelligence

Benchmarking Yourself Against the Machines

Someone on Reddit built a tool that lets you benchmark yourself against AI language models. Same tests, same scoring. You sit down, answer the questions, and find out what size model you approximate.

The post took off, mostly because the original poster was having an absolute blast in the comments, treating themselves like a product listing. Quantization options. Token pricing. VRAM requirements. The bit where someone asked if they’d fit on an 8GB graphics card and they replied that they’d had Coca-Cola and cheesecake before testing “for an extra pump” — that’s genuinely funny. The whole thread had the energy of someone who understood exactly what they’d made and leaned into it without overselling it.

The tool itself is straightforward. A Streamlit app. You answer benchmark questions, it scores you, it tells you where you land relative to various models. The poster apparently scored around 15 billion parameters equivalent, which one commenter noted is roughly in the ballpark of the human neocortex. Whether that means anything useful, I genuinely don’t know. But it’s an interesting number to sit with.

Here’s the thing that caught me, though. There’s no “I don’t know” button. Someone pointed this out, suggested it should automatically mark you wrong. The poster confirmed that’s intentional: LLMs don’t get that option, so neither do you. Which is either a fair comparison methodology or a quiet indictment of how we’ve built these systems, depending on how you look at it. Probably both.

I’ve been in IT long enough to remember when benchmarks were for CPUs. You ran Cinebench, you got a number, you compared it to other CPUs. Clean. The number meant something specific. Benchmarking a person the same way carries a lot of assumptions about what intelligence is, what it’s for, and whether a test score captures any of it. I’m not saying the tool is wrong to do it. I’m saying the joke and the actual question are closer together than they look.

What I find genuinely interesting, not just as a gag, is that most people doing this will probably score lower than they expect on certain tasks and higher than they expect on others. That asymmetry is the useful thing. We’re bad at knowing which parts of our thinking are actually good. Models have the same problem in a different direction: they’re eerily competent at things that feel hard and weirdly terrible at things that feel easy. The overlap zone is where it gets complicated.

Someone in the thread made a Neuromancer reference, warning that we might get “constructs in a box at varying degrees of sanity depending on quantization.” Dramatic, but not entirely wrong as a direction of travel. The commenter who replied that they wake up feeling like Q2_K_S on some mornings is more relatable than I’d like to admit.

The poster eventually dropped the app link after a bit of chaos, having posted the thread before the tool was actually ready. Planning, they admitted, is not a skill they’re strong on. Honestly, same. There’s something reassuring about that. Fifteen billion parameters equivalent, terrible at project management. Very on-brand for the industry.

I ran through a few of the questions. I’m not going to tell you my score, partly because I’m not sure what it means, and partly because the honest answer is that I stopped partway through when I realised I was starting to feel vaguely competitive about a benchmark designed as a joke. That instinct alone probably tells you something about how these things get under your skin.

The tool is at benchmark-yourself.streamlit.app if you want to have a go. Set aside the part of your brain that wants to study for it first. That part is the problem.