The Benchmark Wars: Why I'm Cautiously Optimistic About Open Source AI

January 28, 2026

There’s been a lot of chatter online lately about Kimi-K2.5, an open-source AI model that’s supposedly beating Claude Opus 4.5 in various benchmarks, particularly in coding tasks. The reactions have been… well, let’s just say they’ve been interesting.

The conversation reminded me of watching my daughter study for her VCE exams. She’d ace practice tests but then stress about whether that actually meant she’d perform on the day. Turns out, AI models face a similar problem – performing well on benchmarks doesn’t always translate to real-world capability.

What struck me most wasn’t the claimed performance improvements, but the deep cynicism in the responses. The phrase “benchmaxxing” kept coming up – essentially the accusation that models are being trained specifically to excel at benchmark tests rather than actual practical tasks. Someone pointed out that when it’s a model people like, benchmarks are treated as gospel. When it’s not, suddenly everyone’s a skeptic. That double standard bothers me, even if I understand where it comes from.

The geopolitical undertones are hard to ignore too. There’s this assumption floating around that Chinese models are trained on benchmarks while Western models are trained for “real tasks.” Really? That seems like a convenient narrative that doesn’t hold up to scrutiny. Someone else suggested the opposite might actually be true, and honestly, without transparent data, who really knows?

From my DevOps background, I know that optimising for tests is a genuine problem. We see it all the time in software testing – teams game their metrics, and suddenly you’ve got 100% code coverage but the application still crashes in production. But I also know that dismissing all benchmarks is throwing the baby out with the bathwater.

What genuinely excites me here isn’t whether Kimi beats Claude or not. It’s that we’re having this conversation about an open-source model at all. Someone in the discussion made a crucial point: even if Kimi is only close to Opus 4.5, that’s remarkable because it’s a fraction of the price and, critically, it’s open.

The open-source aspect matters enormously. The ability to run models offline, to fine-tune them for specific tasks, to not worry about a company suddenly changing their pricing or nerfing their model (yes, looking at you, everyone who’s done this) – these are real advantages that don’t show up in any benchmark.

There’s a practical hurdle, though. One commenter joked about needing to borrow a 1.2TB VRAM GPU. It’s funny because it’s true – the hardware requirements for running these models locally are absurd. This creates a concerning future where “open source” becomes theoretical rather than practical for most people. We end up back at renting compute from large providers, just with different branding.

The environmental impact keeps nagging at me too. We’re in a race to build bigger models requiring more hardware, more power, more everything. Meanwhile, we’re supposedly trying to address climate change. The cognitive dissonance is exhausting.

Several people mentioned they’d tried Kimi 2.0 and found it useless for real coding tasks. Others said they’d tested Kimi 2.5 and found it better than Opus about 35-40% of the time. That kind of variation in real-world experience is telling. It suggests the model might be genuinely capable but perhaps less consistent, or optimised for different use cases than what commercial models focus on.

The discussion around different benchmarks – SWE-bench, livebench, artificialanalysis – reveals another problem: we don’t even have consensus on how to measure these things properly. It’s like trying to compare cars when everyone’s measuring different attributes and driving on different roads.

Look, I remain cautiously optimistic about open-source AI. The more options we have, the less power concentrated companies hold over this transformative technology. Competition is good. Transparency is good. The ability for researchers and developers to actually understand and modify these systems is essential.

But we need to be realistic. Benchmarks are useful data points, not crystal balls. Real-world performance varies based on use case, integration, and a hundred other factors. The best benchmark is always going to be: does it work for what you need it to do?

The AI landscape changes so rapidly that by the time I finish writing this, there’ll probably be another model claiming to beat everything else. That’s both exciting and exhausting. What matters more than any single model’s performance is the ecosystem we’re building – one that ideally balances innovation with accessibility, capability with sustainability, and commercial incentives with open collaboration.

We’ll see how Kimi performs in the wild. Until then, I’ll keep my subscription to Claude, maintain my healthy skepticism, and keep an eye out for when these open-source models become practical enough for someone without a small data centre in their garage. Progress is messy, but it’s still progress.