Posts / artificial-intelligence

Kimi K2.7: Coding AI That's Not Trying to Fool You

There’s a thing that happens in the AI space, reliably, almost rhythmically: a new model drops, the benchmarks are suspiciously curated, the blog post reads like it was written by a marketing department that just discovered the word “unprecedented,” and within 48 hours someone on Reddit has found the caveats buried in appendix C. Rinse, repeat.

So when Moonshot AI put out Kimi K2.7 Code this week, I was half-expecting the usual. What I got was something a bit different, and I find myself cautiously impressed, not by the model itself, which I haven’t tested properly, but by the way it was presented.

The model is a coding-focused update to their K2.6 base. The headline numbers: better at long-horizon coding tasks, roughly 30% fewer “thinking tokens” used to get there. That last part matters more than it sounds. Token efficiency is not a glamorous metric, but it’s a real one. Less internal deliberation to reach the same or better output means cheaper runs and faster completions, which, if you’re actually using these things in a workflow rather than just benchmarking them, is something you notice in your invoice at the end of the month.

The conversation around the release has been interesting. Someone pointed out that the benchmark selection is unusual, including at least one benchmark the Kimi team appears to have built themselves. That’s a legitimate concern and worth naming plainly. Self-evaluation is a conflict of interest even when the people doing it are acting in good faith. The counter-argument, and it’s a fair one, is that they weren’t overselling. The numbers they published don’t claim to beat GPT-4o or Claude’s latest. They show a meaningful improvement on a specific class of tasks, at a price point that’s considerably more accessible. One comment put it plainly: well under $5 per million output tokens compared to $25 for Opus. For a lot of use cases, that gap is the whole conversation.

The open-source angle is worth a mention too. A few people in the thread noted that recent high-profile Chinese model releases have trended toward closed source. Qwen 3.7, Minimax M3. Kimi staying open is not nothing. It keeps the research community able to poke around under the bonnet, which is how you actually build trust over time rather than just asserting it.

There’s a broader thing here that I keep thinking about. The AI benchmarking ecosystem, and I use that word grudgingly, is in a genuinely weird state. One commenter noted that Anthropic’s Fable 5 refused all 200 tasks on ProgramBench. Every single one. That’s a model that by many measures sits at or near the top of capability rankings, and it just… declined to participate. Whether that’s a safety alignment overcorrection, a quirk of how the eval was constructed, or something else, I don’t know. But it does illustrate that benchmark performance and practical usefulness are not the same number, and they’re diverging in ways that should make everyone a bit sceptical about the leaderboards.

The honest version of the Kimi K2.7 story is probably this: it’s a solid incremental step from a team that’s moving quickly, priced competitively, and released openly. It’s not going to replace whatever you’re already happy with if you’re happy with it. But if you’re paying close attention to the cost side of running AI-assisted development at any scale, it’s worth a look.

I don’t have strong conclusions here. The space is moving too fast for confident takes, and anyone who tells you they know which model will matter in 12 months is guessing. I’m just relieved when a release comes with honesty about its limitations instead of a press release designed to make you feel like you’re about to miss out on something historic.

That’s not a high bar. It shouldn’t feel like one. But here we are.