Posts / ai

96 Agents, 12 Hours, One OS: Impressive Demo or Impressive Marketing?

Google’s Antigravity 2.0 apparently used 96 agents running in parallel to write an operating system from scratch in 12 hours for under a thousand US dollars in token costs. And it runs Doom.

That’s the claim, anyway.

The Doom thing has become a genuine benchmark meme at this point. Someone ran Doom on a pregnancy test display a few years back. Doom runs on ESP32 microcontrollers. Doom runs on graphing calculators. If your new piece of technology can’t run Doom, that’s probably the more interesting story. So let’s hold that particular detail lightly.

The broader claim is more interesting. Ninety-six specialised agents, orchestrated in parallel, dividing up a complex software task and producing something that boots. The token cost breakdown actually checks out when you do the maths with Gemini 2.5 Flash pricing: a few hundred million output tokens, the rest cached or input, and you can plausibly land under a thousand dollars. Someone in the comments thread I was reading did the working, and it’s at least not obviously wrong. I’ll give it that.

What I’m less sure about is whether “wrote an OS from scratch” means what it sounds like. These models are trained on Linux, on Unix, on decades of open-source kernel code. The more honest framing might be “assembled a functional OS-adjacent thing from internalised patterns at speed.” That’s still genuinely useful. It’s just not the same as novel invention. One comment made the point well: how much software development work is truly novel versus derivative? Probably less than we’d like to admit. And if AI can handle the derivative parts faster and cheaper, that matters, even if it can’t design something no human has designed before.

The demo itself had that particular quality that tech stage presentations have now: carefully paced, slightly over-enunciated, scripted to land just slow enough that the audience can follow. Someone in the thread noted this and got roundly told that the alternative is failing live, which, fair. I’d rather watch someone fail live, personally, but I understand why product managers don’t share that preference. The format has converged on something that feels produced because it is produced. You can be annoyed by it without it meaning the underlying technology is fake.

The more interesting thread was the scepticism about whether the orchestration infrastructure is doing more heavy lifting than the demo suggests. One person made the point that there’s almost certainly a knowledge base, a set of specs, a structured plan already in place that the agents are executing against rather than reasoning up from nothing. That seems likely. It also seems fine? Scaffolding is part of engineering. The question is how much of that scaffolding a human had to build versus the system.

I’ve been using AI coding tools enough now to have a calibrated view. They are genuinely good at tasks with clear shapes: implement this interface, refactor this function, write a test suite for this class. They struggle when the problem requires holding a lot of ambiguous context and making judgement calls about what the actual requirement is. An OS has a well-defined shape. It has specs. It has forty years of prior art. A new product feature for a domain-specific enterprise system does not, and that’s where the gap still shows up in my day job.

The environmental cost doesn’t get mentioned much in these demos. 2.6 billion tokens for a proof-of-concept OS that runs Doom is not nothing. I’m genuinely uncertain how to weigh that against the potential productivity gains; I don’t think anyone has a credible answer yet. But I notice that the framing is always about the dollar cost of tokens, not the energy cost. Those are related but not the same number, and only one of them shows up on the invoice.

Still. Ninety-six agents orchestrating a complex software build in parallel. A year ago that was a thought experiment. Now it’s a conference demo with working output. The pace of this is genuinely hard to sit with.