Posts / ai

2.3 Terabytes of RAM and a Dream: When the Tinkerers Go Feral

There’s a post doing the rounds that stopped me mid-scroll last week. Someone has assembled what they’re calling the infinity stones of local AI inference: 2.3 terabytes of RAM, 400-plus vCores, a Blackwell GPU for prefill, and a mesh of Mac Studios for decode. They want to connect the whole thing via RDMA over Thunderbolt and run disaggregated inference across heterogeneous hardware, essentially splitting the “thinking” work across fundamentally different architectures.

The responses ranged from genuine technical curiosity to “you’ll break even in 2039,” which, honestly, is fair.

I sat with this for a while. Not because I could build anything like it, but because the project sits right at the intersection of things I find genuinely interesting and things I find genuinely unsettling.

The technical ambition here is real. Disaggregated prefill, where one piece of hardware handles the prompt processing and another handles token generation, is a well-understood concept in data centre AI infrastructure. Getting it to work across consumer and prosumer hardware, using a relatively new Apple distributed compute library and a third-party RDMA implementation, is not. One commenter laid it out plainly: you need to think hard about where the model weights live, where the KV cache lives, and how many bytes you’re shoving across a Thunderbolt 5 connection per second. These are not small problems.

The person building this seems to know that. They’re using Ghidra to reverse-engineer Apple’s distributed compute APIs. They’re in contact with the maintainers of the relevant open-source projects. They’re asking for help rather than pretending they’ve already solved it. That’s the mark of someone doing actual work, not just posting a flex.

And yet. Someone in the comments asked the obvious question: what are you actually doing with it? It got no meaningful reply.

That question matters more than it might seem. I’ve worked in tech long enough to have watched people spend enormous energy on infrastructure problems that turned out to be solutions looking for a problem. There’s a particular kind of builder, common in this space, who gets more satisfaction from the assembly than the application. The model runs. It runs locally. It runs on hardware they own and control. That’s the point. The use case is secondary.

I understand the appeal. I really do. There’s something philosophically satisfying about running capable AI on your own hardware, outside of any API, any rate limit, any terms-of-service change at 2am on a Tuesday. After watching the AI cloud providers shift pricing and deprecate models with basically no notice, local inference starts looking less like a hobbyist affectation and more like a reasonable hedge.

But I keep coming back to the environmental side of this. The energy required to run a cluster like this continuously is not trivial. A setup with this much RAM and a Blackwell GPU will draw serious power, all day, every day, for inference workloads that might amount to a few queries an hour. The carbon footprint of hobbyist AI infrastructure is not something the community talks about much. We’re very good at discussing it when it’s Google’s problem or Microsoft’s problem, less so when it’s distributed across ten thousand enthusiasts in home offices.

I don’t have a clean answer to that. I’m genuinely fascinated by what this person is building. The engineering is interesting, the open-source collaboration is good, and the knowledge produced will be useful to others. I also think we’re collectively underweighting the cost of the “local AI is inherently virtuous” framing.

The person who replied “slow but can load big models, there’s your benchmark” was being funny, but they also accidentally described the current state of almost all local inference: technically impressive, practically marginal for most tasks, and thermally enthusiastic. Maybe the Blackwell prefill changes that calculus. Maybe it doesn’t. We’ll see when the benchmarks arrive.

In the meantime, I’m watching from the outer southeast, running nothing more exotic than the occasional Ollama session on hardware that fits under my desk without requiring its own postcode.