The Beautiful Madness of Building When You Could Just Buy
I came across a fascinating discussion online about someone who built a fully self-hosted web scraping infrastructure using 50 Raspberry Pi nodes, and honestly, it’s been rattling around in my head for days now. Not because it’s the most efficient solution—quite the opposite, actually—but because it represents something I find increasingly rare in our field: building something just to see if you can.
The setup is admittedly bonkers. Fifty Raspberry Pis, each running Chrome via Selenium, each with its own VPN connection, all coordinated to scrape job postings. The whole thing is local—no cloud services, just hardware sitting in someone’s home, collecting 3.9 million records over two years. There’s even an IoT power strip that automatically power-cycles nodes when they stop responding. It’s automated chaos, and I kind of love it.
The comment section was predictably split. One camp was genuinely curious about the technical challenges and solutions. The other camp was basically screaming “WHY DIDN’T YOU JUST USE DOCKER CONTAINERS ON ONE DECENT SERVER?” And look, they’re not wrong. From a pure efficiency standpoint, this is like using 50 hammers when you could use one nail gun. Multiple people pointed out that a single modern server could handle this workload with a fraction of the power consumption, cost, and cable management nightmares.
But here’s the thing that got me thinking: the builder mentioned these Pi nodes already existed for other IoT projects. The scraping work just expanded organically from 5 nodes to 50. The marginal cost was essentially zero because the infrastructure was already there. And that completely changes the calculus, doesn’t it?
Working in DevOps, I’ve seen both sides of this coin. There’s the pragmatic approach—spin up some EC2 instances, use containerisation, keep it lean and scalable. Then there’s the tinkerer’s approach—cobble together what you have, learn as you go, build something that’s uniquely yours even if it’s objectively inefficient. Our industry tends to valorise the former while quietly mocking the latter, but I reckon we lose something important when we only optimise for efficiency.
The discussion got interesting when people started questioning the “why” of collecting job posting data. Someone scraped these records to analyse trends in the job market—which jobs appear, how long they stay active, what patterns emerge over time. Not for applying to jobs, just for understanding the landscape. A few commenters found this suspicious or pointless, which struck me as odd. Since when did we need a business justification for curiosity? Data hoarding is a perfectly valid hobby, especially when you’re learning about distributed systems, anti-bot detection, and infrastructure management in the process.
One commenter mentioned that the physical diversity of 50 separate nodes helps defeat anti-scraping measures. Each Pi has its own hardware fingerprint, making them appear as distinct residential users rather than an obvious bot farm. Sure, you could probably simulate this in software, but that adds complexity. Sometimes the “dumb” solution of just having 50 actual separate devices is simpler than the “smart” solution of virtualising everything and then fighting to make each virtual instance look sufficiently unique.
The environmental angle gives me pause though. Fifty devices running 24/7, even low-power Pis, adds up. Someone calculated it’s roughly 250 watts continuous draw—not terrible, but not nothing either. A single beefy server would likely use similar power while providing more compute capacity. In our rush to self-host everything and escape cloud costs, we sometimes forget that data centres achieve economies of scale for a reason. Our home labs aren’t necessarily greener just because we control them directly.
Still, there’s something deeply satisfying about the “zero cloud” approach. Every one of those 3.9 million records lives on hardware the owner can physically touch. No terms of service changes, no surprise pricing updates, no service shutdowns. In an era where we’re all increasingly dependent on services we don’t control, there’s real value in that sovereignty, even if it comes wrapped in a tangle of power cables and network switches.
What I found most relatable was the admission about the “50 individual power bricks” situation. Anyone who’s built something over time knows that feeling—you start with what works, you expand bit by bit, and before you know it you’re looking at a rats’ nest of cables thinking “why didn’t I just use PoE from the start?” But that’s how actual learning happens. The perfectly planned project teaches you nothing; the messy one that evolves teaches you everything.
The broader question this raises for me is about how we define “good” engineering. If the goal was pure efficiency, this project fails. But if the goals were learning, experimentation, and building something interesting with available resources, it succeeds brilliantly. We need both kinds of projects in our field. We need the production-ready, cost-optimised, perfectly architected systems. But we also need the weird, inefficient, over-engineered passion projects that exist purely because someone wondered “what if?”
Maybe that’s the real value here—not the 3.9 million job postings, but the reminder that sometimes the journey matters more than the destination. In a field that increasingly feels like it’s all about shipping fast and moving on, there’s something refreshing about someone who built 50-node infrastructure just to see what patterns emerge in job postings over two years.
Would I build this exact setup? Probably not. But I respect the hell out of someone who did.