Standing on the Shoulders of Giants: Why 'Attention Is All You Need' Matters (But Isn't Everything)
I’ve been following an interesting discussion online about what constitutes the most important AI paper of the decade, and it’s got me thinking about how we measure scientific breakthroughs and give credit where it’s due. The paper in question? “Attention Is All You Need” by Vaswani et al., published in 2017 - the one that introduced the transformer architecture that’s now powering everything from ChatGPT to the latest Google search improvements.
The numbers are pretty staggering. This paper is approaching 200,000 citations and is on track to become one of the most cited papers in scientific history. That’s no small feat, and it speaks to just how foundational this work has become for the current AI boom we’re all witnessing. Every time you use a modern language model, you’re essentially interacting with a descendant of the ideas presented in this paper.
But here’s where the discussion gets interesting - and where my slightly contrarian nature kicks in. While I absolutely acknowledge the massive impact of the transformer architecture, the conversation reminded me of something that’s been bothering me about how we talk about scientific progress more broadly. We have this tendency to lionise single papers or individual breakthroughs, when the reality is far messier and more collaborative.
Take the attention mechanism itself, for instance. The 2017 paper didn’t invent attention - it was introduced in a 2014 paper by Bahdanau, Cho, and Bengio on neural machine translation. And if we want to go further back, we could talk about Word2Vec from 2013, or even AlexNet from 2012 that really kicked off the deep learning revolution by showing what GPUs could do for neural network training. Each of these built on decades of previous work in neural networks, statistics, and computer science.
This reminds me of something I’ve observed in my own work in DevOps and software development. The tools and practices we use today - containerisation, continuous integration, microservices - didn’t spring fully formed from someone’s head. They evolved through countless iterations, failed experiments, and incremental improvements by thousands of developers worldwide. Yet we often remember only the final, polished versions that gained widespread adoption.
The same thing happens in science. Someone in the discussion made a tongue-in-cheek comment about how, technically, we should credit the entire line of research back to von Neumann computing every time we talk about attention mechanisms. While they were joking, there’s a kernel of truth there. Science really doesn’t happen in a vacuum, despite what Cave Johnson might have claimed at Aperture Science.
What particularly struck me about this discussion is how it reflects a broader issue I’ve been grappling with regarding AI development and recognition. We’re so focused on the latest breakthrough that we sometimes forget the scaffolding that made it possible. This isn’t just an academic concern - it has real implications for how we fund research, how we educate the next generation of scientists, and how we think about innovation policy.
From my perspective, living through the current AI revolution while working in tech, I can see both sides of this. The transformer architecture really has been transformative - no pun intended. It’s enabled capabilities that seemed like science fiction just a few years ago. But it’s also built on decades of work in linguistics, cognitive science, mathematics, and computer science that rarely gets the same recognition.
This is where my progressive leanings come into play. I worry that our tendency to create tech heroes and singular breakthrough moments obscures the collaborative nature of scientific progress. It makes it easier to justify massive inequalities in research funding and recognition, when the reality is that most advances depend on a vast ecosystem of contributors, many of whom never get their names in the headlines.
The environmental implications also worry me. When we focus solely on the flashy end results, we might miss opportunities to make the underlying research and development processes more sustainable and equitable. The massive compute requirements for training these models didn’t emerge overnight - they’re the result of architectural choices made throughout the development of these systems.
That said, I don’t want to diminish the genuine achievement that “Attention Is All You Need” represents. The elegance of the solution, the way it enabled parallelisation that made training much more efficient, and its broad applicability across different domains - these are real innovations that deserve recognition. The paper solved genuine technical problems in a way that unlocked new possibilities.
Perhaps what we need is a more nuanced way of talking about scientific progress. Instead of asking “what’s the most important paper,” maybe we should be asking “what constellation of ideas and contributions led to our current capabilities?” It’s less catchy, but it’s more honest about how science actually works.
The discussion has left me optimistic about one thing, though. The fact that so many people are engaged with the technical details of AI research, debating the merits of different approaches and acknowledging the complexity of scientific progress, suggests that we’re becoming more sophisticated in how we think about these technologies. That’s going to be crucial as we navigate the challenges and opportunities that lie ahead in AI development.
At the end of the day, whether “Attention Is All You Need” is the most important AI paper of the decade matters less than what it represents: human ingenuity, collaborative scientific progress, and our ongoing quest to build systems that can understand and generate language. That’s something worth celebrating, even as we remember all the shoulders these giants are standing on.