The 2017 Paper Underneath Every Modern AI

On June 12, 2017, eight researchers at Google Brain and the University of Toronto put a paper on arXiv called "Attention Is All You Need." Ashish Vaswani was the lead author. The paper introduced a new neural network architecture called the transformer.

Every major AI product you've heard of in 2026 (GPT, Claude, Gemini, Llama) is built on it.

That paper is the reason the current AI era exists.

To understand why, you have to know what it replaced.

What Came Before

Before transformers, the dominant way to process language with neural networks was the recurrent neural network, or RNN. The most common variant, the long short-term memory network (LSTM), was invented by Sepp Hochreiter and Jürgen Schmidhuber in 1997. RNNs worked by reading text one word at a time and carrying a hidden state forward from each step to the next, like a person reading a sentence and trying to remember what they've already read.

This had two problems.

First, information had to travel through a long chain of steps. By the time the model got to word 200 in a passage, the signal from word 1 had been overwritten so many times that it was basically gone.

Second, the architecture was sequential by design. Each step depended on the one before it. You couldn't parallelize the work across many GPUs because step 47 needed step 46 to finish first. Training a large RNN on a large corpus took forever.

That second problem was the real blocker. GPUs had gotten massively faster. Datasets had gotten massively bigger. But the architecture didn't let you take advantage of either.

What the Transformer Did

The transformer's key idea is called self-attention. Instead of carrying information through a chain of hidden states, every position in a sequence can directly look at every other position in a single step.

For each pair of tokens in the input, the model computes a weight that captures how relevant one is to the other. The output at each position becomes a weighted sum over the entire sequence. The model decides, on its own, what to attend to.

The clever part isn't just that this captures long-range relationships better than RNNs. It's that all of those computations are independent. You can run them in parallel across a whole GPU cluster.

In the original paper, Vaswani's team trained the model on machine translation. They reported a BLEU score of 28.4 on English-to-German and 41.8 on English-to-French, both new state of the art. Training took 3.5 days on eight GPUs. The recurrent baselines they beat had taken much longer to train and ended up worse.

A new architecture that does the job better and runs in a fraction of the time.

How to Actually See It Work

If you want to understand what attention looks like inside the model, two resources are worth your time.

Jay Alammar published a blog post in 2018 called "The Illustrated Transformer." It's a sequence of diagrams that walks through the attention computation step by step. It's been translated into at least ten languages and is used in graduate NLP courses at MIT and Cornell. It's at jalammar.github.io/illustrated-transformer/.

The other one is Andrej Karpathy's YouTube video from January 17, 2023, called "Let's build GPT: from scratch, in code, spelled out." It's one hour and 56 minutes. Karpathy builds a working transformer language model character by character, then trains it on Shakespeare. The output is recognizably Shakespearean. The video has become a rite of passage for engineers moving into ML.

If you've never written ML code and want to actually see what's happening inside one of these models, that video is the way in.

The Scaling Story

The transformer alone wasn't the whole story. The second thing that happened was scale.

In February 2019, OpenAI released GPT-2, a 1.5 billion parameter transformer trained on text scraped from outbound Reddit links. The paper was "Language Models are Unsupervised Multitask Learners," by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.

A year and three months later, OpenAI followed with GPT-3, a 175 billion parameter version. The paper was called "Language Models are Few-Shot Learners." That's more than 100 times the size of GPT-2.

GPT-3 showed something nobody had quite expected. At that scale, the same transformer architecture could perform a wide range of language tasks without being fine-tuned for any of them. You prompted it with a few examples and it figured out the pattern.

That result is what eventually triggered ChatGPT, which triggered the AI moment you're living through right now.

What That Built

The transformer wasn't a tweak. It was a new thing.

A flexible attention mechanism, full parallelization, and a scaling property that turned out to produce qualitatively new behaviors. Those three ingredients, together, set the trajectory for the last eight years of the field.

Every consumer-facing AI product in 2026, with rare exceptions, is built on this paper. A 2017 arXiv preprint with eight authors is the floor underneath everything you call AI.

Part of the AI Foundations series.

Sources

Every major AI product you've heard of in 2026 (GPT, Claude, Gemini, Llama) is built on it.

That paper is the reason the current AI era exists.

To understand why, you have to know what it replaced.

What Came Before

This had two problems.

That second problem was the real blocker. GPUs had gotten massively faster. Datasets had gotten massively bigger. But the architecture didn't let you take advantage of either.

What the Transformer Did

The clever part isn't just that this captures long-range relationships better than RNNs. It's that all of those computations are independent. You can run them in parallel across a whole GPU cluster.

A new architecture that does the job better and runs in a fraction of the time.

How to Actually See It Work

If you want to understand what attention looks like inside the model, two resources are worth your time.

If you've never written ML code and want to actually see what's happening inside one of these models, that video is the way in.

The Scaling Story

The transformer alone wasn't the whole story. The second thing that happened was scale.

A year and three months later, OpenAI followed with GPT-3, a 175 billion parameter version. The paper was called "Language Models are Few-Shot Learners." That's more than 100 times the size of GPT-2.

That result is what eventually triggered ChatGPT, which triggered the AI moment you're living through right now.

What That Built

The transformer wasn't a tweak. It was a new thing.

Every consumer-facing AI product in 2026, with rare exceptions, is built on this paper. A 2017 arXiv preprint with eight authors is the floor underneath everything you call AI.

Part of the AI Foundations series.

The 2017 Paper Underneath Every Modern AI

What Came Before

What the Transformer Did

How to Actually See It Work

The Scaling Story

What That Built

Sources

Keep reading

AI Is a 70-Year-Old Marketing Term

An LLM Is One Giant Function

Plausible, Not True

The 2017 Paper Underneath Every Modern AI

What Came Before

What the Transformer Did

How to Actually See It Work

The Scaling Story

What That Built

Sources

Keep reading

AI Is a 70-Year-Old Marketing Term

An LLM Is One Giant Function

Plausible, Not True