The transformer is the architecture behind every modern large language model — GPT, BERT, T5, LLaMA, and the rest. Introduced in the 2017 paper Attention Is All You Need, it replaced recurrent networks with a mechanism called self-attention that reads an entire sequence at once instead of word by word.
This article builds the architecture up piece by piece: first why the older recurrent approach hit a wall, then how attention solves it, and finally how the encoder and decoder stacks combine into the model that powers today's LLMs.
Prerequisites: A working knowledge of neural networks (layers, weights, training) is helpful. No code is required — this is a conceptual foundation for the hands-on fine-tuning tutorials that follow.
Before Transformers: RNNs and Seq2Seq
Before 2017, sequence tasks like translation used recurrent neural networks (RNNs) and their gated variant, LSTMs, in an encoder-decoder (seq2seq) setup. The encoder read the input one word at a time, compressing everything into a single fixed-size "context" vector, and the decoder generated the output from that vector.
This design has two fundamental problems:
- Memory bottleneck. Squeezing an entire sentence — or paragraph — into one fixed vector loses information. The longer the input, the more the model forgets the beginning by the time it reaches the end.
- Vanishing gradients. Because RNNs process tokens sequentially, the training signal has to flow back through every step. Over long sequences the gradients shrink toward zero, so the model fails to learn long-range dependencies between distant words.
There is also a practical cost: recurrence is inherently sequential. You cannot compute step 10 until step 9 is done, which makes RNNs slow to train and impossible to fully parallelize on modern GPUs.
Note
The seq2seq attention idea from Bahdanau et al. (2014, arXiv:1409.0473) first let the decoder "look back" at all encoder states. The transformer took this further and removed recurrence entirely.
Why Transformers Win
The transformer (arXiv:1706.03762) keeps the attention idea and throws away recurrence. That single change brings four advantages:
| Advantage | What it means |
|---|---|
| Parallel processing | Every token is processed at the same time, not one after another — training is far faster on GPUs |
| Alleviating vanishing gradients | Direct connections between any two tokens keep the gradient path short |
| Long-range dependency capture | Self-attention links any word to any other word in the sequence, regardless of distance |
| Transfer learning | Pretrain once on a huge corpus, then fine-tune cheaply on many downstream tasks |

RNNs process tokens one at a time; transformers attend to all tokens in parallel.
From Text to Vectors
A transformer cannot read raw strings.
Text passes through several preparation steps before the model sees it:
- Tokenized text. The input string is broken into atomic units — words or subwords. For example, an uncommon word may split into several pieces (
Talk,##ie). - Token encodings. Each token is mapped to a unique integer ID from the model's vocabulary.
- Token embedding. Each ID is looked up in an embedding table that converts it into a dense vector capturing meaning. Similar words land near each other in this vector space.
- Positional embedding. Because attention has no built-in sense of order, a positional encoding is added to each token embedding so the model knows where each token sits in the sequence.
The result is a sequence of vectors that carry both meaning and position — ready for the attention layers.
Self-Attention
Self-attention is the heart of the transformer. For each token, it asks: how much should I pay attention to every other token in this sequence?
Each token's embedding is projected into three vectors — a query (Q), a key (K), and a value (V). The model scores every query against every key (a dot product), scales and softmax-normalizes those scores into attention weights, and uses them to take a weighted sum of the value vectors. Tokens that are relevant to each other get high weights.
Because every token is compared with every other token directly, self-attention captures relationships across the whole sequence in a single step — no information has to travel through intermediate recurrent states.

Self-attention scores every token against every other token using queries, keys, and values.
Multi-Head Self-Attention
One attention pattern is rarely enough. Multi-head self-attention runs several attention operations ("heads") in parallel, each learning a different kind of relationship — one head might track syntax, another might track which noun a pronoun refers to. Their outputs are concatenated and projected back to the model dimension, giving the layer a richer view of the sequence.
The Encoder and Decoder
A full transformer has two stacks:
- Encoder stack. A stack of identical layers, each combining multi-head self-attention with a feed-forward network (plus layer normalization and residual connections). The encoder turns the input embeddings into a sequence of context-rich hidden states — a representation of the whole input.
- Decoder stack. Also a stack of layers, but each decoder layer has two attention mechanisms. Masked self-attention lets each output position attend only to earlier positions (so the model cannot "cheat" by looking at future words while generating). Cross-attention (encoder-decoder attention) lets the decoder attend to the encoder's hidden states, focusing on the relevant parts of the input as it produces each output token.
| Component | Role |
|---|---|
| Encoder | Reads the input and produces hidden-state representations |
| Self-attention | Relates each input token to every other input token |
| Multi-head attention | Learns several relationship patterns in parallel |
| Decoder | Generates the output sequence token by token |
| Masked self-attention | Prevents the decoder from seeing future tokens |
| Cross-attention | Lets the decoder focus on relevant encoder states |

The encoder builds a representation of the input; the decoder uses masked and cross-attention to generate output.
How This Maps to Modern LLMs
Different model families use different parts of this architecture:
- Encoder-only models like BERT use only the encoder stack. They produce rich bidirectional representations and excel at understanding tasks — classification, NER, question answering.
- Decoder-only models like GPT, Phi, and LLaMA use only the decoder stack with masked self-attention. They predict the next token and excel at generation.
- Encoder-decoder models like T5 use both, which suits sequence-to-sequence tasks such as translation and summarization.
Every one of these is pretrained on enormous text corpora and then fine-tuned for specific tasks — the workflow the rest of this series puts into practice.
Summary
Transformers replaced recurrent networks by swapping sequential processing for self-attention, which connects every token to every other token directly. That change unlocked parallel training, long-range dependency capture, and the pretrain-then-fine-tune paradigm.
The pipeline is consistent: tokenize text, embed tokens, add positional information, apply multi-head self-attention through encoder and/or decoder stacks, and read out a representation or a generated sequence. With this foundation, you are ready to look inside specific models — starting with BERT's architecture and fine-tuning.