Transformer Architecture & LLM Foundations

The transformer is the architecture behind every modern large language model. GPT, BERT, T5, and LLaMA are all built on it. It came from the 2017 paper Attention Is All You Need. Before that, models read text word by word. The transformer replaced that with a mechanism called self-attention. In simple words, self-attention reads the whole sequence at once.

In this blog, we will build the architecture up piece by piece. First we see why the older recurrent approach hit a wall. Then we see how attention solves it. Finally, we see how the encoder and decoder stacks combine into the model that powers today's LLMs.

Prerequisites: A working knowledge of neural networks (layers, weights, training) is helpful. No code is required. This is a conceptual foundation for the hands-on fine-tuning tutorials that follow.

Before Transformers: RNNs and Seq2Seq

Before 2017, sequence tasks like translation used recurrent neural networks (RNNs). LSTMs were a gated variant of RNNs. They worked in an encoder-decoder setup, also called seq2seq. The encoder read the input one word at a time. It squeezed everything into a single fixed-size context vector. The decoder then generated the output from that one vector.

This design has two big problems:

Memory bottleneck. Squeezing an entire sentence or paragraph into one fixed vector loses information. The longer the input, the more the model forgets the start by the time it reaches the end.
Vanishing gradients. RNNs process tokens one after another. So the training signal has to flow back through every step. Over long sequences the gradients shrink toward zero. The model then fails to learn long-range dependencies between distant words.

There is also a practical cost. Recurrence is sequential by nature. We cannot compute step 10 until step 9 is done. This makes RNNs slow to train. It also makes them impossible to fully parallelize on modern GPUs.

Note

The seq2seq attention idea from Bahdanau et al. (2014, arXiv:1409.0473) first let the decoder look back at all encoder states. The transformer took this further and removed recurrence completely.

Why Transformers Win

The transformer (arXiv:1706.03762) keeps the attention idea and throws away recurrence. That one change brings four advantages:

Advantage	What it means
Parallel processing	Every token is processed at the same time, not one after another, training is far faster on GPUs
Alleviating vanishing gradients	Direct connections between any two tokens keep the gradient path short
Long-range dependency capture	Self-attention links any word to any other word in the sequence, regardless of distance
Transfer learning	Pretrain once on a huge corpus, then fine-tune cheaply on many downstream tasks

Diagram contrasting sequential RNN processing with the parallel, all-to-all attention of a transformer

RNNs process tokens one at a time; transformers attend to all tokens in parallel.

From Text to Vectors

A transformer cannot read raw text strings.

The text passes through a few preparation steps before the model sees it:

Tokenized text. The input string is broken into small units, words or subwords. For example, an uncommon word may split into pieces like Talk and ##ie.
Token encodings. Each token is mapped to a unique integer ID from the model's vocabulary.
Token embedding. Each ID is looked up in an embedding table. This turns it into a dense vector that captures meaning. Similar words land near each other in this vector space.
Positional embedding. Attention has no built-in sense of order. So a positional encoding is added to each token embedding. This tells the model where each token sits in the sequence.

The result is a sequence of vectors that carry both meaning and position. They are now ready for the attention layers.

Self-Attention

Self-attention is the heart of the transformer. For each token, it asks one question. How much should I pay attention to every other token in this sequence?

Each token's embedding is projected into three vectors: a query (Q), a key (K), and a value (V). The model scores every query against every key with a dot product. It scales those scores and passes them through softmax to get attention weights. It then uses those weights to take a weighted sum of the value vectors. Tokens that are relevant to each other get high weights.

Every token is compared with every other token directly. So self-attention captures relationships across the whole sequence in a single step. No information has to travel through in-between recurrent states.

Diagram of self-attention: query, key, and value projections producing attention-weighted outputs

Self-attention scores every token against every other token using queries, keys, and values.

Multi-Head Self-Attention

One attention pattern is rarely enough. Multi-head self-attention runs several attention operations in parallel. We call each one a head. Each head learns a different kind of relationship. One head might track grammar. Another might track which noun a pronoun refers to. Their outputs are joined together and projected back to the model dimension. This gives the layer a fuller view of the sequence.

The Encoder and Decoder

A full transformer has two stacks:

Encoder stack. This is a stack of identical layers. Each layer combines multi-head self-attention with a feed-forward network. It also adds layer normalization and residual connections. The encoder turns the input embeddings into a sequence of context-rich hidden states. This is a representation of the whole input.
Decoder stack. This is also a stack of layers. But each decoder layer has two attention mechanisms. Masked self-attention lets each output position attend only to earlier positions. This stops the model from cheating by looking at future words while it generates. Cross-attention, also called encoder-decoder attention, lets the decoder attend to the encoder's hidden states. So the decoder focuses on the relevant parts of the input as it produces each output token.

Component	Role
Encoder	Reads the input and produces hidden-state representations
Self-attention	Relates each input token to every other input token
Multi-head attention	Learns several relationship patterns in parallel
Decoder	Generates the output sequence token by token
Masked self-attention	Prevents the decoder from seeing future tokens
Cross-attention	Lets the decoder focus on relevant encoder states

Diagram of the full transformer with encoder stack, decoder stack, masked attention, and cross-attention

The encoder builds a representation of the input; the decoder uses masked and cross-attention to generate output.

How This Maps to Modern LLMs

Different model families use different parts of this architecture:

Encoder-only models like BERT use only the encoder stack. They produce rich two-way representations. They are great at understanding tasks like classification, NER, and question answering.
Decoder-only models like GPT, Phi, and LLaMA use only the decoder stack with masked self-attention. They predict the next token. They are great at generation.
Encoder-decoder models like T5 use both stacks. This suits sequence-to-sequence tasks such as translation and summarization.

Every one of these is pretrained on huge text corpora and then fine-tuned for specific tasks. That is the workflow the rest of this series puts into practice.

Summary

This is how the transformer works. It replaced recurrent networks by swapping word-by-word processing for self-attention. Self-attention connects every token to every other token directly. That one change enabled parallel training. It captured long-range dependencies. And it made the pretrain-then-fine-tune approach practical.

The flow is always the same. We tokenize the text, embed the tokens, and add positional information. We then apply multi-head self-attention through the encoder or decoder stacks, or both. Finally, we read out a representation or a generated sequence. With this foundation, we are ready to look inside specific models. We start with BERT's architecture and fine-tuning.

Transformer Architecture & LLM Foundations

Fine Tuning LLM with HuggingFace Transformers for NLP

Before Transformers: RNNs and Seq2Seq

Why Transformers Win

From Text to Vectors

Self-Attention

Multi-Head Self-Attention

The Encoder and Decoder

How This Maps to Modern LLMs

Summary

Found this useful? Keep building with me.

Latest recommendations you might like

BERT Architecture: Theory and Fine-Tuning

Fine-Tuning Distilled BERT for Fake News Detection

Fine-Tuning BERT for Sentiment Classification

Fine-Tuning DistilBERT for Restaurant Search NER

Find this tutorial useful?

Discussion & Comments