BERT Architecture: Theory and Fine-Tuning

BERT stands for Bidirectional Encoder Representations from Transformers. In simple words, it is the encoder half of the transformer, trained to understand language deeply. Google released it in 2018. It set a new bar for tasks that need understanding rather than text generation. Those tasks include classification, named entity recognition, and question answering.

BERT's design has three parts worth learning. First is how it is built. Second is the pretraining that gives it two-way understanding. Third is how we adapt a pretrained BERT to our own task through fine-tuning. This sets up the hands-on tutorials that follow.

Prerequisites: Familiarity with the transformer architecture, especially self-attention and the encoder stack.

What Is BERT?

BERT uses only the encoder stack of the transformer. It takes a sequence of tokens and produces a context-aware vector for each one. Downstream models can use these vectors directly.

Its main feature is in the name. It is bidirectional. Earlier language models read text left to right, or right to left. BERT reads the whole sentence at once. So each word's representation uses the words both before and after it. Take the word bank. In river bank and bank account, it gets a different representation. That is because BERT sees the context on both sides.

Note

BERT was introduced in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arXiv:1810.04805). The base model has 12 encoder layers, a hidden size of 768, 12 attention heads, and about 110M parameters.

Diagram of BERT as a stack of transformer encoder layers producing a contextual vector per token

BERT stacks transformer encoder layers to produce a context-aware representation of every token.

Tokenization and Special Tokens

BERT uses WordPiece tokenization. It splits rare words into subword pieces. This keeps the vocabulary at a manageable size, about 30,522 tokens for bert-base-uncased. A subword that continues a previous token is marked with ##, as in Talk + ##ie.

Two special tokens frame every input:

[CLS] is added at the start of the sequence. Its final hidden state acts as a summary of the whole input. This is what classification heads read from.
[SEP] separates segments, for example two sentences in a question-answering pair. It also marks the end of the input.

So a typical encoded input looks like [CLS] the movie was great [SEP]. This is then converted to integer IDs. An attention mask marks which positions are real tokens and which are padding.

How BERT Is Pretrained

BERT learns language from huge unlabeled corpora like Wikipedia and BookCorpus. It uses two self-supervised tasks. No human labels are required.

Masked Language Modeling (MLM)

About 15% of the input tokens are randomly replaced with a [MASK] token. BERT must then predict the original word from the context on both sides. It has to fill in the blanks using both left and right context. This forces the model to learn truly two-way representations.

Next-Sentence Prediction (NSP)

BERT is also given pairs of sentences. It must predict whether the second sentence really follows the first. This teaches relationships between sentences. That helps tasks like question answering and natural-language inference.

Diagram of masked language modeling: tokens are masked and BERT predicts them from bidirectional context

In masked language modeling, BERT predicts hidden tokens using context from both directions.

Pretraining vs. Fine-Tuning

The strength of BERT comes from a two-phase workflow:

Phase	Data	Goal
Pretraining	Massive unlabeled text	Learn general language understanding (done once, by the model authors)
Fine-tuning	Your smaller labeled dataset	Adapt the pretrained model to a specific task

We almost never pretrain BERT ourselves. It is too expensive. Instead, we download a pretrained checkpoint and fine-tune it on our data. This is fast and needs much less data.

Fine-Tuning BERT for a Task

To fine-tune, we attach a small task-specific head on top of the pretrained encoder. Then we train the whole thing on our labeled data.

Hugging Face provides ready-made classes for this:

AutoModelForSequenceClassification adds a classification head on the [CLS] representation. We use it for sentiment, topic, or spam classification.
AutoModelForTokenClassification adds a per-token head. We use it for NER and part-of-speech tagging.
AutoModelForQuestionAnswering adds a span-prediction head. We use it for extractive QA.

The base BERT weights start from their pretrained values. Only the new head starts from random values. During fine-tuning, gradients flow through both the head and the encoder. This gently adapts the language understanding to our task.

Important

When we load AutoModelForSequenceClassification from a base checkpoint, Hugging Face warns that the classifier weights are newly initialized. That is expected. Those head weights are exactly what fine-tuning trains.

Diagram of fine-tuning: a pretrained BERT encoder with a small task head trained on labeled data

Fine-tuning keeps the pretrained encoder and trains a small head on your labeled dataset.

Why BERT Variants Exist

Full BERT is accurate but heavy. So a family of smaller models grew around it: DistilBERT, TinyBERT, and MobileBERT. They keep most of BERT's accuracy while running faster and lighter. They do this through a technique called knowledge distillation. We will fine-tune these compact models later in this series.

Summary

This is how BERT works. It is a two-way transformer encoder. It is pretrained with masked language modeling and next-sentence prediction. That gives it a deep, context-aware understanding of language. We adapt it to our own task by adding a light head and fine-tuning on a labeled dataset. This is fast, data-efficient, and accurate.

With the theory in place, the next tutorial puts it to work. We will do fine-tuning BERT for multi-class sentiment classification on real Twitter data.

BERT Architecture: Theory and Fine-Tuning

Fine Tuning LLM with HuggingFace Transformers for NLP

What Is BERT?

Tokenization and Special Tokens

How BERT Is Pretrained

Masked Language Modeling (MLM)

Next-Sentence Prediction (NSP)

Pretraining vs. Fine-Tuning

Fine-Tuning BERT for a Task

Why BERT Variants Exist

Summary

Found this useful? Keep building with me.

Latest recommendations you might like

Fine-Tuning Distilled BERT for Fake News Detection

Fine-Tuning BERT for Sentiment Classification

Fine-Tuning DistilBERT for Restaurant Search NER

Fine-Tuning Phi-2 with LoRA and QLoRA

Find this tutorial useful?

Discussion & Comments