BERT Architecture: Theory and Fine-Tuning

Learn how BERT's bidirectional encoder works, how masked language modeling pretrains it, and how a classification head adapts it to downstream NLP tasks.

Jun 18, 20265 min readFollow

Topics You Will Master

What makes BERT a bidirectional encoder and why that matters
BERT's pretraining tasks: masked language modeling and next-sentence prediction
Special tokens ([CLS], [SEP]) and WordPiece subword tokenization
The difference between pretraining and fine-tuning

BERT — Bidirectional Encoder Representations from Transformers — is the encoder half of the transformer architecture, pretrained to deeply understand language. Released by Google in 2018, it set a new standard for tasks that require understanding rather than generating text: classification, named entity recognition, and question answering.

This article covers how BERT is structured, the clever pretraining tricks that give it bidirectional understanding, and how you adapt a pretrained BERT to your own task through fine-tuning. It sets up the hands-on tutorials that follow in this series.

Prerequisites: Familiarity with the transformer architecture — especially self-attention and the encoder stack.

95% OFF

Fine Tuning LLM with HuggingFace Transformers for NLP

Learn the basics of transformers and then fine-tune large language models on your own custom datasets.

Enroll Now — 95% OFF →

What Is BERT?

BERT uses only the encoder stack of the transformer. It takes a sequence of tokens and produces a context-aware vector for each one — representations that downstream models can use directly.

Its defining feature is in the name: it is bidirectional. Earlier language models read text left-to-right (or right-to-left). BERT reads the entire sentence at once, so each word's representation is informed by the words both before and after it. The word "bank" in "river bank" and "bank account" gets a different representation because BERT sees the surrounding context on both sides.

Note

BERT was introduced in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arXiv:1810.04805). The base model has 12 encoder layers, a hidden size of 768, 12 attention heads, and about 110M parameters.

Diagram of BERT as a stack of transformer encoder layers producing a contextual vector per token

BERT stacks transformer encoder layers to produce a context-aware representation of every token.


Tokenization and Special Tokens

BERT uses WordPiece tokenization, which splits rare words into subword pieces so the vocabulary stays a manageable size (about 30,522 tokens for bert-base-uncased). A subword that continues a previous token is marked with ##, as in Talk + ##ie.

Two special tokens frame every input:

  • [CLS] is prepended to the sequence. Its final hidden state acts as an aggregate representation of the whole input — this is what classification heads read from.
  • [SEP] separates segments (for example, two sentences in a question-answering pair) and marks the end of the input.

A typical encoded input therefore looks like [CLS] the movie was great [SEP], converted to integer IDs, with an attention mask marking which positions are real tokens versus padding.


How BERT Is Pretrained

BERT learns language from huge unlabeled corpora (Wikipedia and BookCorpus) using two self-supervised tasks — no human labels required.

Masked Language Modeling (MLM)

About 15% of the input tokens are randomly replaced with a [MASK] token, and BERT must predict the original word from the surrounding context on both sides. Because it has to fill in blanks using left and right context, the model is forced to learn genuinely bidirectional representations.

Next-Sentence Prediction (NSP)

BERT is also given pairs of sentences and must predict whether the second sentence actually follows the first. This teaches relationships between sentences, which helps tasks like question answering and natural-language inference.

Diagram of masked language modeling: tokens are masked and BERT predicts them from bidirectional context

In masked language modeling, BERT predicts hidden tokens using context from both directions.


Pretraining vs. Fine-Tuning

The power of BERT comes from a two-phase workflow:

Phase Data Goal
Pretraining Massive unlabeled text Learn general language understanding (done once, by the model authors)
Fine-tuning Your smaller labeled dataset Adapt the pretrained model to a specific task

You almost never pretrain BERT yourself — it is expensive. Instead, you download a pretrained checkpoint and fine-tune it on your data, which is fast and needs comparatively little data.


Fine-Tuning BERT for a Task

To fine-tune, you attach a small task-specific head on top of the pretrained encoder and train the whole thing on your labeled data.

Hugging Face provides ready-made classes for this:

  • AutoModelForSequenceClassification adds a classification head on the [CLS] representation — for sentiment, topic, or spam classification.
  • AutoModelForTokenClassification adds a per-token head — for NER and part-of-speech tagging.
  • AutoModelForQuestionAnswering adds a span-prediction head — for extractive QA.

The base BERT weights start from their pretrained values; only the new head is initialized randomly. During fine-tuning, gradients flow through both the head and the encoder, gently adapting the language understanding to your task.

Important

When you load AutoModelForSequenceClassification from a base checkpoint, Hugging Face warns that the classifier weights are "newly initialized." That is expected — those head weights are exactly what fine-tuning trains.

Diagram of fine-tuning: a pretrained BERT encoder with a small task head trained on labeled data

Fine-tuning keeps the pretrained encoder and trains a small head on your labeled dataset.


Why BERT Variants Exist

Full BERT is accurate but heavy. That motivated a family of smaller models — DistilBERT, TinyBERT, and MobileBERT — that keep most of BERT's accuracy while running faster and lighter through a technique called knowledge distillation. You will fine-tune these compact models later in this series.


Summary

BERT is a bidirectional transformer encoder pretrained with masked language modeling and next-sentence prediction, giving it a deep, context-sensitive understanding of language. You adapt it to your own task by adding a lightweight head and fine-tuning on a labeled dataset — fast, data-efficient, and accurate.

With the theory in place, the next tutorial puts it to work: fine-tuning BERT for multi-class sentiment classification on real Twitter data.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments