BERT — Bidirectional Encoder Representations from Transformers — is the encoder half of the transformer architecture, pretrained to deeply understand language. Released by Google in 2018, it set a new standard for tasks that require understanding rather than generating text: classification, named entity recognition, and question answering.
This article covers how BERT is structured, the clever pretraining tricks that give it bidirectional understanding, and how you adapt a pretrained BERT to your own task through fine-tuning. It sets up the hands-on tutorials that follow in this series.
Prerequisites: Familiarity with the transformer architecture — especially self-attention and the encoder stack.
What Is BERT?
BERT uses only the encoder stack of the transformer. It takes a sequence of tokens and produces a context-aware vector for each one — representations that downstream models can use directly.
Its defining feature is in the name: it is bidirectional. Earlier language models read text left-to-right (or right-to-left). BERT reads the entire sentence at once, so each word's representation is informed by the words both before and after it. The word "bank" in "river bank" and "bank account" gets a different representation because BERT sees the surrounding context on both sides.
Note
BERT was introduced in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arXiv:1810.04805). The base model has 12 encoder layers, a hidden size of 768, 12 attention heads, and about 110M parameters.

BERT stacks transformer encoder layers to produce a context-aware representation of every token.
Tokenization and Special Tokens
BERT uses WordPiece tokenization, which splits rare words into subword pieces so the vocabulary stays a manageable size (about 30,522 tokens for bert-base-uncased). A subword that continues a previous token is marked with ##, as in Talk + ##ie.
Two special tokens frame every input:
[CLS]is prepended to the sequence. Its final hidden state acts as an aggregate representation of the whole input — this is what classification heads read from.[SEP]separates segments (for example, two sentences in a question-answering pair) and marks the end of the input.
A typical encoded input therefore looks like [CLS] the movie was great [SEP], converted to integer IDs, with an attention mask marking which positions are real tokens versus padding.
How BERT Is Pretrained
BERT learns language from huge unlabeled corpora (Wikipedia and BookCorpus) using two self-supervised tasks — no human labels required.
Masked Language Modeling (MLM)
About 15% of the input tokens are randomly replaced with a [MASK] token, and BERT must predict the original word from the surrounding context on both sides. Because it has to fill in blanks using left and right context, the model is forced to learn genuinely bidirectional representations.
Next-Sentence Prediction (NSP)
BERT is also given pairs of sentences and must predict whether the second sentence actually follows the first. This teaches relationships between sentences, which helps tasks like question answering and natural-language inference.

In masked language modeling, BERT predicts hidden tokens using context from both directions.
Pretraining vs. Fine-Tuning
The power of BERT comes from a two-phase workflow:
| Phase | Data | Goal |
|---|---|---|
| Pretraining | Massive unlabeled text | Learn general language understanding (done once, by the model authors) |
| Fine-tuning | Your smaller labeled dataset | Adapt the pretrained model to a specific task |
You almost never pretrain BERT yourself — it is expensive. Instead, you download a pretrained checkpoint and fine-tune it on your data, which is fast and needs comparatively little data.
Fine-Tuning BERT for a Task
To fine-tune, you attach a small task-specific head on top of the pretrained encoder and train the whole thing on your labeled data.
Hugging Face provides ready-made classes for this:
AutoModelForSequenceClassificationadds a classification head on the[CLS]representation — for sentiment, topic, or spam classification.AutoModelForTokenClassificationadds a per-token head — for NER and part-of-speech tagging.AutoModelForQuestionAnsweringadds a span-prediction head — for extractive QA.
The base BERT weights start from their pretrained values; only the new head is initialized randomly. During fine-tuning, gradients flow through both the head and the encoder, gently adapting the language understanding to your task.
Important
When you load AutoModelForSequenceClassification from a base checkpoint, Hugging Face warns that the classifier weights are "newly initialized." That is expected — those head weights are exactly what fine-tuning trains.

Fine-tuning keeps the pretrained encoder and trains a small head on your labeled dataset.
Why BERT Variants Exist
Full BERT is accurate but heavy. That motivated a family of smaller models — DistilBERT, TinyBERT, and MobileBERT — that keep most of BERT's accuracy while running faster and lighter through a technique called knowledge distillation. You will fine-tune these compact models later in this series.
Summary
BERT is a bidirectional transformer encoder pretrained with masked language modeling and next-sentence prediction, giving it a deep, context-sensitive understanding of language. You adapt it to your own task by adding a lightweight head and fine-tuning on a labeled dataset — fast, data-efficient, and accurate.
With the theory in place, the next tutorial puts it to work: fine-tuning BERT for multi-class sentiment classification on real Twitter data.