Module Overview
This module bridges theory and practice. It revisits the attention mechanism at an implementation level (conceptually, not as production code) and then walks through the fine-tuning workflow for each of the three Transformer families on custom datasets, highlighting where each architecture excels.
Learning Objectives
- Describe how attention is assembled from its component operations.
- Outline the fine-tuning workflow for an encoder-only classifier.
- Outline the fine-tuning workflow for a decoder-only generative model.
- Outline the fine-tuning workflow for an encoder–decoder sequence-to-sequence model.
- Map task types (classification, generation, summarisation/translation) to the most suitable architecture.
Topics Covered
Attention Implementation
- Coding attention mechanisms — the operations behind self and multi-head attention (conceptual walkthrough)
Fine-Tuning Workflows
- Fine-tuning DistilBERT (encoder-only) with custom data — classification and token-level tasks
- Fine-tuning DistilGPT (decoder-only) with custom data — text generation and completion
- Fine-tuning T5 (encoder–decoder) with custom data — summarisation, translation, and text-to-text tasks
Key Concepts & Terminology
Encoder-only vs decoder-only vs encoder–decoder task fit, transfer learning, dataset formatting, train/validation split, evaluation against a held-out set, catastrophic forgetting.
Tools & Frameworks Referenced
Hugging Face Transformers (Trainer), Datasets, DistilBERT, DistilGPT-2, T5.
Prerequisites
Module 01 (Transformer Architecture & Tokenization Foundations).