Module 13: Speech-to-Text with Whisper

Module Overview

This module covers the speech modality. It introduces Speech AI and the foundations of speech-to-text, examines Whisper's encoder-decoder architecture and API, and walks through building STT pipelines and fine-tuning Whisper on domain-specific data to improve transcription accuracy.

Learning Objectives

Describe the Speech AI landscape and the role of speech-to-text.
Explain Whisper's architecture and how its API is used.
Outline a production STT pipeline end to end.
Prepare a custom speech dataset for fine-tuning.
Describe the Whisper fine-tuning workflow for domain adaptation.

Topics Covered

Speech AI & STT Foundations

Introduction to Speech AI
Speech-to-text foundations
Whisper architecture

Whisper API & STT Pipelines

Speech-to-text with the Whisper API
Building STT pipelines

Fine-Tuning Whisper

Dataset preparation for fine-tuning
Fine-tuning Whisper on custom data

Key Concepts & Terminology

Log-mel spectrogram, encoder-decoder ASR, multilingual transcription, robustness to noise and accents, domain adaptation for audio.

Tools & Frameworks Referenced

Whisper (and faster-whisper-style runtimes), Hugging Face Transformers/Datasets for audio fine-tuning.

Prerequisites

Modules 01-03 (Transformer foundations).

Module 13: Speech-to-Text with Whisper

Module Overview

Learning Objectives

Topics Covered

Speech AI & STT Foundations

Whisper API & STT Pipelines

Fine-Tuning Whisper

Key Concepts & Terminology

Tools & Frameworks Referenced

Prerequisites

Found this useful? Keep building with me.

Latest recommendations you might like

Module 11: Vision Foundations, CNNs to ViT

Module 12: Visual Language Models

Module 01: Transformers and Tokenization

Module 02: Hands-On Fine-Tuning of Transformers

Find this tutorial useful?

Discussion & Comments