Module Overview
This module covers the speech modality. It introduces Speech AI and the foundations of speech-to-text, examines Whisper's encoder-decoder architecture and API, and walks through building STT pipelines and fine-tuning Whisper on domain-specific data to improve transcription accuracy.
Learning Objectives
- Describe the Speech AI landscape and the role of speech-to-text.
- Explain Whisper's architecture and how its API is used.
- Outline a production STT pipeline end to end.
- Prepare a custom speech dataset for fine-tuning.
- Describe the Whisper fine-tuning workflow for domain adaptation.
Topics Covered
Speech AI & STT Foundations
- Introduction to Speech AI
- Speech-to-text foundations
- Whisper architecture
Whisper API & STT Pipelines
- Speech-to-text with the Whisper API
- Building STT pipelines
Fine-Tuning Whisper
- Dataset preparation for fine-tuning
- Fine-tuning Whisper on custom data
Key Concepts & Terminology
Log-mel spectrogram, encoder-decoder ASR, multilingual transcription, robustness to noise and accents, domain adaptation for audio.
Tools & Frameworks Referenced
Whisper (and faster-whisper-style runtimes), Hugging Face Transformers/Datasets for audio fine-tuning.
Prerequisites
Modules 01–03 (Transformer foundations).