#Capstone#Whisper#Speech-to-Text#Fine-Tuning#Streaming#Syllabus

Project 06: VoiceTrack — Whisper Fine-Tuning and Production STT Pipeline

Fine-tune Whisper on domain-specific audio and ship a production STT service with streaming transcription, diarisation hooks, and an evaluation gate on WER.

May 28, 2026 at 12:00 PM1 min readFollowFollow (Hindi)

Topics You Will Master

Preparing a domain-specific speech dataset for fine-tuning
Fine-tuning Whisper on custom audio with PEFT
Building a streaming STT pipeline behind an OpenAI-compatible API
Evaluating transcripts with WER and a domain-quality rubric
Best For

Engineers shipping a speech-aware product where off-the-shelf Whisper falls short on jargon, accents, or formatting.

Expected Outcome

A fine-tuned Whisper checkpoint serving a streaming STT endpoint with a clear WER improvement over the base model on domain audio.

Project Overview

VoiceTrack takes the speech-to-text track from concept to production. We prepare a domain-specific audio dataset, fine-tune Whisper, and serve a streaming STT API with an evaluation gate on WER.

Objective

Fine-tune Whisper on a domain-specific audio corpus and deploy a streaming STT service with an evaluation harness that gates new checkpoints on WER and domain quality.

Scope

  • Audio data collection, transcription cleanup, and alignment.
  • PEFT-style fine-tuning of Whisper on the domain set.
  • A streaming STT pipeline with chunked decoding.
  • A WER-based evaluation gate plus a domain-quality rubric.

Datasets

  • A domain-specific audio set with cleaned transcriptions.
  • A held-out evaluation slice with golden transcripts.

Stack

  • Hugging Face transformers and datasets for fine-tuning.
  • PEFT for parameter-efficient Whisper adaptation.
  • An audio-processing pipeline for normalisation and chunking.
  • An OpenAI-compatible STT API for serving.

Evaluation

  • Word Error Rate (WER) on the held-out set.
  • A rubric for domain-specific terminology and formatting.
  • Comparison against the base Whisper checkpoint.

Deliverables

  • A cleaned, aligned domain audio dataset.
  • A fine-tuned Whisper checkpoint with a measurable WER improvement.
  • A streaming STT endpoint with an OpenAI-compatible API.
  • An evaluation report comparing the fine-tuned model to the base.

Prerequisites

Module 13 (Speech-to-Text with Whisper), Modules 03–05 (fine-tuning fundamentals and datasets).

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments