Project Overview
VoiceTrack takes the speech-to-text track from concept to production. We prepare a domain-specific audio dataset, fine-tune Whisper, and serve a streaming STT API with an evaluation gate on WER.
Objective
Fine-tune Whisper on a domain-specific audio corpus and deploy a streaming STT service with an evaluation harness that gates new checkpoints on WER and domain quality.
Scope
- Audio data collection, transcription cleanup, and alignment.
- PEFT-style fine-tuning of Whisper on the domain set.
- A streaming STT pipeline with chunked decoding.
- A WER-based evaluation gate plus a domain-quality rubric.
Datasets
- A domain-specific audio set with cleaned transcriptions.
- A held-out evaluation slice with golden transcripts.
Stack
- Hugging Face
transformersanddatasetsfor fine-tuning. - PEFT for parameter-efficient Whisper adaptation.
- An audio-processing pipeline for normalisation and chunking.
- An OpenAI-compatible STT API for serving.
Evaluation
- Word Error Rate (WER) on the held-out set.
- A rubric for domain-specific terminology and formatting.
- Comparison against the base Whisper checkpoint.
Deliverables
- A cleaned, aligned domain audio dataset.
- A fine-tuned Whisper checkpoint with a measurable WER improvement.
- A streaming STT endpoint with an OpenAI-compatible API.
- An evaluation report comparing the fine-tuned model to the base.
Prerequisites
Module 13 (Speech-to-Text with Whisper), Modules 03–05 (fine-tuning fundamentals and datasets).