--- marp: true paginate: true author: Laurent Fainsin math: katex --- # Whisper Robust Speech Recognition via Large-Scale Weak Supervision ---

# Context

- The trend is toward unsupervised learning (Wav2Vec 2.0, 1M hours of training data) $\rightarrow$ good audio encoders but fine-tuning required. - Pre-training on multiple supervised datasets and domains improves speech recognition robustness and generalization. - Limited availability of labeled data in speech recognition, current datasets like [SpeechStew](https://arxiv.org/abs/2104.02133) only totals 5,140 hours of supervision. - Recent efforts to create larger datasets for speech recognition by relaxing the requirement of gold-standard human-validated transcripts. - Trade-off between quality and quantity, similar to computer vision where larger weakly supervised datasets significantly improve model robustness and generalization. ---

# Dataset & Data Processing

- Broad distribution of audio from many different environments, recording setups, speakers, and languages, salvaged from the internet. - Audio language detector and heuristics to detect and filter bad/duplicate transcriptions. - New dataset: 680,000 hours of weakly labeled audio data, including 117,000 hours of audio for 96 other languages and 125,000 hours of english translation data. ---

# Model

![bg 75%](figs/archi_alternative.png) ---

# [GELU](https://paperswithcode.com/method/gelu)

![bg 65%](figs/gelu.png) ---

# Sinusoidal position embeddings

![bg 95%](https://d33wubrfki0l68.cloudfront.net/ef81ee3018af6ab6f23769031f8961afcdd67c68/3358f/img/transformer_architecture_positional_encoding/positional_encoding.png) ---

# [Sparse
Transformer](https://openai.com/blog/sparse-transformer/)

![bg 70%](https://production-media.paperswithcode.com/methods/Screen_Shot_2020-05-30_at_3.09.30_PM.png) ---

# Multitask

![bg 95%](https://cdn.openai.com/whisper/draft-20220919a/asr-details-desktop.svg) ---

# Training

| Hyperparameter | Value | | :-------------------------------- | :----------: | | Updates | 1048576 | | Batch Size | 256 | | Warmup Updates | 2048 | | Max grad norm | 1.0 | | Optimizer | AdamW | | β1 | 0.9 | | β2 | 0.98 | | ε | 10−6 | | Weight Decay | 0.1 | | Weight Init Gaussian | Fan-In | | Learning Rate Schedule | Linear Decay | | Speechless audio subsample factor | 10× | | Condition on prior text rate | 50% | ---

# WER metric

$$\text{WER} = \frac{S + D + I}{N} = \frac{S + D + I}{S + D + C}$$ $S$ is the number of substitutions $D$ is the number of deletions $I$ is the number of insertions $C$ is the number of correct words $N$ is the number of words in the reference $(N=S+D+C)$ ---

# Evaluation

| Dataset | wav2vec 2.0
Large (no LM) | Whisper
Large V2 | RER
(%) | | :---------------- | :----------------------------: | :-------------------: | :-------------------------------------: | | LibriSpeech Clean | **2.7** | **2.7** | 0.0 | | Artie | 24.5 | **6.2** | 74.7 | | Common Voice | 29.9 | **9.0** | 69.9 | | Fleurs En | 14.6 | **4.4** | 69.9 | | Tedlium | 10.5 | **4.0** | 61.9 | | CHiME6 | 65.8 | **25.5** | 61.2 | | VoxPopuli En | 17.9 | **7.3** | 59.2 | | CORAAL | 35.6 | **16.2** | 54.5 | | AMI IHM | 37.0 | **16.9** | 54.3 | | Switchboard | 28.3 | **13.8** | 51.2 | | CallHome | 34.8 | **17.6** | 49.4 | | WSJ | 7.7 | **3.9** | 49.4 | | AMI SDM1 | 67.6 | **36.4** | 46.2 | | LibriSpeech Other | 6.2 | **5.2** | 16.1 | | Average | 29.3 | **12.8** | 55.2 | ![bg right:40% 100%](figs/zeroshot_WER_comparison.png) ---

# Robustness to noise

![bg 100%](figs/noise_perf.png) ![bg 100%](figs/noise_perf_legend.png) ---

# Human comparison & improvements

![bg 100%](figs/whisper_vs_human.png) ![bg 100%](figs/tokenizer_improvement.png) ---

# Model Scaling

| Model | Layers | Width | Heads | Parameters | Required VRAM | Relative speed | | :----- | :----: | :---: | :---: | :--------: | :-----------: | :------------: | | Tiny | 4 | 384 | 6 | 39M | ~1 GB | ~32x | | Base | 6 | 512 | 8 | 74M | ~1 GB | ~16x | | Small | 12 | 768 | 12 | 244M | ~2 GB | ~6x | | Medium | 24 | 1024 | 16 | 769M | ~5 GB | ~2x | | Large | 32 | 1280 | 20 | 1550M | ~10 GB | 1x | ---

# Model Scaling

![bg 95%](figs/model_scaling.png) ---

# Dataset Scaling

| Dataset
size (h) | English
WER (↓) | Multilingual
WER (↓) | X→En
BLEU (↑) | | :-------------------: | :------------------: | :-----------------------: | :----------------: | | 3405 | 30.5 | 92.4 | 0.2 | | 6811 | 19.6 | 72.7 | 1.7 | | 13621 | 14.4 | 56.6 | 7.9 | | 27243 | 12.3 | 45.0 | 13.9 | | 54486 | 10.9 | 36.4 | 19.2 | | 681070 | **9.9** | **29.2** | **24.8** | ---

# Dataset scaling

![bg 95%](figs/wer_dataset_size_correlation.png) ![bg 95%](figs/bleu_dataset_size_correlation.png)