---
marp: true
paginate: true
author: Laurent Fainsin
math: katex
---
# Whisper
Robust Speech Recognition via Large-Scale Weak Supervision
---
- The trend is toward unsupervised learning (Wav2Vec 2.0, 1M hours of training data) $\rightarrow$ good audio encoders but fine-tuning required.
- Pre-training on multiple supervised datasets and domains improves speech recognition robustness and generalization.
- Limited availability of labeled data in speech recognition, current datasets like [SpeechStew](https://arxiv.org/abs/2104.02133) only totals 5,140 hours of supervision.
- Recent efforts to create larger datasets for speech recognition by relaxing the requirement of gold-standard human-validated transcripts.
- Trade-off between quality and quantity, similar to computer vision where larger weakly supervised datasets significantly improve model robustness and generalization.
---
# Dataset & Data Processing
- Broad distribution of audio from many different environments, recording setups, speakers, and languages, salvaged from the internet.
- Audio language detector and heuristics to detect and filter bad/duplicate transcriptions.
- New dataset: 680,000 hours of weakly labeled audio data, including 117,000 hours of audio for 96 other languages and 125,000 hours of english translation data.
---
![bg 75%](figs/archi_alternative.png)
---
# [GELU](https://paperswithcode.com/method/gelu)
![bg 65%](figs/gelu.png)
---
# Sinusoidal position embeddings
![bg 95%](https://d33wubrfki0l68.cloudfront.net/ef81ee3018af6ab6f23769031f8961afcdd67c68/3358f/img/transformer_architecture_positional_encoding/positional_encoding.png)
---
# [Sparse
Transformer](https://openai.com/blog/sparse-transformer/)
![bg 70%](https://production-media.paperswithcode.com/methods/Screen_Shot_2020-05-30_at_3.09.30_PM.png)
---
![bg 95%](https://cdn.openai.com/whisper/draft-20220919a/asr-details-desktop.svg)
---
| Hyperparameter | Value |
| :-------------------------------- | :----------: |
| Updates | 1048576 |
| Batch Size | 256 |
| Warmup Updates | 2048 |
| Max grad norm | 1.0 |
| Optimizer | AdamW |
| β1 | 0.9 |
| β2 | 0.98 |
| ε | 10−6 |
| Weight Decay | 0.1 |
| Weight Init Gaussian | Fan-In |
| Learning Rate Schedule | Linear Decay |
| Speechless audio subsample factor | 10× |
| Condition on prior text rate | 50% |
---
$$\text{WER} = \frac{S + D + I}{N} = \frac{S + D + I}{S + D + C}$$
$S$ is the number of substitutions
$D$ is the number of deletions
$I$ is the number of insertions
$C$ is the number of correct words
$N$ is the number of words in the reference
$(N=S+D+C)$
---
| Dataset | wav2vec 2.0
Large (no LM) | Whisper
Large V2 | RER
(%) |
| :---------------- | :----------------------------: | :-------------------: | :-------------------------------------: |
| LibriSpeech Clean | **2.7** | **2.7** | 0.0 |
| Artie | 24.5 | **6.2** | 74.7 |
| Common Voice | 29.9 | **9.0** | 69.9 |
| Fleurs En | 14.6 | **4.4** | 69.9 |
| Tedlium | 10.5 | **4.0** | 61.9 |
| CHiME6 | 65.8 | **25.5** | 61.2 |
| VoxPopuli En | 17.9 | **7.3** | 59.2 |
| CORAAL | 35.6 | **16.2** | 54.5 |
| AMI IHM | 37.0 | **16.9** | 54.3 |
| Switchboard | 28.3 | **13.8** | 51.2 |
| CallHome | 34.8 | **17.6** | 49.4 |
| WSJ | 7.7 | **3.9** | 49.4 |
| AMI SDM1 | 67.6 | **36.4** | 46.2 |
| LibriSpeech Other | 6.2 | **5.2** | 16.1 |
| Average | 29.3 | **12.8** | 55.2 |
![bg right:40% 100%](figs/zeroshot_WER_comparison.png)
---
![bg 100%](figs/noise_perf.png)
![bg 100%](figs/noise_perf_legend.png)
---
# Human comparison & improvements
![bg 100%](figs/whisper_vs_human.png)
![bg 100%](figs/tokenizer_improvement.png)
---
| Model | Layers | Width | Heads | Parameters | Required VRAM | Relative speed |
| :----- | :----: | :---: | :---: | :--------: | :-----------: | :------------: |
| Tiny | 4 | 384 | 6 | 39M | ~1 GB | ~32x |
| Base | 6 | 512 | 8 | 74M | ~1 GB | ~16x |
| Small | 12 | 768 | 12 | 244M | ~2 GB | ~6x |
| Medium | 24 | 1024 | 16 | 769M | ~5 GB | ~2x |
| Large | 32 | 1280 | 20 | 1550M | ~10 GB | 1x |
---
![bg 95%](figs/model_scaling.png)
---
| Dataset
size (h) | English
WER (↓) | Multilingual
WER (↓) | X→En
BLEU (↑) |
| :-------------------: | :------------------: | :-----------------------: | :----------------: |
| 3405 | 30.5 | 92.4 | 0.2 |
| 6811 | 19.6 | 72.7 | 1.7 |
| 13621 | 14.4 | 56.6 | 7.9 |
| 27243 | 12.3 | 45.0 | 13.9 |
| 54486 | 10.9 | 36.4 | 19.2 |
| 681070 | **9.9** | **29.2** | **24.8** |
---
![bg 95%](figs/wer_dataset_size_correlation.png)
![bg 95%](figs/bleu_dataset_size_correlation.png)