projet-audionumerique/slides.md

15 KiB
Raw Permalink Blame History

marp paginate author math
true true Laurent Fainsin katex

Whisper

Robust Speech Recognition via Large-Scale Weak Supervision

Web-scale Supervised Pretraining for Speech Recognition. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever. OpenAI, San Francisco. September 2022. arXiv:2212.04356.


Context

  • The trend is toward unsupervised learning (Wav2Vec 2.0, 1M hours of training data) \rightarrow good audio encoders but fine-tuning required.
  • Pre-training on multiple supervised datasets and domains improves speech recognition robustness and generalization.
  • Limited availability of labeled data in speech recognition, current datasets like SpeechStew only totals 5,140 hours of supervision.
  • Recent efforts to create larger datasets for speech recognition by relaxing the requirement of gold-standard human-validated transcripts.
  • Trade-off between quality and quantity, similar to computer vision where larger weakly supervised datasets significantly improve model robustness and generalization.

Dataset & Data Processing

  • Broad distribution of audio from many different environments, recording setups, speakers, and languages, salvaged from the internet.
  • Audio language detector and heuristics to detect and filter bad/duplicate transcriptions.
  • New dataset: 680,000 hours of weakly labeled audio data, including 117,000 hours of audio for 96 other languages and 125,000 hours of english translation data.

Model

bg 75%


GELU

bg 65%

$\displaystyle \text{GELU}(x) = x P(X \leq x) = x\Phi(x) = x \frac12 \left[ 1 + \text{erf} \left( x / \sqrt2 \right) \right]$ X \rightarrow \mathcal{N}(0, 1)


Sinusoidal position embeddings

bg 95%

\displaystyle f(t)^{(i)} = \left\{ \begin{array}{l} sin(wk.t), \quad\text{if} i=2k \\ cos(wk.t), \quad\text{if} i=2k+1 \end{array}\right. \quad w_k = \frac{1}{10000^{2k/d}}


Sparse
Transformer

bg 70%


Multitask

bg 95%


Training

Hyperparameter Value
Updates 1048576
Batch Size 256
Warmup Updates 2048
Max grad norm 1.0
Optimizer AdamW
β1 0.9
β2 0.98
ε 106
Weight Decay 0.1
Weight Init Gaussian Fan-In
Learning Rate Schedule Linear Decay
Speechless audio subsample factor 10×
Condition on prior text rate 50%

WER metric

\text{WER} = \frac{S + D + I}{N} = \frac{S + D + I}{S + D + C}

S is the number of substitutions D is the number of deletions I is the number of insertions C is the number of correct words N is the number of words in the reference

(N=S+D+C)


Evaluation

Dataset wav2vec 2.0
Large (no LM)
Whisper
Large V2
RER
(%)
LibriSpeech Clean 2.7 2.7 0.0
Artie 24.5 6.2 74.7
Common Voice 29.9 9.0 69.9
Fleurs En 14.6 4.4 69.9
Tedlium 10.5 4.0 61.9
CHiME6 65.8 25.5 61.2
VoxPopuli En 17.9 7.3 59.2
CORAAL 35.6 16.2 54.5
AMI IHM 37.0 16.9 54.3
Switchboard 28.3 13.8 51.2
CallHome 34.8 17.6 49.4
WSJ 7.7 3.9 49.4
AMI SDM1 67.6 36.4 46.2
LibriSpeech Other 6.2 5.2 16.1
Average 29.3 12.8 55.2

bg right:40% 100%


Robustness to noise

bg 100% bg 100%


Human comparison & improvements

bg 100% bg 100%


Model Scaling

Model Layers Width Heads Parameters Required VRAM Relative speed
Tiny 4 384 6 39M ~1 GB ~32x
Base 6 512 8 74M ~1 GB ~16x
Small 12 768 12 244M ~2 GB ~6x
Medium 24 1024 16 769M ~5 GB ~2x
Large 32 1280 20 1550M ~10 GB 1x

Model Scaling

bg 95%


Dataset Scaling

Dataset
size (h)
English
WER (↓)
Multilingual
WER (↓)
X→En
BLEU (↑)
3405 30.5 92.4 0.2
6811 19.6 72.7 1.7
13621 14.4 56.6 7.9
27243 12.3 45.0 13.9
54486 10.9 36.4 19.2
681070 9.9 29.2 24.8

Dataset scaling

bg 95% bg 95%