Two frontier speech models that excel in speed and accuracy
April 8, 20263 min read
Introducing Resonant-1 and Resonant-1-flash
Today we're launching resonant-1 and resonant-1-flash — two frontier speech models that set a new standard for both speed and accuracy.
Both models sit at the top of the Open ASR Leaderboard, trained on a four-phase schedule that relies heavily on reinforcement learning to push accuracy beyond state-of-the-art.
On the inference side, we've made deep optimizations to our model architecture and to Violin, our internal inference engine built specifically for speech. Fused kernels, CUDA graphs, and smarter scheduling during decoding allow Resonant-1-flash to process one hour of audio in under three seconds.
SOTA Performance in Short-Form English
#
Model
RTFx
Average
AMI
Earnings22
Gigaspeech
SPGISpeech
Tedlium
1
Resonant-1
1187
6.51
9.39
9.02
8.76
3.05
2.32
2
Resonant-1 Flash
1438
6.71
9.57
9.72
8.86
3.09
2.31
3
Cohere Transcribe
525
6.78
8.13
10.86
9.34
3.08
2.49
4
Zoom Scribe v1
—
6.80
10.03
9.53
9.61
1.59
3.22
5
ibm-granite/granite-4.0-1b-speech
280
6.81
8.44
8.48
10.14
3.89
3.10
6
Qwen/Qwen3-ASR-1.7B
148
6.93
10.56
10.25
8.74
2.84
2.28
7
nvidia/canary-qwen-2.5b
418
6.94
10.19
10.45
9.43
1.90
2.71
8
ElevenLabs Scribe v2
—
7.09
11.86
9.43
9.11
2.68
2.37
9
ibm-granite/granite-speech-3.3-8b
145
7.18
8.98
9.42
10.19
3.91
3.40
10
microsoft/Phi-4-multimodal-instruct
151
7.32
11.09
10.16
9.33
3.06
2.94
Word Error Rate (%) on Open ASR Leaderboard — lower is better. RTFx = realtime factor (higher = faster). Averages computed over AMI, Earnings22, Gigaspeech, SPGISpeech, and Tedlium.
Resonant-1 achieves the lowest average WER across all short-form English benchmarks, while resonant-1-flash delivers the fastest inference at 1438 realtime.
We omit Librispeech and Voxpopuli from the evaluations as we use these datasets during training and cannot guarantee a contaminated result, additionally from our observations, while training on these datasets showed significant WER improvement, overall generalisation was hurt
SOTA Performance in European Languages
Resonant-1 leads across French, Dutch, Spanish, and Polish, achieving the lowest average WER excluding Swedish.
Model
Avg (excl. Swe)
English
German
French
Dutch
Spanish
Polish
Swedish
Resonate
4.22
4.69
3.83
4.71
4.88
2.67
4.56
7.38
Whisper v3
4.71
4.78
4.58
5.72
5.63
2.95
4.61
7.23
Cohere Transcribe
4.98
5.68
4.06
5.17
5.71
3.68
5.61
—
Resonate Flash
5.13
5.10
4.58
5.58
6.15
3.38
6.02
10.14
Qwen3-ASR
5.94
4.26
3.86
4.72
7.23
3.24
4.61
19.31
Word Error Rate (%) across FLEURS test sets — lower is better.
Performance in Long-Form English
Processing one hour of speech in under 3 seconds, with little compromise on accuracy
#
Model
Average
Earnings21
Earnings22
Tedlium
CORAAL
1
ElevenLabs Scribe v2
7.32
6.48
9.99
2.12
10.67
2
AssemblyAI Universal 3 Pro
8.34
7.62
10.59
2.32
12.83
3
Resonate
8.58
7.28
11.29
2.24
13.51
4
Speechmatics Enhanced
8.80
7.90
10.75
2.26
14.29
5
Rev AI Fusion
9.54
7.56
15.47
2.52
12.60
6
Rev AI Machine
9.64
7.63
15.72
2.92
12.28
7
Cohere Transcribe
9.73
8.70
12.66
2.23
15.34
Word Error Rate (%) on long-form benchmarks — lower is better.
Pareto Optimal
Resonate Resonate Flash Cohere Transcribe NVIDIA Canary Qwen 2.5B IBM Granite 4.0 1B Qwen3-ASR-1.7B Kyutai STT 2.6B OpenAI Whisper Large v3 Moonshine Streaming Med