Two frontier speech models that excel in speed and accuracy
Today we're launching resonant-1 and resonant-1-flash — two frontier speech models that set a new standard for both speed and accuracy.
Both models sit at the top of the Open ASR Leaderboard, trained on a four-phase schedule that relies heavily on reinforcement learning to push accuracy beyond state-of-the-art.
On the inference side, we've made deep optimizations to our model architecture and to Violin, our internal inference engine built specifically for speech. Fused kernels, CUDA graphs, and smarter scheduling during decoding allow Resonant-1-flash to process one hour of audio in under three seconds.
| # | Model | RTFx | Average | AMI | Earnings22 | Gigaspeech | SPGISpeech | Tedlium |
|---|---|---|---|---|---|---|---|---|
| 1 | Resonant-1 | 1187 | 6.51 | 9.39 | 9.02 | 8.76 | 3.05 | 2.32 |
| 2 | Resonant-1 Flash | 1438 | 6.71 | 9.57 | 9.72 | 8.86 | 3.09 | 2.31 |
| 3 | Cohere Transcribe | 525 | 6.78 | 8.13 | 10.86 | 9.34 | 3.08 | 2.49 |
| 4 | Zoom Scribe v1 | — | 6.80 | 10.03 | 9.53 | 9.61 | 1.59 | 3.22 |
| 5 | ibm-granite/granite-4.0-1b-speech | 280 | 6.81 | 8.44 | 8.48 | 10.14 | 3.89 | 3.10 |
| 6 | Qwen/Qwen3-ASR-1.7B | 148 | 6.93 | 10.56 | 10.25 | 8.74 | 2.84 | 2.28 |
| 7 | nvidia/canary-qwen-2.5b | 418 | 6.94 | 10.19 | 10.45 | 9.43 | 1.90 | 2.71 |
| 8 | ElevenLabs Scribe v2 | — | 7.09 | 11.86 | 9.43 | 9.11 | 2.68 | 2.37 |
| 9 | ibm-granite/granite-speech-3.3-8b | 145 | 7.18 | 8.98 | 9.42 | 10.19 | 3.91 | 3.40 |
| 10 | microsoft/Phi-4-multimodal-instruct | 151 | 7.32 | 11.09 | 10.16 | 9.33 | 3.06 | 2.94 |
Word Error Rate (%) on Open ASR Leaderboard — lower is better. RTFx = realtime factor (higher = faster). Averages computed over AMI, Earnings22, Gigaspeech, SPGISpeech, and Tedlium.
Resonant-1 achieves the lowest average WER across all short-form English benchmarks, while resonant-1-flash delivers the fastest inference at 1438 realtime.
Resonant-1 leads across French, Dutch, Spanish, and Polish, achieving the lowest average WER excluding Swedish.
| Model | English | German | French | Dutch | Spanish | Polish | Swedish | Avg (excl. Swe) |
|---|---|---|---|---|---|---|---|---|
| Resonant | 3.89 | 4.67 | 3.12 | 4.01 | 3.54 | 5.08 | 6.21 | 4.22 |
| Whisper v3 | 3.74 | 4.45 | 3.98 | 4.87 | 4.12 | 6.34 | 5.76 | 4.71 |
| Cohere Transcribe | 3.96 | 4.72 | 4.21 | 5.13 | 4.34 | 6.52 | 6.03 | 4.98 |
| Resonant Flash | 4.21 | 4.98 | 3.89 | 4.76 | 4.12 | 5.82 | 6.54 | 5.13 |
| Qwen3-ASR-1.7B | 4.43 | 5.31 | 4.67 | 5.92 | 4.89 | 7.14 | 7.28 | 5.94 |
Word Error Rate (%) across FLEURS test sets — lower is better.
Processing one hour of speech in under 3 seconds, with little compromise on accuracy
| # | Model | Average | Earnings21 | Earnings22 | Tedlium | CORAAL |
|---|---|---|---|---|---|---|
| 1 | ElevenLabs Scribe V2 | 7.32 | 6.48 | 9.99 | 2.12 | 10.67 |
| 2 | AssemblyAI Universal Pro 3 | 8.34 | 7.62 | 10.59 | 2.23 | 12.83 |
| 3 | Resonant | 8.65 | 7.41 | 11.37 | 2.21 | 13.64 |
| 4 | Speechmatics Enhanced | 8.80 | 7.90 | 10.75 | 2.26 | 14.29 |
| 5 | Resonant Flash | 8.875 | 7.51 | 10.60 | 2.22 | 14.98 |
| 6 | Cohere Transcribe | 9.73 | 8.70 | 12.66 | 2.23 | 15.34 |
Word Error Rate (%) on long-form benchmarks — lower is better. Resonant Flash processes 1 hour of audio in under 3 seconds.
Pareto Optimal