Benchmarks

Introducing Resonant-1 and Resonant-1-flash

Two frontier speech models that excel in speed and accuracy

April 8, 2026 3 min read

Introducing Resonant-1 and Resonant-1-flash

Today we're launching resonant-1 and resonant-1-flash — two frontier speech models that set a new standard for both speed and accuracy.

Both models sit at the top of the Open ASR Leaderboard, trained on a four-phase schedule that relies heavily on reinforcement learning to push accuracy beyond state-of-the-art.

On the inference side, we've made deep optimizations to our model architecture and to Violin, our internal inference engine built specifically for speech. Fused kernels, CUDA graphs, and smarter scheduling during decoding allow Resonant-1-flash to process one hour of audio in under three seconds.

SOTA Performance in Short-Form English

# Model RTFx Average AMI Earnings22 Gigaspeech SPGISpeech Tedlium
1 Resonant-1 1187 6.51 9.39 9.02 8.76 3.05 2.32
2 Resonant-1 Flash 1438 6.71 9.57 9.72 8.86 3.09 2.31
3 Cohere Transcribe 525 6.78 8.13 10.86 9.34 3.08 2.49
4 Zoom Scribe v1 6.80 10.03 9.53 9.61 1.59 3.22
5 ibm-granite/granite-4.0-1b-speech 280 6.81 8.44 8.48 10.14 3.89 3.10
6 Qwen/Qwen3-ASR-1.7B 148 6.93 10.56 10.25 8.74 2.84 2.28
7 nvidia/canary-qwen-2.5b 418 6.94 10.19 10.45 9.43 1.90 2.71
8 ElevenLabs Scribe v2 7.09 11.86 9.43 9.11 2.68 2.37
9 ibm-granite/granite-speech-3.3-8b 145 7.18 8.98 9.42 10.19 3.91 3.40
10 microsoft/Phi-4-multimodal-instruct 151 7.32 11.09 10.16 9.33 3.06 2.94

Word Error Rate (%) on Open ASR Leaderboard — lower is better. RTFx = realtime factor (higher = faster). Averages computed over AMI, Earnings22, Gigaspeech, SPGISpeech, and Tedlium.

Resonant-1 achieves the lowest average WER across all short-form English benchmarks, while resonant-1-flash delivers the fastest inference at 1438 realtime.

We omit Librispeech and Voxpopuli from the evaluations as we use these datasets during training and cannot guarantee a contaminated result, additionally from our observations, while training on these datasets showed significant WER improvement, overall generalisation was hurt

SOTA Performance in European Languages

Resonant-1 leads across French, Dutch, Spanish, and Polish, achieving the lowest average WER excluding Swedish.

Model Avg (excl. Swe) English German French Dutch Spanish Polish Swedish
Resonate 4.22 4.69 3.83 4.71 4.88 2.67 4.56 7.38
Whisper v3 4.71 4.78 4.58 5.72 5.63 2.95 4.61 7.23
Cohere Transcribe 4.98 5.68 4.06 5.17 5.71 3.68 5.61
Resonate Flash 5.13 5.10 4.58 5.58 6.15 3.38 6.02 10.14
Qwen3-ASR 5.94 4.26 3.86 4.72 7.23 3.24 4.61 19.31

Word Error Rate (%) across FLEURS test sets — lower is better.

Performance in Long-Form English

Processing one hour of speech in under 3 seconds, with little compromise on accuracy

# Model Average Earnings21 Earnings22 Tedlium CORAAL
1 ElevenLabs Scribe v2 7.32 6.48 9.99 2.12 10.67
2 AssemblyAI Universal 3 Pro 8.34 7.62 10.59 2.32 12.83
3 Resonate 8.58 7.28 11.29 2.24 13.51
4 Speechmatics Enhanced 8.80 7.90 10.75 2.26 14.29
5 Rev AI Fusion 9.54 7.56 15.47 2.52 12.60
6 Rev AI Machine 9.64 7.63 15.72 2.92 12.28
7 Cohere Transcribe 9.73 8.70 12.66 2.23 15.34

Word Error Rate (%) on long-form benchmarks — lower is better.

Pareto Optimal

Resonate Resonate Flash Cohere Transcribe NVIDIA Canary Qwen 2.5B IBM Granite 4.0 1B Qwen3-ASR-1.7B Kyutai STT 2.6B OpenAI Whisper Large v3 Moonshine Streaming Med
0 400 800 1200 1600 5.5 6.5 7.5 8.5 9.5 Accuracy (WER, lower better) Throughput (RTFx) 525 418 280 448 148 88 146 145 1187 1438