Benchmarks

Introducing Resonant-1 and Resonant-1-flash

Two frontier speech models that excel in speed and accuracy

April 8, 2026 3 min read

Introducing Resonant-1 and Resonant-1-flash

Today we're launching resonant-1 and resonant-1-flash — two frontier speech models that set a new standard for both speed and accuracy.

Both models sit at the top of the Open ASR Leaderboard, trained on a four-phase schedule that relies heavily on reinforcement learning to push accuracy beyond state-of-the-art.

On the inference side, we've made deep optimizations to our model architecture and to Violin, our internal inference engine built specifically for speech. Fused kernels, CUDA graphs, and smarter scheduling during decoding allow Resonant-1-flash to process one hour of audio in under three seconds.

SOTA Performance in Short-Form English

# Model RTFx Average AMI Earnings22 Gigaspeech SPGISpeech Tedlium
1 Resonant-1 1187 6.51 9.39 9.02 8.76 3.05 2.32
2 Resonant-1 Flash 1438 6.71 9.57 9.72 8.86 3.09 2.31
3 Cohere Transcribe 525 6.78 8.13 10.86 9.34 3.08 2.49
4 Zoom Scribe v1 6.80 10.03 9.53 9.61 1.59 3.22
5 ibm-granite/granite-4.0-1b-speech 280 6.81 8.44 8.48 10.14 3.89 3.10
6 Qwen/Qwen3-ASR-1.7B 148 6.93 10.56 10.25 8.74 2.84 2.28
7 nvidia/canary-qwen-2.5b 418 6.94 10.19 10.45 9.43 1.90 2.71
8 ElevenLabs Scribe v2 7.09 11.86 9.43 9.11 2.68 2.37
9 ibm-granite/granite-speech-3.3-8b 145 7.18 8.98 9.42 10.19 3.91 3.40
10 microsoft/Phi-4-multimodal-instruct 151 7.32 11.09 10.16 9.33 3.06 2.94

Word Error Rate (%) on Open ASR Leaderboard — lower is better. RTFx = realtime factor (higher = faster). Averages computed over AMI, Earnings22, Gigaspeech, SPGISpeech, and Tedlium.

Resonant-1 achieves the lowest average WER across all short-form English benchmarks, while resonant-1-flash delivers the fastest inference at 1438 realtime.

SOTA Performance in European Languages

Resonant-1 leads across French, Dutch, Spanish, and Polish, achieving the lowest average WER excluding Swedish.

Model English German French Dutch Spanish Polish Swedish Avg (excl. Swe)
Resonant 3.89 4.67 3.12 4.01 3.54 5.08 6.21 4.22
Whisper v3 3.74 4.45 3.98 4.87 4.12 6.34 5.76 4.71
Cohere Transcribe 3.96 4.72 4.21 5.13 4.34 6.52 6.03 4.98
Resonant Flash 4.21 4.98 3.89 4.76 4.12 5.82 6.54 5.13
Qwen3-ASR-1.7B 4.43 5.31 4.67 5.92 4.89 7.14 7.28 5.94

Word Error Rate (%) across FLEURS test sets — lower is better.

Performance in Long-Form English

Processing one hour of speech in under 3 seconds, with little compromise on accuracy

# Model Average Earnings21 Earnings22 Tedlium CORAAL
1 ElevenLabs Scribe V2 7.32 6.48 9.99 2.12 10.67
2 AssemblyAI Universal Pro 3 8.34 7.62 10.59 2.23 12.83
3 Resonant 8.65 7.41 11.37 2.21 13.64
4 Speechmatics Enhanced 8.80 7.90 10.75 2.26 14.29
5 Resonant Flash 8.875 7.51 10.60 2.22 14.98
6 Cohere Transcribe 9.73 8.70 12.66 2.23 15.34

Word Error Rate (%) on long-form benchmarks — lower is better. Resonant Flash processes 1 hour of audio in under 3 seconds.

Pareto Optimal

Resonate Resonate Flash Cohere Transcribe NVIDIA Canary Qwen 2.5B IBM Granite 4.0 1B Qwen3-ASR-1.7B Kyutai STT 2.6B OpenAI Whisper Large v3 Moonshine Streaming Med
0 400 800 1200 1600 5.5 6.5 7.5 8.5 9.5 Accuracy (WER, lower better) Throughput (RTFx) 525 418 280 448 148 88 146 145 1187 1438