Tontaube V0
A compute-efficient text-to-speech model, built to match commercial voice-agent quality and run on-device.
Tontaube V0 matches NeuTTS Air on the standard Seed-TTS intelligibility benchmark (2.33% WER vs 2.21%), and is preferred by Gemini 3.1 Pro on 95 to 97% of prosody comparisons against on-device competitors and up to 73% against leading commercial voice-agent providers.
Overview
Tontaube V0 is an autoregressive text to speech model with zero-shot voice cloning. A short reference clip and a transcript produce studio-quality speech in the reference voice. Inference runs behind a production vLLM server that streams audio tokens and decodes them on the fly.
The model is the first public release from Tontaube. It is evaluated below against six comparison systems grouped into two tiers:
- Commercial voice-agent APIs: ElevenLabs Flash v2.5, Cartesia Sonic-3, Gradium.
- On-device TTS: Kani TTS 2, NeuTTS Air, NeuTTS Nano.
On-device and Performance
Tontaube V0 is designed to be deployable wherever the user needs it, from high-throughput server fleets down to a single laptop or phone.
- GPU: 20–30× real-time for a single sequence on a single consumer GPU. Batching further increases throughput, making the model practical for high-volume server-side workloads.
- CPU: Roughly 1.5× real-time on a consumer-grade Ryzen 9 5900X without any GPU, already suitable for offline and edge deployments today.
- Mobile: On-device deployment to phones is on the near-term roadmap. We expect NPU acceleration on modern handsets to bring inference comfortably below real-time.
Audio Samples
Four generations spanning the registers covered by our benchmark.
Literary narrative
Audiobook-style narration in a single take.
The house at the end of Elm Lane had stood empty for nearly forty years. Its shutters, once a bright blue, had weathered to the color of driftwood, and ivy climbed the walls in thick, unruly patches. Locals still walked past it slowly, as though it might remember them.
Nested clauses
Complex sentence with parentheticals, em-dashes, and semicolons.
The proposal, after being reviewed by the committee — which, as noted, had been assembled on short notice — was tabled indefinitely, or rather until the next quarterly meeting, whichever came later; a compromise that, predictably, satisfied absolutely no one.
Uncommon vocabulary
Archaic and technical vocabulary.
The manuscript, an eighteenth-century palimpsest recovered from a Byzantine monastery, revealed traces of an older Aramaic text beneath the Greek homilies. Its conservator, Anastasios Kyriakou, presented his findings at the ICPM symposium in Lisbon last November, drawing cautious applause from the assembled paleographers.
Conversational
One side of a casual phone conversation.
Oh my god, you won't believe the day I've had. So I'm at the coffee shop this morning, right, and the guy in front of me orders like eleven drinks. Eleven! And of course the machine breaks down halfway through. I was standing there for forty minutes just waiting for a coffee.
Benchmark Results
We evaluate Tontaube V0 on two benchmarks: Seed-TTS test-en
for intelligibility, and GMOS, a Gemini-judged
pairwise preference benchmark on 400 literary passages drawn from the
PG-19 test split.
Seed-TTS WER (1088 rows)
Word error rate measures intelligibility: a reference ASR model transcribes each synthesized sample, and the transcript is compared to the input text. For context on what “good” looks like, the original Seed-TTS paper reports the human ground-truth recordings on this set at 2.14% WER, with Seed-TTS itself at 2.25% (Anastassiou et al., 2024, Table 1). Any system in the low-2% range is therefore essentially at the ceiling of what this benchmark can distinguish. Residual error at that scale reflects ASR noise on natural accents and prosody more than synthesis mistakes.
Mean of per row word error rate, clipped at 100% per row. Lower is better.
GMOS pairwise preference (400 rows)
Raw vote share across all 2N Gemini calls, with each pair judged in both orderings. Blue is Tontaube V0, orange is the opponent, faded grey is tie. Higher blue share is better for us.
Methodology (in brief)
GMOS: Gemini as Judge Pairwise Preference
We propose GMOS (Gemini MOS), a pairwise preference protocol using Gemini 3.1 Pro as the judge, as a cost-effective and reproducible replacement for human-rater MOS on the prosody and correctness axes. Conventional Mean Opinion Score studies are expensive and suffer from inter-rater calibration drift. GMOS follows the LLM-as-judge paradigm (Zheng et al., 2023) and its extension to audio via multimodal frontier models: the judge is given two synthesized audios for the same reference text and chooses the preferred one on two independent axes, prosody (rhythm, intonation, emphasis, pacing, naturalness) and correctness (faithful word by word rendering).
GMOS builds on prior work such as EmergentTTS-Eval (Boson AI, 2025), which similarly uses Gemini as an audio judge for TTS, but addresses several limitations of that protocol:
- Focused specifically on prosody. EmergentTTS-Eval evaluates six separate categories (emotions, paralinguistics, syntactic complexity, foreign words, questions, pronunciation), each with its own judging criterion. GMOS isolates prosody as a single, consistent axis that every pair is scored on. This matches our primary quality concern for narrative audiobook-style TTS and makes results easier to interpret across systems.
- Explicit bias controls in the prompt. GMOS instructs the judge, as a first-order directive, to ignore voice timbre, codec artifacts, audio quality, and recording noise. These are dimensions an LLM-as-judge can otherwise use as shortcuts.
- Symmetric ordering. EmergentTTS-Eval mitigates positional bias by randomly swapping the presentation order per row (seeded by the sample's unique id), then judging once. Across a 1,645-row benchmark that approximately averages out, but per-row and within small category subsets, positional-bias errors remain undetectable. GMOS judges each pair in both (A, B) and (B, A) orders and reports the raw vote share across all 2N calls, so positional bias is measurable per row and isolated into its own tie bucket.
- Loudness normalization. Every audio is normalized to −20 dBFS before scoring so amplitude asymmetries cannot influence the preference.
Benchmark Corpus
400 paragraphs of 250 to 500 characters, sampled with a fixed seed from roughly 20,000 qualifying paragraphs in the PG-19 test split (Rae et al., 2020). PG-19 contains 100 English public-domain books published before 1919. The corpus emphasizes literary prose with nested clauses, multiple parentheticals, archaic vocabulary, and dense 19th-century punctuation, which stresses prosodic handling more than the short utterances of Common Voice-style benchmarks.
Seed-TTS WER
The standard Seed-TTS test-en benchmark (Anastassiou
et al., 2024; Chen et al., 2024): 1088 per row zero-shot voice
prompted targets, transcribed with whisper-large-v3,
normalized through Whisper's EnglishTextNormalizer,
and scored with jiwer. We report the per row mean
WER, clipped at 100% per row, a standard robustification that is
a no-op for every model in this evaluation except Kani TTS 2,
whose pooled score is inflated by a handful of diverged
generations on which Whisper in turn emits repetition loops.
References
- Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, NeurIPS Datasets & Benchmarks, 2023. arXiv:2306.05685
- Rae et al., Compressive Transformers for Long-Range Sequence Modelling, ICLR 2020. arXiv:1911.05507
- Manku et al., EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge, 2025. arXiv:2505.23009
- Anastassiou et al., Seed-TTS: A Family of High-Quality Versatile Speech Generation Models, 2024. arXiv:2406.02430
- Chen et al., F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching, 2024. arXiv:2410.06885