Model Card · April 2026

Tontaube V0

A compute-efficient text-to-speech model, built to match commercial voice-agent quality and run on-device.

Tontaube V0 matches NeuTTS Air on the standard Seed-TTS intelligibility benchmark (2.33% WER vs 2.21%), and is preferred by Gemini 3.1 Pro on 95 to 97% of prosody comparisons against on-device competitors and up to 73% against leading commercial voice-agent providers.

2.33%
Seed-TTS WER
per row mean, clipped at 100%
95–97%
GMOS prosody wins
vs on-device tier (NeuTTS Air, Nano, Kani TTS 2)
150–200 ms
Server latency
time to first audio chunk on a single GPU
~ $300
Training compute
full end-to-end training

Overview

Tontaube V0 is an autoregressive text to speech model with zero-shot voice cloning. A short reference clip and a transcript produce studio-quality speech in the reference voice. Inference runs behind a production vLLM server that streams audio tokens and decodes them on the fly.

The model is the first public release from Tontaube. It is evaluated below against six comparison systems grouped into two tiers:

  • Commercial voice-agent APIs: ElevenLabs Flash v2.5, Cartesia Sonic-3, Gradium.
  • On-device TTS: Kani TTS 2, NeuTTS Air, NeuTTS Nano.

On-device and Performance

Tontaube V0 is designed to be deployable wherever the user needs it, from high-throughput server fleets down to a single laptop or phone.

  • GPU: 20–30× real-time for a single sequence on a single consumer GPU. Batching further increases throughput, making the model practical for high-volume server-side workloads.
  • CPU: Roughly 1.5× real-time on a consumer-grade Ryzen 9 5900X without any GPU, already suitable for offline and edge deployments today.
  • Mobile: On-device deployment to phones is on the near-term roadmap. We expect NPU acceleration on modern handsets to bring inference comfortably below real-time.

Audio Samples

Four generations spanning the registers covered by our benchmark.

Literary narrative

Audiobook-style narration in a single take.

The house at the end of Elm Lane had stood empty for nearly forty years. Its shutters, once a bright blue, had weathered to the color of driftwood, and ivy climbed the walls in thick, unruly patches. Locals still walked past it slowly, as though it might remember them.

Nested clauses

Complex sentence with parentheticals, em-dashes, and semicolons.

The proposal, after being reviewed by the committee — which, as noted, had been assembled on short notice — was tabled indefinitely, or rather until the next quarterly meeting, whichever came later; a compromise that, predictably, satisfied absolutely no one.

Uncommon vocabulary

Archaic and technical vocabulary.

The manuscript, an eighteenth-century palimpsest recovered from a Byzantine monastery, revealed traces of an older Aramaic text beneath the Greek homilies. Its conservator, Anastasios Kyriakou, presented his findings at the ICPM symposium in Lisbon last November, drawing cautious applause from the assembled paleographers.

Conversational

One side of a casual phone conversation.

Oh my god, you won't believe the day I've had. So I'm at the coffee shop this morning, right, and the guy in front of me orders like eleven drinks. Eleven! And of course the machine breaks down halfway through. I was standing there for forty minutes just waiting for a coffee.

Benchmark Results

We evaluate Tontaube V0 on two benchmarks: Seed-TTS test-en for intelligibility, and GMOS, a Gemini-judged pairwise preference benchmark on 400 literary passages drawn from the PG-19 test split.

Seed-TTS WER (1088 rows)

Word error rate measures intelligibility: a reference ASR model transcribes each synthesized sample, and the transcript is compared to the input text. For context on what “good” looks like, the original Seed-TTS paper reports the human ground-truth recordings on this set at 2.14% WER, with Seed-TTS itself at 2.25% (Anastassiou et al., 2024, Table 1). Any system in the low-2% range is therefore essentially at the ceiling of what this benchmark can distinguish. Residual error at that scale reflects ASR noise on natural accents and prosody more than synthesis mistakes.

Mean of per row word error rate, clipped at 100% per row. Lower is better.

NeuTTS Nano
1.69%
NeuTTS Air
2.21%
Tontaube V0 (ours)
2.33%
Kani TTS 2
5.31%

GMOS pairwise preference (400 rows)

Raw vote share across all 2N Gemini calls, with each pair judged in both orderings. Blue is Tontaube V0, orange is the opponent, faded grey is tie. Higher blue share is better for us.

vs ElevenLabs Flash v2.5
Prosody
26.5% 10.8% 62.7%
Correctness
9.1% 76.2% 14.6%
vs Cartesia Sonic-3
Prosody
72.5% 4.2% 23.2%
Correctness
30.1% 60.9% 9.0%
vs Gradium API
Prosody
73.0% 5.8% 21.2%
Correctness
19.4% 69.2% 11.4%
vs Kani TTS 2
Prosody
94.4% 0.2% 5.4%
Correctness
49.1% 44.9% 6.0%
vs NeuTTS Air
Prosody
95.2% 0.5% 4.2%
Correctness
69.6% 25.8% 4.6%
vs NeuTTS Nano
Prosody
97.1% 0.1% 2.8%
Correctness
36.0% 56.4% 7.6%
Tontaube V0 Opponent Tie

Methodology (in brief)

GMOS: Gemini as Judge Pairwise Preference

We propose GMOS (Gemini MOS), a pairwise preference protocol using Gemini 3.1 Pro as the judge, as a cost-effective and reproducible replacement for human-rater MOS on the prosody and correctness axes. Conventional Mean Opinion Score studies are expensive and suffer from inter-rater calibration drift. GMOS follows the LLM-as-judge paradigm (Zheng et al., 2023) and its extension to audio via multimodal frontier models: the judge is given two synthesized audios for the same reference text and chooses the preferred one on two independent axes, prosody (rhythm, intonation, emphasis, pacing, naturalness) and correctness (faithful word by word rendering).

GMOS builds on prior work such as EmergentTTS-Eval (Boson AI, 2025), which similarly uses Gemini as an audio judge for TTS, but addresses several limitations of that protocol:

  • Focused specifically on prosody. EmergentTTS-Eval evaluates six separate categories (emotions, paralinguistics, syntactic complexity, foreign words, questions, pronunciation), each with its own judging criterion. GMOS isolates prosody as a single, consistent axis that every pair is scored on. This matches our primary quality concern for narrative audiobook-style TTS and makes results easier to interpret across systems.
  • Explicit bias controls in the prompt. GMOS instructs the judge, as a first-order directive, to ignore voice timbre, codec artifacts, audio quality, and recording noise. These are dimensions an LLM-as-judge can otherwise use as shortcuts.
  • Symmetric ordering. EmergentTTS-Eval mitigates positional bias by randomly swapping the presentation order per row (seeded by the sample's unique id), then judging once. Across a 1,645-row benchmark that approximately averages out, but per-row and within small category subsets, positional-bias errors remain undetectable. GMOS judges each pair in both (A, B) and (B, A) orders and reports the raw vote share across all 2N calls, so positional bias is measurable per row and isolated into its own tie bucket.
  • Loudness normalization. Every audio is normalized to −20 dBFS before scoring so amplitude asymmetries cannot influence the preference.

Benchmark Corpus

400 paragraphs of 250 to 500 characters, sampled with a fixed seed from roughly 20,000 qualifying paragraphs in the PG-19 test split (Rae et al., 2020). PG-19 contains 100 English public-domain books published before 1919. The corpus emphasizes literary prose with nested clauses, multiple parentheticals, archaic vocabulary, and dense 19th-century punctuation, which stresses prosodic handling more than the short utterances of Common Voice-style benchmarks.

Seed-TTS WER

The standard Seed-TTS test-en benchmark (Anastassiou et al., 2024; Chen et al., 2024): 1088 per row zero-shot voice prompted targets, transcribed with whisper-large-v3, normalized through Whisper's EnglishTextNormalizer, and scored with jiwer. We report the per row mean WER, clipped at 100% per row, a standard robustification that is a no-op for every model in this evaluation except Kani TTS 2, whose pooled score is inflated by a handful of diverged generations on which Whisper in turn emits repetition loops.

References

  1. Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, NeurIPS Datasets & Benchmarks, 2023. arXiv:2306.05685
  2. Rae et al., Compressive Transformers for Long-Range Sequence Modelling, ICLR 2020. arXiv:1911.05507
  3. Manku et al., EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge, 2025. arXiv:2505.23009
  4. Anastassiou et al., Seed-TTS: A Family of High-Quality Versatile Speech Generation Models, 2024. arXiv:2406.02430
  5. Chen et al., F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching, 2024. arXiv:2410.06885