Model Card · April 2026

Tontaube V0

A compute-efficient text-to-speech model, built to match commercial voice-agent quality and run on-device.

Tontaube V0 is preferred by Gemini 3.1 Pro on 72–73% of prosody comparisons against Cartesia Sonic-3 and Gradium, and on 94-97% against on-device competitors (NeuTTS Air, NeuTTS Nano, Kani TTS 2). Intelligibility sits at the Seed-TTS human-recording ceiling (2.33% WER, vs 2.14% for ground-truth audio). Trained for ~$300, runs faster than real-time on consumer CPU and approaches real-time on mobile.

2.33%
Seed-TTS WER
per row mean, clipped at 100%
94–97%
GMOS prosody wins
vs on-device tier (NeuTTS Air, Nano, Kani TTS 2)
150–200 ms
Server latency
time to first audio chunk on a single GPU
On-Device
Near real-time on Pixel 8
CPU inference — no GPU required

Overview

Tontaube V0 is an autoregressive text to speech model with zero-shot voice cloning. Inference runs behind a production vLLM server that streams audio tokens and decodes them on the fly.

The model is the first public release from Tontaube. It was trained from scratch with a budget of a $1000. It is evaluated below against six comparison systems grouped into two tiers:

  • Commercial voice-agent APIs: ElevenLabs Flash v2.5, Cartesia Sonic-3, Gradium.
  • On-device TTS: Kani TTS 2, NeuTTS Air, NeuTTS Nano.

On-device and Performance

Tontaube V0 is designed to be deployable wherever the user needs it, from high-throughput server fleets down to a single laptop or phone.

  • GPU: 20–30× real-time for a single sequence on a single consumer GPU. Batching further increases throughput, making the model practical for high-volume server-side workloads.
  • CPU: Roughly 1.5× real-time on a consumer-grade Ryzen 9 5900X without any GPU, already suitable for offline and edge deployments today.
  • Smartphone: Near real-time on a consumer-grade smartphone (Pixel 8).

Audio Samples

Four generations spanning the registers covered by our benchmark.

Literary narrative

Audiobook-style narration in a single take.

The house at the end of Elm Lane had stood empty for nearly forty years. Its shutters, once a bright blue, had weathered to the color of driftwood, and ivy climbed the walls in thick, unruly patches. Locals still walked past it slowly, as though it might remember them.

Nested clauses

Complex sentence with parentheticals, em-dashes, and semicolons.

The proposal, after being reviewed by the committee — which, as noted, had been assembled on short notice — was tabled indefinitely, or rather until the next quarterly meeting, whichever came later; a compromise that, predictably, satisfied absolutely no one.

Uncommon vocabulary

Archaic and technical vocabulary.

The manuscript, an eighteenth-century palimpsest recovered from a Byzantine monastery, revealed traces of an older Aramaic text beneath the Greek homilies. Its conservator, Anastasios Kyriakou, presented his findings at the ICPM symposium in Lisbon last November, drawing cautious applause from the assembled paleographers.

Conversational

One side of a casual phone conversation.

Oh my god, you won't believe the day I've had. So I'm at the coffee shop this morning, right, and the guy in front of me orders like eleven drinks. Eleven! And of course the machine breaks down halfway through. I was standing there for forty minutes just waiting for a coffee.

Benchmark Results

We evaluate Tontaube V0 on two benchmarks: Seed-TTS test-en for intelligibility, and GMOS, a Gemini-judged pairwise preference benchmark on 400 literary passages drawn from the PG-19 test split.

Seed-TTS WER (1088 rows)

Word error rate measures intelligibility: a reference ASR model transcribes each synthesized sample, and the transcript is compared to the input text. For context on what “good” looks like, the original Seed-TTS paper reports the human ground-truth recordings on this set at 2.14% WER, with Seed-TTS itself at 2.25% (Anastassiou et al., 2024, Table 1). Any system in the low-2% range is therefore essentially at the ceiling of what this benchmark can distinguish. Residual error at that scale reflects ASR noise on natural accents and prosody more than synthesis mistakes.

Mean of per row word error rate, clipped at 100% per row. Lower is better.

NeuTTS Nano
1.69%
NeuTTS Air
2.21%
Tontaube V0 (ours)
2.33%
Kani TTS 2
5.31%

GMOS pairwise preference (400 rows)

Raw vote share across all 2N Gemini calls, with each pair judged in both orderings. Blue is Tontaube V0, orange is the opponent, faded grey is tie. Higher blue share is better for us.

vs ElevenLabs Flash v2.5
Prosody
26.5% 10.8% 62.7%
Correctness
9.1% 76.2% 14.6%
vs Cartesia Sonic-3
Prosody
72.5% 4.2% 23.2%
Correctness
30.1% 60.9% 9.0%
vs Gradium API
Prosody
73.0% 5.8% 21.2%
Correctness
19.4% 69.2% 11.4%
vs Kani TTS 2
Prosody
94.4% 0.2% 5.4%
Correctness
49.1% 44.9% 6.0%
vs NeuTTS Air
Prosody
95.2% 0.5% 4.2%
Correctness
69.6% 25.8% 4.6%
vs NeuTTS Nano
Prosody
97.1% 0.1% 2.8%
Correctness
36.0% 56.4% 7.6%
Tontaube V0 Opponent Tie

Methodology (in brief)

GMOS: Gemini as Judge Pairwise Preference

We propose GMOS (Gemini MOS), a pairwise preference protocol using Gemini 3.1 Pro as the judge, as a cost-effective and reproducible replacement for human-rater MOS on the prosody and correctness axes. Conventional Mean Opinion Score studies are expensive and suffer from inter-rater calibration drift. GMOS follows the LLM-as-judge paradigm (Zheng et al., 2023) and its extension to audio via multimodal frontier models: the judge is given two synthesized audios for the same reference text and chooses the preferred one on two independent axes, prosody (rhythm, intonation, emphasis, pacing, naturalness) and correctness (faithful word by word rendering).

GMOS builds on prior work such as EmergentTTS-Eval (Boson AI, 2025), which similarly uses Gemini as an audio judge for TTS, but addresses several limitations of that protocol:

  • Focused specifically on prosody. EmergentTTS-Eval evaluates six separate categories (emotions, paralinguistics, syntactic complexity, foreign words, questions, pronunciation), each with its own judging criterion. GMOS isolates prosody as a single, consistent axis that every pair is scored on. This matches our primary quality concern for narrative audiobook-style TTS and makes results easier to interpret across systems.
  • Explicit bias controls in the prompt. GMOS instructs the judge, as a first-order directive, to ignore voice timbre, codec artifacts, audio quality, and recording noise. These are dimensions an LLM-as-judge can otherwise use as shortcuts.
  • Symmetric ordering. EmergentTTS-Eval mitigates positional bias by randomly swapping the presentation order per row (seeded by the sample's unique id), then judging once. Across a 1,645-row benchmark that approximately averages out, but per-row and within small category subsets, positional-bias errors remain undetectable. GMOS judges each pair in both (A, B) and (B, A) orders and reports the raw vote share across all 2N calls, so positional bias is measurable per row and isolated into its own tie bucket.
  • Loudness normalization. Every audio is normalized to −20 dBFS before scoring so amplitude asymmetries cannot influence the preference.

Benchmark Corpus

400 paragraphs of 250 to 500 characters, sampled with a fixed seed from roughly 20,000 qualifying paragraphs in the PG-19 test split (Rae et al., 2020). PG-19 contains 100 English public-domain books published before 1919. The corpus emphasizes literary prose with nested clauses, multiple parentheticals, archaic vocabulary, and dense 19th-century punctuation, which stresses prosodic handling more than the short utterances of Common Voice-style benchmarks.

Seed-TTS WER

The standard Seed-TTS test-en benchmark (Anastassiou et al., 2024; Chen et al., 2024): 1088 per row zero-shot voice prompted targets, transcribed with whisper-large-v3, normalized through Whisper's EnglishTextNormalizer, and scored with jiwer. We report the per row mean WER, clipped at 100% per row, a standard robustification that is a no-op for every model in this evaluation except Kani TTS 2, whose pooled score is inflated by a handful of diverged generations on which Whisper in turn emits repetition loops.

Comparative Audio Samples

Three head-to-head test cases from our architecture validation, comparing Tontaube V0 against leading commercial voice-agent APIs on the same reference text and voice.

Test Case 1: Semantic Understanding Complex, nested sentence structures.
The young tree — which, assuming the heavy rain (cold, yet very necessary) continues to fall, is growing — can, if the bright sun finally comes out from behind the dark clouds, survive the winter.
ElevenLabs V3 Global leader
Gradium Paris startup
Cartesia Sonic 3 Agents leader
Tontaube V0 Our model
Test Case 2: Narrations Naturalness & human-likeness.
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book," thought Alice "without pictures or conversations?" So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
ElevenLabs V3 Global leader
Cartesia Sonic 3 Agents leader
Tontaube V0 Our model
Test Case 3: Agentic Dialogue Naturalness in customer-facing delivery.
My bad about the delay earlier. I tracked your order and it reached the local hub this morning, so delivery is now expected between 2 and 5 PM today.
ElevenLabs Flash 2.5 Global leader
Cartesia Sonic 3 Agents leader
Tontaube V0 Our model

References

  1. Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, NeurIPS Datasets & Benchmarks, 2023. arXiv:2306.05685
  2. Rae et al., Compressive Transformers for Long-Range Sequence Modelling, ICLR 2020. arXiv:1911.05507
  3. Manku et al., EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge, 2025. arXiv:2505.23009
  4. Anastassiou et al., Seed-TTS: A Family of High-Quality Versatile Speech Generation Models, 2024. arXiv:2406.02430
  5. Chen et al., F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching, 2024. arXiv:2410.06885