PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

Abstract

Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. Trained on only 200K hours of data processed entirely by open-source tools, PilotTTS contributes: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis with 11 categories, paralinguistic synthesis with 4 categories, and Chinese dialect synthesis with 14 dialects. On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets.

Figure 1: Overview of PilotTTS.

Emotional Speech Synthesis

PilotTTS supports zero-shot emotional speech synthesis across 11 categories. Given a neutral prompt audio and an emotion tag, the model generates speech with the corresponding emotional expression.

Emotion	Prompt Audio	Text	Generated Speech

Paralinguistic Features

The model supports paralinguistic generation including LAUGH, COUGH, CRY, BREATH, and COMBO. Special tags (e.g., <|LAUGH|>) in the text control where paralinguistic events are inserted.

Type	Prompt Audio	Text with Tags	Generated Speech

Dialect Speech Synthesis

PilotTTS supports 13 Chinese dialects with 91.8% same-dialect accuracy. The following dialects are supported:

Same-Dialect Synthesis (Dialect → Dialect)

Using a dialect prompt audio as reference, the model generates speech in the same dialect while preserving speaker characteristics.

Dialect	Prompt Audio	Text	Generated Speech

Mandarin-to-Dialect Synthesis

Using a Mandarin prompt audio as reference, the model extracts speaker timbre and generates speech in the target dialect.

Dialect	Prompt Audio	Text	Generated Speech

Cross-Dialect Synthesis

Using a prompt audio in one dialect, the model generates speech in a different target dialect while preserving speaker identity.

Source → Target	Prompt Audio	Text	Generated Speech