Abstract
Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. Trained on only 200K hours of data processed entirely by open-source tools, PilotTTS contributes: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis with 11 categories, paralinguistic synthesis with 4 categories, and Chinese dialect synthesis with 14 dialects. On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets.
Figure 1: Overview of PilotTTS.
Emotional Speech Synthesis
PilotTTS supports zero-shot emotional speech synthesis across 11 categories. Given a neutral prompt audio and an emotion tag, the model generates speech with the corresponding emotional expression.
| Emotion |
Prompt Audio |
Text |
Generated Speech |
Paralinguistic Features
The model supports paralinguistic generation including LAUGH, COUGH, CRY, BREATH, and COMBO. Special tags (e.g., <|LAUGH|>) in the text control where paralinguistic events are inserted.
| Type |
Prompt Audio |
Text with Tags |
Generated Speech |
Dialect Speech Synthesis
PilotTTS supports 13 Chinese dialects with 91.8% same-dialect accuracy. The following dialects are supported:
Same-Dialect Synthesis (Dialect → Dialect)
Using a dialect prompt audio as reference, the model generates speech in the same dialect while preserving speaker characteristics.
| Dialect |
Prompt Audio |
Text |
Generated Speech |
Mandarin-to-Dialect Synthesis
Using a Mandarin prompt audio as reference, the model extracts speaker timbre and generates speech in the target dialect.
| Dialect |
Prompt Audio |
Text |
Generated Speech |
Cross-Dialect Synthesis
Using a prompt audio in one dialect, the model generates speech in a different target dialect while preserving speaker identity.
| Source → Target |
Prompt Audio |
Text |
Generated Speech |