PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

Amap Voice, Alibaba Group

Abstract

Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. Trained on only 200K hours of data processed entirely by open-source tools, PilotTTS contributes: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis with 11 categories, paralinguistic synthesis with 4 categories, and Chinese dialect synthesis with 14 dialects. On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets.

PilotTTS Overview

Figure 1: Overview of PilotTTS.

Zero-Shot Voice Cloning Demo

Zero-shot cloned voices, emotions, paralinguistic expressions, multi-dialect — all from PilotTTS

Emotional Speech Synthesis

PilotTTS supports zero-shot emotional speech synthesis across 11 categories. Given a neutral prompt audio and an emotion tag, the model generates speech with the corresponding emotional expression.
Emotion Prompt Audio Text Generated Speech

Paralinguistic Features

The model supports paralinguistic generation including LAUGH, COUGH, CRY, BREATH, and COMBO. Special tags (e.g., <|LAUGH|>) in the text control where paralinguistic events are inserted.
Type Prompt Audio Text with Tags Generated Speech

Dialect Speech Synthesis

PilotTTS supports 13 Chinese dialects with 91.8% same-dialect accuracy. The following dialects are supported:

Same-Dialect Synthesis (Dialect → Dialect)

Using a dialect prompt audio as reference, the model generates speech in the same dialect while preserving speaker characteristics.
Dialect Prompt Audio Text Generated Speech

Mandarin-to-Dialect Synthesis

Using a Mandarin prompt audio as reference, the model extracts speaker timbre and generates speech in the target dialect.
Dialect Prompt Audio Text Generated Speech

Cross-Dialect Synthesis

Using a prompt audio in one dialect, the model generates speech in a different target dialect while preserving speaker identity.
Source → Target Prompt Audio Text Generated Speech