ARCHI-TTS: An Architecturally Refined Text-to-Speech Model with Self-supervised Semantic Aligner and Accelerated Inference


Code and paper will be released soon.

Abstract

Text-to-speech (TTS) synthesis converts written text into natural-sounding speech. We propose a flow-matching based TTS system, ARCHI-TTS, including (i) a self-supervised semantic aligner with a single learnable mask embedding to learn text-aligned semantic representations, (ii) modernized training dynamics for TTS synthesis with fast convergence, and (iii) an separated encoder-decoder architecture in Diffusion Transformer (DiT) with accelerated inference without distillation as a by-product. The combination of refined our model and training designs achieves a Word-Error-Rate (WER) of 1.98% on the LibriSpeech-PC test-clean, 1.47% and 1.42% on the SeedTTS test-en and test-zh sets respectively. Our model partly outperforms state-of-the-art (SOTA) TTS models that trained with much large computational resouces, given that ARCHI-TTS is trained on 100k multi-lingual dataset on 8 RTX5090 GPUs for 4 days.

This page is for research demonstration purposes only.

Overview

ARCHI-TTS Architecture Diagram

Figure 1: An overview of ARCHI-TTS. Based on a text-guided speech-infilling task, we introduce a novel semantic aligner to project text embeddings into semantic representations of speech.

All samples in this demo page are generated with ARCHI-TTS, with NFE=32, CFG=3.0 and timeshift=3.0 in one time. Sampled latents are decoded with our pretrained VAE decoder.

Zero-shot Generation

Prompt and text from the demo page of Seed-TTS.

Language Prompt Same Language Generation Cross-linugal Generation
EN

I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.

顿时,气氛变得沉郁起来。乍看之下,一切的困扰仿佛都围绕在我身边。我皱着眉头,感受着那份压力,但我知道我不能放弃,不能认输。于是,我深吸一口气,心底的声音告诉我:“无论如何,都要冷静下来,重新开始。”

Perhaps they are driven by the delicious blend of flavors, or it could be the appealing visual presentation. At the end of the day, our choices in food reflect our personal preferences and sometimes, even our lifestyle or belief system.

我抬起头,坚定地说:“身高不能决定一切,这世界在看我,我更看得到世界。无论是北上广,或是别的什么,我都将以我自己的方式去攀爬,去追逐。我可能小,但我绝不会被忽视。”

Your safety and the pack's reputation are at stake. Your bravery is admirable, but sometimes bravery is knowing when to retreat. Please, consider returning with me. We can work out a plan, but only if you're willing to listen.

你的安全以及族群的声誉都危在旦夕。你的勇敢令人钦佩,但有时候勇敢在于懂得何时撤退。拜托,考虑一下和我一起回去吧。我们可以制定一个计划,但前提是你愿意倾听。

ZH

突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"

Suddenly, there was a burst of laughter beside me. I looked at them, stood up straight with high spirit, shook the slightly fleshy arms, and smiled lightly, saying, "The flesh on my body is to hide my bursting charm. Otherwise, wouldn't it scare you?"

顿时,气氛变得沉郁起来。乍看之下,一切的困扰仿佛都围绕在我身边。我皱着眉头,感受着那份压力,但我知道我不能放弃,不能认输。于是,我深吸一口气,心底的声音告诉我:“无论如何,都要冷静下来,重新开始。”

Suddenly, the atmosphere became gloomy. At first glance, all the troubles seemed to surround me. I frowned, feeling that pressure, but I know I can't give up, can't admit defeat. So, I took a deep breath, and the voice in my heart told me, "Anyway, must calm down and start again."

皇上的面色未变,宛如雕塑般静止,他的眼中闪过一丝动人的温度。他深深地看了那位忠心耿耿的臣子一眼,终于开口:“诺,我会再考虑考虑的。”他的声音低沉且坚定,留下空气中隐隐的无奈与柔情。

The emperor's complexion did not change, remaining as still as a sculpture, and a touch of touching warmth flashed in his eyes. He deeply glanced at the loyal minister, and finally spoke: "Well, I will consider it again." His voice was low and firm, leaving a faint hint of helplessness and tenderness in the air.

Code-Switch samples from FireRedTTS demo page.

Prompt Text Code-Switched Generation

他今天的mood看起来不太好,可能需要一些space。

这次旅行的schedule有点tight,我们需要plan得更efficient一些。

我觉得我们需要一个更clear的strategy来实现我们的goals。

你昨天的performance真是outstanding,完全展示了你的skills。

Speed Control Generation

First two prompt and text pairs are from E2TTS demo page. Last two pairs are from Seed-TTS demo page.

Similar to F5-TTS, ARCHI-TTS predicts duration based on number of characters with respect to the average duration per character in the prompt.

Prompt Text 0.7x Speed 1.0x Speed 1.3x Speed

He gave way to the others very readily and retreated unperceived by the Squire and Mistress Fitzooth to the rear of the tent.

“How cheerfully he seems to grin, How neatly spread his claws, And welcome little fishes in With gently smiling jaws”!

好呀,哈哈哈哈哈,喜欢笑的人运气都不会差哦,希望你每天笑口常开~

顿时,气氛变得沉郁起来。乍看之下,一切的困扰仿佛都围绕在我身边。我皱着眉头,感受着那份压力,但我知道我不能放弃,不能认输。于是,我深吸一口气,心底的声音告诉我:“无论如何,都要冷静下来,重新开始。”

Hard Sentences Generation

Hard sentences from Seed-TTS and ELLA-V demo page.

Prompt Text Hard sentences generation

针蓝线蓝领子蓝,蓝针蓝线蓝领蓝。蓝针蓝线连蓝领,针蓝线蓝领子蓝。

墙上画凤凰,凤凰画在粉红墙。红凤凰、粉凤凰,红粉凤凰、花凤凰。红凤凰,黄凤凰,红粉凤凰,粉红凤凰,花粉花凤凰。

随后,民警还在店里发现一把锤子锤子锤子锤子锤子锤子。

Active artists always appreciate artistic achievements and applaud awesome artworks.

Brave bakers boldly baked big batches of brownies in beautiful bakeries.

Daring dancers dazzled during dynamic dance displays, drawing delighted crowds.