TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech

Donghyun Seong1, Hoyoung Lee2, and Joon-Hyuk Chang1

1 Department of Electronic Engineering, Hanyang University, Seoul, Republic of Korea
2 Department of Artificial Intelligence Application, Hanyang University, Seoul, Republic of Korea

Abstract

Expressive text-to-speech (TTS) aims to synthesize better human-like speech by incorporating diverse speech styles or emotions. While most expressive TTS models rely on reference speech to condition the style of the generated speech, they often fail to generate speech of regular quality. To ensure consistent speech quality, we propose an expressive TTS conditioned on style representation extracted from the text itself. To implement this text-based style predictor, we design a style module incorporating residual vector quantization. Furthermore, the style representation is enhanced through style-to-text alignment and a mel decoder with style hierarchical layer normalization (SHLN). Our experimental findings demonstrate that our proposed model accurately estimates style representation, enabling the generation of high-quality speech without the need for reference speech.

Contents

Model Architecture


Fig.1 The overall architecture for the TSP-TTS.



Evaluation on different expressive TTS models


4.1 Naturalness, 4.2 Ablation study

Angry

Text (Kr): 고모는 조카딸의 품행 나쁜 것을 밉게 보았던 모양이지.
Translation: The aunt must have hated her niece's bad behavior.

GT M1 M2 M3 M4 M5 M6 M7 M8

Anxiety

Text (Kr): 물건을 팔려면 노력을 해야 해요.
Translation: You have to make an effort to sell things.

GT M1 M2 M3 M4 M5 M6 M7 M8

Embarrassment

Text (Kr): 이 많은 선물은 다 뭐예요?
Translation: What are all these presents?

GT M1 M2 M3 M4 M5 M6 M7 M8

Hurt

Text (Kr): 나는 온몸이 아파 자는 것은 생각도 할 수 없었다.
Translation: I couldn't even think of sleeping because my whole body was hurting.

GT M1 M2 M3 M4 M5 M6 M7 M8

Joy

Text (Kr): 어깨를 나란히 하고 세계를 형성하려는 정의에 감격했다.
Translation: I was moved by the justice that stood shoulder to shoulder and shaped the world.

GT M1 M2 M3 M4 M5 M6 M7 M8

Neutrality

Text (Kr): 그는 용감히 바닷가로 뛰어간다, 이따금 뒤를 돌아보면서.
Translation: He bravely runs to the beach, looking back from time to time.

GT M1 M2 M3 M4 M5 M6 M7 M8

Sadness

Text (Kr): 돌아가진 아빠는 자식을 예뻐하는 게 당연하다고 말하는 사람이었습니다.
Translation: My late father was a person who said that it was natural to cherish children.

GT M1 M2 M3 M4 M5 M6 M7 M8



4.3 Unseen data

Angry

Text (Kr): 나보다도 모르는 게 말이 돼?
Translation: Does it make sense that you don't know more than I do?

M1 M2 M3 M5

Anxiety

Text (Kr): 손을 뿌리치고 소리치며 달려가 험상궂게 생긴 사람들을 데리고 왔습니다.
Translation: They shook off their hands and ran shouting, bringing with them some scary-looking people.

M1 M2 M3 M5

Embarrassment

Text (Kr): 제 동생이 아파 동생 눈을 살펴보셨어요.
Translation: My younger brother was sick, so he looked into his eyes.

M1 M2 M3 M5

Hurt

Text (Kr): 기계방아에 대한 원한이 영감의 가슴속에 불을 부어주는 것이었다.
Translation: Resentment against the mechanical mill poured fire into the old man's heart.

M1 M2 M3 M5

Joy

Text (Kr): 동생 뼈는 아주 튼튼하구나!
Translation: Your brother's bones are very strong!

M1 M2 M3 M5

Neutrality

Text (Kr): 속으로는 더 이상 쓸 이야깃거리가 없다고 생각하고 있었습니다.
Translation: Inside, I thought there was nothing more to write about.

M1 M2 M3 M5

Sadness

Text (Kr): 언니와 나는 둘이서 침실에 있었습니다.
Translation: My sister and I were in the bedroom together.

M1 M2 M3 M5