TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech
Abstract
Expressive text-to-speech (TTS) aims to synthesize better human-like speech by incorporating diverse speech styles or emotions. While most expressive TTS models rely on reference speech to condition the style of the generated speech, they often fail to generate speech of regular quality. To ensure consistent speech quality, we propose an expressive TTS conditioned on style representation extracted from the text itself. To implement this text-based style predictor, we design a style module incorporating residual vector quantization. Furthermore, the style representation is enhanced through style-to-text alignment and a mel decoder with style hierarchical layer normalization (SHLN). Our experimental findings demonstrate that our proposed model accurately estimates style representation, enabling the generation of high-quality speech without the need for reference speech.
Contents
Model Architecture
Evaluation on different expressive TTS models
- GT: Ground-truth
- M1: Tacotron2-GST
- M2: StyleSpeech
- M3: Meta-StyleSpeech
- M4: FastSpeech2-GST
- M5: Proposed model
- M6: Proposed model (w/o style-to-text-alignment)
- M7: Proposed model (w/o SHLN)
- M8: Proposed model (w/ SALN, w/o SHLN)
4.1 Naturalness, 4.2 Ablation study
Angry
Text (Kr): 고모는 조카딸의 품행 나쁜 것을 밉게 보았던 모양이지.
Translation: The aunt must have hated her niece's bad behavior.
GT | M1 | M2 | M3 | M4 | M5 | M6 | M7 | M8 |
---|---|---|---|---|---|---|---|---|
Anxiety
Text (Kr): 물건을 팔려면 노력을 해야 해요.
Translation: You have to make an effort to sell things.
GT | M1 | M2 | M3 | M4 | M5 | M6 | M7 | M8 |
---|---|---|---|---|---|---|---|---|
Embarrassment
Text (Kr): 이 많은 선물은 다 뭐예요?
Translation: What are all these presents?
GT | M1 | M2 | M3 | M4 | M5 | M6 | M7 | M8 |
---|---|---|---|---|---|---|---|---|
Hurt
Text (Kr): 나는 온몸이 아파 자는 것은 생각도 할 수 없었다.
Translation: I couldn't even think of sleeping because my whole body was hurting.
GT | M1 | M2 | M3 | M4 | M5 | M6 | M7 | M8 |
---|---|---|---|---|---|---|---|---|
Joy
Text (Kr): 어깨를 나란히 하고 세계를 형성하려는 정의에 감격했다.
Translation: I was moved by the justice that stood shoulder to shoulder and shaped the world.
GT | M1 | M2 | M3 | M4 | M5 | M6 | M7 | M8 |
---|---|---|---|---|---|---|---|---|
Neutrality
Text (Kr): 그는 용감히 바닷가로 뛰어간다, 이따금 뒤를 돌아보면서.
Translation: He bravely runs to the beach, looking back from time to time.
GT | M1 | M2 | M3 | M4 | M5 | M6 | M7 | M8 |
---|---|---|---|---|---|---|---|---|
Sadness
Text (Kr): 돌아가진 아빠는 자식을 예뻐하는 게 당연하다고 말하는 사람이었습니다.
Translation: My late father was a person who said that it was natural to cherish children.
GT | M1 | M2 | M3 | M4 | M5 | M6 | M7 | M8 |
---|---|---|---|---|---|---|---|---|
4.3 Unseen data
Angry
Text (Kr): 나보다도 모르는 게 말이 돼?
Translation: Does it make sense that you don't know more than I do?
M1 | M2 | M3 | M5 |
---|---|---|---|
Anxiety
Text (Kr): 손을 뿌리치고 소리치며 달려가 험상궂게 생긴 사람들을 데리고 왔습니다.
Translation: They shook off their hands and ran shouting, bringing with them some scary-looking people.
M1 | M2 | M3 | M5 |
---|---|---|---|
Embarrassment
Text (Kr): 제 동생이 아파 동생 눈을 살펴보셨어요.
Translation: My younger brother was sick, so he looked into his eyes.
M1 | M2 | M3 | M5 |
---|---|---|---|
Hurt
Text (Kr): 기계방아에 대한 원한이 영감의 가슴속에 불을 부어주는 것이었다.
Translation: Resentment against the mechanical mill poured fire into the old man's heart.
M1 | M2 | M3 | M5 |
---|---|---|---|
Joy
Text (Kr): 동생 뼈는 아주 튼튼하구나!
Translation: Your brother's bones are very strong!
M1 | M2 | M3 | M5 |
---|---|---|---|
Neutrality
Text (Kr): 속으로는 더 이상 쓸 이야깃거리가 없다고 생각하고 있었습니다.
Translation: Inside, I thought there was nothing more to write about.
M1 | M2 | M3 | M5 |
---|---|---|---|
Sadness
Text (Kr): 언니와 나는 둘이서 침실에 있었습니다.
Translation: My sister and I were in the bedroom together.
M1 | M2 | M3 | M5 |
---|---|---|---|