Designing a large recording script for open-domain English speech synthesis

Kim, Sunhee;Kim, Hojeong;Lee, Yooseop;Kim, Boryoung;Won, Yongkook;Kim, Bongwan;

doi:10.13064/KSSS.2021.13.3.065

Phonetics and Speech Sciences (말소리와 음성과학)

Volume 13 Issue 3
/
Pages.65-70
/
2021
/
2005-8063(pISSN)
/
2586-5854(eISSN)

Korean Society of Speech Sciences (한국음성학회)

DOI QR Code

Designing a large recording script for open-domain English speech synthesis

Kim, Sunhee (Department of French Language Education, Seoul National University) ;
Kim, Hojeong (Department of Foreign Language Education, Seoul National University) ;
Lee, Yooseop (Department of French Language Education, Seoul National University) ;
Kim, Boryoung (Department of French Language Education, Seoul National University) ;
Won, Yongkook (Center for Educational Research, Seoul National University) ;
Kim, Bongwan (Kakao Enterprise Corp.)

Received : 2021.07.31
Accepted : 2021.09.09
Published : 2021.09.30

https://doi.org/10.13064/KSSS.2021.13.3.065 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

This paper proposes a method for designing a large recording script for open domain English speech synthesis. For read-aloud style text, 12 domains and 294 sub-domains were designed using text contained in five different news media publications. For conversational style text, 4 domains and 36 sub-domains were designed using movie subtitles. The final script consists of 43,013 sentences, 27,085 read-aloud style sentences, and 15,928 conversational style sentences, consisting of 549,683 tokens and 38,356 types. The completed script is analyzed using four criteria: word coverage (type coverage and token coverage), high-frequency vocabulary coverage, phonetic coverage (diphone coverage and triphone coverage), and readability. The type coverage of our script reaches 36.86% despite its low token coverage of 2.97%. The high-frequency vocabulary coverage of the script is 73.82%, and the diphone coverage and triphone coverage of the whole script is 86.70% and 38.92%, respectively. The average readability of whole sentences is 9.03. The results of analysis show that the proposed method is effective in producing a large recording script for English speech synthesis, demonstrating good coverage in terms of unique words, high-frequency vocabulary, phonetic units, and readability.

Keywords

Acknowledgement

This work was supported by the Kakao Enterprise Corporation.

References

Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., ... Shoeybi, M. (2017, August). Deep voice: Real-time neural text-to-speech. Proceedings of the 34th International Conference on Machine Learning, PMLR 70 (pp. 195-204). Sydney, Australia.
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language toolkit. Beijing, China: O'Reilly Media.
Bonafonte, A., Hoge, H., Tropf, H. S., Moreno, A., van der Heuvel, H., Sundermann, D., ... Kiss, I. (2005). TTS baselines and specifications (Report No. FP6-506738). Retrieved from https://docsbay.net/tc-star-projectdeliverable-no-d8title-tts-baselines-specifications
Bozkurt, B., Ozturk, O., & Dutoit, T. (2003, September). Text design for TTS speech corpus building using a modified greedy selection. Proceedings of the Eurospeech 2003 (pp. 277-280). Geneva, Switzerland.
Chevelu, J., & Lolive, D. (2015, September). Do not build your TTS training corpus randomly. Proceedings of the 23rd European Signal Processing Conference (EUSIPCO) (pp. 350-354). Nice, France.
Dong, M., Cen, L., Chan, P., & Li, H. (2009). Readability consideration in speech synthesis recording script selection. International Journal on Asian Language Processing, 19(2), 45-54.
Gallegos, P. O., Williams, J., Rownicka, J., & King, S. (2020, October). An unsupervised method to select a speaker subset from large multi-speaker speech synthesis datasets. Proceedings of the Interspeech 2020 (pp. 1758-1762). Shanghai, China.
Honnet, P. E., Lazaridis, A., Garner, P. N., & Yamagishi, J. (2017). The SIWIS French speech synthesis database-Design and recording of a high quality French database for speech synthesis. Retrieved from https://infoscience.epfl.ch/record/225946
Kawai, H., Yamamoto, S., Higuchi, N., & Shimizu, T. (2000, October). A design method of speech corpus for text-to-speech synthesis taking account of prosody. Proceedings of the 6th International Conference on Spoken Language Processing (pp. 420-425). Beijing, China.
Kim, S., Kim, J., Kim, S., & Kim, H. (2013, November). Recording script design for speech corpus of English news reading TTS. Proceedings of the 2013 Autumn Conference of Acoustical Society of Korea (pp. 49-52). Jeju, Korea.
King, S. (2014). Measuring a decade of progress in text-to-speech. Loquens, 1(1), e006. https://doi.org/10.3989/loquens.2014.006
Klare, G. R. (1974-1975). Assessing readability. Reading Research Quarterly, 10(1), 62-102. https://doi.org/10.2307/747086
Kominek, J., & Black, A. W. (2003). CMU Arctic database for speech synthesis (Report No. CMU-LTI-03-177). Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=6699C4E348169581A2EED5E3041C1C81?doi=10.1.1.64.8827&rep=rep1&type=pdf
Kominek, J., & Black, A. W. (2004, June). The CMU Arctic speech databases. Proceedings of the 5th ISCA ITRW Speech Synthesis (pp. 223-224). Pittsburgh, PA.
Kuo, F. Y., Ouyang, I. C., Aryal, S., & Lanchantin, P. (2019, September). Selection and training schemes for improving TTS voice built on found data. Proceedings of the Interspeech 2019 (pp. 1516-1520). Graz, Austria.
Matousek, J., Psutka, J., & Kruta, J. (2001, September). Design of speech corpus for text-to-speech synthesis. Proceedings of the Eurospeech 2001 (pp. 2047-2050). Aalborg, Denmark.
Mobius, B. (2000). Corpus-based speech synthesis: Methods and challenges. Arbeitspapiere des Instituts fur Maschinelle Sprach- verarbeitung, 6(4), 87-116.
Nation, P. (n.d.). Vocabulary lists. Retrieved from https://www.wgtn.ac.nz/lals/resources/paul-nations-resources/vocabulary-lists
News Articles [dataset] (2018, May). Retrieved from https://www.kaggle.com/harishcscode/all-news-articles-from-home-page-media-house/version/1
Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., ... Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. Retrieved from https://arxiv.org/abs/1609.03499.
Park, K., & Mulc, T. (2019, September). CSS10: A collection of single speaker speech datasets for 10 languages. Proceedings of the Interspeech 2019 (pp. 1566-1570). Graz, Austria.
Park, K., & Kim, J. (2019). g2pE: A simple Python module for English grapheme to phoneme conversion. Retrieved from https://github.com/Kyubyong/g2p
Prahallad, K., & Black, A. W. (2011, July). Segmentation of monologues in audio books for building synthetic voices. IEEE Transactions on Audio, Speech, and Language Processing, 19(5), 1444-1449.
Purwins, H., Li, B., Virtanen, T., Schluter, J., Chang, S. Y., & Sainath, T. (2019). Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing, 13(2), 206-219. https://doi.org/10.1109/jstsp.2019.2908700
Santen, J. V., & Buchsbaum, A. (1997, September). Methods for optimal text selection. Proceedings of the 5th European Conference on Speech Communication and Technology (pp. 553-556). Rhodes, Greece.
Tao, J., Liu, F., Zhang, M., & Jia, H. (2008, October). Design of speech corpus for mandarin text to speech. Proceedings of the Blizzard Challenge 2008 Workshop (pp. 1-4). Brisbane, Australia.
Torres, H. M., Gurlekian, J. A., Evin, D. A., & Mercado, C. G. C. (2019). Emilia: a speech corpus for Argentine Spanish text to speech synthesis. Language Resources and Evaluation, 53(3), 419-447. https://doi.org/10.1007/s10579-019-09447-7
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., ... Saurous, R. A. (2017, August). Tacotron: Towards end-to-end speech synthesis. Proceedings of the Interspeech 2017 (pp. 4006-4010). Stockholm, Sweden.
Watts, O., Stan, A., Clark, R., Mamiya, Y., Giurgiu, M., Yamagishi, J., & King, S. (2013, September). Unsupervised and lightlysupervised learning for rapid construction of TTS systems in multiple languages from 'found' data: evaluation and analysis. Proceedings of the 8th ISCA Speech Synthesis Workshop (pp. 101-106). Barcelona, Spain.
Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J., Jia, Y., Chen, Z., & Wu, Y. (2019, September). LibriTTS: A corpus derived from LibriSpeech for text-to-speech. Proceedings of the Interspeech 2019 (pp. 1526-1530). Graz, Austria.
Zhu, W., Zhang, W., Shi, Q., Chen, F., Li, H., Ma, X., & Shen, L. (2002, September). Corpus building for data-driven TTS systems. Proceedings of the 2002 IEEE Workshop on Speech Synthesis(pp. 199-202). Santa Monica, CA.

Phonetics and Speech Sciences (말소리와 음성과학)

Designing a large recording script for open-domain English speech synthesis

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)