DOI QR코드

DOI QR Code

Comparison of Korean Real-time Text-to-Speech Technology Based on Deep Learning

딥러닝 기반 한국어 실시간 TTS 기술 비교

  • Kwon, Chul Hong (Dept. of Information, Communication, Electronics Engineering, Daejeon Univ)
  • 권철홍 (대전대학교 정보통신.전자공학과)
  • Received : 2021.01.06
  • Accepted : 2021.01.19
  • Published : 2021.02.28

Abstract

The deep learning based end-to-end TTS system consists of Text2Mel module that generates spectrogram from text, and vocoder module that synthesizes speech signals from spectrogram. Recently, by applying deep learning technology to the TTS system the intelligibility and naturalness of the synthesized speech is as improved as human vocalization. However, it has the disadvantage that the inference speed for synthesizing speech is very slow compared to the conventional method. The inference speed can be improved by applying the non-autoregressive method which can generate speech samples in parallel independent of previously generated samples. In this paper, we introduce FastSpeech, FastSpeech 2, and FastPitch as Text2Mel technology, and Parallel WaveGAN, Multi-band MelGAN, and WaveGlow as vocoder technology applying non-autoregressive method. And we implement them to verify whether it can be processed in real time. Experimental results show that by the obtained RTF all the presented methods are sufficiently capable of real-time processing. And it can be seen that the size of the learned model is about tens to hundreds of megabytes except WaveGlow, and it can be applied to the embedded environment where the memory is limited.

딥러닝 기반 종단간 TTS 시스템은 텍스트에서 스펙트로그램을 생성하는 Text2Mel 과정과 스펙트로그램에서 음성신호를 합성하는 보코더 등 두 가지 과정으로 구성되어 있다. 최근 TTS 시스템에 딥러닝 기술을 적용함에 따라 합성음의 명료도와 자연성이 사람의 발성과 유사할 정도로 향상되고 있다. 그러나 기존의 방식과 비교하여 음성을 합성하는 추론 속도가 매우 느리다는 단점을 갖고 있다. 최근 제안되고 있는 비-자기회귀 방식은 이전에 생성된 샘플에 의존하지 않고 병렬로 음성 샘플을 생성할 수 있어 음성 합성 처리 속도를 개선할 수 있다. 본 논문에서는 비-자기회귀 방식을 적용한 Text2Mel 기술인 FastSpeech, FastSpeech 2, FastPitch와, 보코더 기술인 Parallel WaveGAN, Multi-band MelGAN, WaveGlow를 소개하고, 이를 구현하여 실시간 처리 여부를 검증한다. 실험 결과 구한 RTF로 부터 제시된 방식 모두 실시간 처리가 충분히 가능함을 알 수 있다. 그리고 WaveGlow를 제외하고 학습 모델 크기가 수십에서 수백 MB 정도로, 메모리가 제한되어 있는 임베디드 환경에 적용 가능함을 알 수 있다.

Keywords

References

  1. A. J. Hunt and A. W. Black, "Unit selection in a concatenative speech synthesis system using a large speech database", Proceedings of the International Conference on Acoustics, Speech, Signal Processing, pp. 373-376, 1996
  2. T. Yoshimura, K. Tokuda, T. Masuko, T, Kobayashi, T. Kitamura, "Simultaneous modeling of spectrum, pitch and duration in HMM based speech synthesis". Proceedings of the Eurospeech 1999, pp. 2347-2350, 1999
  3. Y. Wan, R J, Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, R. A. Saurous, "Tacotron: Towards end-to-end speech synthesis", arXiv preprint, https://arxiv.org/pdf/1703.10135.pdf, 2017 Apr.
  4. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, Y. Wu, "Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions", arXiv preprint, https://arxiv.org/pdf/1712.05884.pdf, 2018 Feb.
  5. N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, M. Zhou, "Neural speech synthesis with transformer network", arXiv preprint, https://arxiv.org/pdf/1809.08895.pdf, 2019 Jan.
  6. Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, T. Y. Liu, "FastSpeech: Fast, robust and controllable text to speech", arXiv preprint, https://arxiv.org/pdf/1905.09263.pdf, 2019 Nov.
  7. A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A.Senior, K. Kavukcuoglu, "WaveNet: A generative model for raw audio," arXiv preprint, https://arxiv.org/pdf/1609.03499.pdf, 2016 Sep.
  8. C. H. Kwon, "Performance comparison of state-of-the-art vocoder technology based on deep learning in a Korean TTS system", The Journal of the Convergence on Culture Technology (JCCT), Vol. 6, No. 2, pp. 509-514, 2020 https://doi.org/10.17703/JCCT.2020.6.2.509
  9. N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, K. Kavukcuoglu. "Efficient neural audio synthesis", arXiv preprint. https://arxiv.org/pdf/1802.08435.pdf, 2018, Feb.
  10. Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, T. Y. Liu, "FastSpeech 2: Fast and high-quality end-to-end text to speech", arXiv preprint, https://arxiv.org/pdf/2006.04558.pdf, 2020 Oct.
  11. A. Lancucki, "FastPitch: Parallel text-to- speech with pitch prediction", arXiv preprint, https://arxiv.org/pdf/2006.06873.pdf, 2020 June
  12. R. Yamamoto, E. W. Song, J. M. Kim, "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram", arXiv preprint, https://arxiv.org/pdf/1910.11480.pdf, 2020 Feb.
  13. K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brebisson, Y. Bengio, A. Courville, "MelGAN: Generative adversarial networks for conditional waveform synthesis", arXiv preprint, https://arxiv.org/pdf/1910.06711.pdf, 2018 Dec.
  14. G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, L. Xie1, "Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech", arXiv preprint, https://arxiv.org/pdf/2005.05106.pdf, 2020 Nov.
  15. R. Prenger, R. Valle, B. Catanzaro, "WaveGlow: A flow-based generative network for speech synthesis", arXiv preprint. https://arxiv.org/pdf/1811.00002.pdf, 2018 Oct.