DOI QR코드

DOI QR Code

A Multi-speaker Speech Synthesis System Using X-vector

x-vector를 이용한 다화자 음성합성 시스템

  • 조민수 ((주)사운드잇) ;
  • 권철홍 (대전대학교 정보통신.전자공학과)
  • Received : 2021.09.06
  • Accepted : 2021.09.28
  • Published : 2021.11.30

Abstract

With the recent growth of the AI speaker market, the demand for speech synthesis technology that enables natural conversation with users is increasing. Therefore, there is a need for a multi-speaker speech synthesis system that can generate voices of various tones. In order to synthesize natural speech, it is required to train with a large-capacity. high-quality speech DB. However, it is very difficult in terms of recording time and cost to collect a high-quality, large-capacity speech database uttered by many speakers. Therefore, it is necessary to train the speech synthesis system using the speech DB of a very large number of speakers with a small amount of training data for each speaker, and a technique for naturally expressing the tone and rhyme of multiple speakers is required. In this paper, we propose a technology for constructing a speaker encoder by applying the deep learning-based x-vector technique used in speaker recognition technology, and synthesizing a new speaker's tone with a small amount of data through the speaker encoder. In the multi-speaker speech synthesis system, the module for synthesizing mel-spectrogram from input text is composed of Tacotron2, and the vocoder generating synthesized speech consists of WaveNet with mixture of logistic distributions applied. The x-vector extracted from the trained speaker embedding neural networks is added to Tacotron2 as an input to express the desired speaker's tone.

최근 인공지능 스피커 시장이 성장하면서 사용자와 자연스러운 대화가 가능한 음성합성 기술에 대한 수요가 증가하고 있다. 따라서 다양한 음색의 목소리를 생성할 수 있는 다화자 음성합성 시스템이 필요하다. 자연스러운 음성을 합성하기 위해서는 대용량의 고품질 음성 DB로 학습하는 것이 요구된다. 그러나 많은 화자가 발화한 고품질의 대용량 음성 DB를 수집하는 것은 녹음 시간과 비용 측면에서 매우 어려운 일이다. 따라서 각 화자별로는 소량의 학습 데이터이지만 매우 많은 화자의 음성 DB를 사용하여 음성합성 시스템을 학습하고, 이로부터 다화자의 음색과 운율 등을 자연스럽게 표현하는 기술이 필요하다. 본 논문에서는 화자인식 기술에서 사용하는 딥러닝 기반 x-vector 기법을 적용하여 화자 인코더를 구성하고, 화자 인코더를 통해 소량의 데이터로 새로운 화자의 음색을 합성하는 기술을 제안한다. 다화자 음성합성 시스템에서 텍스트 입력에서 멜-스펙트로그램을 합성하는 모듈은 Tacotron2로, 합성음을 생성하는 보코더는 로지스틱 혼합 분포가 적용된 WaveNet으로 구성되어 있다. 학습된 화자 임베딩 신경망에서 추출한 x-vector를 Tacotron2에 입력으로 추가하여 원하는 화자의 음색을 표현한다.

Keywords

Acknowledgement

이 논문은 한국연구재단 지역대학 우수과학자지원 사업(NRF-2020R1I1A3052136)에 의해 연구되었음

References

  1. C. H. Kwon, "Performance comparison of state-of-the-art vocoder technology based on deep learning in a Korean TTS system", The Journal of the Convergence on Culture Technology (JCCT), Vol. 6, No. 2, pp. 509-514, 2020, DOI:10.17703/JCCT.2020.6.2.509
  2. C. H. Kwon, "Comparison of Korean real-time text-to-speech technology based on deep learning", The Journal of the Convergence on Culture Technology (JCCT), Vol. 7, No. 1, pp. 640-645, 2021, DOI:10.17703/JCCT.2021.7.1.640
  3. M. S. Jo, "A study on a multi-speaker TTS system using speaker embedding", Master Thesis, Graduate School of Daejeon Univ. 2021
  4. D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, "Deep neural network embeddings for text-independent speaker verification", Proceedings of the Interspeech 2017, pp. 999-1003, 2017, DOI:10.21437/Interspeech.2017-620
  5. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, "X-vectors: Robust DNN embeddings for speaker recognition", Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018, pp. 5329-5333, 2018, DOI:10.1109/ICASSP.2018.8461375
  6. A. Nagrani, J. S. Chung, A. Zisserman, "VoxCeleb: A large-scale speaker identification dataset", Proceedings of the Interspeech 2017, pp. 2616-2620, 2017, DOI:10.1109/ICASSP.2018.8461375
  7. Zeroth-Korean: Korean open source speech corpus, https://www.openslr.org/40/
  8. J. W. Ha, K. H. Nam, J. Kang, et al., "ClovaCall: Korean goal-oriented dialog speech corpus for automatic speech recognition of contact centers", Proceedings of the Interspeech 2020, pp. 409-413, 2020, DOI:10.21437/Interspeech.2020-1136
  9. H. Zen, V. Dang, R. Clark, et al., "LibriTTS: A corpus derived from LibriSpeech for text-to-speech", Proceedings of the Interspeech 2019, pp. 1526-1530, 2019, DOI:10.21437/Interspeech.2019-2441
  10. M. McLaren, L. Ferrer, D. Castan, A. Lawson, "The speakers in the wild (SITW) speaker recognition database", Proceedings of the Interspeech 2016, pp. 818-822, 2016, DOI:10.21437/Interspeech.2016-1129
  11. D. Povey, A. Ghoshal, G. Boulianne, et al., "The Kaldi speech recognition toolkit", Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2011, 2011
  12. T. Hayashi, R, Yamamoto, K. Inoue, et al., "ESP net-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit", Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020, pp. 7654-7658, 2020, DOI:10.1109/ICASSP40776.2020.9053512
  13. S. Watanabe, T. Hori, S. Karita, et al., "ESPnet: End-to-end speech processing toolkit", Proceedings of the Interspeech 2018, pp. 2207-2211, 2018, DOI:10.21437/Interspeech.2018-1456
  14. T. Ko, V. Peddinti, D. Povey, M. Seltzer, S. Khudanpur, "A study on data augmentation of reverberant of speech for robust speech recognition", Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017, pp. 5220-5224, 2017, DOI:10.1109/ICASSP.2017.7953152
  15. D. Snyder, G. Chen, D. Povey, "MUSAN: A music, speech, and noise corpus", arXiv preprint. https://arxiv.org/pdf/1510.08484.pdf, 2015 Oct.
  16. J. Shen, R. Pang, R. J. Weiss, et al., "Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions", Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018, pp. 4779-4783, 2018, DOI: 10.1109/ICASSP.2018.8461368
  17. A. Oord, S. Dieleman, H. Zen, et al., "WaveNet: A generative model for raw audio", Proceedings of the 9th ISCA Speech Synthesis Workshop, pp. 125-125., 2016