DOI QR코드

DOI QR Code

RawNet3 화자 표현을 활용한 임의의 화자 간 음성 변환을 위한 StarGAN의 확장

Extending StarGAN-VC to Unseen Speakers Using RawNet3 Speaker Representation

  • 투고 : 2023.04.28
  • 심사 : 2023.06.19
  • 발행 : 2023.07.31

초록

음성 변환(Voice Conversion)은 개인의 음성 데이터를 다른 사람의 음향적 특성(음조, 리듬, 성별 등)으로 재생성할 수 있는 기술로, 교육, 의사소통, 엔터테인먼트 등 다양한 분야에서 활용되고 있다. 본 논문은 StarGAN-VC 모델을 기반으로 한 접근 방식을 제안하여, 병렬 발화(Utterance) 없이도 현실적인 음성을 생성할 수 있다. 고정된 원본(source) 및 목표(target)화자 정보의 원핫 벡터(One-hot vector)를 이용하는 기존 StarGAN-VC 모델의 제약을 극복하기 위해, 본 논문에서는 사전 훈련된 Rawnet3를 사용하여 목표화자의 특징 벡터를 추출한다. 이를 통해 음성 변환은 직접적인 화자 간 매핑 없이 잠재 공간(latent space)에서 이루어져 many-to-many를 넘어서 any-to-any 구조가 가능하다. 기존 StarGAN-VC 모델에서 사용된 손실함수 외에도, Wasserstein-1 거리를 사용하여 생성된 음성 세그먼트가 목표 음성의 음향적 특성과 일치하도록 보장했다. 또한, 안정적인 훈련을 위해 Two Time-Scale Update Rule (TTUR)을 사용한다. 본 논문에서 제시한 평가 지표들을 적용한 실험 결과에 따르면, 제한된 목소리 변환만이 가능한 기존 StarGAN-VC 기법 대비, 본 논문의 제안 방법을 통해 다양한 발화자에 대한 성능이 개선된 음성 변환을 제공할 수 있음을 정량적으로 확인하였다.

Voice conversion, a technology that allows an individual's speech data to be regenerated with the acoustic properties(tone, cadence, gender) of another, has countless applications in education, communication, and entertainment. This paper proposes an approach based on the StarGAN-VC model that generates realistic-sounding speech without requiring parallel utterances. To overcome the constraints of the existing StarGAN-VC model that utilizes one-hot vectors of original and target speaker information, this paper extracts feature vectors of target speakers using a pre-trained version of Rawnet3. This results in a latent space where voice conversion can be performed without direct speaker-to-speaker mappings, enabling an any-to-any structure. In addition to the loss terms used in the original StarGAN-VC model, Wasserstein distance is used as a loss term to ensure that generated voice segments match the acoustic properties of the target voice. Two Time-Scale Update Rule (TTUR) is also used to facilitate stable training. Experimental results show that the proposed method outperforms previous methods, including the StarGAN-VC network on which it was based.

키워드

참고문헌

  1. Y. Wang, P. Getreuer, T. Hughes, R. F. Lyon, and R. A. Saurous, "Trainable frontend for robust and far-field keyword spotting," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.5670- 5674, 2017.
  2. W. Fan, X. Xu, B. Cai, and X. Xing, "ISNet: Individual standardization network for speech emotion recognition," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol.30, pp.1803-1814, 2022. https://doi.org/10.1109/TASLP.2022.3171965
  3. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.3156-3164, 2015.
  4. P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, "Hierarchical recurrent neural encoder for video representation with application to captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1029-1038, 2016.
  5. D. Amodei et al., "Deep speech 2: End-to-end speech recognition in english and mandarin," in International Conference on Machine Learning, pp.173-182, 2016.
  6. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, "Wav2Vec 2.0: A framework for self-supervised learning of speech representations," Advances in Neural Information Processing Systems, Vol.33, pp.12449-12460, 2020.
  7. J. Shen et al., "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.4779-4783, 2018.
  8. Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, "FastSpeech: Fast, Robust and Controllable Text to Speech," Advances in Neural Information Processing Systems, Vol. 32, 2019.
  9. J. Yeung and G. Bae, Forever young, beautiful and scandal-free: The rise of South Korea's virtual influencers [Internet], https://edition.cnn.com/style/article/south-korea-virtual-influencers-beauty-social-media-intl-hnk-dst/ind ex.html
  10. J. Zong, C. Lee, A. Lundgard, J. W. Jang, D. Hajas, and A. Satyanarayan, "Rich screen reader experiences for accessible data visualization," Computer Graphics Forum, Vol.41, No.3, 2023.
  11. K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, "Autovc: Zero-shot voice style transfer with only autoencoder loss," in International Conference on Machine Learning, pp.5210-5219, 2019.
  12. S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black, and K. Prahallad, "Voice conversion using artificial neural networks," In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.3893-3896, 2009.
  13. B. Sisman, J. Yamagishi, S. King, and H. Li, "An overview of voice conversion and its challenges: From statistical modeling to deep learning," IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol.29, pp.132-157, 2020. https://doi.org/10.1109/TASLP.2020.3038524
  14. D. A. Reynolds, "Gaussian mixture models," Encyclopedia of biometrics, Vol.741, pp.659-663, 2009. https://doi.org/10.1007/978-0-387-73003-5_196
  15. S. Mobin and J. Bruna, "Voice conversion using convolutional neural networks," arXiv preprint, 2016. [Internet], https://arxiv.org/abs/1610.08927
  16. J. Lai, B. Chen, T. Tan, S. Tong, and K. Yu, "Phone-aware LSTM-RNN for voice conversion," in 2016 IEEE 13th International Conference on Signal Processing, pp.177-182, 2016.
  17. A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, "Generative adversarial networks: An overview," IEEE Signal Processing Magazine, Vol.35, No.1, pp.53-65, 2018. https://doi.org/10.1109/MSP.2017.2765202
  18. M. Mirza and S. Osindero, "Conditional generative adversarial nets," arXiv preprint, 2014. [Internet], https://arxiv.org/abs/1411.1784
  19. J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, "Unpaired image-to-image translation using cycle-consistent adversarial networks," in Proceedings of the IEEE International Conference on Computer Vision, pp.2223-2232, 2017.
  20. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, "Improved training of Wasserstein GANs," Advances in Neural Information Processing Systems, Vol.30, 2017.
  21. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, "Gans trained by a two time-scale update rule converge to a local nash equilibrium," Advances in Neural Information Processing Systems, Vol.20, 2017.
  22. H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks," In 2018 IEEE Spoken Language Technology Workshop (SLT), pp.266-273, 2018.
  23. Y. Choi, M. Choi, M. Kim, J. W. Ha, S. Kim, and J. Choo, "Stargan: Unified generative adversarial networks for multi-domain image-to-image translation," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.8789-8797, 2018.
  24. M. S. Al-Radhi, T. G. Csapo, and G. Nemeth, "Parallel voice conversion based on a continuous sinusoidal model," in 2019 International Conference on Speech Technology and Human-Computer Dialogue, pp.1-6, 2019.
  25. Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, "Nonparallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.5274-5278, 2018.
  26. W. C. Huang, T. Hayashi, Y. C. Wu., H. Kameoka, and T. Toda, "Pretraining techniques for sequence-to-sequence voice conversion," IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol.29, pp.745-755, 2021. https://doi.org/10.1109/TASLP.2021.3049336
  27. S. Lee, B. Ko, K. Lee, I. C. Yoo, and D. Yook, "Many-to-many voice conversion using conditional cycle-consistent adversarial networks," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.6279-6283, 2020.
  28. J. W. Jung, Y. J. Kim, H.S. Heo, B. J. Lee, Y. Kwon, and J. S. Chung, "Pushing the limits of raw waveform speaker recognition." in Proceedings of Interspeec, pp.2228-2232, 2022.
  29. M. Morris, F. Yokomori, and K. Ozawa, "WORLD: a vocoder-based high-quality speech synthesis system for real-time applications." IEICE TRANSACTIONS on Information and Systems, Vol.E99-D, No.7, pp.1877-1884, 2016. https://doi.org/10.1587/transinf.2015EDP7457
  30. E. O. Brigham and R. E. Morrow, "The fast Fourier transform," IEEE Spectrum, Vol.4, No.12, pp.63-70, 1967. https://doi.org/10.1109/MSPEC.1967.5217220
  31. M. Arjovsky, S. Chintala, and L. Bottou, "Wasserstein Generative Adversarial Networks," in Proceedings of the 34th International Conference on Machine Learning, Vol.70, pp.214-223, 2017.
  32. S. H. Gao, M. M. Cheng, K. Zhao, and X. W. Hu, "Res2Net: A new multi-scale backbone architecture," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.43, No.2, pp.652-662, 2019. https://doi.org/10.1109/TPAMI.2019.2938758
  33. J. W. Park, S. B. Kim, H. J. Shim, J. H. Kim, and H. J. Yu, "Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms," in Proceedings of Interspeech, pp.1496-1500, 2020.
  34. B. Desplanques, J. Thienpondt, K. Demuynck, "ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification," in Proceedings of Interspeech, pp.3830-3834, 2020.
  35. J. Hu, L. Shen, and G. Sun, "Squeeze-and-Excitation Networks," in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.7132-7141, 2018. 
  36. J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling, "The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods," in Proceedings of The Speaker and Language Recognition Workshop (Odyssey 2018), pp.195-202, 2018.
  37. K. Zhou, B. Sisman, R. Liu, and H. Li, "Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.920-924, 2021.