Many-to-many voice conversion experiments using a Korean speech corpus

Yook, Dongsuk;Seo, HyungJin;Ko, Bonggu;Yoo, In-Chul;

doi:10.7776/ASK.2022.41.3.351

The Journal of the Acoustical Society of Korea (한국음향학회지)

Volume 41 Issue 3
/
Pages.351-358
/
2022
/
1225-4428(pISSN)
/
2287-3775(eISSN)

The Acoustical Society of Korea (한국음향학회)

DOI QR Code

Many-to-many voice conversion experiments using a Korean speech corpus

다수 화자 한국어 음성 변환 실험

Yook, Dongsuk (Artificial Intelligence Laboratory, Department of Computer Science and Engineering, Korea University) ;
Seo, HyungJin ;
Ko, Bonggu ;
Yoo, In-Chul

육동석 (고려대학교 컴퓨터학과 인공지능연구실) ;
서형진 (고려대학교 컴퓨터학과 인공지능연구실) ;
고봉구 (고려대학교 컴퓨터학과 인공지능연구실) ;
유인철 (고려대학교 컴퓨터학과 인공지능연구실)

Received : 2022.03.16
Accepted : 2022.05.13
Published : 2022.05.31

https://doi.org/10.7776/ASK.2022.41.3.351 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Recently, Generative Adversarial Networks (GAN) and Variational AutoEncoders (VAE) have been applied to voice conversion that can make use of non-parallel training data. Especially, Conditional Cycle-Consistent Generative Adversarial Networks (CC-GAN) and Cycle-Consistent Variational AutoEncoders (CycleVAE) show promising results in many-to-many voice conversion among multiple speakers. However, the number of speakers has been relatively small in the conventional voice conversion studies using the CC-GANs and the CycleVAEs. In this paper, we extend the number of speakers to 100, and analyze the performances of the many-to-many voice conversion methods experimentally. It has been found through the experiments that the CC-GAN shows 4.5 % less Mel-Cepstral Distortion (MCD) for a small number of speakers, whereas the CycleVAE shows 12.7 % less MCD in a limited training time for a large number of speakers.

심층 생성 모델의 일종인 Generative Adversarial Network(GAN)과 Variational AutoEncoder(VAE)는 비병렬 학습 데이터를 사용한 음성 변환에 새로운 방법론을 제시하고 있다. 특히, Conditional Cycle-Consistent Generative Adversarial Network(CC-GAN)과 Cycle-Consistent Variational AutoEncoder(CycleVAE)는 다수 화자 사이의 음성 변환에 우수한 성능을 보이고 있다. 그러나, CC-GAN과 CycleVAE는 비교적 적은 수의 화자를 대상으로 연구가 진행되어왔다. 본 논문에서는 100 명의 한국어 화자 데이터를 사용하여 CC-GAN과 CycleVAE의 음성 변환 성능과 확장 가능성을 실험적으로 분석하였다. 실험 결과 소규모 화자의 경우 CC-GAN이 Mel-Cepstral Distortion(MCD) 기준으로 4.5 % 우수한 성능을 보이지만 대규모 화자의 경우 CycleVAE가 제한된 학습 시간 안에 12.7 % 우수한 성능을 보였다.

Keywords

음성 변환;

Acknowledgement

이 논문은 2017년도 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 기초연구사업임(No. NRF-2017R1E1A1A01078157).

References

B. Ko, K. Lee, I.-C. Yoo, and D. Yook, "Korean voice conversion experiments using CC-GAN and VAW-GAN" (in Korean), Proc, Speech Communication and Signal Processing, 36, 39 (2019).
B. Jang, H. Seo, I.-C. Yoo, and D. Yook, "CycleVAE based many-to-many voice conversion experiments using Korean speech corpus" (in Korean), J. Acoust. Soc. Suppl.2(s) 40, 79 (2021).
I.-C. Yoo, K. Lee, S.-G. Leem, H. Oh, B. Ko, and D. Yook, "Speaker anonymization for personal information protection using voice conversion techniques," IEEE Access, 8, 198637-198645 (2020). https://doi.org/10.1109/access.2020.3035416
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets," Proc. NIPS, 2672-2680 (2014).
D. Kingma and M. Welling, "Auto-encoding variational Bayes," arXiv:1312.6114 (2013).
J. Zhu, T. Park, P. Isola, and A. Efros, "Unpaired image-to image translation using cycle-consistent adversarial networks," Proc. IEEE Int. Conf. Computer Vision, 2242-2251 (2017).
T. Kaneko and H. Kameoka, "CycleGAN-VC: Nonparallel voice conversion using cycle-consistent adversarial networks," Proc. EUSIPCO, 2114-2118 (2018).
T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, "CycleGAN-VC2: Improved CycleGAN-based nonparallel voice conversion," Proc. IEEE ICASSP, 6820- 6824 (2019).
T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, "CycleGAN-VC3: Examining and improving CycleGAN-VCs for Mel-spectrogram conversion," Proc. Interspeech, 2017-2021 (2020).
D. Yook, I.-C. Yoo, and S. Yoo, "Voice conversion using conditional CycleGAN," Proc. Int. Conf. CSCI, 1460-1461 (2018).
S. Lee, B. Ko, K. Lee, I.-C. Yoo, and D. Yook, "Many-to-many voice conversion using conditional cycle-consistent adversarial networks," Proc. IEEE ICASSP, 6279-6283 (2020).
H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks," Proc. IEEE Workshop on SLT, 266-273 (2018).
T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, "StarGAN-VC2: Rethinking conditional methods for StarGAN-based voice conversion," Proc. Interspeech, 679-683 (2019).
C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang, "Voice conversion from non-parallel corpora using variational autoencoder," Proc. APSIPA, 1-6 (2016).
A. Oord and O. Vinyals, "Neural discrete representation learning," Proc. NIPS, 6309-6318 (2017).
C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang, "Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks," Proc. Interspeech, 3364-3368 (2017).
H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "ACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder," IEEE/ ACM Trans. on Audio, Speech, and Lang. Process. 27, 1432-1443 (2019).
P. Tobing, Y. Wu, T. Hayashi, K. Kobayashi, and T. Toda, "Non-parallel voice conversion with cyclic variational autoencoder," Proc. Interspeech, 674-678 (2019).
D. Yook, S.-G. Leem, K. Lee, and I.-C. Yoo, "Manyto-Many voice conversion using cycle-consistent variational autoencoder with multiple decoders," Proc. Odyssey: The Speaker Language Recognition Workshop, 215-221 (2020).
B. Ko, Many-to-many voice conversion using cycle-consistency for Korean speech (in Korean), (Master Thesis, Korea University, 2020).
M. Morise, F. Yokomori, and K. Ozawa, "WORLD: A vocoder-based high-quality speech synthesis system for real-time applications," IEICE Trans. on Information and Systems, 99, 1877-1884 (2016).
D. Kingma and J. Ba, "Adam: A method for stochastic optimization," Proc. ICLR, 1-13 (2015).
T. Toda, A. Black, and K. Tokuda, "Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory," IEEE Trans. on Audio, Speech, and Lang. Process. 15, 2222-2235 (2007). https://doi.org/10.1109/TASL.2007.907344
S. Takamichi, T. Toda, A. Black, G. Neubig, S. Sakti, and S. Nakamura, "Postfilters to modify the modulation spectrum for statistical parametric speech synthesis," IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 24, 755-767 (2016). https://doi.org/10.1109/TASLP.2016.2522655

The Journal of the Acoustical Society of Korea (한국음향학회지)

Many-to-many voice conversion experiments using a Korean speech corpus

다수 화자 한국어 음성 변환 실험

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)