Zero-shot voice conversion with HuBERT

Hyelee Chung;Hosung Nam;

doi:10.13064/KSSS.2023.15.3.069

Phonetics and Speech Sciences (말소리와 음성과학)

Volume 15 Issue 3
/
Pages.69-74
/
2023
/
2005-8063(pISSN)
/
2586-5854(eISSN)

Korean Society of Speech Sciences (한국음성학회)

DOI QR Code

Zero-shot voice conversion with HuBERT

Hyelee Chung (Department of English Language and Literature, Korea University) ;
Hosung Nam (Department of English Language and Literature, Korea University)

Received : 2023.07.05
Accepted : 2023.07.23
Published : 2023.09.30

https://doi.org/10.13064/KSSS.2023.15.3.069 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

This study introduces an innovative model for zero-shot voice conversion that utilizes the capabilities of HuBERT. Zero-shot voice conversion models can transform the speech of one speaker to mimic that of another, even when the model has not been exposed to the target speaker's voice during the training phase. Comprising five main components (HuBERT, feature encoder, flow, speaker encoder, and vocoder), the model offers remarkable performance across a range of scenarios. Notably, it excels in the challenging unseen-to-unseen voice-conversion tasks. The effectiveness of the model was assessed based on the mean opinion scores and similarity scores, reflecting high voice quality and similarity to the target speakers. This model demonstrates considerable promise for a range of real-world applications demanding high-quality voice conversion. This study sets a precedent in the exploration of HuBERT-based models for voice conversion, and presents new directions for future research in this domain. Despite its complexities, the robust performance of this model underscores the viability of HuBERT in advancing voice conversion technology, making it a significant contributor to the field.

Keywords

References

Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020, December). wav2vec 2.0: A framework for self-supervised learning of speech representations. Proceedings of the Advances in Neural Information Processing Systems. Online Conference.
Bakhturina, E., Lavrukhin, V., Ginsburg, B., & Zhang, Y. (2021). Hi-fi multi-speaker English TTS dataset. Retrieved from https://doi.org/10.48550/arXiv.2104.01497
Buduma, N., Buduma, N., & Papa, J. (2022). Fundamentals of deep learning. Sebastopol, CA: O'Reilly Media.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://doi.org/10.48550/arXiv.1810.04805
Graves, A. (2013). Generating sequences with recurrent neural networks. Retrieved from https://doi.org/10.48550/arXiv.1308.0850
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
Hsu, W. N., Bolte, B., Tsai, Y. H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451-3460. https://doi.org/10.1109/TASLP.2021.3122291
Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Mmachine Learning (pp. 448-456). Lille, France.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. Retrieved from https://doi.org/10.48550/arXiv.1609.04836
Kim, J., Kim, S., Kong, J., & Yoon, S. (2020, December). Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. Poceedings of the Advances in Neural Information Processing Systems. Online Conference.
Kim, J., Kong, J., & Son, J. (2021, July). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. Proceedings of the International Conference on Machine Learning (pp. 5530-5540). Online Conference.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. Retrieved from https://doi.org/10.48550/arXiv.1412.6980
Kingma, D. P., & Dhariwal, P. (2018, December). Glow: Generative flow with invertible 1x1 convolutions. Proceedings of the Advances in Neural Information Processing Systems. Montreal, QU, Canada.
Kong, J., Kim, J., & Bae, J. (2020, December). HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Proceedings of the Advances in Neural Information Processing Systems. Online Conference.
Logan IV, R. L., Balazevic, I., Wallace, E., Petroni, F., Singh, S., & Riedel, S. (2021). Cutting down on prompts and parameters: Simple few-shot learning with language models. Retrieved from https://doi.org/10.48550/arXiv.2106.13353
McFee, B., Raffel, C., Liang, D., Ellis, D. P. W., McVicar, M., Battenberg, E., & Nieto, O. (2015, July). librosa: Audio and music signal analysis in python. Proceedings of the 14th Python in Science Conference. Austin, TX.
Palatucci, M., Pomerleau, D., Hinton, G. E., & Mitchell, T. M. (2009, December). Zero-shot learning with semantic output codes. Proceedings of the Advances in Neural Information Processing Systems. Vancouver, BC.
Qian, K., Zhang, Y., Chang, S., Yang, X., & Hasegawa-Johnson, M. (2019, June). AutoVC: Zero-shot voice style transfer with only autoencoder loss. Proceedings of the International Conference on Machine Learning (pp. 5210-5219). Long Beach, CA.
Rezende, D., & Mohamed, S. (2015, July). Variational inference with normalizing flows. Proceedings of the International Conference on Machine Learning (pp. 1530-1538). Lille, France.
RVC-Project. (2023). RVC-Project/Retrieval-based-Voice- Conversion-WebUI. GitHub. Retrieved from https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI
Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. Retrieved from https://doi.org/10.48550/arXiv.1904.05862
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681. https://doi.org/10.1109/78.650093
Shen, K., Ju, Z., Tan, X., Liu, Y., Leng, Y., He, L., Qin, T., ... Bian, J. (2023). NaturalSpeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. Retrieved from https://doi.org/10.48550/arXiv.2304
Sisman, B., Yamagishi, J., King, S., & Li, H. (2021). An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 132-157. https://doi.org/10.1109/TASLP.2020.3038524
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015, July). Deep unsupervised learning using nonequilibrium thermodynamics. Proceedings of the International Conference on Machine Learning (pp. 2256-2265). Lille, France.
svc-develop-team. (2023). svc-develop-team/so-vits-svc. GitHub. Retrieved from https://github.com/svc-develop-team/so-vits-svc
van Niekerk, B., Carbonneau, M. A., Zaidi, J., Baas, M., Seute, H., & Kamper, H. (2022, May). A comparison of discrete and soft speech units for improved voice conversion. Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 6562-6566). Singapore, Singapore.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., ... Polosukhin, I. (2017, December). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems. Long Beach, CA.
Wester, M., Wu, Z., & Yamagishi, J. (2016, September). Analysis of the voice conversion challenge 2016 evaluation results. Proceedings of the Interspeech (pp. 1637-1641). San Francisco, CA.
Yamagishi, J., Veaux, C., & MacDonald, K. (2019). CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR), 6, 15.

Phonetics and Speech Sciences (말소리와 음성과학)

Zero-shot voice conversion with HuBERT

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)