DOI QR코드

DOI QR Code

Exploring the feasibility of fine-tuning large-scale speech recognition models for domain-specific applications: A case study on Whisper model and KsponSpeech dataset

  • Jungwon Chang (Department of English Language and Literature, Korea University) ;
  • Hosung Nam (Department of English Language and Literature, Korea University)
  • Received : 2023.08.15
  • Accepted : 2023.09.15
  • Published : 2023.09.30

Abstract

This study investigates the fine-tuning of large-scale Automatic Speech Recognition (ASR) models, specifically OpenAI's Whisper model, for domain-specific applications using the KsponSpeech dataset. The primary research questions address the effectiveness of targeted lexical item emphasis during fine-tuning, its impact on domain-specific performance, and whether the fine-tuned model can maintain generalization capabilities across different languages and environments. Experiments were conducted using two fine-tuning datasets: Set A, a small subset emphasizing specific lexical items, and Set B, consisting of the entire KsponSpeech dataset. Results showed that fine-tuning with targeted lexical items increased recognition accuracy and improved domain-specific performance, with generalization capabilities maintained when fine-tuned with a smaller dataset. For noisier environments, a trade-off between specificity and generalization capabilities was observed. This study highlights the potential of fine-tuning using minimal domain-specific data to achieve satisfactory results, emphasizing the importance of balancing specialization and generalization for ASR models. Future research could explore different fine-tuning strategies and novel technologies such as prompting to further enhance large-scale ASR models' domain-specific performance.

Keywords

References

  1. Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020, December). wav2vec 2.0: A framework for self-supervised learning of speech representations. Proceedings of the Advances in Neural Information Processing Systems (pp. 12449-12460). Online Conference.
  2. Bang, J. U., Yun, S., Kim, S. H., Choi, M. Y., Lee, M. K., Kim, Y. J., Kim, D. H., ... Kim, S. H. (2020). KsponSpeech: Korean spontaneous speech corpus for automatic speech recognition. Applied Sciences, 10(19), 6936.
  3. Chang, K. W., Tseng, W. C., Li, S. W., & Lee, H. Y. (2022). SpeechPrompt: An exploration of prompt tuning on generative spoken language model for speech processing tasks. Retrieved from https://arxiv.org/abs/2203.16773
  4. Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., ... Wei, F. (2022). WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505-1518. https://doi.org/10.1109/JSTSP.2022.3188113
  5. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://arxiv.org/abs/1810.04805
  6. Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., Han, W., ... Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. Retrieved from https://arxiv.org/abs/2005.08100
  7. Guo, P., Boyer, F., Chang, X., Hayashi, T., Higuchi, Y., Inaguma, H., Kamo, N., ... Zhang, Y. (2021, June). Recent developments on espnet toolkit boosted by conformer. Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5874-5878). Toronto, ON.
  8. Hsu, W. N., Bolte, B., Tsai, Y. H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451-3460. https://doi.org/10.1109/TASLP.2021.3122291
  9. Kim, K., Wu, F., Peng, Y., Pan, J., Sridhar, P., Han, K. J., & Watanabe, S. (2023, January). E-branchformer: Branchformer with enhanced merging for speech recognition. Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT) (pp. 84-91). Doha, Qatar.
  10. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. Retrieved from https://arxiv.org/abs/1412.6980
  11. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2021). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Retrieved from https://arxiv.org/abs/2107.13586
  12. Mohamed, A., Lee, H. Y., Borgholt, L., Havtorn, J. D., Edin, J., Igel, C., Kirchhoff, K., ... Watanabe, S. (2022). Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1179-1210. https://doi.org/10.1109/JSTSP.2022.3207050
  13. Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015, April). Librispeech: an ASR corpus based on public domain audio books. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5206-5210). South Brisbane, Australia.
  14. Peng, P., Yan, B., Watanabe, S., & Harwath, D. (2023a). Prompting the hidden talent of web-scale speech models for zero-shot task generalization. Retrieved from https://arxiv.org/abs/2305.11095
  15. Peng, Y., Kim, K., Wu, F., Yan, B., Arora, S., Chen, W., Tang, J., ... Watanabe, S. (2023b). A comparative study on E-branchformer vs conformer in speech recognition, translation, and understanding tasks. Retrieved from https://arxiv.org/abs/2305.11073
  16. Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A, ... Auli, M. (2023). Scaling speech technology to 1,000+ languages. Retrieved from https://arxiv.org/abs/2305.13516
  17. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023, July). Robust speech recognition via large-scale weak supervision. Proceedings of the 40th International Conference on Machine Learning (pp. 28492-28518). Honolulu, HI.
  18. Rouditchenko, A., Khurana, S., Thomas, S., Feris, R., Karlinsky, L., Kuehne, H., Harwath, D., ... Glass, J. (2023). Comparison of multilingual self-supervised and weakly-supervised speech pre-training for adaptation to unseen languages. Retrieved from https://arxiv.org/abs/2305.12606
  19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017, December). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems. Long Beach, CA.
  20. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., ... Rush, A. M. (2019). Huggingface's transformers: State-of-the-art natural language processing. Retrieved from https://arxiv.org/abs/1910.03771
  21. Zhang, Y., Han, W., Qin, J., Wang, Y., Bapna, A., Chen, Z., Chen, N., ... Wu, Y. (2023). Google usm: Scaling automatic speech recognition beyond 100 languages. Retrieved from https://arxiv.org/abs/2303.01037