DOI QR코드

DOI QR Code

AI-based stuttering automatic classification method: Using a convolutional neural network

인공지능 기반의 말더듬 자동분류 방법: 합성곱신경망(CNN) 활용

  • Jin Park (Department of Speech and Language Rehabilitation, Catholic Kwandong University) ;
  • Chang Gyun Lee (Department of Business Administration, Catholic Kwandong University)
  • 박진 (가톨릭관동대학교 언어재활학과) ;
  • 이창균 (가톨릭관동대학교 경영학과)
  • Received : 2023.11.17
  • Accepted : 2023.12.12
  • Published : 2023.12.31

Abstract

This study primarily aimed to develop an automated stuttering identification and classification method using artificial intelligence technology. In particular, this study aimed to develop a deep learning-based identification model utilizing the convolutional neural networks (CNNs) algorithm for Korean speakers who stutter. To this aim, speech data were collected from 9 adults who stutter and 9 normally-fluent speakers. The data were automatically segmented at the phrasal level using Google Cloud speech-to-text (STT), and labels such as 'fluent', 'blockage', prolongation', and 'repetition' were assigned to them. Mel frequency cepstral coefficients (MFCCs) and the CNN-based classifier were also used for detecting and classifying each type of the stuttered disfluency. However, in the case of prolongation, five results were found and, therefore, excluded from the classifier model. Results showed that the accuracy of the CNN classifier was 0.96, and the F1-score for classification performance was as follows: 'fluent' 1.00, 'blockage' 0.67, and 'repetition' 0.74. Although the effectiveness of the automatic classification identifier was validated using CNNs to detect the stuttered disfluencies, the performance was found to be inadequate especially for the blockage and prolongation types. Consequently, the establishment of a big speech database for collecting data based on the types of stuttered disfluencies was identified as a necessary foundation for improving classification performance.

본 연구는 말더듬 화자들의 음성 데이터를 기반으로 하여, 인공지능 기술을 활용한 말더듬 자동 식별 방법을 개발하는 것을 주목적으로 진행되었다. 특히, 한국어를 모국어로 하는 말더듬 화자들을 대상으로 CNN(convolutional neural network) 알고리즘을 활용한 식별기 모델을 개발하고자 하였다. 이를 위해 말더듬 성인 9명과 정상화자 9명을 대상으로 음성 데이터를 수집하고, Google Cloud STT(Speech-To-Text)를 활용하여 어절 단위로 자동 분할한 후 유창, 막힘, 연장, 반복 등의 라벨을 부여하였다. 또한 MFCCs(mel frequency cepstral coefficients)를 추출하여 CNN 알고리즘을 기반한 말더듬 자동 식별기 모델을 수립하고자 하였다. 연장의 경우 수집결과가 5건으로 나타나 식별기 모델에서 제외하였다. 검증 결과, 정확도는 0.96으로 나타났고, 분류성능인 F1-score는 '유창'은 1.00, '막힘'은 0.67, '반복'은 0.74로 나타났다. CNN 알고리즘을 기반한 말더듬 자동분류 식별기의 효과를 확인하였으나, 막힘 및 반복유형에서는 성능이 미흡한 것으로 나타났다. 향후 말더듬의 유형별 충분한 데이터 수집을 통해 추가적인 성능 검증이 필요함을 확인하였다. 향후 말더듬 화자의 발화 빅데이터 확보를 통해 보다 신뢰성 있는 말더듬 자동 식별 기술의 개발과 함께 이를 통한 좀 더 고도화된 평가 및 중재 관련 서비스가 창출되기를 기대해 본다.

Keywords

Acknowledgement

본 연구에 참여해 음성 데이터를 제공해 주신 모든 대상자분들에게 감사의 말씀을 전합니다.

References

  1. Altinkaya, M., & Smeulders, A. W. M. (2020, October). A dynamic, self supervised, large scale audiovisual dataset for stuttered speech. Proceedings of the 1st International Workshop on Multimodal Conversational AI (pp. 9-13). Seattle, WA.
  2. Barrett, L., Hu, J., & Howell, P. (2022). Systematic review of machine learning approaches for detecting developmental stuttering. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 1160-1172. https://doi.org/10.1109/TASLP.2022.3155295
  3. Bayerl, S. P., von Gudenberg, A. W., Honig, F., Noth, E., & Riedhammer, K. (2022, June). KSoF: The Kassel state of fluency dataset -A therapy centered dataset of stuttering. Proceedings of the 13th Conference on Language Resources and Evaluation (pp. 1780-1787). Marseille, France.
  4. Bhushan, P. S., Vani, H. Y., Shivkumar, D. K., & Sreeraksha, M. R. (2021). Stuttered Speech Recognition using Convolutional Neural Networks, International Journal of Engineering Research & Technology, 9(12), 250-254.
  5. Das, A., Mock, J. Irani, F., Huang, Y., Najafirad, P., & Golob, E. (2022). Multimodal explainable AI predicts upcoming speech behavior in adults who stutter. Frontiers in Neuroscience, 16:912798.
  6. Fang, S. H., Tsao, Y., Hsiao, M. J., Chen, J. Y., Lai, Y. H., Lin, F. C., & Wang, C. T. (2019). Detection of pathological voice using cepstrum vectors: A deep learning approach. Journal of Voice, 33(5), 634-641. https://doi.org/10.1016/j.jvoice.2018.02.003
  7. Garg, U., Agarwal, S., Gupta, S., Dutt, R., & Singh, D. (2020, September). Prediction of emotions from the audio speech signals using MFCC, MEL and Chroma. Proceedings of the 12th International Conference on Computational Intelligence and Communication Networks (CICN). Bhimtal, India.
  8. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, UK: MIT Press.
  9. Guitar, B. (2019). Stuttering: An integrated approach to its nature and treatment. Baltimore, PA: Lippincott Williams.
  10. Hariharan, M., Chee, L. S., Ai, O. C., & Yaacob, S. (2012). Classification of speech disfluencies using LPC based parameterization techniques. Journal of Medical Systems, 36(3), 1821-1830. https://doi.org/10.1007/s10916-010-9641-6
  11. Howell, P., Sackin, S., & Glenn, K. (1997). Development of a two-stage procedure for the automatic recognition of dysfluencies in the speech of children who stutter: II. ANN recognition of repetitions and prolongations with supplied word segment markers. Journal of Speech, Language, and Hearing Research, 40(5), 1085-1096. https://doi.org/10.1044/jslhr.4005.1085
  12. Jeon, H. S., & Jeon, H. E. (2015). Characteristics of disfluency clusters in adults who stutter. Journal of Speech-Language & Hearing Disorders, 24(1), 135-144. https://doi.org/10.15724/jslhd.2015.24.1.011
  13. Jo, C., Wang, S. G., & Kwon, I. (2022). Performance comparison on vocal cords disordered voice discrimination via machine learning methods. Phonetics and Speech Sciences, 14(4), 35-43. https://doi.org/10.13064/KSSS.2022.14.4.035
  14. Kully, D., & Boberg, E. (1988). An investigation of interclinic agreement in the identification of fluent and stuttered syllables. Journal of Fluency Disorders, 13(5), 309-318. https://doi.org/10.1016/0094-730X(88)90001-0
  15. Lee, Y. H. (2017). Speech/audio processing based on deep learning. Broadcasting and Media Magazine, 22(1), 47-58.
  16. Mishra, N., Gupta, A., & Vathana, D. (2021). Optimization of stammering in speech recognition applications. International Journal of Speech Technology, 24(2), 679-685. https://doi.org/10.1007/s10772-021-09828-w
  17. Park, J., Oh, S. Y., Jun, J. P., & Kang, J. S. (2015). Effects of background noises on speech-related variables of adults who stutter. Phonetics and Speech Sciences, 7(1), 27-37. https://doi.org/10.13064/KSSS.2015.7.1.027
  18. Prabhu, Y., & Seliya, N. (2022, December). A CNN-based automated stuttering identification system. Proceeding of the 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA). Nassau, Bahamas.
  19. Ravikumar, K. M., Rajagopal, R., & Nagaraj, H. C. (2009). Stuttered Speech Using MFCC Features. ICGST International Journal on Digital Signal Processing, 9, 19-24.
  20. Riley, G. D. (1972). A stuttering severity instrument for children and adults. Journal of Speech and Hearing Disorders, 37(3), 314-322. https://doi.org/10.1044/jshd.3703.314
  21. Sheikh, S. A., Sahidullah, M., Hirsch, F., & Ouni, S. (2022). Machine learning for stuttering identification: Review, challenges and future directions. Neurocomputing, 514, 385-402. https://doi.org/10.1016/j.neucom.2022.10.015
  22. Shim, H. S., Shin, M. J., & Lee, E. J. (2010). Paradise Fluency Assessment-II (P-FA-II). Seoul: Paradise Welfare Foundation.
  23. Shim, H. S., Shin, M. J., Lee, E. J., Lee, K. J., & Lee, S. B. (2022). Fluency disorders: Assessment and treatment. Seoul: Korea.
  24. Tichenor, S. E., Constantino, C., & Scott Yaruss, J. (2022). A point of view about fluency. Journal of Speech, Language, and Hearing Research, 65(2), 645-652. https://doi.org/10.1044/2021_JSLHR-21-00342
  25. Van Riper, C. (1972). Speech correction: Principles and methods(5th ed.). Englewood Cliffs, NJ: Prentice-Hall.
  26. Wisniewski, M., Kuniszyk-Jozkowiak, W., Smolka, E., & Suszynski, W. (2007). Automatic detection of disorders in a continuous speech with the hidden Markov models approach. In M. Kurzynski, E. Puchala, M. Wozniak, & A. Zolnierek (Eds.), Computer recognition systems 2: Advances in soft computing (pp. 445-453). Berlin, Heidelberg: Springer.
  27. Yang, B., Wu, J., Zhou, Z., Komiya, M., Kishimoto, K., Xu, J., Nonaka, K., ... Horiuchi, T. (2021, October). Facial action unit-based deep learning framework for spotting macro- and micro-expressions in long video sequences. Proceedings of the 29th ACM International Conference on Multimedia (pp. 4794-4798). Chengdu, China.
  28. Yaruss, S. J. (1997). Utterance timing and childhood stuttering. Journal of Fluency Disorders, 22(4), 263-286. https://doi.org/10.1016/S0094-730X(97)00023-5