DOI QR코드

DOI QR Code

Automatic detection of speech sound disorder in children using automatic speech recognition and audio classification

  • Selina S. Sung (Department of Computer Science and Engineering, Sogang University) ;
  • Jungmin So (Department of Computer Science and Engineering, Sogang University) ;
  • Tae-Jin Yoon (Department of English and Literature, Sungshin Women's University) ;
  • Seunghee Ha (Department of Speech Pathology and Audiology, Hallym University)
  • Received : 2024.07.31
  • Accepted : 2024.09.11
  • Published : 2024.09.30

Abstract

Children with speech sound disorders (SSDs) face various challenges in producing speech sounds, which often lead to significant social and educational barriers. Detecting and treating SSDs in children is complex due to the variability in disorder severity and diagnostic boundaries. This study aims to develop an automated SSD detection system using deep learning models, leveraging their ability to transcribe audio, efficiently capture sound patterns on a vast scale, and address the limitations of traditional methods involving speech-language pathologists. For this study, we collected audio recordings from 573 children aged two to nine using standardized prompts from the Assessment of Phonology and Articulation for Children. Speech-language pathologists analyzed the recordings and identified 92 children with SSDs. To build an automatic SSD detection system, we used a dataset to train neural network models for automatic speech recognition and audio classification. Five different methods are studied, with the best method achieving 73.9% unweighted average recall. While the results show the potential of using deep learning models for the automatic detection of SSDs in children, further research is needed to improve the reliability of the models widely used in practice.

Keywords

Acknowledgement

This research was supported by the National Research Foundation of Korea under grant no. NRF-2021S1A5A2A03064795.

References

  1. Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020, December). Wav2vec 2.0: A framework for self-supervised learning of speech representations. In: H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in Neural Information Processing Systems (NeurIPS 2020) (Vol. 33, pp. 12449-12460). Online Conference. 
  2. Boersma, P., & Weenink, D. (2001). Praat, a system for doing phonetics by computer. Glot International, 5(9), 341-345. 
  3. Geng, M., Xie, X., Liu, S., Yu, J., Hu, S., Liu, X., & Meng, H. (2020, October). Investigation of data augmentation techniques for disordered speech recognition. Proceedings of Interspeech 2020 (pp. 696-700). Shanghai, China. 
  4. Getman, Y., Al-Ghezi, R., Voskoboinik, K., Grosz, T., Kurimo, M., Salvi, G., Svendsen, T., & Strombergsson, S. (2022, September). Wav2vec2-based speech rating system for children with speech sound disorder. Proceedings of Interspeech (pp. 3618-3622). Incheon, Korea. 
  5. Han, M. J., & Kim, S. J. (2021). Characteristics of functional speech sound disorders in Korean children. Annals of Child Neurology, 30(1), 8-16. 
  6. Hitchcock, E. R., Harel, D., & Byun, T. M. (2015). Social, emotional, and academic impact of residual speech errors in school-aged children: A survey study. Seminars in Speech and Language, 36(4), 283-294. 
  7. Javanmardi, F., Tirronen, S., Kodali, M., Kadiri, S. R., & Alku, P. (2023, June). Wav2vec-based detection and severity level classification of dysarthria from speech. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Rhodes Island, Greece.
  8. Jiao, Y., Tu, M., Berisha, V., & Liss, J. (2018, April). Simulating dysarthric speech for training data augmentation in clinical speech applications. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6009-6013). Calgary, AB. 
  9. Kothalkar, P., Rudolph, J., Dollaghan, C., McGlothlin, J., Campbell, T., & Hansen, J. H. L. (2018, September). Fusing text-dependent word-level i-vector models to screen 'at risk' child speech. Proceedings of Interspeech (pp. 1681-1685). Hyderabad, India. 
  10. Laaridh, I., Kheder, W. B., Fredouille, C., & Meunier, C. (2017, August). Automatic prediction of speech evaluation metrics for dysarthric speech. Proceedings of Interspeech 2017 (pp. 1834-1838). Stockholm, Sweden. 
  11. McLeod, S., & Baker, E. (2017). Children's speech: An evidence-based approach to assessment and intervention. Boston, MA: Pearson. 
  12. Ng, S. I., Ng, C. W. Y., & Lee, T. (2023, August). A study on using duration and formant features in automatic detection of speech sound disorder in children. Proceedings of Interspeech 2023 (pp. 4643-4647). Dublin, Ireland. 
  13. Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019, September). SpecAugment: A simple data augmentation method for automatic speech recognition. Proceedings of Interspeech 2019 (pp. 2613-2617). Graz, Austria. 
  14. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356. https://arxiv.org/abs/2212.04356 
  15. Sices, L., Taylor, H. G., Freebairn, L., Hansen, A., & Lewis, B. (2007). Relationship between speech-sound disorders and early literacy skills in preschool-age children: Impact of comorbid language impairment. Journal of Developmental and Behavioral Pediatrics, 28(6), 438-447. 
  16. Shahin, M., Zafar, U., & Ahmed, B. (2020). The automatic detection of speech disorders in children: Challenges, opportunities, and preliminary results. IEEE Journal of Selected Topics in Signal Processing, 14(2), 400-412. 
  17. Sudro, P. N., Das, R. K., Sinha, R., & Mahadeva Prasanna, S. R. (2021, December). Significance of data augmentation for improving cleft lip and palate speech recognition. 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Tokyo, Japan. 
  18. Wang, J., Qin, Y., Peng, Z., & Lee, T. (2019, September). Child speech disorder detection with Siamese recurrent network using speech attribute features. Proceedings of Interspeech 2019 (pp. 3885-3889). Graz, Austria.