A Study on Speech Recognition Technology Using Artificial Intelligence Technology

인공 지능 기술을 이용한 음성 인식 기술에 대한 고찰

  • Young Jo Lee (Department of Software Engineering, Hyupsung Universiy) ;
  • Ki Seung Lee (School of Electrical and Electronic Engineering, Konkuk Universiy) ;
  • Sung Jin Kang (School of Electrical, Electronics & Communication Engineering, Korea University of Technology and Education )
  • 이영조 (협성대학교 소프트웨어공학과) ;
  • 이기승 (건국대학교 전기전자공학부) ;
  • 강성진 (한국기술교육대학교 전기전자통신공학부)
  • Received : 2024.09.10
  • Accepted : 2024.09.14
  • Published : 2024.09.30

Abstract

This paper explores the recent advancements in speech recognition technology, focusing on the integration of artificial intelligence to improve recognition accuracy in challenging environments, such as noisy or low-quality audio conditions. Traditional speech recognition methods often suffer from performance degradation in noisy settings. However, the application of deep neural networks (DNN) has led to significant improvements, enabling more robust and reliable recognition in various industries, including banking, automotive, healthcare, and manufacturing. A key area of advancement is the use of Silent Speech Interfaces (SSI), which allow communication through non-speech signals, such as visual cues or other auxiliary signals like ultrasound and electromyography, making them particularly useful for individuals with speech impairments. The paper further discusses the development of multi-modal speech recognition, combining both audio and visual inputs, which enhances recognition accuracy in noisy environments. Recent research into lip-reading technology and the use of deep learning architectures, such as CNN and RNN, has significantly improved speech recognition by extracting meaningful features from video signals, even in difficult lighting conditions. Additionally, the paper covers the use of self-supervised learning techniques, like AV-HuBERT, which leverage large-scale, unlabeled audiovisual datasets to improve performance. The future of speech recognition technology is likely to see further integration of AI-driven methods, making it more applicable across diverse industries and for individuals with communication challenges. The conclusion emphasizes the need for further research, especially in languages with complex morphological structures, such as Korean

Keywords

References

  1. G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kings bury, "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012.
  2. A. Fernandez-Lopez, F. M. Sukno, "Survey on automatic lip-reading in the era of deep learning," Image and Vision Computing, vol. 78, pp. 53-72. 2018.
  3. J. A. Gonzalez-Lopez, A. Gomez-Alanis, J. M. Martin Donas, J. L. Perez-Cordoba and A. M. Gomez, "Silent Speech Interfaces for Speech Restoration: A Review," in IEEE Access, vol. 8, pp. 177995-178021, 2020.
  4. M. Hao, M. Mamut, N. Yadikar, A. Aysa and K. Ubul, "A Survey of Research on Lipreading Technology," in IEEE Access, vol. 8, pp. 204518-204544, 2020.
  5. S. Fenghour, D. Chen, K. Guo, B. Li and P. Xiao, "Deep Learning-Based Automated Lip-Reading: A Survey," in IEEE Access, vol. 9, pp. 121184-121205, 2021.
  6. K. Paliwal, K. Wojcicki, B. Shannon, "The importance of phase in speech enhancement," Speech Communication, vol. 53, No. 4, pp. 465-494, 2011.
  7. M. Wollmer, B. Schuller, F. Eyben and G. Rigoll, "Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening," in IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 5, pp. 867-881, 2010.
  8. J. T. Geiger, F. Weninger, J. F. Gemmeke, M. Wollmer, B. Schuller and G. Rigoll, "Memory-Enhanced Neural Networks and NMF for Robust ASR," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 6, pp. 1037-1046, 2014.
  9. Y. Qian, M. Bi, T. Tan, and K. Yu, "Very deep convolutional neural networks for noise robust speech recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2263-2276, 2016.
  10. G. E. Dahl, D. Yu, L. Deng and A. Acero, "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30-42, Jan. 2012.
  11. G. Trigeorgis et al., "Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, pp. 5200-5204, 2016.
  12. H. Joo, K. Lee, "Estimating speech parameters for ultrasonic Doppler signal using LSTM recurrent neural networks," The Journal of the Acoustical Society of Korea, vol.38, no.4, pp. 433-441, 2019.
  13. Z. Zhang, N. Cummins and B. Schuller, "Advanced Data Exploitation in Speech Analysis: An overview," in IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 107-129, July 2017.
  14. K. Lee, "An acoustic Doppler-based silent speech interface technology using generative adversarial networks'" The Journal of the Acoustical Society of Korea. vol.40, no.2, pp. 161-168, 2021.
  15. A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, "Generative adversarial networks: An overview," IEEE Signal Processing Magazine, vol. 35, pp. 53-65, 2018.
  16. Julius Richter, Simon Welker, J-M Lemercier, Bunlong Lay, Timo Gerkmann, "Speech Enhancement and Dereverberation with Diffusion-Based Generative Models," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351-2364. 2023.
  17. C. -S. Lin, S. -F. Chang, C. -C. Chang and C. -C. Lin, "Microwave Human Vocal Vibration Signal Detection Based on Doppler Radar Technology," in IEEE Transactions on Microwave Theory and Techniques, vol. 58, no. 8, pp. 2299-2306, Aug. 2010.
  18. B. Denby, T. Schultz, K. Honda, T. Hueber, J.M. Gilbert, J.S. Brumberg, "Silent speech interfaces," Speech Communication, vol. 52, No. 4, pp. 270-287, 2010.
  19. T. Le Cornu and B. Milner, "Generating Intelligible Audio Speech from Visual Speech," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 9, pp. 1751-1761, 2017.
  20. K. Lee, "Ultrasonic Doppler Based Silent Speech Interface Using Perceptual Distance," Applied Sciences. 12(2), 827, 2022.
  21. K. Lee, "Speech enhancement using ultrasonic doppler sonar", Speech Communication, Vol. 110, pp. 21-32, July 2019.
  22. K. Lee, "Silent Speech Interface Using Ultrasonic Doppler Sonar," EICE Transactions on Information and Systems, vol. E103.D, no. 8, pp. 1875-1887, 2020.
  23. T. Toda, M. Nakagiri and K. Shikano, "Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 9, pp. 2505-2517, Nov. 2012.
  24. M. Janke and L. Diener, "EMG-to-Speech: Direct Generation of Speech From Facial Electromyographic Signals," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, pp. 2375- 2385, Dec. 2017.
  25. G. Shin, J. Kim, "A Study on the Intelligent Recognition of a Various Electronic Components and Alignment Method with Vision," Journal of the Semiconductor & Display Technology, vol. 23, no. 2, pp. 1-5, 2024.
  26. X. Tan and B. Triggs, "Enhanced Local Texture Feature Sets for Face Recognition Under Difficult Lighting Conditions," in IEEE Transactions on Image Processing, vol. 19, no. 6, pp. 1635-1650, 2010.
  27. A. Chavarin, E. Cuevas, O. Avalos, J. Galvez and M. Perez-Cisneros, "Contrast Enhancement in Images by Homomorphic Filtering and Cluster-Chaotic Optimization," in IEEE Access, vol. 11, pp. 73803-73822, 2023.
  28. P. -H. Lee, S. -W. Wu and Y. -P. Hung, "Illumination Compensation Using Oriented Local Histogram Equalization and its Application to Face Recognition," in IEEE Transactions on Image Processing, vol. 21, no. 9, pp. 4280-4289, Sept. 2012.
  29. M. Zheng, G. Qi, Z. Zhu, Y. Li, H. Wei and Y. Liu, "Image Dehazing by an Artificial Image Fusion Method Based on Adaptive Structure Decomposition," in IEEE Sensors Journal, vol. 20, no. 14, pp. 8062-8072, 15 July15, 2020.
  30. D. Sugimura, T. Mikami, H. Yamashita and T. Hamamoto, "Enhancing Color Images of Extremely Low Light Scenes Based on RGB/NIR Images Acquisition With Different Exposure Times," in IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 3586-3597, Nov. 2015.
  31. Y. Kumar, R. Jain, K. M. Salik, R. R. Shah, Y. Yin, R. Zimmermann, "Lipper: Synthesizing thy speech using multi-view lipreading," in Proc. AAAI Conf. Artif. Intell., vol. 33, pp. 2588-2595, 2019.
  32. K. Vougioukas, P. Ma, S. Petridis, and M. Pantic, "Video-driven speech reconstruction using generative adversarial networks," in Proc. Interspeech, Sep. 2019, pp. 4125-4129.
  33. M. Wand, J. Koutnik, and J. Schmidhuber, "Lipreading with Long Short-Term Memory," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2016, pp. 6115-6119.
  34. A. Ephrat and S. Peleg, "Vid2Speech: Speech reconstruction from silent video," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2017, pp. 5095-5099.
  35. H. Akbari, H. Arora, L. Cao, and N. Mesgarani, "Lip2Audspec: Speech reconstruction from silent lip movements video," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 2516- 2520.
  36. T. Stafylakis and G. Tzimiropoulos, "Combining residual networks with LSTMs for lipreading," in Proc. Interspeech, Aug. 2017, pp. 3652-3656.
  37. B. Martinez, P. Ma, S. Petridis, and M. Pantic, "Lipreading using temporal convolutional networks," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May 2020, pp. 6319-6323.
  38. K. Lee, "Improving the Performance of Automatic Lip-Reading Using Image Conversion Techniques," Electronics, 13(6), 1032, March 2024.
  39. M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, "Audio-visual Speech Enhancement Using Conditional Variational Auto-Encoders," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1788-1800. 2020.
  40. X. Qian, Z. Wang, J. Wang, G. Guan, and H. Li, "Audio-visual cross-attention networks for robotic speaker tracking," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 550-562. 2023.
  41. Lei. Liu, Li Liu, and H. Li, "Computation and Parameter Efficient Multi-Modal fusion Transformer for Cued Speech Recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1559-1572. 2024.
  42. B. Shi, W. Hsu, A. Mohamed, "Robust Self-Supervised Audio-Visual Speech Recognition," arXiv:2201.01763, 2022.
  43. L. Qu, C. Weber, S. Wermter, "LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading," IEEE Transactions on Neural Networks and Learning systems, vol. 35, no. 2, pp. 2772-2782, 2024.
  44. T. Afouras, J. Chung, A. Senior, O. Vinyals, A. Zisserman, "Deep Audio-Visual Speech Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 8717-8727, 2022.
  45. C. Xie, T. Toda, "Noisy-to-Noisy Voice Conversion Under Variations of Noisy Condition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3871-3882. 2023.
  46. J. Devlin, M. Chang, K. Lee, K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171-4186, 2019.
  47. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, "ALBERT: A lite BERT for self-supervised learning of language representations," ICLR 2020 Conference, 2019, arXiv:1909.11942.
  48. B. Chen et al., "Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos," 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 2021, pp. 7992-8001.