DOI QR코드

DOI QR Code

Lip Reading Method Using CNN for Utterance Period Detection

발화구간 검출을 위해 학습된 CNN 기반 입 모양 인식 방법

  • Kim, Yong-Ki (Dept. of Computer Engineering, Chungbuk National University) ;
  • Lim, Jong Gwan (Dept. of Mechanical Engineering, KAIST) ;
  • Kim, Mi-Hye (Dept. of Computer Engineering, Chungbuk National University)
  • Received : 2016.06.20
  • Accepted : 2016.08.20
  • Published : 2016.08.28

Abstract

Due to speech recognition problems in noisy environment, Audio Visual Speech Recognition (AVSR) system, which combines speech information and visual information, has been proposed since the mid-1990s,. and lip reading have played significant role in the AVSR System. This study aims to enhance recognition rate of utterance word using only lip shape detection for efficient AVSR system. After preprocessing for lip region detection, Convolution Neural Network (CNN) techniques are applied for utterance period detection and lip shape feature vector extraction, and Hidden Markov Models (HMMs) are then used for the recognition. As a result, the utterance period detection results show 91% of success rates, which are higher performance than general threshold methods. In the lip reading recognition, while user-dependent experiment records 88.5%, user-independent experiment shows 80.2% of recognition rates, which are improved results compared to the previous studies.

소음환경에서의 음성인식 문제점으로 인해 1990년대 중반부터 음성정보와 영양정보를 결합한 AVSR(Audio Visual Speech Recognition) 시스템이 제안되었고, Lip Reading은 AVSR 시스템에서 시각적 특징으로 사용되었다. 본 연구는 효율적인 AVSR 시스템을 구축하기 위해 입 모양만을 이용한 발화 단어 인식률을 극대화하는데 목적이 있다. 본 연구에서는 입 모양 인식을 위해 실험단어를 발화한 입력 영상으로부터 영상의 전처리 과정을 수행하고 입술 영역을 검출한다. 이후 DNN(Deep Neural Network)의 일종인 CNN(Convolution Neural Network)을 이용하여 발화구간을 검출하고, 동일한 네트워크를 사용하여 입 모양 특징 벡터를 추출하여 HMM(Hidden Markov Mode)으로 인식 실험을 진행하였다. 그 결과 발화구간 검출 결과는 91%의 인식률을 보임으로써 Threshold를 이용한 방법에 비해 높은 성능을 나타냈다. 또한 입모양 인식 실험에서 화자종속 실험은 88.5%, 화자 독립 실험은 80.2%로 이전 연구들에 비해 높은 결과를 보였다.

Keywords

References

  1. Y. K. Kim, J. G. Lim, and M. H. Kim, “Feature Generations Analysis of Lip Image Streams for Isolate Words Recognition,” International Journal of Multimedia and Ubiquitous Engineering, Vol. 10, No. 10, pp. 337-346, 2015. https://doi.org/10.14257/ijmue.2015.10.10.33
  2. Luettin, Juergen, and Neil A. Thacker, “Speechreading using probabilistic models,” Computer Vision and Image Understanding, Vol. 65, No. 2, pp. 163-178, 1997. https://doi.org/10.1006/cviu.1996.0570
  3. E. K. Kim, Y. D. Kwon, and J. S. Lee. "Neural Network Vowel-Recongition Jointly Using Voice Features and Mouth Shape Image". Korean Institute of Information Scientists and Engineers Congress 1996, Vol. 23 No. 2A, pp. 693-696, 1996.
  4. J. S., Lee, and C. H. Park, "Automatic Lipreading Using Color Lip Images and Principal Component Analysis," Journal of Information Processing Systems B, Vol. 15, No. 3 pp. 229-236, 2008.
  5. Shaikh, A. A., Kumar, D. K., Yau, W. C., Azemin, M. C., and Gubbi, J, "Lip reading using optical flow and support vector machines," Image and Signal Processing (CISP), 2010 3rd International Congress on., Vol. 1, 2010.
  6. Shaikh, Ayaz A., Dinesh K. Kumar, and Jayavardhana Gubbi, “Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments,” The Visual Computer, Vol. 29, No. 10, pp. 969-982, 2010. https://doi.org/10.1007/s00371-012-0751-7
  7. Lan, Y., Theobald, B. J., Harvey, R., Ong, E. J., and Bowden, R, "Improving visual features for lip-reading.," In AVSP 2010, International Conference on Audio-Visual Speech Processing, pp. 7-3, 2010.
  8. Kim Y. K., Lim J. G., and Kim M. H., "Lip Reading Algorithm Using Bool Matrix and SVM," International Conference on Small & Medium Business, (in Korean), (2015), pp. 267-268.
  9. Sujatha, B., and T. Santhanam, "A novel approach integrating geometric and Gabor wavelet approaches to improvise visual lip-reading," Int. J. Soft Comput 5, pp. 13-18, 2010. https://doi.org/10.3923/ijscomp.2010.13.18
  10. Ibrahim, M. Z., and D. J. Mulvaney, "Robust geometrical-based lip-reading using Hidden Markov models," EUROCON, 2013 IEEE, pp. 2011-2016, 2013.
  11. Werda, Salah, Walid Mahdi, and Abdelmajid Ben Hamadou, "Lip localization and viseme classification for visual speech recognition," arXiv preprint arXiv:1301.4558, Vol. 5, No. 1, pp. 62-75 2013.
  12. Wang, S. L., Lau, W. H., Leung, S. H. and Yan, H, "A real-time automatic lipreading system," Circuits and Systems, 2004. ISCAS'04. Proceedings of the 2004 International Symposium on., Vol. 2, 2004.
  13. Cetingul, H. E., Yemez, Y., Erzin, E. and Tekalp, A. M, “Discriminative analysis of lip motion features for speaker identification and speech-reading,” Image Processing, IEEE Transactions on., Vol. 15, No. 10, pp. 2879-2891, 2006. https://doi.org/10.1109/TIP.2006.877528
  14. Siatras, S., Nikolaidis, N., Krinidis, M., and Pitas, I., “Visual lip activity detection and speaker detection using mouth region intensities,” Circuits and Systems for Video Technology, IEEE Transactions on, Vol. 19, No. 1, pp. 133-137, 2009. https://doi.org/10.1109/TCSVT.2008.2009262
  15. Arsic, Aleksandra, Milos Jordanski, and Milan Tuba, "Improved lip detection algorithmbased on region segmentation and edge detection," Telecommunications Forum Telfor (TELFOR), 2015 23rd. IEEE, 2015.
  16. G. B. Kim, J. W. Ryu, and N. I. Cho, “Voice Activity Detection using Motion and Variation of Intensity in The Mouth Region,” Journal of Broadcast Engineering, Vol. 17, No. 3, pp. 519-528, 2012. https://doi.org/10.5909/JBE.2012.17.3.519
  17. E. K. Kim, “Speech Activity Detection using Lip Movement Image Signals,” Journal of the Institute of Signal Processing and Systems, Vol. 11, No. 4, pp. 289-297, 2010.
  18. J. S. Kim, J. G. Nam , and B. T. Zhang, "Deep Learning-based Video Analysis Techniques" Journal of Korean Institute Information Scientists Engineers, Vol. 33, No. 9, pp. 21-31, 2015.
  19. Yun-A Hur, Keun-Ho Lee, “A Study on Countermeasures of Convergence for Big Data and Security Threats to Attack DRDoS in U-Healthcare Device,” Journal of the Korea Convergence Society, Vol. 6, No. 4, pp. 243-248, 2015. https://doi.org/10.15207/JKCS.2015.6.4.243
  20. G. J. Jang and J. S. Park, "Visual Object Recognition Based on Deep Neural Networks Implemented by CAFFE". Journal of Korean Institute Information Scientists Engineers, Vol. 33, No. 8, pp. 49-54, 2015.
  21. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems. pp1097-1105, 2012.
  22. Viola, Paul, and Michael Jones, "Rapid object detection using a boosted cascade of simple features," Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on., Vol. 1, 2001.
  23. S.I. Chien and Il Choi, "Face and Facial Landmarks Location Based on Log-Polar Mapping," Lecture Notes in Computer Science, pp. 379-386, 2000.
  24. Lim, Jong Gwan, Jaehong Kim, and Dong-Soo Kwon, "Multidimensional evaluation and analysis of motion segmentation for inertial measurement unit applications," Multimedia Tools and Applications, pp. 1-28, 2015.
  25. Lim, Jong Gwan, Mi-hye Kim, and Sahngwoon Lee, "Empirical Validation of Objective Functions in Feature Selection Based on Acceleration Motion Segmentation Data," Mathematical Problems in Engineering, 2015.
  26. Krizhevsky, Alex, and G. Hinton, "Convolutional deep belief networks on cifar-10," Unpublished manuscript, 2010.
  27. Maini, Raman, and Himanshu Aggarwal, “Study and comparison of various image edge detection techniques,” International journal of image processing (IJIP), Vol. 3, No. 1, pp. 1-11, 2009.
  28. Jun-Yeon Lee, “Forecasting the Time-Series Data Converged on Time PLOT and Moving Average,” Journal of the Korea Convergence Society, Vol. 6, No. 4, pp. 161-167, 2015. https://doi.org/10.15207/JKCS.2015.6.4.161