DOI QR코드

DOI QR Code

Lip Reading Method Using CNN for Utterance Period Detection

발화구간 검출을 위해 학습된 CNN 기반 입 모양 인식 방법

  • Kim, Yong-Ki (Dept. of Computer Engineering, Chungbuk National University) ;
  • Lim, Jong Gwan (Dept. of Mechanical Engineering, KAIST) ;
  • Kim, Mi-Hye (Dept. of Computer Engineering, Chungbuk National University)
  • Received : 2016.06.20
  • Accepted : 2016.08.20
  • Published : 2016.08.28

Abstract

Due to speech recognition problems in noisy environment, Audio Visual Speech Recognition (AVSR) system, which combines speech information and visual information, has been proposed since the mid-1990s,. and lip reading have played significant role in the AVSR System. This study aims to enhance recognition rate of utterance word using only lip shape detection for efficient AVSR system. After preprocessing for lip region detection, Convolution Neural Network (CNN) techniques are applied for utterance period detection and lip shape feature vector extraction, and Hidden Markov Models (HMMs) are then used for the recognition. As a result, the utterance period detection results show 91% of success rates, which are higher performance than general threshold methods. In the lip reading recognition, while user-dependent experiment records 88.5%, user-independent experiment shows 80.2% of recognition rates, which are improved results compared to the previous studies.

Keywords

Image Processing;AVSR;Lip Reading;Motion Segmentation;DNN

References

  1. Y. K. Kim, J. G. Lim, and M. H. Kim, “Feature Generations Analysis of Lip Image Streams for Isolate Words Recognition,” International Journal of Multimedia and Ubiquitous Engineering, Vol. 10, No. 10, pp. 337-346, 2015.
  2. Luettin, Juergen, and Neil A. Thacker, “Speechreading using probabilistic models,” Computer Vision and Image Understanding, Vol. 65, No. 2, pp. 163-178, 1997. https://doi.org/10.1006/cviu.1996.0570
  3. E. K. Kim, Y. D. Kwon, and J. S. Lee. "Neural Network Vowel-Recongition Jointly Using Voice Features and Mouth Shape Image". Korean Institute of Information Scientists and Engineers Congress 1996, Vol. 23 No. 2A, pp. 693-696, 1996.
  4. J. S., Lee, and C. H. Park, "Automatic Lipreading Using Color Lip Images and Principal Component Analysis," Journal of Information Processing Systems B, Vol. 15, No. 3 pp. 229-236, 2008.
  5. Shaikh, A. A., Kumar, D. K., Yau, W. C., Azemin, M. C., and Gubbi, J, "Lip reading using optical flow and support vector machines," Image and Signal Processing (CISP), 2010 3rd International Congress on., Vol. 1, 2010.
  6. Shaikh, Ayaz A., Dinesh K. Kumar, and Jayavardhana Gubbi, “Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments,” The Visual Computer, Vol. 29, No. 10, pp. 969-982, 2010.
  7. Lan, Y., Theobald, B. J., Harvey, R., Ong, E. J., and Bowden, R, "Improving visual features for lip-reading.," In AVSP 2010, International Conference on Audio-Visual Speech Processing, pp. 7-3, 2010.
  8. Kim Y. K., Lim J. G., and Kim M. H., "Lip Reading Algorithm Using Bool Matrix and SVM," International Conference on Small & Medium Business, (in Korean), (2015), pp. 267-268.
  9. Sujatha, B., and T. Santhanam, "A novel approach integrating geometric and Gabor wavelet approaches to improvise visual lip-reading," Int. J. Soft Comput 5, pp. 13-18, 2010. https://doi.org/10.3923/ijscomp.2010.13.18
  10. Ibrahim, M. Z., and D. J. Mulvaney, "Robust geometrical-based lip-reading using Hidden Markov models," EUROCON, 2013 IEEE, pp. 2011-2016, 2013.
  11. Werda, Salah, Walid Mahdi, and Abdelmajid Ben Hamadou, "Lip localization and viseme classification for visual speech recognition," arXiv preprint arXiv:1301.4558, Vol. 5, No. 1, pp. 62-75 2013.
  12. Wang, S. L., Lau, W. H., Leung, S. H. and Yan, H, "A real-time automatic lipreading system," Circuits and Systems, 2004. ISCAS'04. Proceedings of the 2004 International Symposium on., Vol. 2, 2004.
  13. Cetingul, H. E., Yemez, Y., Erzin, E. and Tekalp, A. M, “Discriminative analysis of lip motion features for speaker identification and speech-reading,” Image Processing, IEEE Transactions on., Vol. 15, No. 10, pp. 2879-2891, 2006. https://doi.org/10.1109/TIP.2006.877528
  14. Siatras, S., Nikolaidis, N., Krinidis, M., and Pitas, I., “Visual lip activity detection and speaker detection using mouth region intensities,” Circuits and Systems for Video Technology, IEEE Transactions on, Vol. 19, No. 1, pp. 133-137, 2009. https://doi.org/10.1109/TCSVT.2008.2009262
  15. Arsic, Aleksandra, Milos Jordanski, and Milan Tuba, "Improved lip detection algorithmbased on region segmentation and edge detection," Telecommunications Forum Telfor (TELFOR), 2015 23rd. IEEE, 2015.
  16. G. B. Kim, J. W. Ryu, and N. I. Cho, “Voice Activity Detection using Motion and Variation of Intensity in The Mouth Region,” Journal of Broadcast Engineering, Vol. 17, No. 3, pp. 519-528, 2012. https://doi.org/10.5909/JBE.2012.17.3.519
  17. E. K. Kim, “Speech Activity Detection using Lip Movement Image Signals,” Journal of the Institute of Signal Processing and Systems, Vol. 11, No. 4, pp. 289-297, 2010.
  18. J. S. Kim, J. G. Nam , and B. T. Zhang, "Deep Learning-based Video Analysis Techniques" Journal of Korean Institute Information Scientists Engineers, Vol. 33, No. 9, pp. 21-31, 2015.
  19. Yun-A Hur, Keun-Ho Lee, “A Study on Countermeasures of Convergence for Big Data and Security Threats to Attack DRDoS in U-Healthcare Device,” Journal of the Korea Convergence Society, Vol. 6, No. 4, pp. 243-248, 2015. https://doi.org/10.15207/JKCS.2015.6.4.243
  20. G. J. Jang and J. S. Park, "Visual Object Recognition Based on Deep Neural Networks Implemented by CAFFE". Journal of Korean Institute Information Scientists Engineers, Vol. 33, No. 8, pp. 49-54, 2015.
  21. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems. pp1097-1105, 2012.
  22. Viola, Paul, and Michael Jones, "Rapid object detection using a boosted cascade of simple features," Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on., Vol. 1, 2001.
  23. S.I. Chien and Il Choi, "Face and Facial Landmarks Location Based on Log-Polar Mapping," Lecture Notes in Computer Science, pp. 379-386, 2000.
  24. Lim, Jong Gwan, Jaehong Kim, and Dong-Soo Kwon, "Multidimensional evaluation and analysis of motion segmentation for inertial measurement unit applications," Multimedia Tools and Applications, pp. 1-28, 2015.
  25. Lim, Jong Gwan, Mi-hye Kim, and Sahngwoon Lee, "Empirical Validation of Objective Functions in Feature Selection Based on Acceleration Motion Segmentation Data," Mathematical Problems in Engineering, 2015.
  26. Krizhevsky, Alex, and G. Hinton, "Convolutional deep belief networks on cifar-10," Unpublished manuscript, 2010.
  27. Maini, Raman, and Himanshu Aggarwal, “Study and comparison of various image edge detection techniques,” International journal of image processing (IJIP), Vol. 3, No. 1, pp. 1-11, 2009.
  28. Jun-Yeon Lee, “Forecasting the Time-Series Data Converged on Time PLOT and Moving Average,” Journal of the Korea Convergence Society, Vol. 6, No. 4, pp. 161-167, 2015. https://doi.org/10.15207/JKCS.2015.6.4.161