Constructing a Noise-Robust Speech Recognition System using Acoustic and Visual Information

Lee, Jong-Seok;Park, Cheol-Hoon;

doi:10.5302/J.ICROS.2007.13.8.719

Journal of Institute of Control, Robotics and Systems (제어로봇시스템학회논문지)

Volume 13 Issue 8
/
Pages.719-725
/
2007
/
1976-5622(pISSN)
/
2233-4335(eISSN)

Institute of Control, Robotics and Systems (제어로봇시스템학회)

DOI QR Code

Constructing a Noise-Robust Speech Recognition System using Acoustic and Visual Information

청각 및 시가 정보를 이용한 강인한 음성 인식 시스템의 구현

이종석 (한국과학기술원 전자전산학부 전기 및 전자공학과) ;
박철훈 (한국과학기술원 전자전산학부 전기 및 전자공학과)

Published : 2007.08.01

https://doi.org/10.5302/J.ICROS.2007.13.8.719 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we present an audio-visual speech recognition system for noise-robust human-computer interaction. Unlike usual speech recognition systems, our system utilizes the visual signal containing speakers' lip movements along with the acoustic signal to obtain robust speech recognition performance against environmental noise. The procedures of acoustic speech processing, visual speech processing, and audio-visual integration are described in detail. Experimental results demonstrate the constructed system significantly enhances the recognition performance in noisy circumstances compared to acoustic-only recognition by using the complementary nature of the two signals.

Keywords

References

L. A. Ross, D. Saint-Amour, V. M. Leavitt, D. C. Javitt, and J. J. Foxe, 'Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments,' Cerebral Cortex, vol. 17, no. 5, pp. 1147-1153, 2007 https://doi.org/10.1093/cercor/bhl024
C. C. Chibelushi, F. Deravi, and J. S. D. Mason, 'A review of speech-based bimodal recognition,' IEEE Trans. Multimedia, vol. 4, no. 1, pp. 23-37, Mar. 2002 https://doi.org/10.1109/6046.985551
X.-D. Huang, A. Acero, and H.- W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice Hall, 2001
P. Scanlon and R. Reilly, 'Feature analysis for automatic speechreading,' in Proc. Int. Conf. Multimedia and Expo, pp. 625-630, 2001
C. Benoit, 'The intrinsic bimodality of speech communication and the synthesis of talking faces,' The Structure of Multimodal Dialogue II, M. M. Taylor, F. Nel, and D. Bouwhuis, Eds. Amsterdam, The Netherlands: John Benjamins, pp. 485-202, 2000
J.-S. Lee and C. H. Park, 'Training hidden Markov models by hybrid simulated annealing for visual speech recognition,' in Proc. Int. Conf. Systems, Man, Cybernetics, pp. 198-202, Oct. 2006
이종석, 심선희, 김소영. 박철훈, '제어되지 않은 조명 조건하에서 입술움직임의 강인한 특징추출을 이용한 바이모달 음성인식,' Telecommunications Review, 제 14 권, 제 1호, pp. 123-134, 2. 2004
R. C. Gonzalez and R. E. Woods, Digital Image Processing, Addison-Wesley Publishing Company, 2001
T. J. Hazen, 'Visual model structures and synchrony constraints for audio-visual speech recognition,' IEEE Trans. Audio, Speech, Language Processing, vol. 14, no. 3, pp. 1082-1089, May 2006 https://doi.org/10.1109/TSA.2005.857572
A. Verma, T. Faruquie, C. Neti, and S. Basu, 'Late integration in audio-visual continuous speech recognition,' in Proc. Workshop on Automatic Speech Recognition and Understanding, pp. 71-74, Dec. 1999
G. F. Meyer, J. B. Mulligan, and S. M. Wuerger, 'Continuous audio-visual digit recognition using N-best decision fusion,' Information Fusion, vol. 5, no. 2, pp. 91-101, June 2004 https://doi.org/10.1016/j.inffus.2003.07.001
S. Tamura, K. Iwano, and S. Furui, 'A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization,' in Proc. Int. Conf. Acoustics, Speech and Signal Processing, vol. 1, pp. 469-472, Mar. 2005
T. W. Lewis and D. M. W. Powers, 'Sensor fusion weighting measures in audio-visual speech recognition,' in Proc. Conf. Australasian Computer Science, pp. 305-314, 2004
C. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, UK, 1995
A. Varga and H. J. M. Steeneken, 'Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,' Speech Communication, vol. 12, no. 3, pp. 247-251, 1993 https://doi.org/10.1016/0167-6393(93)90095-3