잡음에 강인한 시청각 음성인식

Lee, Jong-Seok;Park, Cheol-Hun;

ICROS (제어로봇시스템학회지)

Volume 13 Issue 3
/
Pages.28-34
/
2007
/
1976-4529(pISSN)

Institute of Control, Robotics and Systems (제어로봇시스템학회)

잡음에 강인한 시청각 음성인식

이종석 (한국과학기술원 전자전산학부) ;
박철훈 (한국과학기술원 전자전산학부)

Published : 2007.09.25

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Keywords

References

H. McGurk and J. MacDonald, 'Hearing lips and seeing voices,' Nature, vol. 264, pp. 746-748,Dec., 1976 https://doi.org/10.1038/264746a0
A. Q. Summerfield, 'Some preliminaries to a comprehensive account of audio-visual speech perception, in B. Dodd and R. Campbell, eds., Hearing by Eye: The Psychology of Lip-reading, pp. 3-51, Lawrence Erlbarum, London, 1987
L. A. Ross, D. Saint-Amour, V. M. Leavitt, D. C. Javitt, and J. J. Foxe, 'Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments,' Cerebral Cortex, vol. 17, no. 5, pp. 1147-1153, 2007 https://doi.org/10.1093/cercor/bhl024
R. M. Stem, B. Raj, and P. J. Moreno, 'Compensation for environmental degradation in automatic speech recognition,' in Proc. ESCA-NATO Tutorial and Research Workshop on Robust Speech Recognition using Unknown Communication Channels, Pont-a-mousson, France, pp. 33-42, Apr. 1997
H. Hermansky and N. Morgan, 'RASTA processing of speech,' IEEE Trans. Speech and Audio Processing, vol. 2, no. 4, pp. 578-589,1994 https://doi.org/10.1109/89.326616
L. R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, New Jersey, 1993
H. P. Graf, E. Cosatto, and G. Potamianos, 'Robust recognition of faces and facial features with a multi-modal system,' in Proc. Int. Conf Systems, Man and Cybernetics, pp. 2034-2039,1997
M. N. Kaynak, Q. Zhi, A. D. Cheok, K. Sengupta, Z. Jian, and K. C. Chung, 'Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis,' Speech Communication, vol. 43, no. 1-2, pp. 1-16, Jan. 2004 https://doi.org/10.1016/j.specom.2004.01.003
T. Coianiz, L. Torresani, and B. Capril, '2D deformable models for visual speech analysis,' in D. G. Stork and M. E. Hennecke, eds., Speechreading by Humans and Machines: Models, Systems and Applications, pp. 391-398, Springer-Verlag, Berlin, German, 1996
S. Dupont and J. Luettin, 'Audio-visual speech modeling for continuous speech recognition,' IEEE Trans. Multimedia, vol. 2, no. 3,pp.141-151, Sept. 2000 https://doi.org/10.1109/6046.865479
G. Potamianos, H. P. Graf, and E. Cosatto, 'An image transform approach for HMM based automatic lipreading,' in Proc. Int. Conf. Image Processing, vol. 3, Chicago, pp. 173-177, 1998
M. S. Gray, J. R. Movellan, and T. J. Sejnowski, 'Dynamic features for visual speechreading: a systematic comparison,' Advances in Neural Information Processing Systems, vol. 9, pp. 751-757,1997
이종석, 심선희, 김소영, 박철훈, '제어되지 않은 조명 조건하에서 입술움직임의 강인한 특징추출을 이용한 바이모달 음성인식,' Telecommunications Review, 제 14권 1호, pp. 123-134, 2004년 2월
T. J. Hazen, 'Visual model structures and synchrony constraints for audio-visual speech recognition,' IEEE Trans. Audio, Speech, Language Processing, vol. 14, no. 3, pp. 1082-1089, May 2006 https://doi.org/10.1109/TSA.2005.857572
C. Benoit, 'The intrinsic bimodality of speech communication and the synthesis of talking faces,' in M. M. Taylor, F. Nel, and D. Bouwhuis, eds., The Structure of Multimodal Dialogue II, John Benjamins, Amsterdam, The Netherlands, pp. 485-502, 2000
B. Conrey and D. B. Pisoni, 'Auditory-visual speech perception and synchrony detection for speech and nonspeech signals,' Journal of Acoustical Society of America, vol. 119, no. 6, pp. 4065-4073, June, 2006 https://doi.org/10.1121/1.2195091
S. Tamura, K. Iwano, and S. Furui, 'A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization,' in Proc. Int. Conf Acoustics, Speech and Signal Processing, vol. 1, pp. 469-472, 2005
이종석, 박철훈, '시청각 음성인식을 위한 정보통합: 신뢰도 측정방식의 비교와 신경회로망을 이용한 통합 기법,' Telecommunications Review, 제 17권 3호, pp. 538-550, 2007년 6월
S. Nakamura, 'Statistical multimodal integration for audio-visual speech processing,' IEEE Trans. Neural Networks, vol. 13, no. 4, pp. 854-866, Jul. 2002 https://doi.org/10.1109/TNN.2002.1021886
S. M. Chu and T. S. Huang, 'Audio-visual speech modeling using coupled hidden Markov models,' in Proc. Int. Conf Acoustics, Speech and Signal Processing, vol. 2, Orlando, FL, pp. 2009-2012, May 2002
S. Bengio, 'Mnltimodal speech processing using asynchronous hidden Markov models,' Information Fusion, vol. 5, pp. 81-89, 2004 https://doi.org/10.1016/j.inffus.2003.04.001
A. V. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, 'Dynamic Bayesian networks for audio-visual speech recognition,' EURASIP J. Applied Signal Processing, vol. 11, pp. 1-15, 2002
G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, 'Recent advances in the automatic recognition of audiovisual speech,' Proc. IEEE, vol. 91, no. 9, pp. 1306-1326, Sept. 2003 https://doi.org/10.1109/JPROC.2003.817150
S. Pigeon and L. Vandendorpe, 'The M2VTS multimodal face database,' in Proc. Int. Conf Audio- and Video-based Biometric Person Authentication, pp. 403-409, 1997 https://doi.org/10.1007/BFb0016021
K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, 'XM2VTS: the extended M2VTS database,' in Proc. Int. Conf Audio and Video-based Biometric Person Authentication, pp. 72-76,1999
http://voice.etri.re.kr/DBSearch/Voice.asp