H. McGurk and J. MacDonald, 'Hearing lips and seeing voices,' Nature, vol. 264, pp. 746-748,Dec., 1976https://doi.org/10.1038/264746a0
A. Q. Summerfield, 'Some preliminaries to a comprehensive account of audio-visual speech perception, in B. Dodd and R. Campbell, eds., Hearing by Eye: The Psychology of Lip-reading, pp. 3-51, Lawrence Erlbarum, London, 1987
L. A. Ross, D. Saint-Amour, V. M. Leavitt, D. C. Javitt, and J. J. Foxe, 'Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments,' Cerebral Cortex, vol. 17, no. 5, pp. 1147-1153, 2007https://doi.org/10.1093/cercor/bhl024
R. M. Stem, B. Raj, and P. J. Moreno, 'Compensation for environmental degradation in automatic speech recognition,' in Proc. ESCA-NATO Tutorial and Research Workshop on Robust Speech Recognition using Unknown Communication Channels, Pont-a-mousson, France, pp. 33-42, Apr. 1997
H. Hermansky and N. Morgan, 'RASTA processing of speech,' IEEE Trans. Speech and Audio Processing, vol. 2, no. 4, pp. 578-589,1994https://doi.org/10.1109/89.326616
L. R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, New Jersey, 1993
H. P. Graf, E. Cosatto, and G. Potamianos, 'Robust recognition of faces and facial features with a multi-modal system,' in Proc. Int. Conf Systems, Man and Cybernetics, pp. 2034-2039,1997
M. N. Kaynak, Q. Zhi, A. D. Cheok, K. Sengupta, Z. Jian, and K. C. Chung, 'Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis,' Speech Communication, vol. 43, no. 1-2, pp. 1-16, Jan. 2004https://doi.org/10.1016/j.specom.2004.01.003
T. Coianiz, L. Torresani, and B. Capril, '2D deformable models for visual speech analysis,' in D. G. Stork and M. E. Hennecke, eds., Speechreading by Humans and Machines: Models, Systems and Applications, pp. 391-398, Springer-Verlag, Berlin, German, 1996
S. Dupont and J. Luettin, 'Audio-visual speech modeling for continuous speech recognition,' IEEE Trans. Multimedia, vol. 2, no. 3,pp.141-151, Sept. 2000https://doi.org/10.1109/6046.865479
G. Potamianos, H. P. Graf, and E. Cosatto, 'An image transform approach for HMM based automatic lipreading,' in Proc. Int. Conf. Image Processing, vol. 3, Chicago, pp. 173-177, 1998
M. S. Gray, J. R. Movellan, and T. J. Sejnowski, 'Dynamic features for visual speechreading: a systematic comparison,' Advances in Neural Information Processing Systems, vol. 9, pp. 751-757,1997
이종석, 심선희, 김소영, 박철훈, '제어되지 않은 조명 조건하에서 입술움직임의 강인한 특징추출을 이용한 바이모달 음성인식,' Telecommunications Review, 제 14권 1호, pp. 123-134, 2004년 2월
T. J. Hazen, 'Visual model structures and synchrony constraints for audio-visual speech recognition,' IEEE Trans. Audio, Speech, Language Processing, vol. 14, no. 3, pp. 1082-1089, May 2006https://doi.org/10.1109/TSA.2005.857572
C. Benoit, 'The intrinsic bimodality of speech communication and the synthesis of talking faces,' in M. M. Taylor, F. Nel, and D. Bouwhuis, eds., The Structure of Multimodal Dialogue II, John Benjamins, Amsterdam, The Netherlands, pp. 485-502, 2000
B. Conrey and D. B. Pisoni, 'Auditory-visual speech perception and synchrony detection for speech and nonspeech signals,' Journal of Acoustical Society of America, vol. 119, no. 6, pp. 4065-4073, June, 2006https://doi.org/10.1121/1.2195091
S. Tamura, K. Iwano, and S. Furui, 'A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization,' in Proc. Int. Conf Acoustics, Speech and Signal Processing, vol. 1, pp. 469-472, 2005
이종석, 박철훈, '시청각 음성인식을 위한 정보통합: 신뢰도 측정방식의 비교와 신경회로망을 이용한 통합 기법,' Telecommunications Review, 제 17권 3호, pp. 538-550, 2007년 6월
S. Nakamura, 'Statistical multimodal integration for audio-visual speech processing,' IEEE Trans. Neural Networks, vol. 13, no. 4, pp. 854-866, Jul. 2002https://doi.org/10.1109/TNN.2002.1021886
S. M. Chu and T. S. Huang, 'Audio-visual speech modeling using coupled hidden Markov models,' in Proc. Int. Conf Acoustics, Speech and Signal Processing, vol. 2, Orlando, FL, pp. 2009-2012, May 2002
A. V. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, 'Dynamic Bayesian networks for audio-visual speech recognition,' EURASIP J. Applied Signal Processing, vol. 11, pp. 1-15, 2002
G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, 'Recent advances in the automatic recognition of audiovisual speech,' Proc. IEEE, vol. 91, no. 9, pp. 1306-1326, Sept. 2003https://doi.org/10.1109/JPROC.2003.817150
S. Pigeon and L. Vandendorpe, 'The M2VTS multimodal face database,' in Proc. Int. Conf Audio- and Video-based Biometric Person Authentication, pp. 403-409, 1997https://doi.org/10.1007/BFb0016021
K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, 'XM2VTS: the extended M2VTS database,' in Proc. Int. Conf Audio and Video-based Biometric Person Authentication, pp. 72-76,1999