Robust Endpoint Detection for Bimodal System in Noisy Environments

잡음환경에서의 바이모달 시스템을 위한 견실한 끝점검출

  • 오현화 (경북대학교 전자전기공학부) ;
  • 권홍석 (경북대학교 전자전기공학부) ;
  • 손종목 (경북대학교 전자전기공학부) ;
  • 진성일 (경북대학교 전자전기공학부) ;
  • 배건성 (l경북대학교 전자전기공학부)
  • Published : 2003.09.01

Abstract

The performance of a bimodal system is affected by the accuracy of the endpoint detection from the input signal as well as the performance of the speech recognition or lipreading system. In this paper, we propose the endpoint detection method which detects the endpoints from the audio and video signal respectively and utilizes the signal to-noise ratio (SNR) estimated from the input audio signal to select the reliable endpoints to the acoustic noise. In other words, the endpoints are detected from the audio signal under the high SNR and from the video signal under the low SNR. Experimental results show that the bimodal system using the proposed endpoint detector achieves satisfactory recognition rates, especially when the acoustic environment is quite noisy.

음성인식 시스템과 입술독해 시스템을 결합한 하여 음향학적 잡음에 대하여 안정된 성능을 갖는 바이모달(bimodal) 시스템을 구현한다. 바이모달 시스템의 성능은 두 인식 시스템의 성능뿐만 아니라 입력 신호의 끝점검출 성능에도 크게 영향을 받는다. 본 논문에서는 음성신호와 영상신호에서 끝점을 자각 자동 검출하여 입력 음성신호로부터 음성신호에서 추정한 신호대잡음비(signal-to-noise ratio: SNR)로 두 끝점검출 결과를 선택하는 방법을 제안한다. 즉 낮은 SNR에서는 영상신호로부터 검출된 끝점을 선택하고 높은 SNR에서는 음성신호로부터 검출된 끝점을 선택함으로써 음향학적 잡음에 대하여 견실하게 끝점을 검출한다. 제안한 끝점검출 방법이 적용된 바이모달 시스템이 강한 음향학적 잡음에 대하여 만족스러운 인식성능을 나타냄을 실험견과에서 확인할 수 있다.

Keywords

References

  1. H. Kaplan, C.J. Bally, and C. Garretson, Speechreading: A Way to Improve Understanding, Gallaudet University Press, Washington D.C., 1999
  2. B. Dodd and R. Campbell, Hearing by Eye: The Psychology of Lip-reading, Lawrence Erbaum Press, Hillsdale NJ, 1987
  3. C. Bregler and Y. Konig, 'Eigenlips for Robust Speech Recognition,' in Proc. of IEEE Int'l Conf. on Acoustics, Speech and Signal Processing, vol. 2, pp. 669-672, 1994 https://doi.org/10.1109/ICASSP.1994.389567
  4. M.E. Hennecks, K.V. Prasad, and D.G. Stork, 'Automatic Speech Recognition System Using Acoustic and Visual Signals,' in Proc. of 29th Asilomar Conf. on Signals, Systems and Computers, vol. 2, pp. 1214-1218, 1995 https://doi.org/10.1109/ACSSC.1995.540892
  5. 박병구, 김진영, 최승호, '바이모달 음성인식의 음성정보와 입술정보 결합방법 비교,' 한국음향학회지, 제18권 제4호, pp. 31-37, 1999
  6. S. Dupont and J. Luettin, 'Audio-Visual Speech Modeling for Continuous Speech Recognition,' IEEE Trans. on Multimedia, vol. 2, no. 3, pp. 141-151, 2000 https://doi.org/10.1109/6046.865479
  7. L.R. Rabiner and M.R. Sambur, 'An Algorithm for Determining the Endpoints of Isolated Uttrances,' Bell Syst. Tech. J., vol. 54, no. 2, pp. 297-315, 1975 https://doi.org/10.1002/j.1538-7305.1975.tb02840.x
  8. L.F. Lamel, L.R. Rabiner, A.E. Rosenberg, and J.G. Wilpon, 'An Improved Endpoint Detector for Isolated Word Recognition,' IEEE Trans. Acoust., Speech, and Signal Processing, vol. 29, no. 4, pp. 777-785, 1981 https://doi.org/10.1109/TASSP.1981.1163642
  9. G.S. Ying, C.D. Mitchell, and L.H. Jamieson, 'Endpoing Detection of Isolated Utterances Based on a Modified Teager Energy Measurement,' in Proc. of IEEE Int'l Conf. on Acoustics, Speech and Signal Processing, pp. 732-735, 1993
  10. Y. Ephraim and D. Malah, 'Speech Enhancement Using a Minimum Mean Square Error Short-Time Spectral Amplitude Estimator,' IEEE Trans. on Acoustic, Speech and Signal Processing, vol. ASSP-2, no. 6, pp. 1109-1121, 1984 https://doi.org/10.1109/TASSP.1984.1164453
  11. H.-S. Kwon, J.-M. Son, S.-Y. Jung, and K.-S. Bae, 'Speech Enhancement Using Microphone Array with MMSE-STSA Based Post-Processing,' in Proc. of Int'l Conf. on Electronics, Information and Communications, pp. 186-189, Ulaanbaatar, Mongolia, Jul. 2002
  12. H.-H. Oh, Y.-M. Jeoun, and S.-I. Chien, 'A Set of Mesh Features for Automatic Visual Speech Recognition,' in Proc. of IARP Workshop on Machine Vision Applications, pp. 488-491, Nara, Japan, Dec. 2002
  13. S. Bou-Ghazale and K. Assaleh, 'A Robust Endpoint Detection of Speech for Noisy Environments with Application to Automatic Speech Recognition,' Proc. IEEE Int'l Conf. On Acoustics, Speech and Signal Processing, pp. IV-3808-IV-3811, Orlando, Florida, May 2002 https://doi.org/10.1109/ICASSP.2002.1004747