DOI QR코드

DOI QR Code

Speech Segmentation using Weighted Cross-correlation in CASA System

계산적 청각 장면 분석 시스템에서 가중치 상호상관계수를 이용한 음성 분리

  • Kim, JungHo (Department of Electronics and Communication Engineering, Kwangwoon University) ;
  • Kang, ChulHo (Department of Electronics and Communication Engineering, Kwangwoon University)
  • 김정호 (광운대학교 전자통신공학과) ;
  • 강철호 (광운대학교 전자통신공학과)
  • Received : 2014.02.19
  • Accepted : 2014.04.30
  • Published : 2014.05.25

Abstract

The feature extraction mechanism of the CASA(Computational Auditory Scene Analysis) system uses time continuity and frequency channel similarity to compose a correlogram of auditory elements. In segmentation, we compose a binary mask by using cross-correlation function, mask 1(speech) has the same periodicity and synchronization. However, when there is delay between autocorrelation signals with the same periodicity, it is determined as a speech, which is considered to be a drawback. In this paper, we proposed an algorithm to improve discrimination of channel similarity using Weighted Cross-correlation in segmentation. We conducted experiments to evaluate the speech segregation performance of the CASA system in background noise(siren, machine, white, car, crowd) environments by changing SNR 5dB and 0dB. In this paper, we compared the proposed algorithm to the conventional algorithm. The performance of the proposed algorithm has been improved as following: improvement of 2.75dB at SNR 5dB and 4.84dB at SNR 0dB for background noise environment.

계산적 청각 장면 분석 시스템의 특징 추출은 시간 연속성과 주파수 채널간에 유사성을 이용하여 청각 요소의 상관지도를 구성한다. 세그먼테이션은 상호상관계수 함수를 이용하여 2진 마스크를 구성하고, 마스크 성분 1(음성)은 동일한 주기성과 동기를 가진다. 그러나 채널간에 비슷한 주기성을 갖지만 지연이 있는 경우에 음성으로 잘못 결정되는 문제가 있다. 본 논문에서는 세그먼테이션에서 가중치 상호상관계수를 이용해 채널간에 유사성의 변별력을 높이는 방법을 제안한다. 계산적 청각 장면 분석 시스템의 음성분리 성능을 평가하기 위하여 배경 잡음(사이렌, 기계, 백색, 자동차, 군중) 환경에서 신호 대 잡음비(5dB, 0dB)의 변화에 따라 실험을 수행하였다. 본 논문에서는 기존의 방법과 제안한 방법과 비교한 결과, 제안한 방법이 기존의 방법에 비하여 각각 신호 대 잡음비 5dB에서 2.75dB 그리고 0dB에서 4.84dB 향상되었다.

Keywords

References

  1. A. S. Bregman, "Auditory Scene Analysis: The Perceptual Organization of Sound," MIT Press, 1994.
  2. Loizou and Philipos C., "Speech Enhancement: Theory and Practice," Crc Press, 2007.
  3. A. Hyvarinen, J. Karhunen and K. Oja, "Independent Component Analysis," Wiley-Interscience, 2001.
  4. D. L. Wang and G. J. Brown, "Computational Auditory Scene Analysis," Wiley-IEEE Press, 2006.
  5. D. L. Wang, "On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis," Speech Separation by Humans and Machines, pp. 181-197, Kluwer Academic, Norwell MA, 2005.
  6. G. Hu and D. L. Wang, "Monaural speech segregation based on pitch tracking and amplitude modulation," IEEE Trans. on Neural Networks, vol. 15, no. 5, pp. 1135-1150, September 2004 https://doi.org/10.1109/TNN.2004.832812
  7. Jung-Ho Kim, Hyung-Hwa Ko, Chul-Ho Kang, "A Study on Voice Activity Detection Using Auditory Scene and Periodic to Aperiodic Component Ratio in CASA System," Journal of The Institute of Electronics Engineers of Korea, vol. 50, no. 10, pp. 181-187, October 2013. https://doi.org/10.5573/ieek.2013.50.10.181
  8. B. R. Glasberg and B. C. J. Moore, "Derivation of auditory filter shapes from notched-noise data," Hearing Research, vol. 47, no. 2, pp. 103-138, August 1990. https://doi.org/10.1016/0378-5955(90)90170-T
  9. G. Jacovitti and G. Scarano, "Discrete Time Techniques for Time Delay Estimation," IEEE Trans. on Signal Processing, vol. 41, no. 2, pp. 525-533, February 1993. https://doi.org/10.1109/78.193195
  10. G. Hu and PNL, "100 Nonspeech Sounds," http://www.cse.ohio-state.edu/pnl/corpus
  11. G. Hu and D. L. Wang, "Auditory Segmentation Based on Onset and Offset Analysis," IEEE Tran. on Audio, Speech, and Language Processing, vol. 15, no. 2, pp. 396-405, February 2007. https://doi.org/10.1109/TASL.2006.881700