DOI QR코드

DOI QR Code

A Parametric Voice Activity Detection Based on the SPD-TE for Nonstationary Noises

비정체성 잡음을 위한 SPD-TE 기반 계수형 음성 활동 탐지

  • Koo, Boneung (Department of Electronic Engineering, Kyonggi University)
  • 구본응 (경기대학교 전자공학과)
  • Received : 2015.01.29
  • Accepted : 2015.04.07
  • Published : 2015.07.31

Abstract

A single channel VAD (Voice Activity Detection) algorithm for nonstationary noise environment is proposed in this paper. Threshold values of the feature parameter for VAD decision are updated adaptively based on estimates of means and standard deviations of past non-speech frames. The feature parameter, SPD-TE (Spectral Power Difference-Teager Energy), is obtained by applying the Teager energy to the WPD (Wavelet Packet Decomposition) coefficients. It was reported previously that the SPD-TE is robust to noise as a feature for VAD. Experimental results by using TIMIT speech and NOISEX-92 noise databases show that decision accuracy of the proposed algorithm is comparable to several typical VAD algorithms including standards for SNR values ranging from 10 to -10 dB.

본 논문에서는 비정체성(nonstationary) 잡음 환경을 위한 단일 채널 VAD(Voice Activity Detection) 알고리듬 제안하였다. VAD 판별을 위한 특징계수의 임계값은 과거 비음성 프레임들의 평균과 표준편차를 추산하여 적응적으로 갱신하였다. 특징계수로는 SPD-TE(Spectral Power Difference-Teager Energy)를 사용했는데, 이것은 WPD(Wavelet Packet Decomposition) 계수에 Teager 에너지를 적용한 것으로서 잡음에 강인한 것으로 보고된 바 있다. TIMIT 음성과 NOISEX-92 잡음을 사용하여 10 dB부터 -10 dB까지의 SNR에 대한 실험 결과, 제안된 알고리듬이 표준을 포함한 기존의 알고리듬과 비슷한 정확도를 보였다.

Keywords

References

  1. P. C. Loizou, Speech Enhancement (CRC Press, Boca Raton, 2007), pp. 309-400.
  2. J. Sohn, N. S. Kim, and W. Sung, "A statistical model-based voice activity detection," IEEE Signal Process. Lett. 16, 1-3 (1999).
  3. ITU, A silence compression scheme for G.729 optimized for terminals conforming to recommendation V.70, ITU-T Recommendation G.729-Annex B (1996).
  4. ETSI EN 301 708 V7.1.1(1999-12), Digital cellular telecommunications system(Phase 2+); VAD for AMR speech traffic channels; General Description (GSM 06.94 version 7.1.1 Release 1998), 13-14 (1999).
  5. ETSI ES 202 050, Ver. 1.1.5(2007-01), Speech Processing, Transmission and Quality Aspects(STQ); Distributed Speech Recognition; Advanced front-end feature extraction algorithm; Compression algorithms, Annex A.3 Stage 2-VAD Logic, 42-43 (2007).
  6. J. Ramirez, J. C. Segura, C. Benitez, A. Torre, and A. Rubio, "Efficient voice activity detection algorithms using longterm speech information," Speech Commun. 42, 271-287 (2004). https://doi.org/10.1016/j.specom.2003.10.002
  7. A. Davis, S. Nordholm, and R. Togneri, "Statistical voice activity detection using low-variance spectrum estimation and an adaptive threshold," IEEE Trans. Audio, Speech and Lang. Processing 14, 412-414 (2006). https://doi.org/10.1109/TSA.2005.855842
  8. G. Evangelopoulos and P. Maragos, "Multiband modulation energy tracking for noisy speech detection," IEEE Trans. Audio, Speech and Lang. Processing 14, 2024-2038 (2006). https://doi.org/10.1109/TASL.2006.872625
  9. T. V. Pham and T. T. Chien, "Reliable voice activity detection algorithm under adverse environments," in Proc. IEEE Int. Conf. Commun. Electronics, 218-223 (2008).
  10. P. K. Ghosh and S. Narayanan, "Robust voice activity detection using long-term signal variability," IEEE Trans. Audio, Speech and Lang. Processing 19, 600-613 (2011). https://doi.org/10.1109/TASL.2010.2052803
  11. E. Chuangsuwanich and J. Glass, "Robust voice activity detector for real world application using harmonicity and modulation frequency," in Proc. Interspeech, 2645-2648 (2011).
  12. B. Koo, "A single channel voice activity detection for noisy environments using wavelet packet decomposition and Teager energy" (in Korean), J. Acoust. Soc. Kr. 33, 139-145 (2014). https://doi.org/10.7776/ASK.2014.33.2.139
  13. J. Garofolo, "TIMIT acoustic-phonetic continuous speech corpus," LDC93S1, Linguistic Data Consortium, Philadelphia, 1993.
  14. A. Varga and H. Steeneken, "Assessment for automatic speech recognition: II. NOISEX-92: An additive noise on speech recognition systems," Speech Commun. 12, 247-251 (1993). https://doi.org/10.1016/0167-6393(93)90095-3