DOI QR코드

DOI QR Code

Noise Robust Automatic Speech Recognition Scheme with Histogram of Oriented Gradient Features

  • Park, Taejin (Audio Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Beack, SeungKwan (Audio Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Lee, Taejin (Audio Research Laboratory, Electronics and Telecommunications Research Institute)
  • Received : 2014.01.15
  • Accepted : 2014.07.28
  • Published : 2014.10.31

Abstract

In this paper, we propose a novel technique for noise robust automatic speech recognition (ASR). The development of ASR techniques has made it possible to recognize isolated words with a near perfect word recognition rate. However, in a highly noisy environment, a distinct mismatch between the trained speech and the test data results in a significantly degraded word recognition rate (WRA). Unlike conventional ASR systems employing Mel-frequency cepstral coefficients (MFCCs) and a hidden Markov model (HMM), this study employ histogram of oriented gradient (HOG) features and a Support Vector Machine (SVM) to ASR tasks to overcome this problem. Our proposed ASR system is less vulnerable to external interference noise, and achieves a higher WRA compared to a conventional ASR system equipped with MFCCs and an HMM. The performance of our proposed ASR system was evaluated using a phonetically balanced word (PBW) set mixed with artificially added noise.

Keywords

References

  1. R. P. Lippmann, "Speech recognition by machines and humans," Speech communication, Vol. 22, No. 1, pp. 1-15. 1997. Article (CrossRef Link) https://doi.org/10.1016/S0167-6393(97)00021-6
  2. A. Torre, D. Fohr, and J. P. Haton, "On the Comparison of Front-Ends for Robust Speech Recognition in Car Environments," in Proc. ISCA ITRW on Adaptation Methods for Speech Recognition, Sophia Antipolis, France, 2001, pp. 105-108. Article (CrossRef Link)
  3. G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK book. Cambridge: Entropic Cambridge Research Laboratory, 1997. Article (CrossRef Link)
  4. S. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," Acoustics, Speech and Signal Processing, IEEE Transactions on, Vol. 28, No. 4, pp. 357-366, Aug. 1980. Article (CrossRef Link) https://doi.org/10.1109/TASSP.1980.1163420
  5. A.E. Rosenberg, C.H. Lee, F. K. Soong, 1994. "Cepstral channel normalization techniques for HMM-based speaker verification," in Proc. ICSLP, Vol. 4, pp. 1835-1838, 1994. Article (CrossRef Link)
  6. O. Viikki and K. Laurila, "Cepstral domain segmental feature vector normalization for noise robust speech recognition," Speech Communication, Vol. 25, No. 1-3, pp. 133-147, 1998. Article (CrossRef Link) https://doi.org/10.1016/S0167-6393(98)00033-8
  7. A. Torre, et al., "Histogram equalization of speech representation for robust speech recognition," Speech and Audio Processing, IEEE Transactions on, Vol. 13, No. 3, pp. 355-366, May. 2005. Article (CrossRef Link) https://doi.org/10.1109/TSA.2005.845805
  8. H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech," Acoustical Society of America Journal, Vol. 87, pp.1738-1752, Apr. 1990. Article (CrossRef Link) https://doi.org/10.1121/1.399423
  9. H. Hermansky and N. Morgan, "RASTA processing of speech," Speech and Audio Processing, IEEE Transactions on, Vol. 2, No. 4, pp. 578-589, Oct. 1994. Article (CrossRef Link) https://doi.org/10.1109/89.326616
  10. M. R. Schadler, R. Marc, B. T. Meyer, and B. Kollmeier. "Spectro-temporal modulation subspacespanning filter bank features for robust automatic speech recognition." The Journal of the Acoustical Society of America, Vol. 131, No. 5, pp. 4134-4151, 2012. Article (CrossRef Link)
  11. N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in Proc. CVPR, San Diego, CA, USA, Jun, 2005, pp. 886-893. Article (CrossRef Link)
  12. D. O'Shaughnessy, Speech communication: human and machine, Addison-Wesley, 1987, p. 150. Article (CrossRef Link)
  13. R. Martin, "Speech enhancement based on minimum mean-square error estimation and supergaussian priors." Speech and Audio Processing, IEEE Transactions on, Vol. 13, No. 5, pp. 845-856, 2005. Article (CrossRef Link) https://doi.org/10.1109/TSA.2005.851927
  14. T. Gerkmann, and R. Martin. "Empirical distributions of DFT-domain speech coefficients based on estimated speech variances." Proceedings of International Workshop on Acoustic Echo and Noise Control (IWAENC), Tel Aviv, Israel, 2010. Article (CrossRef Link)
  15. N. Bassiou, and C. Kotropoulos, "Color image histogram equalization by absolute discounting backoff." Computer Vision and Image Understanding, Vol. 107, No. 1, pp. 108-122, 2007. Article (CrossRef Link) https://doi.org/10.1016/j.cviu.2006.11.012
  16. C. C. Chang, and C. J. Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 27, No. 2.3, 2011. Article (CrossRef Link)
  17. Y.-J Lee, B.-W. Kim, J.-J Kim, O.-Y. Yang, and S.-Y. Lim, "Some considerations for construction of PBW set," in Proc. of the 12th Workshop on Speech Communications and Signal Processing. Acoustical Society of Korea, pp. 310-314, Jun. 1995. Article (CrossRef Link)
  18. D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, M. D. Plumbley, "Detection and classification of acoustic scenes and events: An IEEE AASP challenge," Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on , vol., no., pp.1,4, 20-23 Oct. 2013. Article (CrossRef Link)