DOI QR코드

DOI QR Code

Utilization of Phase Information for Speech Recognition

음성 인식에서 위상 정보의 활용

  • Received : 2015.08.01
  • Accepted : 2015.09.23
  • Published : 2015.09.30

Abstract

Mel-Frequency Cepstral Coefficients(: MFCC) is one of the noble feature vectors for speech signal processing. An evident drawback in MFCC is that the phase information is lost by taking the magnitude of the Fourier transform. In this paper, we consider a method of utilizing the phase information by treating the magnitudes of real and imaginary components of FFT separately. By applying this method to speech recognition with FVQ/HMM, the speech recognition error rate is found to decrease compared to the conventional MFCC. By numerical analysis, we show also that the optimal value of MFCC components is 12 which come from 6 real and imaginary components of FFT each.

MFCC는 음성 신호 처리에서 귀중한 특징 벡터들 중 하나이다. MFCC에서 명백한 결점은 푸리에 변환의 크기를 취함에 의해 위상 정보가 손실된다는 것이다. 이 논문에서 우리는 푸리에 변환의 실수부와 허수부 크기를 따로 취급함으로써 위상 정보를 활용하는 방법을 생각한다. 퍼지 벡터 양자화와 은닉 마코브 모델을 이용한 음성인식에 이 방법을 적용함으로써, 종전 방법에 비해 음성 인식 오류율을 줄일 수 있음을 보인다. 우리는 또한 수치해석을 통하여, FFT의 실수부와 허수부 각각에서 6개의 성분을 취하여 모두 12개의 MFCC 성분을 사용하는 것이 음성인식에 최적임을 보인다.

Keywords

References

  1. G. Kaplan, "Words into action: I," IEEE Spectrum, vol. 17, 1980, pp. 22-26.
  2. Y. Chang, S. Hung, N. Wang, and B. Lin, "CSR: A Cloud-assisted speech recognition service for personal mobile device," Int. Conf. on Parallel Processing, Taipei, Taiwan, Sep. 2011, pp. 305-314.
  3. M. Kang, "A Study on the Design of Multimedia Service Platform on Wireless Intelligent Technology," J. of the Korea Institute of Electronic Communication Sciences, vol. 4, no. 1, 2009, pp. 24-30.
  4. J. Yoo, H. Park, H. Shin, and Y. Shin, "A Study of the Communication Infrastructure Construction for u-City in Korea," J. of the Korea Institute of Electronic Communication Sciences, vol. 1, no. 2, 2006, pp. 127-135.
  5. B. Kim, "Service Quality Criteria for Voice Services over a WiBro Network," J. of the Korea Institute of Electronic Communication Sciences, vol. 6, no. 6, 2011, pp. 823-829.
  6. J. W. Picone, "Signal modeling techniques in speech recognition," Proc. IEEE, vol. 81, no. 9, 1993, pp. 1215-1247. https://doi.org/10.1109/5.237532
  7. B. Bozkurt and L. Couvreur, "On the use of phase information for speech recognition," In Proc. of Eusipco, Antalya, Turkey, 2005, pp. 1-4.
  8. K. K. Paliwal, "Usefulness of phase in speech processing", Proc. IPSJ Spoken Language Processing Workshop, Gifu, Japan, Feb. 2003, pp. 1-6.
  9. J. C. Wang, J. F. Wang, and Y. Weng, "Chip design of MFCC extraction for speech recognition," The VLSI Journal, vol. 32, 2002, pp. 111-131. https://doi.org/10.1016/S0167-9260(02)00045-7
  10. J. M. Bioucas-Dias and G. Valadao, "Phase Unwrapping via Graph Cuts," IEEE Trans. on Image Processing, vol. 16 no. 3, 2007, pp. 698-709. https://doi.org/10.1109/TIP.2006.888351
  11. T. Drugman, B. Bozkurt, and T. Dutoit, "Complex Cepstrum-Based Decomposition of Speech for Glottal Source Estimation," Interspeech, Brighton, Sep. 2009, pp. 116-119.
  12. L. Fausett, Fundamentals of Neural Networks, New Jersey: Prentice-Hall, 1994.
  13. J. R. Deller, J. G. Proakis, and J. H. L. Hansen, Discrete-Time Processing of Speech Signals, New York: Macmillan, 1994.
  14. W. Xu, Zhengzhou, Y. Guo, B. Wang and X. Wang, "A Noise Robust Front-End Using Wiener Filter, Probability Model and CMS for ASR," Int. Conf. on Natural Language Processing and Knowledge Engineering, Zhengzhou, China, 2005, pp. 102-105.
  15. M. Dehghan, K. Faez, M. Ahmadi, and M. Shridhar, "Unconstrained Farsi Handwritten Word Recognition Using Fuzzy Vector Quantization and Hidden Markov models," Pattern Recognition Letters, vol. 22, 2001, pp. 209-214. https://doi.org/10.1016/S0167-8655(00)00090-8