DOI QR코드

DOI QR Code

Speech Recognition Using MSVQ/TDRNN

MSVQ/TDRNN을 이용한 음성인식

  • 김성석 (용인대학교 컴퓨터과학과)
  • Received : 2014.04.16
  • Accepted : 2014.05.19
  • Published : 2014.07.31

Abstract

This paper presents a method for speech recognition using multi-section vector-quantization (MSVQ) and time-delay recurrent neural network (TDTNN). The MSVQ generates the codebook with normalized uniform sections of voice signal, and the TDRNN performs the speech recognition using the MSVQ codebook. The TDRNN is a time-delay recurrent neural network classifier with two different representations of dynamic context: the time-delayed input nodes represent local dynamic context, while the recursive nodes are able to represent long-term dynamic context of voice signal. The cepstral PLP coefficients were used as speech features. In the speech recognition experiments, the MSVQ/TDRNN speech recognizer shows 97.9 % word recognition rate for speaker independent recognition.

본 논문에서는 MSVQ(Multi-Section Vector Quantization)와 시간지연 회귀 신경회로망(TDRNN)을 이용한 하이브리드 구조의 음성인식 방법을 제안한다. MSVQ는 음성의 길이를 일정한 구간 수로 정규화한 코드북을 생성하고, 시간지연 회귀 신경회로망은 이 코드북을 이용하여 음성을 인식한다. 시간지연 회귀 신경회로망은 음성의 시계열 문맥정보를 잘 학습할 수 있는 구조로 구성되었다. 음성특징으로 인지선형예측(PLP) 계수가 사용되었다. 음성인식 실험을 수행한 결과 MSVQ/TDRNN 음성인식기는 97.9 %의 화자독립 음성 인식률을 보였다.

Keywords

References

  1. X. D. Huang, Y. Ariki, and M. A. Jack, Hidden Markov Models for Speech Recognition (Edinburgh University Press, Edinburgh, 1990).
  2. K. Lippmann, "Reviews of neural networks for speech recognition," Neural Computation 1, 1-38(1989). https://doi.org/10.1162/neco.1989.1.1.1
  3. H. Bourlard and N. Morgan, Connectionist Speech Recognition - A Hybrid Approach (Kluwer. Amsterdam, 1994), pp. 185-200.
  4. A. Waibel, H. Sawai, and K. Shikano, "Modularity and scaling in large phoneme neural networks," IEEE Trans. ASSP. 37, 1188-1197 (1989).
  5. T. Robinson, "An application of recurrent nets to phone probability estimation," IEEE Trans. Neural Networks 5, 298-305 (1994). https://doi.org/10.1109/72.279192
  6. S. S. Kim, "Time-delay recurrent neural network for temporal correlations and prediction," Neurocomputing 20, 253-263 (1998). https://doi.org/10.1016/S0925-2312(98)00018-6
  7. S. S. Kim, M. Hasegawa-Johnson, and K. Chen, "Automatic recognition of pitch movements using multi-layer prceptron and time-delay recursive neural network," IEEE Signal Process. Lett. 11, 645-648(2004). https://doi.org/10.1109/LSP.2004.830114
  8. H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech," J. Acoust. Soc. Am. 87, 1738-52 (1990). https://doi.org/10.1121/1.399423
  9. Z. Rong, C. Zhaoxiong, and H. Heyan, "An improved multisection vector quantization model with application to Chinese digits recognition," Proc. of ICSP 1, 749-752(1996).
  10. Y. Linde, A. Buzo, and R. M. Gray, "An algorithm for vector quantizer design," IEEE Trans. on Communication 28, 84-95 (1980). https://doi.org/10.1109/TCOM.1980.1094577
  11. D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, "Learning representations by backpropagating errors," in Parallel Distributed Processing 1 (MIT Press, Cambridge, 1986), pp. 318-362.