DOI QR코드

DOI QR Code

Automatic Clustering of Speech Data Using Modified MAP Adaptation Technique

수정된 MAP 적응 기법을 이용한 음성 데이터 자동 군집화

  • Received : 2014.01.07
  • Accepted : 2014.03.15
  • Published : 2014.03.31

Abstract

This paper proposes a speaker and environment clustering method in order to overcome the degradation of the speech recognition performance caused by various noise and speaker characteristics. In this paper, instead of using the distance between Gaussian mixture model (GMM) weight vectors as in the Google's approach, the distance between the adapted mean vectors based on the modified maximum a posteriori (MAP) adaptation is used as a distance measure for vector quantization (VQ) clustering. According to our experiments on the simulation data generated by adding noise to clean speech, the proposed clustering method yields error rate reduction of 10.6% compared with baseline speaker-independent (SI) model, which is slightly better performance than the Google's approach.

Keywords

References

  1. Hilgerk, F., Molau S., & Ney H. (2002). Quantile based histogram equalization for online applications. Proc. ICSLP, 237-240.
  2. Moreno, P. J., Raj B., & Stern, R. M. (1996). A vector Taylor series approach for environment-independent speech recognition. Proc. ICASSP, 733-736.
  3. Gales, M. J. F. & Young, S. J. (1996). Robust continuous speech recognition using parallel model combination. IEEE Trans. on Speech and Audio Process, 5(5), 352-359.
  4. Deng, L., Droppo, J., & Acero A. (2003). Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition. IEEE Trans. Speech Audio Process, 11, 6, 568-580. https://doi.org/10.1109/TSA.2003.818076
  5. Gales, M. J. F. (1997). Maximum likelihood linear transformations for HMM based speech recognition. Cambridge Univ. Tech. Rep. TR 291, Cambridge, U.K.
  6. Song, H. J., Jeon, H. B. & Kim, H. S. (2009). Fast speaker adaptation based on eigenspace-based MLLR using artificially distorted speech in car noise environment. Phonetics and Speech Sciences, 1(4), 119-125. (송화전, 전형배, 김형순 (2009). 차량 잡음 환경에서 인위적 왜곡 음성을 이용한 Eigenspace-based MLLR에 기반한 고속 화자 적응, 말소리와 음성과학, 1(4), 119-125.)
  7. Beaufays, F., Vanhoucke, V., & Strope, B. (2010). Unsupervised discovery and training of maximally dissimilar cluster models. Proc. Interspeech, 66-69.
  8. Zhang, Y., Xu, J., Yan, Z. J., & Huo, Q. (2011). An i-vector based approach to training data clustering for improved speech recognition. Proc. Interspeech, 1247-1250.
  9. Tsao, Y. & Lee, C. H. (2009). An ensemble speaker and speaking environment modeling approach to robust speech recognition. IEEE Trans. Audio, Speech, and Language Processing, 17(5), 1025-1037. https://doi.org/10.1109/TASL.2009.2016231
  10. Lee, C. H., Lin, C. H. & Juang, B. H. (1991). A study on speaker adaptation of the parameters of continuous density hidden Markov models. IEEE Transactions on Signal Processing, 39(4), 806-814. https://doi.org/10.1109/78.80902
  11. Campbell, W. M., Sturim, D. E., Reynolds, D. A. & Solomonoff, A. (2006). SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. Proc. ICASSP, 1, 97-100.
  12. Ban, S. M., Kang, B. O., Lee, Y. K., & Kim, H. S. (2012). Automatic clustering of speech data using the distance between the cepstral mean vectors. Proc. 2012 Fall Conf. of the Korean Society of Speech Sciences, 35-36. (반성민, 강병옥, 이윤근, 김형순 (2012). 켑스트럼 평균벡터 거리를 이용한 음성 데이터 자동 클러스터링, 한국음성학회 가을 학술대회 발표논문집, 35-36.)
  13. Lim, Y. & Lee Y. (1995). Implementation of the POW (phonetically optimized words) algorithm for speech database. Proc. ICASSP, 1, 89-92.
  14. Lee, Y. J., Kim, B. W., Kim, J. J., Yang, O. Y. & Lim, S. Y. (1995). Some considerations for construction of PBW set. Proc. 12th Workshop on Speech Communications and Signal Processing. Korean Association of Speech Sciences, 310-314. (이용주, 김봉완, 김종진, 양옥렬, 임선영 (1995). 음성 DB용 PBW에 관한 검토, 제12회 음성통신 및 신호처리 워크샵 논문집, 한국음성학회, 310-314.)
  15. Lee, S. J., Kang, B. O., Jung, H. Y., Lee, Y. K. & Kim, H. S. (2010). Statistical model-based noise reduction approach for car interior applications to speech recognition. ETRI Journal, 32(5), 801-809. https://doi.org/10.4218/etrij.10.1510.0024