DOI QR코드

DOI QR Code

Performance Comparison of Deep Feature Based Speaker Verification Systems

깊은 신경망 특징 기반 화자 검증 시스템의 성능 비교

  • Received : 2015.08.29
  • Accepted : 2015.12.09
  • Published : 2015.12.31

Abstract

In this paper, several experiments are performed according to deep neural network (DNN) based features for the performance comparison of speaker verification (SV) systems. To this end, input features for a DNN, such as mel-frequency cepstral coefficient (MFCC), linear-frequency cepstral coefficient (LFCC), and perceptual linear prediction (PLP), are first compared in a view of the SV performance. After that, the effect of a DNN training method and a structure of hidden layers of DNNs on the SV performance is investigated depending on the type of features. The performance of an SV system is then evaluated on the basis of I-vector or probabilistic linear discriminant analysis (PLDA) scoring method. It is shown from SV experiments that a tandem feature of DNN bottleneck feature and MFCC feature gives the best performance when DNNs are configured using a rectangular type of hidden layers and trained with a supervised training method.

Keywords

References

  1. Kinnunen, T. & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, Vol. 52, No. 1, 12-40. https://doi.org/10.1016/j.specom.2009.08.009
  2. Reynolds, D. A., Quatieri, T. F. & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, Vol. 10, No. 1, 19-41. https://doi.org/10.1006/dspr.1999.0361
  3. Kenny, P., Boulianne, G., Ouellet, P. & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 15, No. 4, 1435-1447. https://doi.org/10.1109/TASL.2006.881693
  4. Matrouf, D., Scheffer, N., Fauve, B. G. & Bonastre, J. F. (2007). A straightforward and efficient implementation of the factor analysis model for speaker verification. In Proceedings of Interspeech, Antwerp, Belgium, 1242-1245.
  5. Dehak, N., Dehak, R., Glass, J. R., Reynolds, D. A. & Kenny, P. (2010). Cosine similarity scoring without score normalization techniques. In Proceedings of Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, 71-75.
  6. Fu, T., Qian, Y., Liu, Y. & Yu, K. (2014). Tandem deep features for text-dependent speaker verification. In Proceedings of Interspeech, Singapore, Singapore, 1327-1331.
  7. Yu, D. & Seltzer, M. L. (2011). Improved bottleneck features using pretrained deep neural networks. In Proceedings of Interspeech, Florence, Italy, 237-240.
  8. Zhang, Y., Chuangsuwanich, E., & Glass, J. (2014). Extracting deep neural network bottleneck features using low-rank matrix factorization. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 185-189.
  9. Liu, Y., Fu, T., Fan, Y., Qian, Y., & Yu, K. (2014). Speaker verification with deep features. In Proceedings of International Joint Conference on Neural Networks (IJCNN), Beijing, China, 747-753.
  10. Kanagasundaram, A. (2014). Speaker verification using I-vector features. Ph.D. Dissertation, Queensland University of Technology.
  11. Kenny, P., Boulianne, G. & Dumouchel, P. (2005). Eigenvoice modeling with sparse training data. IEEE Transactions on Speech and Audio Processing, Vol. 13, No. 3, 345-354. https://doi.org/10.1109/TSA.2004.840940
  12. Bishop, C. M. (2007). Pattern Recognition and Machine Learning (Information Science and Statistics), Springer.
  13. Prince, S. J. & Elder, J. H. (2007). Probabilistic linear discriminant analysis for inferences about identity. In Proceedings of IEEE International Conference on Computer Vision (ICCV), Rio de Janeiro, Brazil, 1-8.
  14. Lee, K. A., Larcher, A., You, C. H., Ma, B. & Li, H. (2013). Multi-session PLDA scoring of i-vector for partially open-set speaker detection. In Proceedings of Interspeech, Lyon, France, 3651-3655.
  15. Kenny, P. (2010). Bayesian speaker verification with heavy tailed priors. In Proceedings of Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, paper no 014.
  16. Sainath, T. N., Kingsbury, B. & Ramabhadran, B. (2012). Auto-encoder bottleneck features using deep belief networks. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 4153-4156.
  17. Larcher, A., Bonastre, J. F., Fauve, B. G., Lee, K. A., Levy, C., Li, H. & Parfait, J. Y. (2013). ALIZE 3.0-open source toolkit for state-of-the-art speaker recognition. In Proceedings of Interspeech, Lyon, France, 2768-2772.
  18. Bonastre, J. F., Wils, F. & Meignier, S. (2005). ALIZE, a free toolkit for speaker recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Philadelphia, PA, 737-740.
  19. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N. & Vesel, K. (2011). The Kaldi speech recognition toolkit. In Proceedings of IEEE ASRU, Honolulu, HI, 1-4.
  20. Brummer, N. & De Villiers, E. (2010). The speaker partitioning problem. In Proceedings of Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, 194-201.
  21. Greenberg, C. S., Stanford, V. M., Martin, A. F., Yadagiri, M., Doddington, G. R., Godfrey, J. J. & Hernandez-Cordero, J. (2013). The 2012 NIST speaker recognition evaluation. In Proceedings of Interspeech, Lyon, France, 1971-1975.