DOI QR코드

DOI QR Code

Multi-resolution DenseNet based acoustic models for reverberant speech recognition

잔향 환경 음성인식을 위한 다중 해상도 DenseNet 기반 음향 모델

  • Received : 2018.02.08
  • Accepted : 2018.03.12
  • Published : 2018.03.31

Abstract

Although deep neural network-based acoustic models have greatly improved the performance of automatic speech recognition (ASR), reverberation still degrades the performance of distant speech recognition in indoor environments. In this paper, we adopt the DenseNet, which has shown great performance results in image classification tasks, to improve the performance of reverberant speech recognition. The DenseNet enables the deep convolutional neural network (CNN) to be effectively trained by concatenating feature maps in each convolutional layer. In addition, we extend the concept of multi-resolution CNN to multi-resolution DenseNet for robust speech recognition in reverberant environments. We evaluate the performance of reverberant speech recognition on the single-channel ASR task in reverberant voice enhancement and recognition benchmark (REVERB) challenge 2014. According to the experimental results, the DenseNet-based acoustic models show better performance than do the conventional CNN-based ones, and the multi-resolution DenseNet provides additional performance improvement.

Keywords

References

  1. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (pp. 1097-1105).
  2. Sainath, T., Kingsbury, B., Saon, G., Soltau, H., Mohamed, A., Dahl, G., & Ramabhadran, B. (2015). Deep convolutional neural networks for large-scale speech tasks. Neural Networks, 64, 39-48.
  3. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778).
  4. Srivastava, R., Greff, K., & Schmidhuber, J. (2015). Training very deep networks. Proceedings of the Advances in Neural Information Processing Systems 28 (pp. 2377-2385).
  5. Huang, G., Liu, Z., Maaten, L., & Weinberger, K. (2017). Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4700-4708).
  6. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211-252. https://doi.org/10.1007/s11263-015-0816-y
  7. Park, S., Jeong, Y., & Kim, H. (2017). Multiresolution CNN for reverberant speech recognition. Proceedings of the Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques.
  8. Robinson, T., Fransen, J., Pye, D., Foote, J., & Renals, S. (1995). WSJCAMO: A British English speech corpus for large vocabulary continuous speech recognition. 1995 International Conference on Acoustics, Speech, and Signal Processing (pp. 81-84). Detroit, MI. 1995.
  9. Lincoln, M., McCowan, I., Vepa, J., & Maganti, H. (2005). The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments. Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. San Juan (pp. 357-362).
  10. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The Kaldi speech recognition toolkit. Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2011) (p. 4). Hawaii. 11-15 December, 2011.
  11. Yu, D., Yao, K., & Zhang, Y. (2015). The computational network toolkit. IEEE Signal Processing Magazine, 32(6), 123-126. https://doi.org/10.1109/MSP.2015.2462371
  12. Qian, Y., Bi, M., Tan, T., & Yu, K. (2016). Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(12), 2263-2276. https://doi.org/10.1109/TASLP.2016.2602884