DOI QR코드

DOI QR Code

Performance Enhancement of Phoneme and Emotion Recognition by Multi-task Training of Common Neural Network

공용 신경망의 다중 학습을 통한 음소와 감정 인식의 성능 향상

  • Kim, Jaewon (Dept. of Electronics Engineering, Kwangwoon University) ;
  • Park, Hochong (Dept. of Electronics Engineering, Kwangwoon University)
  • Received : 2020.05.07
  • Accepted : 2020.09.09
  • Published : 2020.09.30

Abstract

This paper proposes a method for recognizing both phoneme and emotion using a common neural network and a multi-task training method for the common neural network. The common neural network performs the same function for both recognition tasks, which corresponds to the structure of multi-information recognition of human using a single auditory system. The multi-task training conducts a feature modeling that is commonly applicable to multiple information and provides generalized training, which enables to improve the performance by reducing an overfitting occurred in the conventional individual training for each information. A method for increasing phoneme recognition performance is also proposed that applies weight to the phoneme in the multi-task training. When using the same feature vector and neural network, it is confirmed that the proposed common neural network with multi-task training provides higher performance than the individual one trained for each task.

본 논문에서는 하나의 공용 신경망을 사용하여 음소와 감정을 모두 인식하는 방법과 공용 신경망 학습을 위한 다중 학습 방법을 제안한다. 공용 신경망은 동일한 동작을 수행하여 두 정보를 모두 인식하며, 이는 인간이 하나의 청각기관으로 여러 정보를 동시에 인식하는 구조에 해당한다. 다중 학습은 여러 정보를 위한 공통 모델링을 진행하므로 여러 정보에 대한 일반화된 학습을 진행시켜 기존의 정보별 개별 학습에서 나타나는 과적합을 감소시키고 인식 성능을 향상시킨다. 또한, 다중 학습에서 음소 인식에 가중치를 부여하여 음소 인식 성능을 추가 향상시키는 방법을 제안한다. 동일한 특성벡터와 신경망을 사용할 때, 제안한 다중 학습이 적용된 공용 신경망의 성능이 각 정보별로 학습시킨 개별 신경망에 비하여 우수한 것을 확인하였다.

Keywords

References

  1. A. Graves, A. R. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," Proc. on IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 6645-6649, May 2013, doi:10.1109/ICASSP.2013.6638947.
  2. T. L. Nwe, S. W. Foo, and L. C. De Silva, "Speech emotion recognition using hidden Markov models," Speech communication, vol. 41, no. 4, pp. 603-623, Nov. 2003, doi:10.1016/S0167-6393(03)00099-2.
  3. J. P. Campbell, "Speaker recognition: A tutorial," Proceedings of the IEEE, vol. 85, no. 9, pp. 1437-1462, Sept. 1997, doi:10.1109/5.628714.
  4. W. J. Jang, H. W. Yun, S. H. Shin, H. J. Cho, W. Jang, and H. Park, "Music genre classification using spikegram and deep neural network," J. of Broadcast Engineering, vol. 22, no. 6, pp. 693-701, Nov. 2017, doi:10.5909/JBE.2017.22.6.693.
  5. S. H. Shin, H. W. Yun, W. J. Jang, and H. Park, "Extraction of acoustic features based on auditory spike code and its application to music genre classification," IET Signal Processing, vol. 13, no. 2, pp. 230-234, Apr. 2019, doi:10.1049/iet-spr.2018.5158.
  6. S. Han, J. Kim, S. An, S. Shin, and H. Park, "Speech feature extraction based on spikegram for phoneme recognition," J. of Broadcast Engineering, vol. 24, no. 5, pp. 735-742, Sept. 2019, doi:10.5909/JBE.2019.24.5.735.
  7. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, Cambridge and London, 2016.
  8. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfitting," The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, Jan 2014, doi:10.5555/2627435.2670313.
  9. R. Caruana, "Multitask learning," Machine Learning, vol. 28, no. 1, pp.41-75, 1997, doi:10.1023/A:1007379606734.
  10. E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, "Polyphonic sound event detection using multi label deep neural networks," Proc. on Int. Joint Conf. on Neural Networks, pp. 1-7, July 2015, doi:10.1109/IJCNN.2015.7280624.
  11. S. J. Pan, and Q. Yang, "A survey on transfer learning," IEEE Trans. on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345-1359, Oct. 2009, doi:10.1109/TKDE.2009.191.
  12. B. Logan, "Mel frequency cepstral coefficients for music modeling," ISMIR, vol. 270, pp. 1-11, Oct. 2000.
  13. ETSI, Speech processing, transmission and quality aspects (STQ); Distributed speech recognition; Extended front-end feature extraction algorithm; Compression algorithm; Back-end speech reconstruction algorithm, ETSI ES 202 211, v1.1.1, Nov. 2003.
  14. X. Huang, A. Acero, and H. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice Hall, pp. 423-424, 2001.
  15. C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J, N. Chang, S. Lee, and S. S. Narayanan, "IEMOCAP: interactive emotional dyadic motion capture database," Language Resources and Evaluation, vol. 42, no. 4, pp. 335, Dec. 2008, doi:10.1007/s10579-008-9076-6.
  16. V. Zue, S. Seneff, and J. Glass, "Speech database development at MIT: TIMIT and beyond," Speech Communication, vol. 9, no. 4, pp. 351-356, Aug. 1990, doi:10.1016/0167-6393(90)90010-7.
  17. K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," Proc. on IEEE Int. Conf. on Computer Vision, pp. 1026-1034, 2015, doi:10.1109/iccv.2015.123.
  18. D. P. Kingma, and J. Ba, "Adam: A method for stochastic optimization," arXiv:1412.6980, Dec. 2014.