참고문헌
- A. Graves, A. R. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," Proc. on IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 6645-6649, May 2013, doi:10.1109/ICASSP.2013.6638947.
- T. L. Nwe, S. W. Foo, and L. C. De Silva, "Speech emotion recognition using hidden Markov models," Speech communication, vol. 41, no. 4, pp. 603-623, Nov. 2003, doi:10.1016/S0167-6393(03)00099-2.
- J. P. Campbell, "Speaker recognition: A tutorial," Proceedings of the IEEE, vol. 85, no. 9, pp. 1437-1462, Sept. 1997, doi:10.1109/5.628714.
- W. J. Jang, H. W. Yun, S. H. Shin, H. J. Cho, W. Jang, and H. Park, "Music genre classification using spikegram and deep neural network," J. of Broadcast Engineering, vol. 22, no. 6, pp. 693-701, Nov. 2017, doi:10.5909/JBE.2017.22.6.693.
- S. H. Shin, H. W. Yun, W. J. Jang, and H. Park, "Extraction of acoustic features based on auditory spike code and its application to music genre classification," IET Signal Processing, vol. 13, no. 2, pp. 230-234, Apr. 2019, doi:10.1049/iet-spr.2018.5158.
- S. Han, J. Kim, S. An, S. Shin, and H. Park, "Speech feature extraction based on spikegram for phoneme recognition," J. of Broadcast Engineering, vol. 24, no. 5, pp. 735-742, Sept. 2019, doi:10.5909/JBE.2019.24.5.735.
- I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, Cambridge and London, 2016.
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfitting," The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, Jan 2014, doi:10.5555/2627435.2670313.
- R. Caruana, "Multitask learning," Machine Learning, vol. 28, no. 1, pp.41-75, 1997, doi:10.1023/A:1007379606734.
- E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, "Polyphonic sound event detection using multi label deep neural networks," Proc. on Int. Joint Conf. on Neural Networks, pp. 1-7, July 2015, doi:10.1109/IJCNN.2015.7280624.
- S. J. Pan, and Q. Yang, "A survey on transfer learning," IEEE Trans. on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345-1359, Oct. 2009, doi:10.1109/TKDE.2009.191.
- B. Logan, "Mel frequency cepstral coefficients for music modeling," ISMIR, vol. 270, pp. 1-11, Oct. 2000.
- ETSI, Speech processing, transmission and quality aspects (STQ); Distributed speech recognition; Extended front-end feature extraction algorithm; Compression algorithm; Back-end speech reconstruction algorithm, ETSI ES 202 211, v1.1.1, Nov. 2003.
- X. Huang, A. Acero, and H. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice Hall, pp. 423-424, 2001.
- C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J, N. Chang, S. Lee, and S. S. Narayanan, "IEMOCAP: interactive emotional dyadic motion capture database," Language Resources and Evaluation, vol. 42, no. 4, pp. 335, Dec. 2008, doi:10.1007/s10579-008-9076-6.
- V. Zue, S. Seneff, and J. Glass, "Speech database development at MIT: TIMIT and beyond," Speech Communication, vol. 9, no. 4, pp. 351-356, Aug. 1990, doi:10.1016/0167-6393(90)90010-7.
- K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," Proc. on IEEE Int. Conf. on Computer Vision, pp. 1026-1034, 2015, doi:10.1109/iccv.2015.123.
- D. P. Kingma, and J. Ba, "Adam: A method for stochastic optimization," arXiv:1412.6980, Dec. 2014.