Figure 2.1. Sequence to sequence model.
Figure 2.2. Attention model (Bahdanau et al., 2014).
Figure 4.1. Mel-frequency cepstral coefficients.
Figure 4.2. The structure of the encoder.
Figure 4.3. A finite automata that searches for correct Korean strings.
Table 5.1. Performance comparison between end-to-end deep learning models
Table 5.2. Performance comparison when adding a finite automata language model
Table 5.3. Performance comparison with commercial API
References
- Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473
- Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 5, 157-166. https://doi.org/10.1109/72.279181
- Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural networks, arXiv preprint, arXiv:1505.05424
- Chan, W., Jaitly, N., Le, Q. V., and Vinyals, O. (2015). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, 4960-4964. IEEE, 2016.
- Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014). On the properties of neural machine translation: Encoder-Decoder approaches. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation.
- Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, arXiv preprint, arXiv:1406.1078
- Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint, arXiv:1412.3555
- Gal, Y. and Ghahramani, Z. (2016a). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 1050-1059.
- Gal, Y. and Ghahramani, Z. (2016b). A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, 1019-1027.
- Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning (Vol. 1), MIT press, Cambridge.
- Graves, A., Fernandez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, 369-376. ACM.
-
Gales, M., and Young, S. (2008). The application of hidden Markov models in speech recognition, Foundations and Trends
${\mu}l$ kpa in Signal Processing, 1, 195-304. https://doi.org/10.1561/2000000004 - Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
- Huang, X., Acero, A., and Hon, H. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice hall PTR, New Jersey.
- Jelinek, F. (1997). Statistical Methods for Speech Recognition, MIT press, Cambridge.
- Kim, S., Hori, T., andWatanabe, S. (2017). Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, 4835-4839. IEEE.
- Kingma, D. P. and Ba, J. (2014). Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kwon, O. W. and Park, J. (2003). Korean large vocabulary continuous speech recognition with morphemebased recognition units, Speech Communication, 39, 287-300. https://doi.org/10.1016/S0167-6393(02)00031-6
- Luong, M.-T., Pham, H., and Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. arXiv preprint arXiv:1508.04025