References
- D.W. Massaro, "Symbiotic value of an embodied agent in language learning," 37th Hawaii International Conference on System Sciences, Big Island, HI, USA, 2004, doi: 10.1109/ HICSS.2004.1265333.
- B. Fan, L. Wang, F.K. Soong, and L. Xie, "Photo-real t alking head with deep bidirectional LSTM", International Conference on Acoustics, Speech, and Signal Processing, Brisbane, QLD, Australia, 2015, doi: 10.1109/ICASSP.2015.7178899.
- L. Wang, and F.K. Soong, "HMM trajectory-guided sample selection for photo-realistic talking head," Multimedia Tools and Applications, vol. 74, no. 22, pp. 9849-9869, Nov., 2014. https://doi.org/10.1007/s11042-014-2118-8
- A. Karpathy, J. Johnson, and L. Fei-Fei, "Visualizing and understanding recurrent networks", arXiv:1506.02078, 2015.
- E. Cosatto, and H.P. Graf, "Sample-based synthesis of photo realistic talking heads," Computer Animation 98, Philadelphia, PA, USA, USA, pp. 103-110, 1998.
- V. Wan, R. Blokland, N. Braunschweiler, L. Chen, B. Kolluru, J. Latorre, R. Maia, B. Stenger, K. Yanagisawa, Y. Stylianou, M. Akamine, M.J.F. Gales, and R. Cipolla, "Photo-Realistic Expressive Text to Talking Head Synthesis," 14 th Annual Conference of the International Speech Communication Association, Lyon, France, pp. 2667-2669, 2013.
- M. Schuster, and K.K. Paliwal, "Bidirectional recurrent neural networks," IEEE Transactions on Signal Processing, vol. 45, no. 11, Nov., 1997.
- S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies," A field guide to dynamical recurrent neural networks, IEEE Press, 2001.
- A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber, "A novel connectionist system for unconstrained handwriting recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, May 2009.
- H. Sanaul and P.J.B. Jackson, "Multimodal Emotion Recognition," W. Wang ed., Machine Audition: Principles, Algorithms and Systems, Hershey, PA: IGI Global, 2011, pp. 398-423, doi: 10.4018/978-1-61520-919-4.ch017.
- R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, "Multi-PIE," Image and Vision Computing, vol. 28, no. 5, pp. 807-813, May, 2010. https://doi.org/10.1016/j.imavis.2009.08.002
- O.-W. Kwon, K. Chan, J. Hao, and T.-W. Lee, "Emotion Recognition by speech signals," 8th European Conference on Speech Communication and Technology, Geneva, Switzerland, pp. 125-128, 2003.
- Y. Pan, P. Shen, and L. Shen, "Speech emotion recognition using support vector machine," International Journal of Smart Home, vol. 6, no. 2, pp. 101-107, April, 2012.
- S. Furui, "Speaker-independent isolated word recognition using dynamic features of speech spectrum," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 34, no. 1, pp. 52-59, Feb., 1986. https://doi.org/10.1109/TASSP.1986.1164788
- S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X.A. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK book, Microsoft Corporation, 1995.
- S. Imai, T. Kobayashi, K. Tokuda, T. Masuko, K. Koishida, S. Sako, and H. Zen, Speech signal processing toolkit (SPTK), [Online], http://sp-tk.sourceforge.net/, Accessed: Feb. 14, 2018.
- I. Matthews and S. Baker, "Active appearance models revisited," International Journal of Computer Vision, vol. 60, no. 2, pp. 135-164, Nov., 2004. https://doi.org/10.1023/B:VISI.0000029666.37597.d3
- J. Alabort-i-Medina, E. Antonakos, J. Booth, P. Snape, and S. Zafeiriou, "Menpo: a comprehensive platform for parametric image alignment and visual deformable models," 22nd ACM international conference on Multimedia, Orlando, Florida, USA, pp. 679-682, 2014.
- B.D. Lucas, and T. Kanade, "An iterative image registration technique with an application to stereo vision," 1981 DARPA Image Understanding Workshop, pp. 121-130, April 1981.
- T. Tieleman, and G. Hinton, "Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude", COURSERA: Neural Networks for Machine Learning, vol. 4, pp. 26-30, 2012.
- W. Han, L. Wang, F. Soong, and B. Yuan, "Improved minimum converted trajectory error training for real-time speech-to-lips conversion," 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, 2012, doi: 10.1109/ICASSP.2012.6288921.
- D. W. Massaro, J. B eskow, M. M. C ohen, C. L . Fry, a nd T. Rodgriguez, "Picture My Voice: Audio to Visual Speech Synthesis using Artificial Neural Networks", Auditory-Visual Speech Processing, Santa Cruz, CA, USA, pp. 133-138, 1999.
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," Journal of Machine Learning Research, Vol. 15, pp. 1929-1958, Jun., 2014.