Synthesis of Expressive Talking Heads from Speech with Recurrent Neural Network

Sakurai, Ryuhei;Shimba, Taiki;Yamazoe, Hirotake;Lee, Joo-Ho;

doi:10.7746/jkros.2018.13.1.016

The Journal of Korea Robotics Society (로봇학회논문지)

Volume 13 Issue 1
/
Pages.16-25
/
2018
/
1975-6291(pISSN)
/
2287-3961(eISSN)

Korea Robotics Society (한국로봇학회)

DOI QR Code

Synthesis of Expressive Talking Heads from Speech with Recurrent Neural Network

RNN을 이용한 Expressive Talking Head from Speech의 합성

Sakurai, Ryuhei (Ritsumeikan University) ;
Shimba, Taiki (Ritsumeikan University) ;
Yamazoe, Hirotake (Ritsumeikan University) ;
Lee, Joo-Ho (Ritsumeikan University)

Received : 2018.01.23
Accepted : 2018.02.13
Published : 2018.02.28

https://doi.org/10.7746/jkros.2018.13.1.016 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

The talking head (TH) indicates an utterance face animation generated based on text and voice input. In this paper, we propose the generation method of TH with facial expression and intonation by speech input only. The problem of generating TH from speech can be regarded as a regression problem from the acoustic feature sequence to the facial code sequence which is a low dimensional vector representation that can efficiently encode and decode a face image. This regression was modeled by bidirectional RNN and trained by using SAVEE database of the front utterance face animation database as training data. The proposed method is able to generate TH with facial expression and intonation TH by using acoustic features such as MFCC, dynamic elements of MFCC, energy, and F0. According to the experiments, the configuration of the BLSTM layer of the first and second layers of bidirectional RNN was able to predict the face code best. For the evaluation, a questionnaire survey was conducted for 62 persons who watched TH animations, generated by the proposed method and the previous method. As a result, 77% of the respondents answered that the proposed method generated TH, which matches well with the speech.

Keywords

References

D.W. Massaro, "Symbiotic value of an embodied agent in language learning," 37th Hawaii International Conference on System Sciences, Big Island, HI, USA, 2004, doi: 10.1109/ HICSS.2004.1265333.
B. Fan, L. Wang, F.K. Soong, and L. Xie, "Photo-real t alking head with deep bidirectional LSTM", International Conference on Acoustics, Speech, and Signal Processing, Brisbane, QLD, Australia, 2015, doi: 10.1109/ICASSP.2015.7178899.
L. Wang, and F.K. Soong, "HMM trajectory-guided sample selection for photo-realistic talking head," Multimedia Tools and Applications, vol. 74, no. 22, pp. 9849-9869, Nov., 2014. https://doi.org/10.1007/s11042-014-2118-8
A. Karpathy, J. Johnson, and L. Fei-Fei, "Visualizing and understanding recurrent networks", arXiv:1506.02078, 2015.
E. Cosatto, and H.P. Graf, "Sample-based synthesis of photo realistic talking heads," Computer Animation 98, Philadelphia, PA, USA, USA, pp. 103-110, 1998.
V. Wan, R. Blokland, N. Braunschweiler, L. Chen, B. Kolluru, J. Latorre, R. Maia, B. Stenger, K. Yanagisawa, Y. Stylianou, M. Akamine, M.J.F. Gales, and R. Cipolla, "Photo-Realistic Expressive Text to Talking Head Synthesis," 14 th Annual Conference of the International Speech Communication Association, Lyon, France, pp. 2667-2669, 2013.
M. Schuster, and K.K. Paliwal, "Bidirectional recurrent neural networks," IEEE Transactions on Signal Processing, vol. 45, no. 11, Nov., 1997.
S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies," A field guide to dynamical recurrent neural networks, IEEE Press, 2001.
A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber, "A novel connectionist system for unconstrained handwriting recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, May 2009.
H. Sanaul and P.J.B. Jackson, "Multimodal Emotion Recognition," W. Wang ed., Machine Audition: Principles, Algorithms and Systems, Hershey, PA: IGI Global, 2011, pp. 398-423, doi: 10.4018/978-1-61520-919-4.ch017.
R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, "Multi-PIE," Image and Vision Computing, vol. 28, no. 5, pp. 807-813, May, 2010. https://doi.org/10.1016/j.imavis.2009.08.002
O.-W. Kwon, K. Chan, J. Hao, and T.-W. Lee, "Emotion Recognition by speech signals," 8th European Conference on Speech Communication and Technology, Geneva, Switzerland, pp. 125-128, 2003.
Y. Pan, P. Shen, and L. Shen, "Speech emotion recognition using support vector machine," International Journal of Smart Home, vol. 6, no. 2, pp. 101-107, April, 2012.
S. Furui, "Speaker-independent isolated word recognition using dynamic features of speech spectrum," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 34, no. 1, pp. 52-59, Feb., 1986. https://doi.org/10.1109/TASSP.1986.1164788
S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X.A. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK book, Microsoft Corporation, 1995.
S. Imai, T. Kobayashi, K. Tokuda, T. Masuko, K. Koishida, S. Sako, and H. Zen, Speech signal processing toolkit (SPTK), [Online], http://sp-tk.sourceforge.net/, Accessed: Feb. 14, 2018.
I. Matthews and S. Baker, "Active appearance models revisited," International Journal of Computer Vision, vol. 60, no. 2, pp. 135-164, Nov., 2004. https://doi.org/10.1023/B:VISI.0000029666.37597.d3
J. Alabort-i-Medina, E. Antonakos, J. Booth, P. Snape, and S. Zafeiriou, "Menpo: a comprehensive platform for parametric image alignment and visual deformable models," 22nd ACM international conference on Multimedia, Orlando, Florida, USA, pp. 679-682, 2014.
B.D. Lucas, and T. Kanade, "An iterative image registration technique with an application to stereo vision," 1981 DARPA Image Understanding Workshop, pp. 121-130, April 1981.
T. Tieleman, and G. Hinton, "Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude", COURSERA: Neural Networks for Machine Learning, vol. 4, pp. 26-30, 2012.
W. Han, L. Wang, F. Soong, and B. Yuan, "Improved minimum converted trajectory error training for real-time speech-to-lips conversion," 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, 2012, doi: 10.1109/ICASSP.2012.6288921.
D. W. Massaro, J. B eskow, M. M. C ohen, C. L . Fry, a nd T. Rodgriguez, "Picture My Voice: Audio to Visual Speech Synthesis using Artificial Neural Networks", Auditory-Visual Speech Processing, Santa Cruz, CA, USA, pp. 133-138, 1999.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," Journal of Machine Learning Research, Vol. 15, pp. 1929-1958, Jun., 2014.

The Journal of Korea Robotics Society (로봇학회논문지)

Synthesis of Expressive Talking Heads from Speech with Recurrent Neural Network

RNN을 이용한 Expressive Talking Head from Speech의 합성

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)