Speech Animation Synthesis based on a Korean Co-articulation Model

Jang, Minjung;Jung, Sunjin;Noh, Junyong;

doi:10.15701/kcgs.2020.26.3.49

Journal of the Korea Computer Graphics Society (한국컴퓨터그래픽스학회논문지)

Volume 26 Issue 3
/
Pages.49-59
/
2020
/
1975-7883(pISSN)
/
2383-529X(eISSN)

Korea Computer Graphics Society (한국컴퓨터그래픽스학회)

DOI QR Code

Speech Animation Synthesis based on a Korean Co-articulation Model

한국어 동시조음 모델에 기반한 스피치 애니메이션 생성

Jang, Minjung (KAIST, Visual Media Lab.) ;
Jung, Sunjin (KAIST, Visual Media Lab.) ;
Noh, Junyong (KAIST, Visual Media Lab.)

장민정 (카이스트 비주얼미디어연구실) ;
정선진 (카이스트 비주얼미디어연구실) ;
노준용 (카이스트 비주얼미디어연구실)

Received : 2020.06.19
Accepted : 2020.06.25
Published : 2020.07.01

https://doi.org/10.15701/kcgs.2020.26.3.49 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we propose a speech animation synthesis specialized in Korean through a rule-based co-articulation model. Speech animation has been widely used in the cultural industry, such as movies, animations, and games that require natural and realistic motion. Because the technique for audio driven speech animation has been mainly developed for English, however, the animation results for domestic content are often visually very unnatural. For example, dubbing of a voice actor is played with no mouth motion at all or with an unsynchronized looping of simple mouth shapes at best. Although there are language-independent speech animation models, which are not specialized in Korean, they are yet to ensure the quality to be utilized in a domestic content production. Therefore, we propose a natural speech animation synthesis method that reflects the linguistic characteristics of Korean driven by an input audio and text. Reflecting the features that vowels mostly determine the mouth shape in Korean, a coarticulation model separating lips and the tongue has been defined to solve the previous problem of lip distortion and occasional missing of some phoneme characteristics. Our model also reflects the differences in prosodic features for improved dynamics in speech animation. Through user studies, we verify that the proposed model can synthesize natural speech animation.

본 논문에서는 규칙 기반의 동시조음 모델을 통해 한국어에 특화된 스피치 애니메이션을 생성하는 모델을 제안한다. 음성에 대응되는 입 모양 애니메이션을 생성하는 기술은 영어를 중심으로 많은 연구가 진행되어 왔으며, 자연스럽고 사실적인 모션이 필요한 영화, 애니메이션, 게임 등의 문화산업 전반에 널리 활용된다. 그러나 많은 국내 콘텐츠의 경우, 스피치 애니메이션을 생략하거나 음성과 상관없이 단순 반복 재생한 뒤 성우가 더빙하는 형태로 시각적으로 매우 부자연스러운 결과를 보여준다. 또한, 한국어에 특화된 모델이 아닌 언어 비의존적 연구는 아직 국내 콘텐츠 제작에 활용될 정도의 퀄리티를 보장하지 못한다. 따라서 본 논문은 음성과 텍스트를 입력받아 한국어의 언어학적 특성을 반영한 자연스러운 스피치 애니메이션 생성 기술을 제안하고자 한다. 한국어에서 입 모양은 대부분 모음에 의해 결정된다는 특성을 반영하여 입술과 혀를 분리한 동시조음 모델을 정의해 기존의 입술 모양에 왜곡이 발생하거나 일부 음소의 특성이 누락되는 문제를 해결하였으며, 더 나아가 운율적 요소에 따른 차이를 반영하여 보다 역동적인 스피치 애니메이션 생성이 가능하다. 제안된 모델은 유저 스터디를 통해 자연스러운 스피치 애니메이션을 생성함을 검증하였으며, 향후 국내 문화산업 발전에 크게 기여할 것으로 기대된다.

Keywords

References

김탁훈, "애니메이션 캐릭터의 한국어 립싱크 연구: 영어권 애니메이션의 립싱크 기법을 기반으로," 만화애니메이션 연구, pp. 97-114, 2008.
S.-W. Kim, H. Lee, K.-H. Choi, and S.-Y. Park, "A talking head system for korean text," World Academy of Science, Engineering and Technology, vol. 50, 2005.
오현화, 김인철, 김동수, and 진성일, "한국어 모음 입술독해를 위한 시공간적 특징에 관한 연구," 한국음향학회지, pp. 19-26, 2002. https://doi.org/10.7776/ASK.2012.31.1.019
H.-J. Hyung, B.-K. Ahn, D. Choi, D. Lee, and D.-W. Lee, "Evaluation of a korean lip-sync system for an android robot," 2016 13th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), IEEE, pp. 78-82, 2016.
정일홍 and 김은지, "한국어 음소를 이용한 자연스러운 3d 립싱크 애니메이션," 한국디지털콘텐츠학회 논문지, vol. 9, no. 2, pp. 331-339, 2008.
김태은 and 박유신, "한글 문자 입력에 따른 얼굴 에니메이션," 한국전자통신학회 논문지, vol. 4, pp. 116-122, 2009.
P. Edwards, C. Landreth, E. Fiume, and K. Singh, "Jali: an animator-centric viseme model for expressive lip synchronization," ACM Transactions on Graphics (TOG), vol. 35, no. 4, p. 127, 2016.
Y.-C. Wang and R. T.-H. Tsai, "Rule-based korean grapheme to phoneme conversion using sound patterns," Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2, pp. 843-850, 2009.
H. P. Graf, E. Cosatto, V. Strom, and F. J. Huang, "Visual prosody: Facial movements accompanying speech," Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition, IEEE, pp. 396-401, 2002.
I. Albrecht, J. Haber, and H.-P. Seidel, "Automatic generation of non-verbal facial expressions from speech," Advances in Modelling, Animation and Rendering, Springer, London, pp. 283-293, 2002.
J.-R. Park, C.-W. Choi, and M.-Y. Park, "Human-like fuzzy lip synchronization of 3d facial model based on speech speed," Proceedings of the Korean Institute of Intelligent Systems Conference, Korean Institute of Intelligent Systems, pp. 416-419, 2006.
K. Tjaden and G. E. Wilding, "Rate and loudness manipulations in dysarthria," Journal of Speech, Language, and Hearing Research, 2004.
S. L. Taylor, M. Mahler, B.-J. Theobald, and I. Matthews, "Dynamic units of visual speech," Proceedings of the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation, pp. 275-284, 2012.
신지영, 한국어의 말소리. 박이정출판사, 2014.
R. D. Kent and F. D. Minifie, "Coarticulation in recent speech production models," Journal of phonetics, vol. 5, no. 2, pp. 115-133, 1977. https://doi.org/10.1016/S0095-4470(19)31123-4
이광희, 고우현, 지상훈, 남경태, and 이상무, "시청각 정보를 활용한 음성 오인식률 개선 알고리즘," 한국정밀공학회 학술발표대회 논문집, pp. 341-342, 2010.
임성민, 구자현, and 김회린, "어텐션 기반 엔드투엔드 음성인식 시각화 분석," 말소리와 음성과학, vol. 11, no. 1, pp. 41-49, 2019. https://doi.org/10.13064/KSSS.2019.11.1.041
M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, "Montreal forced aligner: Trainable text-speech alignment using kaldi." Interspeech, pp. 498-502, 2017.
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, "The kaldi speech recognition toolkit," IEEE 2011 workshop on automatic speech recognition and understanding, IEEE Signal Processing Society, 2011.
김종록, "외국인을 위한 한국어 동사 활용 사전 돌아보기," 한글, no. 295, pp. 73-134, 2012.
임홍빈, "한국어의 불규칙 활용에 대하여," 韓國學究論文集, no. 3, pp. 1-21, 2014.
양순임, "'ㅎ'불규칙용언의 표기 규정에 대한 고찰," 한민족어문학, vol. 62, pp. 315-338, 2012.
G. S. Turner and G. Weismer, "Characteristics of speaking rate in the dysarthria associated with amyotrophic lateral sclerosis," Journal of Speech, Language, and Hearing Research, vol. 36, no. 6, pp. 1134-1144, 1993. https://doi.org/10.1044/jshr.3606.1134
Y. Zhou, Z. Xu, C. Landreth, E. Kalogerakis, S. Maji, and K. Singh, "Visemenet: Audio-driven animator-centric speech animation," ACM Transactions on Graphics (TOG), vol. 37, no. 4, pp. 1-10, 2018.