Acknowledgement
This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2023-RS-2023-00256629) grant funded by the Korea government (MSIT), the ITRC (Information Technology Research Center) support program (IITP-2024-RS-2024-00437718) supervised by IITP, and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2023-00219107).
References
- P. Pataranutaporn et al., "AI-generated characters for supporting personalized learning and well-being," Nat Mach Intell, vol. 3, no. 12, pp. 1013-1022, Dec. 2021, doi: 10.1038/s42256-021-00417-9.
- J. Wang, X. Qian, M. Zhang, R. T. Tan, and H. Li, "Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert," in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada: IEEE, Jun. 2023, pp. 14653-14662. doi: 10.1109/CVPR52729.2023.01408.
- B. Zhang, X. Zhang, N. Cheng, J. Yu, J. Xiao, and J. Wang, "EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model," Jan. 15, 2024, arXiv: arXiv:2401.08049. Accessed: Sep. 09, 2024. [Online]. Available: http://arxiv.org/abs/2401.08049
- K. Cheng et al., "VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild," Nov. 27, 2022, arXiv: arXiv:2211.14758. Accessed: Sep. 09, 2024. [Online]. Available: http://arxiv.org/abs/2211.14758
- X. Ji et al., "Audio-Driven Emotional Video Portraits," May 19, 2021, arXiv: arXiv:2104.07452. Accessed: Jun. 07, 2024. [Online]. Available: http://arxiv.org/abs/2104.07452
- L. Chen, R. K. Maddox, Z. Duan, and C. Xu, "Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss," May 09, 2019, arXiv: arXiv:1905.03820. Accessed: Jun. 21, 2024. [Online]. Available: http://arxiv.org/abs/1905.03820
- S. Tan, B. Ji, and Y. Pan, "EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation," in 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France: IEEE, Oct. 2023, pp. 22089-22099. doi: 10.1109/ICCV51070.2023.02024.
- J. Wang, Y. Zhao, L. Liu, T. Xu, Q. Li, and S. Li, "Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks," Jun. 06, 2023, arXiv: arXiv:2306.03594. Accessed: Apr. 04, 2024. [Online]. Available: http://arxiv.org/abs/2306.03594
- M. Stypulkowski, K. Vougioukas, S. He, M. Zieba, S. Petridis, and M. Pantic, "Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation," in 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA: IEEE, Jan. 2024, pp. 5089-5098. doi: 10.1109/WACV57701.2024.00502.
- S. Shen et al., "DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation," presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1982-1991. Accessed: Jun. 17, 2024. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2023/html/Shen_DiffTalk_Crafting_Diffusion_Models_for_Generalized_Audio-Driven_Portraits_Animation_CVPR_2023_paper.html
- H.-S. Vo-Thanh, Q.-V. Nguyen, and S.-H. Kim, "KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation," Sep. 09, 2024, arXiv: arXiv:2409.05330. Accessed: Sep. 10, 2024. [Online]. Available: http://arxiv.org/abs/2409.05330
- A. Baevski, H. Zhou, A. Mohamed, and M. Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," arXiv.org. Accessed: Sep. 10, 2024. [Online]. Available: https://arxiv.org/abs/2006.11477v3
- Z. Liu et al., "KAN: Kolmogorov-Arnold Networks," arXiv.org. Accessed: Sep. 10, 2024. [Online]. Available: https://arxiv.org/abs/2404.19756v4
- K. Wang et al., "MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation," in Computer Vision - ECCV 2020, vol. 12366, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., in Lecture Notes in Computer Science, vol. 12366. , Cham: Springer International Publishing, 2020, pp. 700-717. doi: 10.1007/978-3-030-58589-1_42.
- Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, "MakeltTalk: speaker-aware talking-head animation," ACM Trans. Graph., vol. 39, no. 6, pp. 1-15, Dec. 2020, doi: 10.1145/3414685.3417774.
- "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild | Proceedings of the 28th ACM International Conference on Multimedia." Accessed: Sep. 10, 2024. [Online]. Available: https://dl.acm.org/doi/10.1145/3394171.3413532