DOI QR코드

DOI QR Code

시각적 어텐션을 활용한 입술과 목소리의 동기화 연구

Lip and Voice Synchronization Using Visual Attention

  • 윤동련 (한국전자기술연구원 에너지IT융합연구센터) ;
  • 조현중 (고려대학교 컴퓨터융합소프트웨어학과)
  • 투고 : 2023.11.22
  • 심사 : 2024.03.12
  • 발행 : 2024.04.30

초록

본 연구에서는 얼굴 동영상에서 입술의 움직임과 음성 간의 동기화 탐지 방법을 제안한다. 기존의 연구에서는 얼굴 탐지 기술로 얼굴 영역의 바운딩 박스를 도출하고, 박스의 하단 절반 영역을 시각 인코더의 입력으로 사용하여 입술-음성 동기화 탐지에 필요한 시각적인 특징을 추출하였다. 본 연구에서는 입술-음성 동기화 탐지 모델이 음성 정보의 발화 영역인 입술에 더 집중할 수 있도록 사전 학습된 시각적 Attention 기반의 인코더 도입을 제안한다. 이를 위해 음성 정보 없이 시각적 정보만으로 발화하는 말을 예측하는 독순술(Lip-Reading)에서 사용된 Visual Transformer Pooling(VTP) 모듈을 인코더로 채택했다. 그리고, 제안 방법이 학습 파라미터 수가 적음에도 불구하고 LRS2 데이터 세트에서 다섯 프레임 기준으로 94.5% 정확도를 보임으로써 최근 모델인 VocaList를 능가하는 것을 실험적으로 증명하였다. 또, 제안 방법은 학습에 사용되지 않은 Acappella 데이터셋에서도 VocaList 모델보다 8% 가량의 성능 향상이 있음을 확인하였다.

This study explores lip-sync detection, focusing on the synchronization between lip movements and voices in videos. Typically, lip-sync detection techniques involve cropping the facial area of a given video, utilizing the lower half of the cropped box as input for the visual encoder to extract visual features. To enhance the emphasis on the articulatory region of lips for more accurate lip-sync detection, we propose utilizing a pre-trained visual attention-based encoder. The Visual Transformer Pooling (VTP) module is employed as the visual encoder, originally designed for the lip-reading task, predicting the script based solely on visual information without audio. Our experimental results demonstrate that, despite having fewer learning parameters, our proposed method outperforms the latest model, VocaList, on the LRS2 dataset, achieving a lip-sync detection accuracy of 94.5% based on five context frames. Moreover, our approach exhibits an approximately 8% superiority over VocaList in lip-sync detection accuracy, even on an untrained dataset, Acappella.

키워드

과제정보

본 연구는 2023년 정부(교육부)의 재원으로 한국연구재단 기초연구사업의 지원을 받아 수행된 연구임(No. NRF-2021R1F1A1049202).

참고문헌

  1. Y. Shalev and L. Wolf, "End to end lip synchronization with a temporal autoencoder," Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020.
  2. K. R. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. V. Jawahar, "A lip sync expert is all you need for speech to lip generation in the wild," Proceedings of the 28th ACM International Conference on Multimedia, pp.484-492, 2020.
  3. P. Ma, S. Petridis, and M. Pantic, "End-to-end audio-visual speech recognition with conformers," In: ICASSP 2021- 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp.7613-7617, 2021.
  4. T. Makino et al., "Recurrent neural network transducer for audio-visual speech recognition," In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, pp.905-912, 2019.
  5. A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, "Transformers are rnns: Fas autoregressive transformers with linear attention," In: International Conference on Machine Learning. PMLR, pp.5156-5165, 2020.
  6. V. S. Kadandale, J. F. Montesinos, and G. Haro, "VocaLiST: An audio-visual synchronisation model for lips and voices," In: Interspeech, pp.3128-3132, 2022.
  7. J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, "Lip Reading Sentences in the Wild," In: IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  8. T. Afouras, J. S. Chung, and A. Zisserman. "LRS3-TED: a large-scale dataset for visual speech recognition," In: arXiv preprint arXiv:1809.00496, 2018.
  9. S. Chopra, R. Hadsell, and Y. LeCun, "Learning a similarity metric discriminatively, with application to face verification," In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), IEEE. Vol.1, pp.539-546, 2005.
  10. B. V. Mahavidyalaya. "Phoneme and viseme based approach for lip synchronization," International Journal of Signal Processing, Image Processing and Pattern Recognition, Vol.7, No.3, pp.385-394, 2014. https://doi.org/10.14257/ijsip.2014.7.3.31
  11. J. S. Chung and A. Zisserman, "Out of time: automated lip sync in the wild," In: Workshop on Multi-view Lipreading, ACCV. 2016.
  12. Y. J. Kim, H. S. Heo, S. W. Chung, and B. J. Lee, "End-to-end lip synchronisation based onpattern classification," In: 2021 IEEE Spoken Language Technology Workshop (SLT), IEEE, pp.598-605, 2021.
  13. A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
  14. A. Bulat and G. Tzimiropoulos. "How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks)," In: International Conference on Computer Vision, 2017.
  15. Y. Zhang, S. Yang, J. Xiao, S. Shan, and X. Chen, "Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition," In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), IEEE, pp.356-363, 2020
  16. K. R. Prajwal, T. Afouras, and A. Zisserman, "Sub-word level lip reading with visual attention," In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.5162-5172, 2022.
  17. A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, "Transformers are rnns: Fast autoregressive transformers with linear attention," In: International Conference on Machine Learning, PMLR, pp.5156-5165, 2020.
  18. Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L-P Morency, and R. Salakhutdinov, "Multimodal transformer for unaligned multimodal language sequences," In: Proceedings of the Conference Association for Computational Linguistics, Meeting, NIH Public Access, pp.6558, 2019.
  19. J. F. Montesinos, V. S. Kadandale, and G. Haro, "Acappella: audio-visual singing voice separation," In: 32nd British Machine Vision Conference, BMVC 2021, 2021.
  20. S.-W. Chung, J. S. Chung, and H.-G. Kang. "Perfect match: Improved cross-modal embeddings for audio-visual synchronisation," In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp.3965-3969, 2019.
  21. H. Gupta, "Perceptual synchronization scoring of dubbed content using phoneme-viseme agreement," Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024.