DOI QR코드

DOI QR Code

컨시스트: 오디오 딥페이크 탐지를 위한 그래프 어텐션 기반 새로운 모델링 방법론 연구

CoNSIST: Consist of New Methodologies on AASIST for Audio Deepfake Detection

  • 하재훈 (연세대학교 디지털애널리틱스 융합협동과정) ;
  • 문주원 (연세대학교 디지털애널리틱스 융합협동과정) ;
  • 이상엽 (연세대학교 언론홍보영상학부)
  • 투고 : 2024.07.09
  • 심사 : 2024.09.12
  • 발행 : 2024.10.31

초록

인공지능 기술의 발전과 함께 딥러닝 기반의 오디오 딥페이크 기술이 크게 향상되었고, 이를 악용하여 다양한 범죄 활동이 이루어지고 있다. 오디오 딥페이크를 탐지하여 이러한 피해를 예방하기 위해 본 논문은 새로운 컨시스트(CoNSIST) 모델을 제안한다. 이 모델은 그래프 기반의 모델인 AASIST를 기반으로, 세 가지 추가적인 모델링 방법론을 적용하여 오디오 딥페이크 탐지를 한다. 세 가지 추가적인 모델링을 통해 특징 추출을 강화하고, 불필요한 작업을 제거하며, 다양한 정보를 통합하는 것을 목표로 한다. 최종 실험 결과, 컨시스트가 기존 오디오 딥페이크 탐지 모델들보다 더 우수한 성능을 보여 딥페이크의 악용을 방지하기 위해 더 나은 해결책을 제공한다.

Advancements in artificial intelligence(AI) have significantly improved deep learning-based audio deepfake technology, which has been exploited for criminal activities. To detect audio deepfake, we propose CoNSIST, an advanced audio deepfake detection model. CoNSIST builds on AASIST, which a graph-based end-to-end model, by integrating three key components: Squeeze and Excitation, Positional Encoding, and Reformulated HS-GAL. These additions aim to enhance feature extraction, eliminate unnecessary operations, and incorporate diverse information. Our experimental results demonstrate that CoNSIST significantly outperforms existing models in detecting audio deepfakes, offering a more robust solution to combat the misuse of this technology.

키워드

과제정보

본 연구는 2024년도 연세대학교 인공지능 대학원 AI창의자율연구프로그램 지원비를 받아 수행된 연구임.

참고문헌

  1. J. Yi, C. Wang, J. Tao, X. Zhang, C. Y. Zhang, and Y. Zhao, "Audio Deepfake Detection: A Survey." ArXiv (Cornell University), 28 Aug. 2023, https://doi.org/10.48550/arxiv.2308.14970
  2. K. H. Jung and C. H. Kim,"Beware of Voice Cloning: Deep Voice Crime Steals 400 Billion Won." Moneytoday, 11 Feb. 2023, news.mt.co.kr/mtview.php?no=2023020913433930492.
  3. J. W. Jung et al., "AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks," ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 23 May 2022, https://doi.org/10.1109/icassp43922.2022.9747766.
  4. A. Hamza et al., "Deepfake Audio Detection via MFCC Features Using Machine Learning." IEEE Access, Vol.10, pp.134018-134028, 2022, https://doi.org/10.1109. https://doi.org/10.1109
  5. M. Lataifeh and A. Elnagar, "Ar-DAD: Arabic Diversified Audio Dataset," Data in Brief, Nov. pp.106503, 2020, https://doi.org/10.1016/j.dib.2020.106503.
  6. C. Borrelli, P. Bestagini, F. Antoacci, A. Sarti, and S. Tubaro, "Synthetic Speech Detection through Short-Term and Long-Term Prediction Traces," EURASIP Journal on Information Security, Vol. No.1, 6 Apr. 2021, https://doi.org/10.1186/s13635-021- 00116-3.
  7. A. K. Singh and P. Singh, "Detection of AI-Synthesized Speech Using Cepstral & Bispectral Statistics," 2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR), Sept. 2021, https://doi.org/10.1109/mipr51284.2021.00076.
  8. A. Chintha et al., "Recurrent Convolutional Structures for Audio Spoof and Video Deepfake Detection," IEEE Journal of Selected Topics in Signal Processing, Vol.14, No.5, pp.1024-1037, 2020, https://doi.org/10.1109/jstsp.2020.2999185.
  9. X. Liu, M. Liu, L. Wang, K. A. Lee, H. Zhang, and J. Dang, "Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection," 4 June 2023, https://doi.org/10.1109/icassp49357.2023.10096278.
  10. H. Tak, J. W. Jung, J. Patino, M. Kamble, M. Todisco, and N. Evans, "End-To-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection," ArXiv (Cornell University), 1 Jan. 2021, https://doi.org/10.48550/arxiv.2107.12710.
  11. H. Tak, J. W. Jung, J. Patino, M. Todisco, and N. Evans, "Graph Attention Networks for Anti-Spoofing." ArXiv (Cornell University), 30 Aug. 2021, https://doi.org/10.21437/interspeech.2021-993.
  12. J. W. Jung, S. B. Kim, H. J. Shim, and J. H. Kim, and H. J. Yu, "Improved RawNet with Feature Map Scaling for Text-Independent Speaker Verification Using Raw Waveforms." ArXiv (Cornell University), 25 Oct. 2020, https://doi.org/10.21437/interspeech.2020-1011.
  13. H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, "End-To-End Anti-Spoofing with RawNet2." HAL (Le Centre Pour La Communication Scientifique Directe), 6 June 2021, https://doi.org/10.1109/icassp39728.2021.9414234.
  14. X. Wang et al., "Heterogeneous Graph Attention Network," The World Wide Web Conference, 13 May 2019, https://doi.org/10.1145/3308558.3313562.
  15. P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, "Graph Attention Networks," arXiv (Cornell University), Feb. 2018, https://doi.org/10.48550/arXiv.1710.10903.
  16. J. Hu, L. Shen, S. Albanie, G. Sun, E. Wu, "Squeeze-and-Excitation Networks," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, https://doi.org/10.48550/arXiv.1709.01507.
  17. P. Duffer, M. Schmitt, and H. Schutze, "Position Information in Transformers: An Overview," Computational Linguistics, Vol.48, No.3, pp.733-763, 2022, https://doi.org/10.1162/coli_a_00445.
  18. A. Vaswani et al., "Attention is All you Need," arXiv (Cornell University), Vol.30, pp.5998-6008, 2017. https://doi.org/10.48550/arXiv.1706.03762.
  19. X. Wang et al., "ASVspoof 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech," ArXiv (Cornell University), 4 Nov. 2019, https://doi.org/10.48550/arxiv.1911.01601.
  20. T. Kinnunen et al., "T-DCF: A Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification," Odyssey 2018 the Speaker and Language Recognition Workshop, 26 June 2018, www.isca-speech.org/archive/Odyssey_2018/pdfs/68.pdf, https://doi.org/10.21437/odyssey.2018-44.
  21. X. Wang and J. Yamagishi, "A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection," ArXiv (Cornell University), 30 Aug. 2021, https://doi.org/10.21437/interspeech.2021-702.