DOI QR코드

DOI QR Code

CoNSIST: Consist of New Methodologies on AASIST for Audio Deepfake Detection

컨시스트: 오디오 딥페이크 탐지를 위한 그래프 어텐션 기반 새로운 모델링 방법론 연구

  • 하재훈 (연세대학교 디지털애널리틱스 융합협동과정) ;
  • 문주원 (연세대학교 디지털애널리틱스 융합협동과정) ;
  • 이상엽 (연세대학교 언론홍보영상학부)
  • Received : 2024.07.09
  • Accepted : 2024.09.12
  • Published : 2024.10.31

Abstract

Advancements in artificial intelligence(AI) have significantly improved deep learning-based audio deepfake technology, which has been exploited for criminal activities. To detect audio deepfake, we propose CoNSIST, an advanced audio deepfake detection model. CoNSIST builds on AASIST, which a graph-based end-to-end model, by integrating three key components: Squeeze and Excitation, Positional Encoding, and Reformulated HS-GAL. These additions aim to enhance feature extraction, eliminate unnecessary operations, and incorporate diverse information. Our experimental results demonstrate that CoNSIST significantly outperforms existing models in detecting audio deepfakes, offering a more robust solution to combat the misuse of this technology.

인공지능 기술의 발전과 함께 딥러닝 기반의 오디오 딥페이크 기술이 크게 향상되었고, 이를 악용하여 다양한 범죄 활동이 이루어지고 있다. 오디오 딥페이크를 탐지하여 이러한 피해를 예방하기 위해 본 논문은 새로운 컨시스트(CoNSIST) 모델을 제안한다. 이 모델은 그래프 기반의 모델인 AASIST를 기반으로, 세 가지 추가적인 모델링 방법론을 적용하여 오디오 딥페이크 탐지를 한다. 세 가지 추가적인 모델링을 통해 특징 추출을 강화하고, 불필요한 작업을 제거하며, 다양한 정보를 통합하는 것을 목표로 한다. 최종 실험 결과, 컨시스트가 기존 오디오 딥페이크 탐지 모델들보다 더 우수한 성능을 보여 딥페이크의 악용을 방지하기 위해 더 나은 해결책을 제공한다.

Keywords

Acknowledgement

본 연구는 2024년도 연세대학교 인공지능 대학원 AI창의자율연구프로그램 지원비를 받아 수행된 연구임.

References

  1. J. Yi, C. Wang, J. Tao, X. Zhang, C. Y. Zhang, and Y. Zhao, "Audio Deepfake Detection: A Survey." ArXiv (Cornell University), 28 Aug. 2023, https://doi.org/10.48550/arxiv.2308.14970
  2. K. H. Jung and C. H. Kim,"Beware of Voice Cloning: Deep Voice Crime Steals 400 Billion Won." Moneytoday, 11 Feb. 2023, news.mt.co.kr/mtview.php?no=2023020913433930492.
  3. J. W. Jung et al., "AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks," ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 23 May 2022, https://doi.org/10.1109/icassp43922.2022.9747766.
  4. A. Hamza et al., "Deepfake Audio Detection via MFCC Features Using Machine Learning." IEEE Access, Vol.10, pp.134018-134028, 2022, https://doi.org/10.1109. https://doi.org/10.1109
  5. M. Lataifeh and A. Elnagar, "Ar-DAD: Arabic Diversified Audio Dataset," Data in Brief, Nov. pp.106503, 2020, https://doi.org/10.1016/j.dib.2020.106503.
  6. C. Borrelli, P. Bestagini, F. Antoacci, A. Sarti, and S. Tubaro, "Synthetic Speech Detection through Short-Term and Long-Term Prediction Traces," EURASIP Journal on Information Security, Vol. No.1, 6 Apr. 2021, https://doi.org/10.1186/s13635-021- 00116-3.
  7. A. K. Singh and P. Singh, "Detection of AI-Synthesized Speech Using Cepstral & Bispectral Statistics," 2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR), Sept. 2021, https://doi.org/10.1109/mipr51284.2021.00076.
  8. A. Chintha et al., "Recurrent Convolutional Structures for Audio Spoof and Video Deepfake Detection," IEEE Journal of Selected Topics in Signal Processing, Vol.14, No.5, pp.1024-1037, 2020, https://doi.org/10.1109/jstsp.2020.2999185.
  9. X. Liu, M. Liu, L. Wang, K. A. Lee, H. Zhang, and J. Dang, "Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection," 4 June 2023, https://doi.org/10.1109/icassp49357.2023.10096278.
  10. H. Tak, J. W. Jung, J. Patino, M. Kamble, M. Todisco, and N. Evans, "End-To-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection," ArXiv (Cornell University), 1 Jan. 2021, https://doi.org/10.48550/arxiv.2107.12710.
  11. H. Tak, J. W. Jung, J. Patino, M. Todisco, and N. Evans, "Graph Attention Networks for Anti-Spoofing." ArXiv (Cornell University), 30 Aug. 2021, https://doi.org/10.21437/interspeech.2021-993.
  12. J. W. Jung, S. B. Kim, H. J. Shim, and J. H. Kim, and H. J. Yu, "Improved RawNet with Feature Map Scaling for Text-Independent Speaker Verification Using Raw Waveforms." ArXiv (Cornell University), 25 Oct. 2020, https://doi.org/10.21437/interspeech.2020-1011.
  13. H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, "End-To-End Anti-Spoofing with RawNet2." HAL (Le Centre Pour La Communication Scientifique Directe), 6 June 2021, https://doi.org/10.1109/icassp39728.2021.9414234.
  14. X. Wang et al., "Heterogeneous Graph Attention Network," The World Wide Web Conference, 13 May 2019, https://doi.org/10.1145/3308558.3313562.
  15. P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, "Graph Attention Networks," arXiv (Cornell University), Feb. 2018, https://doi.org/10.48550/arXiv.1710.10903.
  16. J. Hu, L. Shen, S. Albanie, G. Sun, E. Wu, "Squeeze-and-Excitation Networks," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, https://doi.org/10.48550/arXiv.1709.01507.
  17. P. Duffer, M. Schmitt, and H. Schutze, "Position Information in Transformers: An Overview," Computational Linguistics, Vol.48, No.3, pp.733-763, 2022, https://doi.org/10.1162/coli_a_00445.
  18. A. Vaswani et al., "Attention is All you Need," arXiv (Cornell University), Vol.30, pp.5998-6008, 2017. https://doi.org/10.48550/arXiv.1706.03762.
  19. X. Wang et al., "ASVspoof 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech," ArXiv (Cornell University), 4 Nov. 2019, https://doi.org/10.48550/arxiv.1911.01601.
  20. T. Kinnunen et al., "T-DCF: A Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification," Odyssey 2018 the Speaker and Language Recognition Workshop, 26 June 2018, www.isca-speech.org/archive/Odyssey_2018/pdfs/68.pdf, https://doi.org/10.21437/odyssey.2018-44.
  21. X. Wang and J. Yamagishi, "A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection," ArXiv (Cornell University), 30 Aug. 2021, https://doi.org/10.21437/interspeech.2021-702.