DOI QR코드

DOI QR Code

A Data Sampling Technique for Secure Dataset Using Weight VAE Oversampling(W-VAE)

가중치 VAE 오버샘플링(W-VAE)을 이용한 보안데이터셋 샘플링 기법 연구

  • Kang, Hanbada (Department of Convergence Security, Chung-Ang University) ;
  • Lee, Jaewoo (Department of Industrial Security, Chung-Ang University)
  • Received : 2022.10.20
  • Accepted : 2022.11.09
  • Published : 2022.12.31

Abstract

Recently, with the development of artificial intelligence technology, research to use artificial intelligence to detect hacking attacks is being actively conducted. However, the fact that security data is a representative imbalanced data is recognized as a major obstacle in composing the learning data, which is the key to the development of artificial intelligence models. Therefore, in this paper, we propose a W-VAE oversampling technique that applies VAE, a deep learning generation model, to data extraction for oversampling, and sets the number of oversampling for each class through weight calculation using K-NN for sampling. In this paper, a total of five oversampling techniques such as ROS, SMOTE, and ADASYN were applied through NSL-KDD, an open network security dataset. The oversampling method proposed in this paper proved to be the most effective sampling method compared to the existing oversampling method through the F1-Score evaluation index.

최근 인공지능 기술이 발전하면서 해킹 공격을 탐지하기 위해 인공지능을 이용하려는 연구가 활발히 진행되고 있다. 하지만, 인공지능 모델 개발에 핵심인 학습데이터를 구성하는데 있어서 보안데이터가 대표적인 불균형 데이터라는 점이 큰 장애물로 인식되고 있다. 이에 본 눈문에서는 오버샘플링을 위한 데이터 추출에 딥러닝 생성 모델인 VAE를 적용하고 K-NN을 이용한 가중치 계산을 통해 클래스별 오버샘플링 개수를 설정하여 샘플링을 하는 W-VAE 오버샘플링 기법을 제안한다. 본 논문에서는 공개 네트워크 보안 데이터셋인 NSL-KDD를 통해 ROS, SMOTE, ADASYN 등 총 5가지 오버샘플링 기법을 적용하였으며 본 논문에서 제안한 오버샘플링 기법이 F1-Score 평가지표를 통해 기존 오버샘플링 기법과 비교하여 가장 효과적인 샘플링 기법임을 증명하였다.

Keywords

References

  1. S. H Seo, Y. J. Jeon, J. S. Lee, H. J. Jung, and J. T. Kim, "An Over-sampling Method based on Generative Adversarial Networks for Effective Classification of Imbalanced Big Data," in Proceedings of Korea Software Congress 2017, Busan, Korea, pp. 1030-1032, 2017.
  2. M. J. Son, S. W. Jung, and E. J. Hwang, "A Deep Learning Based Over-Sampling Scheme for Imbalanced Data Classification," KIPS Transactions on Software and Data Engineering, vol. 8, no. 7, pp. 311-136, Jul. 2019. https://doi.org/10.3745/KTSDE.2019.8.7.311
  3. J. H. Yang, "Comparison of the Classification Algorithms Using a Sampling Technique in Imbalanced Data," M. S. thesis, Dongguk University, Korea, 2017.
  4. I. O. Jung, J. W. Ji, G. H. Lee, and M. J. Kim, "A study on intrusion detection performance improvement through imbalanced data processing," Jouranl of Information and Security, vol. 21, no. 3, pp. 57-66, Sep. 2021.
  5. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: Synthetic Minority Over-sampling Technique," Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, Jun. 2002. https://doi.org/10.1613/jair.953
  6. H. He, Y. Bai, E. A. Garcia, and S. Li, "ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning," in Proceedings of IEEE International Joint Conference on Neural Networks, Hong Kong, pp.1322-1328, 2008.
  7. K. Lee, "Oversampling based on Gaussian Mixture Model for Imbalanced data classification," M. S. thesis, Hanyang University, Korea, 2019.
  8. Y. H. Choe and K. W. Oh, "A Study on the Introduction of CTGAN Oversampling Algorithm to improve Imbalance Problem in Intrusion Detection Data," The Journal of Korean Institute of Communications and Information Sciences, vol. 45, no. 12, pp. 2114-2122, Dec. 2020. https://doi.org/10.7840/kics.2020.45.12.2114
  9. S. T. Yoo and K. S. Kim., "Comparison of Anomaly Detection Performance Based on GRU Model Applying Various Data Preprocessing Techniques and Data Oversampling," Journal of the Korea Institute of Information Security & Cryptology, vol. 32, no. 2, pp. 201-211, Apr. 2022.
  10. D. P. Kingma and M. Welling, "Auto-Encoding Variational Bayes," arXiv:1312.6114v10, 2013.
  11. J. H. Park, "Improving Fashion Style Classification Accuracy using VAE in Class Imbalance Problem," The Journal of Korean Institute of Information Technology, vol. 19, no. 2, pp. 1-10, Feb. 2021.
  12. K. Sohn, H. Lee, and X. Yan, "Learning Structured Output Representation using Deep Conditional Generative Models," in Proceedings of Advances in neural information processing systems (NeurIPS), Montreal: QC, Canada, pp. 3483-3491, 2015.
  13. F. Ulger, S. E. Yuksel, and A. Yilmaz, "Anomaly Detection for Solder Joints Using β-VAE," IEEE Transactions on Components, Packaging and Manufacturing Technology, vol. 11, no. 12, pp. 2214-2221, Oct. 2021. https://doi.org/10.1109/TCPMT.2021.3121265
  14. H. Tingfei, C. Guangquan, and H. Kuihua, "Using Variational Auto Encoding in Credit Card Fraud Detection," IEEE Access, vol. 8, pp. 149841-149853, Aug. 2020. https://doi.org/10.1109/access.2020.3015600
  15. S. C. Hsiao, D. Y. Kao, Z. Y. Liu, and R. Tso, "Malware Image Classification Using One-Shot Learning with Siamese Networks," in Procedia Computer Science, Budapest, Hungary, vol. 159, pp. 1863-1871, 2019. https://doi.org/10.1016/j.procs.2019.09.358
  16. University of new brunswick, NSK-KDD dataset [Online]. Available: https://www.unb.ca/cic/datasets/nsl.html.
  17. P. Devan and N. Khare, "An efficient XGBoost-DNN-based classification model for network intrusion detection system," Neural Computing and Applications, vol. 32, pp. 12499-12514, Jan. 2020. https://doi.org/10.1007/s00521-020-04708-x
  18. C. Yin, Y. Zhu, J. Fei and X. He, "A Deep Learning Approach for Intrusion Detection Using Recurrent Neural Networks," IEEE Access, vol. 5, pp. 21954-21961, Oct. 2017. https://doi.org/10.1109/ACCESS.2017.2762418
  19. K. J. Ryu, "Study for Solving Network Traffic Data Imbalance And Rare Class Problems Using a Similarity Neural Network," M. S. thesis, Sejong University, Korea, 2021.