DOI QR코드

DOI QR Code

Handling Method of Imbalance Data for Machine Learning : Focused on Sampling

머신러닝을 위한 불균형 데이터 처리 방법 : 샘플링을 위주로

  • 이규남 (충북대학교 빅데이터학과) ;
  • 임종태 (충북대학교 정보통신공학과) ;
  • 복경수 (원광대학교 SW융합학과) ;
  • 유재수 (충북대학교 정보통신공학과)
  • Received : 2019.08.23
  • Accepted : 2019.10.11
  • Published : 2019.11.28

Abstract

Recently, more and more attempts have been made to solve the problems faced by academia and industry through machine learning. Accordingly, various attempts are being made to solve non-general situations through machine learning, such as deviance, fraud detection and disability detection. A variety of attempts have been made to resolve the non-normal situation in which data is distributed disproportionately, generally resulting in errors. In this paper, we propose handling method of imbalance data for machine learning. The proposed method to such problem of an imbalance in data by verifying that the population distribution of major class is well extracted. Performance Evaluations have proven the proposed method to be better than the existing methods.

최근 학계, 산업계 등에서 접하는 기존의 문제를 머신러닝을 통해 해결하려는 시도가 증가하고 있다. 이에 따라 이탈, 사기탐지, 장애탐지 등 일반적이지 않은 상황을 머신러닝으로 해결하기 위한 다양한 연구가 이어지고 있다. 대부분의 일반적이지 않은 환경에서는 데이터가 불균형하게 분포하며, 이러한 불균형한 데이터는 머신러닝의 수행과정에서 오류를 야기하므로 이를 해결하기 위한 불균형 데이터 처리 기법이 필요하다. 본 논문에서는 머신러닝을 위한 불균형 데이터 처리 방법을 제안한다. 제안하는 방법은 샘플링 방법을 중심으로 다수 클래스(Major Class)의 모집단 분포를 효율적으로 추출하도록 검증하여 머신 러닝을 위한 불균형 데이터 문제를 해결한다. 본 논문에서는 성능평가를 통해 제안하는 기법이 기존 기법에 비해 성능이 우수함을 보인다.

Keywords

References

  1. Shaza M. Abd Elrahman and Ajith Abraham, "A review of class imbalance problem," Journal of Network and Innovative Computing, Vol.1, pp.332-340, 2013.
  2. Haibo He and Edwardo A. Garcia, "Learning from imbalanced data," IEEE Transactions on Knowledge & Data Engineering, Vol.21, No.9, pp.1263-1284, 2009. https://doi.org/10.1109/TKDE.2008.239
  3. Arpit Singh and Anuradha Purohit, "A survey on methods for solving data imbalance problem for classification," International Journal of Computer Applications, Vol.127, No.15, pp.37-41, 2015. https://doi.org/10.5120/ijca2015906677
  4. Rushi Longadge, Snehlata S. Dongre, and Latesh Malik, "Class imbalance problem in data mining review," Internation Journal of Computer Science and Network, Vol.2, No.1, pp.1-6, 2013.
  5. Joffrey L. Leevy, Taghi M. Khoshgoftaar, Richard A. Bauder, and Naeem Seliya, "A survey on addressing high-class imbalance in big data," Journal of Big Data, Vol.5, No.1, pp.1-30, 2018. https://doi.org/10.1186/s40537-017-0110-7
  6. Zhaohui Zheng, Xiaoyun Wu, and Rohini Srihari, "Feature selection for text categorization on imbalanced data," ACM Sigkdd Explorations Newsletter, Vol.6, No.1, pp.80-89, 2004. https://doi.org/10.1145/1007730.1007741
  7. Peng Cao, Dazhe Zhao, and Osmar Zaiane, "An optimized cost-sensitive SVM for imbalanced data learning," Proc. Pacific-Asia conference on knowledge discovery and data mining, pp.280-292, 2013.
  8. Peng Cao, Dazhe Zhao, and Osmar R. Zaiane, "A PSO-based cost-sensitive neural network for imbalanced data classification," Proc. Pacific-Asia conference on knowledge discovery and data mining, pp.452-463, 2013.
  9. Alberto Fernandeza, Salvador Garcia, Maria Jose del Jesus, and Francisco Herrera, "A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets," Fuzzy Sets and Systems, Vol.159, No.18, pp.2378-2398, 2008. https://doi.org/10.1016/j.fss.2007.12.023
  10. S. Picek, A. Heuser, A. Jovic, S. Bhasin, and F. Regazzoni, "The curse of class imbalance and conflicting metrics with machine learning for side-channel evaluations," 2018.
  11. Z. Chen, Q. Yan, H. Han, S. Wang, L. Peng, L. Wang, and B. Yang, "Machine learning based mobile malware detection using highly imbalanced network traffic," Information Sciences, Vol.433, pp.346-364, 2018. https://doi.org/10.1016/j.ins.2017.04.044
  12. Dennis L. Wilson, "Asymptotic properties of nearest neighbor rules using edited data," IEEE Transactions on Systems, Man, and Cybernetics, Vol.3, pp.408-421, 1972. https://doi.org/10.1109/TSMC.1972.4309137
  13. I. Tomek, "An experiment with the edited nearest-neighbor rule," IEEE Transactions on systems, Man, and Cybernetics, Vol.6, No.6, pp.448-452, 1976. https://doi.org/10.1109/TSMC.1976.4309523
  14. I. Tomek, "Two Modifications of CNN," IEEE Transactions on Systems, Man and Cybernetics, Vol.6, No.11, pp.769-772, 1976. https://doi.org/10.1109/TSMC.1976.4309452
  15. Kubat, Miroslav, and Stan Matwin, "Addressing the curse of imbalanced training sets: one-sided selection," Proc. International Conference on Machine Learning, Vol.97, pp.179-186, 1997.
  16. J. Laurikkala, "Improving identification of difficult small classes by balancing class distribution," Proc. Conference on Artificial Intelligence in Medicine in Europe - Artificial Intelligence in Medicine, pp.63-66, 2001.
  17. Mani, Inderjeet and I. Zhang, "kNN approach to unbalanced data distributions: a case study involving information extraction," Proc. workshop on learning from imbalanced datasets, Vol.126, 2003.
  18. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, Vol.16, No.1, pp.321-357, 2002. https://doi.org/10.1613/jair.953
  19. H. He, Y. Bai, E. A. Garcia, and S. Li, "ADASYN: Adaptive synthetic sampling approach for imbalanced learning," Proc. IEEE International Joint Conference on Neural Networks, pp.1322-1328, 2008.
  20. Batista, Gustavo EAPA, Ana LC Bazzan, and Maria Carolina Monard, "Balancing Training Data for Automated Annotation of Keywords: a Case Study," Proc. Workshop on Bioinformatics, 2003.
  21. Batista, Gustavo EAPA, Ronaldo C. Prati and Maria Carolina Monard, "A study of the behavior of several methods for balancing machine learning training data," SIGKDD Explorations, Vol.6, No.1, pp.20-29, 2004. https://doi.org/10.1145/1007730.1007735
  22. https://sci2s.ugr.es/keel/imbalanced.php?order=ir#sub10, 2019.8.18.