On sampling algorithms for imbalanced binary data: performance comparison and some caveats

Kim, HanYong;Lee, Woojoo;

doi:10.5351/KJAS.2017.30.5.681

The Korean Journal of Applied Statistics (응용통계연구)

Volume 30 Issue 5
/
Pages.681-690
/
2017
/
1225-066X(pISSN)
/
2383-5818(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

On sampling algorithms for imbalanced binary data: performance comparison and some caveats

불균형적인 이항 자료 분석을 위한 샘플링 알고리즘들: 성능비교 및 주의점

Kim, HanYong (Department of Statistics, Inha University) ;
Lee, Woojoo (Department of Statistics, Inha University)

김한용 (인하대학교 통계학과) ;
이우주 (인하대학교 통계학과)

Received : 2017.07.17
Accepted : 2017.09.12
Published : 2017.10.31

https://doi.org/10.5351/KJAS.2017.30.5.681 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Various imbalanced binary classification problems exist such as fraud detection in banking operations, detecting spam mail and predicting defective products. Several sampling methods such as over sampling, under sampling, SMOTE have been developed to overcome the poor prediction performance of binary classifiers when the proportion of one group is dominant. In order to overcome this problem, several sampling methods such as over-sampling, under-sampling, SMOTE have been developed. In this study, we investigate prediction performance of logistic regression, Lasso, random forest, boosting and support vector machine in combination with the sampling methods for binary imbalanced data. Four real data sets are analyzed to see if there is a substantial improvement in prediction performance. We also emphasize some precautions when the sampling methods are implemented.

파산감지, 스팸메일 감지, 불량품 감지 등 일상생활에서 불균형적인 이항 분류 문제를 다양하게 접할 수 있다. 반응변수의 클래스의 비율이 상당히 불균형한 경우 이항 분류 모형의 예측 성능이 좋지 않다는 점은 이미 잘 알려진 사실이다. 이러한 문제점을 해결하기 위해 그 동안 오버 샘플링, 언더 샘플링, SMOTE와 같은 여러 샘플링 기법이 개발되어 왔다. 본 연구에서는 분류 모형으로 많이 사용되는 기계학습모형으로 로지스틱 회귀모형, Lasso, 랜덤포레스트, 부스팅, 서포트 벡터 머신을 위의 샘플링 기법들과 결합하여 사용했을 때의 예측 성능을 살펴보았다. 실질적인 예측 성능의 개선 여부를 확인하기 위해 네 개의 실제 자료를 분석하였다. 이와 더불어, 샘플링 방법이 사용될 때 주의해야 할 점에 대해서 강조하였다.

Keywords

References

Altini, M. (2015). Dealing with imbalanced data: undersampling, oversampling and proper cross-validation. http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence research, 16, 321-357. https://doi.org/10.1613/jair.953
Dal Pozzolo, A., Caelen, O., Waterschoot, S., and Bontempi, G. (2013). Racing for unbalanced methods selection. In International Conference on Intelligent Data Engineering and Automated Learning, (pp.24-31), Springer, Berlin, Heidelberg.
Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent, Journal of Statistical Software, 33, 1-22.
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., and Herrera, F. (2012). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42, 463-484. https://doi.org/10.1109/TSMCC.2011.2161285
He, H. and Garcia, E. A. (2009). Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering, 21, 1263-1284. https://doi.org/10.1109/TKDE.2008.239
He, H. and Ma, Y (2013). Imbalanced Learning: Foundations, Algorithms, and Applications, Wiley-IEEE Press, New Jersey.
Hulse, J. V., Khoshgoftaar, T. M., and Napolitano, A. (2007). Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th International Conference on Machine Learning, 935-942.
Kuhn, M. (2016). Building predictive models in R using the caret package, Journal of Statistical Software, 28(5).
Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest, R News, 2, 18-22.
Longadge, R. and Dongre, S. (2013). Class imbalance problem in data mining review, arXiv preprint arXiv:1305.1707
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F. (2017). e1071: Misc Functions of the Department of Statistics, R package version 1.6-8.
Ren, P., Yao, S., Li, J., Valdes-Sosa, P. A., and Kendrick, K. M. (2015). Improved prediction of preterm delivery using empirical mode decomposition analysis of uterine electromyography signals, PLOS ONE, 10, e0132116 https://doi.org/10.1371/journal.pone.0132116
Ridgeway, G. (2017). gbm: generalized boosted regression models, R package version 2.1.3.
Xie, J. and Qiu, Z. (2007). The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition, 40, 557-562. https://doi.org/10.1016/j.patcog.2006.01.009

The Korean Journal of Applied Statistics (응용통계연구)

On sampling algorithms for imbalanced binary data: performance comparison and some caveats

불균형적인 이항 자료 분석을 위한 샘플링 알고리즘들: 성능비교 및 주의점

Abstract

Keywords

References

Detail Search