Ensemble Learning for Solving Data Imbalance in Bankruptcy Prediction

기업부실 예측 데이터의 불균형 문제 해결을 위한 앙상블 학습

  • Received : 2009.05.21
  • Accepted : 2009.07.01
  • Published : 2009.09.30

Abstract

In a classification problem, data imbalance occurs when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed boundary and thus the reduction in the classification accuracy of such a classifier. This paper proposes a Geometric Mean-based Boosting (GM-Boost) to resolve the problem of data imbalance. Since GM-Boost introduces the notion of geometric mean, it can perform learning process considering both majority and minority sides, and reinforce the learning on misclassified data. An empirical study with bankruptcy prediction on Korea companies shows that GM-Boost has the higher classification accuracy than previous methods including Under-sampling, Over-Sampling, and AdaBoost, used in imbalanced data and robust learning performance regardless of the degree of data imbalance.

데이터 불균형 문제는 분류 및 예측 문제에서 하나의 범주에 속하는 표본의 수가 다른 범주들에 속하는 표본 수에 비하여 현저하게 적을 경우 나타난다. 데이터 불균형이 심화됨에 따라 범주 사이의 분류 경계영역이 왜곡되고 결과적으로 분류자의 학습성과가 저하되는 문제가 발생한다. 본 연구에서는 데이터 불균형 문제를 해결하기 위하여 Geometric Mean-based Boosting (GM-Boost) 알고리즘을 제안하고자 한다. GM-Boost 알고리즘은 기하평균 개념에 기초하고 있어 다수 범주와 소수 범주를 동시에 고려한 학습이 가능하고 오분류된 표본에 집중하여 학습을 강화할 수 있는 장점이 있다. 기업부실 예측문제를 활용하여 GM-Boost 알고리즘의 성과를 검증한 결과 기존의Under-Sampling, Over-Sampling 및 AdaBoost 알고리즘에 비하여 우수한 분류 정확성을 보여주었고 데이터 불균형 정도에 관계없이 견고한 학습성과를 나타냈다.

Keywords

References

  1. 강필성, 조성준 (2006), "데이터 불균형 해결을 위한 Under-sampling 기반 앙상블 SVMs", 대한산업공학회/한국경영과학회 2006 춘계공동학술대회.
  2. Altman, E. L., "Financial ratios, discriminant analysis and the prediction of corporate bankruptcy", The Journal of Finance, Vol.23 No.4(1968), 589-609. https://doi.org/10.1111/j.1540-6261.1968.tb00843.x
  3. Altman, E. L., I. Edward, R. Haldeman, and P. Narayanan, "A new model to identify bankruptcy risk of corporations", Journal of Banking and Finance, Vol.1(1977), 29-54. https://doi.org/10.1016/0378-4266(77)90017-6
  4. Beaver, W., "Financial ratios as predictors of failure, empirical research in accounting:Selected studied", Journal ofAccounting Research, Vol.4, No.3(1966), 71-111. https://doi.org/10.2307/2490171
  5. Bruzzone, L. and S. B. Serpico, "Classifications of imbalanced remote-sensing data by neural networks", Pattern recognition letters, Vol.18, No.11-13(1997), 1323-1328. https://doi.org/10.1016/S0167-8655(97)00109-8
  6. Bryant, S. M., "A case-based reasoning approach to bankruptcy prediction modeling", International Journal of Intelligent Systems in Accounting, Finance and Management, Vol.6, No.3(1997), 195-214 https://doi.org/10.1002/(SICI)1099-1174(199709)6:3<195::AID-ISAF132>3.0.CO;2-F
  7. Buta, P., "Mining for financial knowledge with CBR", AI Expert, Vol.9. No.10(1994), 34-41.
  8. Cao, L. and F. E. H. Tay, "Financial forecasting using support vector machines", Neural Computing and Applications, Vol.10(2001), 184-192. https://doi.org/10.1007/s005210170010
  9. Chawla, N., K. Bowyer, L. Hall, and W. Kegelmeyer, "SMOTE: synthetic minority oversampling techniques", Journal of Artificial Intelligence Research, Vol.16(2002), 321-357.
  10. Chawla, N., A. Lazarevic, L. Hall, and K. Bowyer, "SMOTEBoost:improving prediction of the minority class in boosting", 7th European conference on principles and practice of knowledge discovery in databases. Cavtat-Dubrovnik, Croatia, (2003), 107-119.
  11. Cover, T. M. and J. A. Thomas, Element of information theory, John Wiley and Sons, (1991).
  12. Darbellay, G. A., "An estimator of the mutual information based on a criterion for independence", Computational Statistics and Data Analysis, Vol.32(1999), 1-17. https://doi.org/10.1016/S0167-9473(99)00020-1
  13. Dimitras, A. I., S. H. Zanakis, and C. Zopounidis, "A survey of business failure with an emphasis on prediction methods and industrial applications", European Journal of Operational Research, Vol.90, No.3(1996), 487-513. https://doi.org/10.1016/0377-2217(95)00070-4
  14. Elkan, C., "The foundation of cost-sensitive learning", In Proceedings of the 17th International Joint Conference on Artificial Intelligence, (2001), 973-978, Seattle, WA.
  15. Fawcett, T., "An introduction to ROC analysis", Pattern Recognition Letters, Vol.27(2006), 861-874. https://doi.org/10.1016/j.patrec.2005.10.010
  16. Fawcett, T. and F. Provost, "Adaptive fraud detection", Data Mining and Knowledge discovery, Vol.1, No.3(1997), 291-316. https://doi.org/10.1023/A:1009700419189
  17. Freund, Y. and R. E. Schapire, "A decision theoretic generalization of online learning and an application to boosting", Journal of Computer and System Science, Vol.55, No.1(1997), 119-139. https://doi.org/10.1006/jcss.1997.1504
  18. Han, I., J. S. Chandler, and T. P. Liang, "The impact of measurement scale and correlation structure on classification performance of inductive learning and statistical methods". Expert System with Applications, Vol.10, No.2(1996), 209-221. https://doi.org/10.1016/0957-4174(95)00047-X
  19. Hong, X., "A kernel-based two-class classifier for imbalanced data sets", IEEE Transactions on neural networks, Vol.18, No.1(2007), 28-40. https://doi.org/10.1109/TNN.2006.882812
  20. Huang, Zan, Chen, Hsinchun, Hsu, Chia-Jung, Chen, Wun-Hwa, and Wu, Soushan, "Credit rating analysis with support vector machines and neural networks. A market comparative study", Decision Support Systems, Vol.37(2004), 543-558. https://doi.org/10.1016/S0167-9236(03)00086-1
  21. Japkowicz, N. and S. Stephen, "The class imbalance problem:a systematic study", Intelligent Data Analysis, Vol.6, No.5(2002), 429-250.
  22. Kim, K., "Financial time series forecasting using support vector machines", Neurocomputing, Vol.55(2004), 307-319.
  23. Kotsiantis, S., D. Tzelepis, E. Kounmanakos, and V. Tampakas, "Selective costing voting for bankruptcy prediction", International Journal of Knowledge-based and Intelligent Engineering Systems, Vol.11(2007), 115-127. https://doi.org/10.3233/KES-2007-11204
  24. Kubat, M., Holte, R., and S. Matwin, "Learning when Negative example abound", Proceedings of the 9th European Conference on Machine Learning, ECML'97 (1997).
  25. Kubat M. and S. Matwin, "Addressing the curse of imbalanced training sets:one-sided selection", In Proceedings of the Fourteenth International Conference onMachine Learning, (1997), 179-186.
  26. Laitinen, T. and M. Kankaanpaa, "Comparative analysis of failure prediction methods:the Finish case", European Accounting Review, Vol.8, No.1(1999), 67-92. https://doi.org/10.1080/096381899336159
  27. Laurikkala, J., "Instance-based data reduction for improved identification of difficult small classes", Intelligent Data Analysis, Vol.6, No.4(2002), 311-322.
  28. Maia, T. T., A. P. Braga, and A. F. Carvalho, "Hybrid classification algorithms based on boosting and support vector machines", Kybernetes, Vol.37, No.9(2008), 1469-1491. https://doi.org/10.1108/03684920810907814
  29. Meyer, P. A. and H. Pifer, "Prediction of bank failures", The Journal of Finance, Vol.25(1970), 853-68. https://doi.org/10.1111/j.1540-6261.1970.tb00558.x
  30. Min, S. H., J. M. Lee, and I. G. Han, Hybrid genetic algorithms and support vector machines for bankruptcy prediction. Expert Systems with Applications, Vol.31(2006), 652-660. https://doi.org/10.1016/j.eswa.2005.09.070
  31. Odom, M. and R. Sharda, "A neural network for bankruptcy prediction", Proceedings of the International Joint Conference on Neural Networks, IEEE Press, San Diego, CA. (1990).
  32. Ohlson, J., "Financial ratios and the probabilistic prediction of bankruptcy", Journal of Accounting Research, Vol.18, No.1(1980), 109-131. https://doi.org/10.2307/2490395
  33. Optiz, D. and R. Maclin, "Popular ensemble methods: an empirical study", Journal of Artificial Intelligence, Vol.11(1999), 169-198.
  34. Pantalone, C. and M. B. Platt, "Predicting commercial bank failure since deregulation", New England Economic Review, (1987), 37-47.
  35. Platt, J., "Fast Training of Support Vector Machines using Sequential Minimal Optimization. In B. Schoelkopf, C. Burges, and A. Smola, (Eds.)", Advances in Kernel Methods-Support Vector Learning, MIT Press, (1998).
  36. Provost. F. and T. Fawcett, "Robust classification for imprecise environments", Machine Learning, Vol.42(2001), 203-231. https://doi.org/10.1023/A:1007601015854
  37. Ravi, P. and K. V. Ravi, "Bankruptcy prediction in banks and firms via statistical and intelligent techniques-a review", European Journal of Operational Research, Vol.180(2007), 1-28. https://doi.org/10.1016/j.ejor.2006.08.043
  38. Seiffert, C., T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, "RUSBoost: Improving classification performance when training data is skewed", 19th International Conference on Pattern Recognition, (2008), 1-4.
  39. Shaw, M. and J. Gentry, "Using and expert system with inductive learning to evaluate business loans", Financial Management, Vol.17, No.3(1998), 45-56.
  40. Shin, H. J. and S. Z. Cho, "Response modeling with support vector machines", Expert Systems with applications, Vol.30, No.4(2006), 746-760. https://doi.org/10.1016/j.eswa.2005.07.037
  41. Shin, K., T. Lee, and H. Kim, "An application of support vector machines in bankruptcy prediction", Expert Systems with Applications, Vol.28(2005), 127-135. https://doi.org/10.1016/j.eswa.2004.08.009
  42. Tay. F. E. J. and L. J. Cao, "Modified support vector machine in financial time series forecasting", Neurocomputing, Vol.48(2002), 847-861. https://doi.org/10.1016/S0925-2312(01)00676-2
  43. Vapnik, V. N., "The nature of statistical learning theory", New York:Springer, (1995).
  44. Wang, B. X. and N. Japkowicz, "Boosting support vector machines for imbalanced data sets", Knowledge and Information Systems, forthcoming, (2009).
  45. Weiss, G. M., "Mining with rarity:a unifying framework", SIGKDD Explorations, Vol.T, No.1(2004), 7-19.
  46. Wu, G. and E. Chang, "Adaptive feature-space conformal transformation for imbalanced data learning", In Proceedings of the 20th International Conference on Machine Learning, (2003).
  47. Wu, G. and E. Chang, "KBA: Kernel boundary alignment considering imbalanced data distribution", IEEE Transactions on knowledge and data engineering, Vol.17, No.6(2005), 786-795. https://doi.org/10.1109/TKDE.2005.95
  48. Wu, G. Y. Wu, L. Jiao, Y. F. Wang, and E. Chang, "Multi-camera spatio-temporal fusion and biased sequence-data learning for security surveillance", Proceedings of 20th International Conference on Multimedia, (2003).
  49. Yan, R., Y. Liu, and R. Hauptman, "On predicting rare classes with SVM ensembles in scene classification", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'03), (2003).
  50. Zmijewski, M. E.,:"Methodological issues related to the estimation of financial distress prediction models", Journal of Accounting Research, Vol.22, No.1(1984), 59-82. https://doi.org/10.2307/2490859