DOI QR코드

DOI QR Code

A Study for Improving the Performance of Data Mining Using Ensemble Techniques

앙상블기법을 이용한 다양한 데이터마이닝 성능향상 연구

  • Received : 20100300
  • Accepted : 20100600
  • Published : 2010.07.31

Abstract

We studied the performance of 8 data mining algorithms including decision trees, logistic regression, LDA, QDA, Neral network, and SVM and their combinations of 2 ensemble techniques, bagging and boosting. In this study, we utilized 13 data sets with binary responses. Sensitivity, Specificity and missclassificate error were used as criteria for comparison.

본 논문은 8가지 방법의 데이터 마이닝 알고리즘(CART, QUEST, CRUISE, 로지스틱 회귀분석, 선형판별분석, 이차판별분석, 신경망분석, 서포트 벡터 머신) 기법과 단일 알고리즘에 2가지 앙상블기법(배깅, 부스팅)을 적용한 16가지 방법을 바탕으로 총 24가지의 방법을 비교하였다. 알고리즘의 성능 비교를 위하여 13개의 이항반응변수로 구성된 데이터를 사용하였다. 비교 기준은 민감도, 특이도 및 오분류율을 사용하여 데이터 마이닝 기법의 성능향상에 대해 평가하였다.

Keywords

References

  1. 김규곤 (2003). 데이터 마이닝에서 분류방법에 관한 연구, Journal of the Korean Data Analysis Society, 5, 101-112.
  2. 김기영, 전명식 (1994). <다변량 통계자료분석>, 자유아카데미, 서울.
  3. 이영섭, 오현정, 김미경 (2005). 데이터 마이닝에서 배깅, 부스팅, SVM 분류 알고리즘 비교 분석, <응용통계연구>, 18, 343-354. https://doi.org/10.5351/KJAS.2005.18.2.343
  4. 허면회, 서혜선 (2001). , 자유아카데미, 서울.
  5. Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, Boosting and variants, Machine Learning, 36, 105-139. https://doi.org/10.1023/A:1007515423169
  6. Breiman, L. (1996). Bagging predictors, Machine Learning, 26, 123-140.
  7. Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and Regression Trees, Chapman & Hall, New York.
  8. Clemen, R. (1989). Combining forecasts: A review and annotated bibliography, Journal of Forecasting, 5, 559-583. https://doi.org/10.1016/0169-2070(89)90012-5
  9. Drucker, H. and Cortes, C. (1996). Boosting decision trees, Neural Information Processing Systems, 8, 470-485.
  10. Druker, H., Schapire, R. and Simard, P. (1993). Boosting performance in neural networks, International Journal of Pattern Recognition and Artificial Intelligence, 7, 705-719. https://doi.org/10.1142/S0218001493000352
  11. Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap, Chapman & Hall, New York.
  12. Frank, A. and Asuncion, A. (2010). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
  13. Freund, Y. (1995). Boosting a weak learning algorithm by majority, Information and Computation, 121, 256-285. https://doi.org/10.1006/inco.1995.1136
  14. Freund, Y. and Schapire, R. (1996). Experiments with a new boosting algorithm, Proceedings of the Thirteenth International Conference on Machine Learning, 148-156.
  15. Kass, G. V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data. Journal of the Royal Statistical Society. Series C (Applied Statistics), 29, 119-127.
  16. Kearns, M. and Valiant, L. G. (1994). Cryptographic limitations on learning Boolean formulae and finite automata, Journal of the Association for Computing Machinery, 41, 67-95. https://doi.org/10.1145/174644.174647
  17. Kim, H. J. and Loh, W. Y. (2001). Classification trees with unbiased multiway splits, Journal of the American Statistical Association, 96, 598-604.
  18. Loh, W. Y. and Shih, Y. S. (1997). Split selection method for classification trees, Statistica Sinica, 7, 815-840.
  19. Opitz, D. and Maclin, R. (1999). Popular ensemble methods: An empirical study, Journal of the Artificial Intelligence Research, 11, 169-198.
  20. Perrone, M. (1993). Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization, Doctoral dissertation, Department of Physics, Brown University.
  21. Quinlan, J. R. (1992). C4.5 : Programming with Machine Learning, Morgan Kaufmann Publishers.
  22. Quinlan, J. R. (1996). Bagging, boosting, and C4.5, Proceedings of the Fourteenth National Conference on Machine Learning, 725-730.
  23. Schapire, R. E. (1990). The strength of weak learnability, Machine Learning, 5, 197-227.
  24. Schapire, R. E. and Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions, Machine Learning, 37, 297-336. https://doi.org/10.1023/A:1007614523901
  25. Valiant, L. G. (1984). A theory of the learnable, Communication of the ACM, 27, 1134-1142. https://doi.org/10.1145/1968.1972
  26. Vapnik, V. (1979). Estimation of Dependences Based on Empirical Data, Nauka, Moscow.
  27. Wolpert, D. (1992). Stacked generalization, Neural Network, 5, 241-259. https://doi.org/10.1016/S0893-6080(05)80023-1

Cited by

  1. A Study on the Data Mining Preprocessing Tool For Efficient Database Marketing vol.12, pp.11, 2014, https://doi.org/10.14400/JDC.2014.12.11.257