DOI QR코드

DOI QR Code

앙상블기법을 이용한 다양한 데이터마이닝 성능향상 연구

A Study for Improving the Performance of Data Mining Using Ensemble Techniques

  • 투고 : 20100300
  • 심사 : 20100600
  • 발행 : 2010.07.31

초록

본 논문은 8가지 방법의 데이터 마이닝 알고리즘(CART, QUEST, CRUISE, 로지스틱 회귀분석, 선형판별분석, 이차판별분석, 신경망분석, 서포트 벡터 머신) 기법과 단일 알고리즘에 2가지 앙상블기법(배깅, 부스팅)을 적용한 16가지 방법을 바탕으로 총 24가지의 방법을 비교하였다. 알고리즘의 성능 비교를 위하여 13개의 이항반응변수로 구성된 데이터를 사용하였다. 비교 기준은 민감도, 특이도 및 오분류율을 사용하여 데이터 마이닝 기법의 성능향상에 대해 평가하였다.

We studied the performance of 8 data mining algorithms including decision trees, logistic regression, LDA, QDA, Neral network, and SVM and their combinations of 2 ensemble techniques, bagging and boosting. In this study, we utilized 13 data sets with binary responses. Sensitivity, Specificity and missclassificate error were used as criteria for comparison.

키워드

참고문헌

  1. 김규곤 (2003). 데이터 마이닝에서 분류방법에 관한 연구, Journal of the Korean Data Analysis Society, 5, 101-112.
  2. 김기영, 전명식 (1994). <다변량 통계자료분석>, 자유아카데미, 서울.
  3. 이영섭, 오현정, 김미경 (2005). 데이터 마이닝에서 배깅, 부스팅, SVM 분류 알고리즘 비교 분석, <응용통계연구>, 18, 343-354. https://doi.org/10.5351/KJAS.2005.18.2.343
  4. 허면회, 서혜선 (2001). , 자유아카데미, 서울.
  5. Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, Boosting and variants, Machine Learning, 36, 105-139. https://doi.org/10.1023/A:1007515423169
  6. Breiman, L. (1996). Bagging predictors, Machine Learning, 26, 123-140.
  7. Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and Regression Trees, Chapman & Hall, New York.
  8. Clemen, R. (1989). Combining forecasts: A review and annotated bibliography, Journal of Forecasting, 5, 559-583. https://doi.org/10.1016/0169-2070(89)90012-5
  9. Drucker, H. and Cortes, C. (1996). Boosting decision trees, Neural Information Processing Systems, 8, 470-485.
  10. Druker, H., Schapire, R. and Simard, P. (1993). Boosting performance in neural networks, International Journal of Pattern Recognition and Artificial Intelligence, 7, 705-719. https://doi.org/10.1142/S0218001493000352
  11. Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap, Chapman & Hall, New York.
  12. Frank, A. and Asuncion, A. (2010). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
  13. Freund, Y. (1995). Boosting a weak learning algorithm by majority, Information and Computation, 121, 256-285. https://doi.org/10.1006/inco.1995.1136
  14. Freund, Y. and Schapire, R. (1996). Experiments with a new boosting algorithm, Proceedings of the Thirteenth International Conference on Machine Learning, 148-156.
  15. Kass, G. V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data. Journal of the Royal Statistical Society. Series C (Applied Statistics), 29, 119-127.
  16. Kearns, M. and Valiant, L. G. (1994). Cryptographic limitations on learning Boolean formulae and finite automata, Journal of the Association for Computing Machinery, 41, 67-95. https://doi.org/10.1145/174644.174647
  17. Kim, H. J. and Loh, W. Y. (2001). Classification trees with unbiased multiway splits, Journal of the American Statistical Association, 96, 598-604.
  18. Loh, W. Y. and Shih, Y. S. (1997). Split selection method for classification trees, Statistica Sinica, 7, 815-840.
  19. Opitz, D. and Maclin, R. (1999). Popular ensemble methods: An empirical study, Journal of the Artificial Intelligence Research, 11, 169-198.
  20. Perrone, M. (1993). Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization, Doctoral dissertation, Department of Physics, Brown University.
  21. Quinlan, J. R. (1992). C4.5 : Programming with Machine Learning, Morgan Kaufmann Publishers.
  22. Quinlan, J. R. (1996). Bagging, boosting, and C4.5, Proceedings of the Fourteenth National Conference on Machine Learning, 725-730.
  23. Schapire, R. E. (1990). The strength of weak learnability, Machine Learning, 5, 197-227.
  24. Schapire, R. E. and Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions, Machine Learning, 37, 297-336. https://doi.org/10.1023/A:1007614523901
  25. Valiant, L. G. (1984). A theory of the learnable, Communication of the ACM, 27, 1134-1142. https://doi.org/10.1145/1968.1972
  26. Vapnik, V. (1979). Estimation of Dependences Based on Empirical Data, Nauka, Moscow.
  27. Wolpert, D. (1992). Stacked generalization, Neural Network, 5, 241-259. https://doi.org/10.1016/S0893-6080(05)80023-1

피인용 문헌

  1. A Study on the Data Mining Preprocessing Tool For Efficient Database Marketing vol.12, pp.11, 2014, https://doi.org/10.14400/JDC.2014.12.11.257