Comparison of Data Mining Classification Algorithms for Categorical Feature Variables

범주형 자료에 대한 데이터 마이닝 분류기법 성능 비교

  • 손소영 (연세대학교 산업시스템공학과) ;
  • 신형원 (연세대학교 산업시스템공학과)
  • Published : 1999.12.31

Abstract

In this paper, we compare the performance of three data mining classification algorithms(neural network, decision tree, logistic regression) in consideration of various characteristics of categorical input and output data. $2^{4-1}$. 3 fractional factorial design is used to simulate the comparison situation where factors used are (1) the categorical ratio of input variables, (2) the complexity of functional relationship between the output and input variables, (3) the size of randomness in the relationship, (4) the categorical ratio of an output variable, and (5) the classification algorithm. Experimental study results indicate the following: decision tree performs better than the others when the relationship between output and input variables is simple while logistic regression is better when the other way is around; and neural network appears a better choice than the others when the randomness in the relationship is relatively large. We also use Taguchi design to improve the practicality of our study results by letting the relationship between the output and input variables as a noise factor. As a result, the classification accuracy of neural network and decision tree turns out to be higher than that of logistic regression, when the categorical proportion of the output variable is even.

Keywords

Acknowledgement

Supported by : 한국과학재단