DOI QR코드

DOI QR Code

Categorical Variable Selection in Naïve Bayes Classification

단순 베이즈 분류에서의 범주형 변수의 선택

  • Kim, Min-Sun (Department of Statistics, University of Seoul) ;
  • Choi, Hosik (Department of Applied and Informational Statistics, Kyonggi University) ;
  • Park, Changyi (Department of Statistics, University of Seoul)
  • 김민선 (서울시립대학교 통계학과) ;
  • 최호식 (경기대학교 응용정보통계학과) ;
  • 박창이 (서울시립대학교 통계학과)
  • Received : 2015.01.20
  • Accepted : 2015.03.10
  • Published : 2015.06.30

Abstract

$Na{\ddot{i}}ve$ Bayes Classification is based on input variables that are a conditionally independent given output variable. The $Na{\ddot{i}}ve$ Bayes assumption is unrealistic but simplifies the problem of high dimensional joint probability estimation into a series of univariate probability estimations. Thus $Na{\ddot{i}}ve$ Bayes classier is often adopted in the analysis of massive data sets such as in spam e-mail filtering and recommendation systems. In this paper, we propose a variable selection method based on ${\chi}^2$ statistic on input and output variables. The proposed method retains the simplicity of $Na{\ddot{i}}ve$ Bayes classier in terms of data processing and computation; however, it can select relevant variables. It is expected that our method can be useful in classification problems for ultra-high dimensional or big data such as the classification of diseases based on single nucleotide polymorphisms(SNPs).

단순 베이즈 분류($Na{\ddot{i}}ve$ Bayes classification)는 출력변수가 주어졌을 때 입력변수들이 조건부 독립이라는 가정에 기반한다. 단순 베이즈 가정은 비현실적이지만 고차원의 확률 추정 문제를 일련의 일차원 확률 추정 문제로 단순화 시킨다는 장점이 있으며, 특히 스팸 메일 필터링, 추천 시스템(recommendation system) 등 방대한 데이터를 다루는 분야야에서 흔히 사용된다. 본 논문에서는 입력변수와 출력변수간의 카이제곱 통계량에 기반한 변수선택법을 제안한다. 이 방법은 단순 베이즈 분류의 장점인 데이터 처리 및 계산의 단순성을 유지하면서도 설명력이 있는 변수를 선택할 수 있으며 SNP(single nucleotide polymorphism)에 의한 질병의 분류 등의 초고차원 혹은 빅데이터에서 유용할 것으로 기대된다.

Keywords

References

  1. Chen, J. and Gupta, A. K. (2000). Parametric Statistical Change Point Analysis, Birkhauser.
  2. Choi, B.-J., Kim, K.-R., Cho, K.-D., Park, C. and Koo, J.-Y. (2014). Variable selection for Naive Bayes Semisupervised learning, Communications in Statistics - Simulation and Computation, 43, 2702-2713. https://doi.org/10.1080/03610918.2012.762391
  3. Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space, Journal of the Royal Statistical Society, 70, 849-911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
  4. Ha, J. H. and Park, C. (2009). Variable selection in linear discriminant analysis, Journal of the Korean Data Analysis Society, 11, 381-389.
  5. Hand, D. and Yu, K. (2001). Idiot's Bayes-not so stupid at all?, International Statistical Review, 69, 385-399.
  6. Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, (2nd Edition), Springer, New York.
  7. Jin, S. K., Kim, K.-R. and Park, C. (2012). Cutpoint Selection via penalization in credit scoring, The Korean Journal of Applied Statistics, 25, 261-267. https://doi.org/10.5351/KJAS.2012.25.2.261
  8. Killick, R., Fearnhead, P. and Eckley, I. A. (2012). Optimal detection of changepoints with a linear computational cost, Journal of the American Statistical Association, 107, 1590-1598. https://doi.org/10.1080/01621459.2012.737745
  9. Killick, R. and Eckley, I. A. (2014). Changepoint: An R package for changepoint analysis, Journal of Statistical Software, 58.
  10. Vidaurre, D., Bielza, C. and Larranaga, P. (2012). Forward stagewise naive Bayes, Progress in Artificial Intelligence, 1, 57-69. https://doi.org/10.1007/s13748-011-0001-7
  11. Vidaurre, D., Bielza, C. and Larranaga, P. (2013). An $L_1$-regularized naive Bayes-inspired classifier for discarding redundant and irrelevant predictors, International Journal on Artificial Intelligence Tools, 22, 1350019. https://doi.org/10.1142/S021821301350019X