Categorical Variable Selection in Naïve Bayes Classification

Kim, Min-Sun;Choi, Hosik;Park, Changyi;

doi:10.5351/KJAS.2015.28.3.407

The Korean Journal of Applied Statistics (응용통계연구)

Volume 28 Issue 3
/
Pages.407-415
/
2015
/
1225-066X(pISSN)
/
2383-5818(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

Categorical Variable Selection in Naïve Bayes Classification

단순 베이즈 분류에서의 범주형 변수의 선택

Kim, Min-Sun (Department of Statistics, University of Seoul) ;
Choi, Hosik (Department of Applied and Informational Statistics, Kyonggi University) ;
Park, Changyi (Department of Statistics, University of Seoul)

김민선 (서울시립대학교 통계학과) ;
최호식 (경기대학교 응용정보통계학과) ;
박창이 (서울시립대학교 통계학과)

Received : 2015.01.20
Accepted : 2015.03.10
Published : 2015.06.30

https://doi.org/10.5351/KJAS.2015.28.3.407 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

$Na{\ddot{i}}ve$ Bayes Classification is based on input variables that are a conditionally independent given output variable. The $Na{\ddot{i}}ve$ Bayes assumption is unrealistic but simplifies the problem of high dimensional joint probability estimation into a series of univariate probability estimations. Thus $Na{\ddot{i}}ve$ Bayes classier is often adopted in the analysis of massive data sets such as in spam e-mail filtering and recommendation systems. In this paper, we propose a variable selection method based on ${\chi}^2$ statistic on input and output variables. The proposed method retains the simplicity of $Na{\ddot{i}}ve$ Bayes classier in terms of data processing and computation; however, it can select relevant variables. It is expected that our method can be useful in classification problems for ultra-high dimensional or big data such as the classification of diseases based on single nucleotide polymorphisms(SNPs).

단순 베이즈 분류($Na{\ddot{i}}ve$ Bayes classification)는 출력변수가 주어졌을 때 입력변수들이 조건부 독립이라는 가정에 기반한다. 단순 베이즈 가정은 비현실적이지만 고차원의 확률 추정 문제를 일련의 일차원 확률 추정 문제로 단순화 시킨다는 장점이 있으며, 특히 스팸 메일 필터링, 추천 시스템(recommendation system) 등 방대한 데이터를 다루는 분야야에서 흔히 사용된다. 본 논문에서는 입력변수와 출력변수간의 카이제곱 통계량에 기반한 변수선택법을 제안한다. 이 방법은 단순 베이즈 분류의 장점인 데이터 처리 및 계산의 단순성을 유지하면서도 설명력이 있는 변수를 선택할 수 있으며 SNP(single nucleotide polymorphism)에 의한 질병의 분류 등의 초고차원 혹은 빅데이터에서 유용할 것으로 기대된다.

Keywords

References

Chen, J. and Gupta, A. K. (2000). Parametric Statistical Change Point Analysis, Birkhauser.
Choi, B.-J., Kim, K.-R., Cho, K.-D., Park, C. and Koo, J.-Y. (2014). Variable selection for Naive Bayes Semisupervised learning, Communications in Statistics - Simulation and Computation, 43, 2702-2713. https://doi.org/10.1080/03610918.2012.762391
Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space, Journal of the Royal Statistical Society, 70, 849-911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
Ha, J. H. and Park, C. (2009). Variable selection in linear discriminant analysis, Journal of the Korean Data Analysis Society, 11, 381-389.
Hand, D. and Yu, K. (2001). Idiot's Bayes-not so stupid at all?, International Statistical Review, 69, 385-399.
Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, (2nd Edition), Springer, New York.
Jin, S. K., Kim, K.-R. and Park, C. (2012). Cutpoint Selection via penalization in credit scoring, The Korean Journal of Applied Statistics, 25, 261-267. https://doi.org/10.5351/KJAS.2012.25.2.261
Killick, R., Fearnhead, P. and Eckley, I. A. (2012). Optimal detection of changepoints with a linear computational cost, Journal of the American Statistical Association, 107, 1590-1598. https://doi.org/10.1080/01621459.2012.737745
Killick, R. and Eckley, I. A. (2014). Changepoint: An R package for changepoint analysis, Journal of Statistical Software, 58.
Vidaurre, D., Bielza, C. and Larranaga, P. (2012). Forward stagewise naive Bayes, Progress in Artificial Intelligence, 1, 57-69. https://doi.org/10.1007/s13748-011-0001-7
Vidaurre, D., Bielza, C. and Larranaga, P. (2013). An $L_1$-regularized naive Bayes-inspired classifier for discarding redundant and irrelevant predictors, International Journal on Artificial Intelligence Tools, 22, 1350019. https://doi.org/10.1142/S021821301350019X

The Korean Journal of Applied Statistics (응용통계연구)

Categorical Variable Selection in Naïve Bayes Classification

단순 베이즈 분류에서의 범주형 변수의 선택

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)