A comparative study of filter methods based on information entropy

Kim, Jung-Tae;Kum, Ho-Yeun;Kim, Jae-Hwan;

doi:10.5916/jkosme.2016.40.5.437

Journal of Advanced Marine Engineering and Technology

Volume 40 Issue 5
/
Pages.437-446
/
2016
/
2234-7925(pISSN)
/
2234-8352(eISSN)

The Korean Society of Marine Engineering (한국마린엔지니어링학회)

DOI QR Code

A comparative study of filter methods based on information entropy

Kim, Jung-Tae (Department of Data Information, Korea Maritime and Ocean University) ;
Kum, Ho-Yeun (Department of Data Information, Korea Maritime and Ocean University) ;
Kim, Jae-Hwan (Department of Data Information, Korea Maritime and Ocean University)

Received : 2016.05.19
Accepted : 2016.06.14
Published : 2016.06.30

https://doi.org/10.5916/jkosme.2016.40.5.437 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Feature selection has become an essential technique to reduce the dimensionality of data sets. Many features are frequently irrelevant or redundant for the classification tasks. The purpose of feature selection is to select relevant features and remove irrelevant and redundant features. Applications of the feature selection range from text processing, face recognition, bioinformatics, speaker verification, and medical diagnosis to financial domains. In this study, we focus on filter methods based on information entropy : IG (Information Gain), FCBF (Fast Correlation Based Filter), and mRMR (minimum Redundancy Maximum Relevance). FCBF has the advantage of reducing computational burden by eliminating the redundant features that satisfy the condition of approximate Markov blanket. However, FCBF considers only the relevance between the feature and the class in order to select the best features, thus failing to take into consideration the interaction between features. In this paper, we propose an improved FCBF to overcome this shortcoming. We also perform a comparative study to evaluate the performance of the proposed method.

Keywords

References

M. Hall, "Correlation-based feature selection for machine learning", PhD thesis, Citeseer, 1999.
Z. Zhao, H. Liu, "Searching for interacting features," International Joint Conference on Artificial Intelligence, vol. 7, pp. 1156-1161, 2007.
I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene selection for cancer classification using support vector machines," Machine Learning, vol. 46, pp. 389-422, 2002. https://doi.org/10.1023/A:1012487302797
S. Maldonado, R. Weber, and J. Basak, "Simultaneous feature selection and classification using kernel-penalized support vector machines," Information Sciences, vol. 181 no.1, pp. 115-128, 2011. https://doi.org/10.1016/j.ins.2010.08.047
J. G. Bae, J. T. Kim, and J. H. Kim, "Subset selection in multiple linear regression: an improved tabu search," Journal of Korean Society of Marine Engineering, vol. 40, no. 2, pp. 138-145, 2016. https://doi.org/10.5916/jkosme.2016.40.2.138
I. Inza, B. Sierra, R. Blanco, and P. Larranaga, "Gene selection by sequential search wrapper approaches in microarray cancer class prediction," Journal of Intelligent and Fuzzy Systems, vol. 12, no. 1, pp. 25-33, 2002.
R. Ruiz, J. Riquelme, and J. Aguilar-Ruiz, "Incremental wrapper-based gene selection from microarray data for cancer classification," Pattern Recognition, vol. 39, no. 12, pp. 2383-2392, 2006. https://doi.org/10.1016/j.patcog.2005.11.001
S. Shreem, S. Abdullah, M. Nazri, and M. Alzaqebah, "Hybridizing ReliefF, mRMR filters and GA wrapper approaches for gene selection," Journal of Theoretical and Applied Information Technology, vol. 46, no. 2, pp. 1034-1039, 2012.
L. Chuang, C. Yang, K. Wu, and C. Yang, "A hybrid feature selection method for DNA microarray data," Computers in Biology and Medicine, vol. 41, no. 4, pp. 228-237, 2011. https://doi.org/10.1016/j.compbiomed.2011.02.004
W. Aiguo, A. Ning, C. Guilin, and L. Lian, "Hybridizing mRMR and harmony search for gene selection and classification of microarray data," Journal of Computational Information Systems, vol. 11, no. 5, pp. 1563-1570, 2015.
V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, pp. 273-297, 1995.
J. Demsar, B. Zupan, M. W. Kattan, J. R. Beck, and I. Bratko, "Naive bayesian-based nomogram for prediction of prostate cancer recurrence," Studies in Health Technology and Informatics, vol. 68, pp. 436-441, 1999.
H. Sun, "A naive Bayes classifier for prediction of multidrug resistance reversal activity on the basis of atom typing," Journal of Medicinal Chemistry, vol. 48, no. 12, pp. 4031-4039, 2005. https://doi.org/10.1021/jm050180t
T. M. Cover and P. E. Hart, "Nearest neighbor pattern classification," IEEE Transactions on Information Theory, vol. 13, no. 1 pp. 21-27, 1967. https://doi.org/10.1109/TIT.1967.1053964
J. N. Morgan and J. A. Sonquist, "Problems in the analysis of survey data, and a proposal," Journal of the American Statistical Association, vol. 58, no. 302, pp. 415-434, 1963. https://doi.org/10.1080/01621459.1963.10500855
J. A. Hartigrn, Clustering Algorithms, Wiley, New York, 1975.
L.E. Raileanu and K. Stoffel, "Theoretical comparison between the Gini Index and information gain criteria," Annals of Mathematics and Artificial Intelligence, vol. 41 no. 1, pp. 77-93, 2004. https://doi.org/10.1023/B:AMAI.0000018580.96245.c6
M. Hall and L. Smith, "Practical feature subset selection for machine learning," Computer Science, Vol. 98, pp. 181-191, 1998
J. Yang, Y. Liu, Z. Liu, X. Zhu, and X. Zhang, "A new feature selection algorithm based on binomial hypothesis testing for spam filtering," Knowledge-Based Systems, vol. 24, no. 6, pp. 904-914, 2011. https://doi.org/10.1016/j.knosys.2011.04.006
Q. Gu, Z. Li, and J. Han, "Generalized fisher score for feature selection," Proceedings of the International Conference on Uncertainty in Artificial Intelligence, 2011.
X. He, D. Cai, and P. Niyogi, "Laplacian score for feature selection," Advances in neural information processing systems, pp. 507-514, 2005.
K. Kira and L. Rendell, "The feature selection problem: traditional methods and a new algorithm," Proceedings of the Tenth National Conference on Artificial intelligence, AAAI Press, San Jose, CA, vol. 2, pp. 129-134. 1992.
L. Yu and H. Liu, "Feature selection for high-dimensional data: a fast correlation-based filter solution," Proceedings of the Twentieth International Conference on Machine Learning, vol. 3, pp. 856-863, 2003.
H. Peng, F. Long, and C. Ding, "Feature selection based on mutual information criteria of max-dependency, maxrelevance, and min-redundancy," IEEE Transactions on pattern analysis and machine intelligence, vol. 27, no. 8, pp. 1226-1238, 2005. https://doi.org/10.1109/TPAMI.2005.159
J. R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1, no. 1, pp. 81-106, 1986. https://doi.org/10.1007/BF00116251
C. Ambroise and G. McLachlan, "Selection bias in gene extraction on the basis of microarray gene-expression data," proceedings of the National Academy of Sciences, vol. 99, no. 10, pp. 6562-6566, 2002. https://doi.org/10.1073/pnas.102102699
A. A. Alizadeh et al, "Distinct types of diffuse large B-cell lymphoma identitfied by gene expression profiling," Nature, vol. 403, no. 6769, pp. 503-511, 2000. https://doi.org/10.1038/35000501
U. Scherf et al, "A cDNA microarray gene expression database for the molecular pharmacology of cancer," vol. 24, no. 3, pp. 236-244, 2000. https://doi.org/10.1038/73439
L. J. Vant't Veer et al, "Gene expression profiling predicts clinical outcome of breast cancer," Nature, vol. 415, no. 6871, pp. 530-536, 2002. https://doi.org/10.1038/415530a

Cited by

Performance evaluation of principal component analysis for clustering problems vol.40, pp.8, 2016, https://doi.org/10.5916/jkosme.2016.40.8.726

Journal of Advanced Marine Engineering and Technology

A comparative study of filter methods based on information entropy

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)