DOI QR코드

DOI QR Code

A comparative study of filter methods based on information entropy

  • Kim, Jung-Tae (Department of Data Information, Korea Maritime and Ocean University) ;
  • Kum, Ho-Yeun (Department of Data Information, Korea Maritime and Ocean University) ;
  • Kim, Jae-Hwan (Department of Data Information, Korea Maritime and Ocean University)
  • Received : 2016.05.19
  • Accepted : 2016.06.14
  • Published : 2016.06.30

Abstract

Feature selection has become an essential technique to reduce the dimensionality of data sets. Many features are frequently irrelevant or redundant for the classification tasks. The purpose of feature selection is to select relevant features and remove irrelevant and redundant features. Applications of the feature selection range from text processing, face recognition, bioinformatics, speaker verification, and medical diagnosis to financial domains. In this study, we focus on filter methods based on information entropy : IG (Information Gain), FCBF (Fast Correlation Based Filter), and mRMR (minimum Redundancy Maximum Relevance). FCBF has the advantage of reducing computational burden by eliminating the redundant features that satisfy the condition of approximate Markov blanket. However, FCBF considers only the relevance between the feature and the class in order to select the best features, thus failing to take into consideration the interaction between features. In this paper, we propose an improved FCBF to overcome this shortcoming. We also perform a comparative study to evaluate the performance of the proposed method.

Keywords

References

  1. M. Hall, "Correlation-based feature selection for machine learning", PhD thesis, Citeseer, 1999.
  2. Z. Zhao, H. Liu, "Searching for interacting features," International Joint Conference on Artificial Intelligence, vol. 7, pp. 1156-1161, 2007.
  3. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene selection for cancer classification using support vector machines," Machine Learning, vol. 46, pp. 389-422, 2002. https://doi.org/10.1023/A:1012487302797
  4. S. Maldonado, R. Weber, and J. Basak, "Simultaneous feature selection and classification using kernel-penalized support vector machines," Information Sciences, vol. 181 no.1, pp. 115-128, 2011. https://doi.org/10.1016/j.ins.2010.08.047
  5. J. G. Bae, J. T. Kim, and J. H. Kim, "Subset selection in multiple linear regression: an improved tabu search," Journal of Korean Society of Marine Engineering, vol. 40, no. 2, pp. 138-145, 2016. https://doi.org/10.5916/jkosme.2016.40.2.138
  6. I. Inza, B. Sierra, R. Blanco, and P. Larranaga, "Gene selection by sequential search wrapper approaches in microarray cancer class prediction," Journal of Intelligent and Fuzzy Systems, vol. 12, no. 1, pp. 25-33, 2002.
  7. R. Ruiz, J. Riquelme, and J. Aguilar-Ruiz, "Incremental wrapper-based gene selection from microarray data for cancer classification," Pattern Recognition, vol. 39, no. 12, pp. 2383-2392, 2006. https://doi.org/10.1016/j.patcog.2005.11.001
  8. S. Shreem, S. Abdullah, M. Nazri, and M. Alzaqebah, "Hybridizing ReliefF, mRMR filters and GA wrapper approaches for gene selection," Journal of Theoretical and Applied Information Technology, vol. 46, no. 2, pp. 1034-1039, 2012.
  9. L. Chuang, C. Yang, K. Wu, and C. Yang, "A hybrid feature selection method for DNA microarray data," Computers in Biology and Medicine, vol. 41, no. 4, pp. 228-237, 2011. https://doi.org/10.1016/j.compbiomed.2011.02.004
  10. W. Aiguo, A. Ning, C. Guilin, and L. Lian, "Hybridizing mRMR and harmony search for gene selection and classification of microarray data," Journal of Computational Information Systems, vol. 11, no. 5, pp. 1563-1570, 2015.
  11. V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, pp. 273-297, 1995.
  12. J. Demsar, B. Zupan, M. W. Kattan, J. R. Beck, and I. Bratko, "Naive bayesian-based nomogram for prediction of prostate cancer recurrence," Studies in Health Technology and Informatics, vol. 68, pp. 436-441, 1999.
  13. H. Sun, "A naive Bayes classifier for prediction of multidrug resistance reversal activity on the basis of atom typing," Journal of Medicinal Chemistry, vol. 48, no. 12, pp. 4031-4039, 2005. https://doi.org/10.1021/jm050180t
  14. T. M. Cover and P. E. Hart, "Nearest neighbor pattern classification," IEEE Transactions on Information Theory, vol. 13, no. 1 pp. 21-27, 1967. https://doi.org/10.1109/TIT.1967.1053964
  15. J. N. Morgan and J. A. Sonquist, "Problems in the analysis of survey data, and a proposal," Journal of the American Statistical Association, vol. 58, no. 302, pp. 415-434, 1963. https://doi.org/10.1080/01621459.1963.10500855
  16. J. A. Hartigrn, Clustering Algorithms, Wiley, New York, 1975.
  17. L.E. Raileanu and K. Stoffel, "Theoretical comparison between the Gini Index and information gain criteria," Annals of Mathematics and Artificial Intelligence, vol. 41 no. 1, pp. 77-93, 2004. https://doi.org/10.1023/B:AMAI.0000018580.96245.c6
  18. M. Hall and L. Smith, "Practical feature subset selection for machine learning," Computer Science, Vol. 98, pp. 181-191, 1998
  19. J. Yang, Y. Liu, Z. Liu, X. Zhu, and X. Zhang, "A new feature selection algorithm based on binomial hypothesis testing for spam filtering," Knowledge-Based Systems, vol. 24, no. 6, pp. 904-914, 2011. https://doi.org/10.1016/j.knosys.2011.04.006
  20. Q. Gu, Z. Li, and J. Han, "Generalized fisher score for feature selection," Proceedings of the International Conference on Uncertainty in Artificial Intelligence, 2011.
  21. X. He, D. Cai, and P. Niyogi, "Laplacian score for feature selection," Advances in neural information processing systems, pp. 507-514, 2005.
  22. K. Kira and L. Rendell, "The feature selection problem: traditional methods and a new algorithm," Proceedings of the Tenth National Conference on Artificial intelligence, AAAI Press, San Jose, CA, vol. 2, pp. 129-134. 1992.
  23. L. Yu and H. Liu, "Feature selection for high-dimensional data: a fast correlation-based filter solution," Proceedings of the Twentieth International Conference on Machine Learning, vol. 3, pp. 856-863, 2003.
  24. H. Peng, F. Long, and C. Ding, "Feature selection based on mutual information criteria of max-dependency, maxrelevance, and min-redundancy," IEEE Transactions on pattern analysis and machine intelligence, vol. 27, no. 8, pp. 1226-1238, 2005. https://doi.org/10.1109/TPAMI.2005.159
  25. J. R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1, no. 1, pp. 81-106, 1986. https://doi.org/10.1007/BF00116251
  26. C. Ambroise and G. McLachlan, "Selection bias in gene extraction on the basis of microarray gene-expression data," proceedings of the National Academy of Sciences, vol. 99, no. 10, pp. 6562-6566, 2002. https://doi.org/10.1073/pnas.102102699
  27. A. A. Alizadeh et al, "Distinct types of diffuse large B-cell lymphoma identitfied by gene expression profiling," Nature, vol. 403, no. 6769, pp. 503-511, 2000. https://doi.org/10.1038/35000501
  28. U. Scherf et al, "A cDNA microarray gene expression database for the molecular pharmacology of cancer," vol. 24, no. 3, pp. 236-244, 2000. https://doi.org/10.1038/73439
  29. L. J. Vant't Veer et al, "Gene expression profiling predicts clinical outcome of breast cancer," Nature, vol. 415, no. 6871, pp. 530-536, 2002. https://doi.org/10.1038/415530a

Cited by

  1. Performance evaluation of principal component analysis for clustering problems vol.40, pp.8, 2016, https://doi.org/10.5916/jkosme.2016.40.8.726