DOI QR코드

DOI QR Code

New Feature Selection Method for Text Categorization

  • Wang, Xingfeng (Information Engineering College, Eastern Liaoning University) ;
  • Kim, Hee-Cheol (Department of Computer Engineering/Institute of Digital Anti-Aging Healthcare (IDA), Inje University)
  • Received : 2017.03.09
  • Accepted : 2017.03.17
  • Published : 2017.03.31

Abstract

The preferred feature selection methods for text classification are filter-based. In a common filter-based feature selection scheme, unique scores are assigned to features; then, these features are sorted according to their scores. The last step is to add the top-N features to the feature set. In this paper, we propose an improved global feature selection scheme wherein its last step is modified to obtain a more representative feature set. The proposed method aims to improve the classification performance of global feature selection methods by creating a feature set representing all classes almost equally. For this purpose, a local feature selection method is used in the proposed method to label features according to their discriminative power on classes; these labels are used while producing the feature sets. Experimental results obtained using the well-known 20 Newsgroups and Reuters-21578 datasets with the k-nearest neighbor algorithm and a support vector machine indicate that the proposed method improves the classification performance in terms of a widely known metric ($F_1$).

Keywords

References

  1. S. Rill, D. Reinel, J. Scheidt, and R. V. Zicari, "PoliTwi: early detection of emerging political topics on twitter and the impact on concept-level sentiment analysis," Knowledge-Based Systems, vol. 69, pp. 24-33, 2014. https://doi.org/10.1016/j.knosys.2014.05.008
  2. A. S. Ghareb, A. B. Bakar, and A. R. Hamdan, "Hybrid feature selection based on enhanced genetic algorithm for text categorization," Expert Systems with Applications, vol. 49, pp. 31-47, 2016. https://doi.org/10.1016/j.eswa.2015.12.004
  3. H. Elghazel, A. Aussem, O. Gharroudi, and W. Saadaoui, "Ensemble multi-label text categorization based on rotation forest and latent semantic indexing," Expert Systems with Applications, vol. 57, pp. 1-11, 2016. https://doi.org/10.1016/j.eswa.2016.03.041
  4. Y. Wang, Y. Liu, L. Feng, and X. Zhu, "Novel feature selection method based on harmony search for email classification," Knowledge-Based Systems, vol. 73, pp. 311-323, 2015. https://doi.org/10.1016/j.knosys.2014.10.013
  5. J. Yang, Z. Liu, and Z. Qu, "A novel feature selection based gravitation for text categorization," International Journal of Database Theory and Application, vol. 9, pp. 211-228, 2016.
  6. W. Medhat, A. Hassan, and H. Koashy, "Sentiment analysis algorithms and applications: a survey," Ain Shams Engineering Journal, vol. 5, no. 4, pp. 1093-1113, 2014. https://doi.org/10.1016/j.asej.2014.04.011
  7. M. Hadni, S. E. A. Ouatik, & A. Lachkar, "Word sense disambiguation for Arabic text categorization," International Arab Journal of Information Technology, vol. 13, no. 1A, pp. 215-222, 2016.
  8. A. H. Mohammad, T. Alwada'n, and O. Al-Momani, "Arabic text categorization using support vector machine, Naive Bayes and Neural Network," GSTF Journal on Computing, vol. 5, no. 1, pp. 108-115, 2016.
  9. S. Gunal, "Hybrid feature selection for text classification," Turkish Journal of Electrical Engineering Computer Sciences, vol. 20, pp. 1296-1311, 2012.
  10. J. Yang, Z. Liu, and Z. Qu, "Text representation based on key terms of document for text categorization," International Journal of Database Theory and Application, vol. 9, no. 4, pp. 1-22, 2016.
  11. W. Zong, F. Wu, L. K. Chu, and D. Sculli, "A discriminative and semantic feature selection method for text categorization," International Journal of Production Economics, vol. 165, pp. 215-222, 2015. https://doi.org/10.1016/j.ijpe.2014.12.035
  12. B. Tang, S. Kay, and H. He, "Toward optimal feature selection in Naive Bayes for text categorization," IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 9, pp. 2508-2521, 2016. https://doi.org/10.1109/TKDE.2016.2563436
  13. W. Yang, Y. Fu, and D. Zhang, "an improved parallel algorithm for text categorization," in Proceedings of International Symposium on Computer, Consumer and Control (IS3C), Xi'an, China, pp. 451-454, 2016.
  14. C. C. Chang and C. J. Lin, "LIBSVM: a library for support vector machines," ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, article no. 27, 2011.
  15. I. Idris and A. Selamat, "Improved email spam detection model with negative selection algorithm and particle swarm optimization," Applied Soft Computing, vol. 22, pp. 11-27, 2014. https://doi.org/10.1016/j.asoc.2014.05.002
  16. L. Jiang, Z. Cai, H. Zhang, and D. Wang, "Naive Bayes text classifiers: a locally weighted learning approach," Journal of Experimental Theoretical Artificial Intelligence, vol. 25, no. 2, pp. 273-286, 2013. https://doi.org/10.1080/0952813X.2012.721010
  17. H. Ogura, H. Amano, and M. Kondo, "Comparison of metrics for feature selection in imbalanced text classification," Expert Systems with Applications, vol. 38, no. 5, pp. 4978-4989, 2011. https://doi.org/10.1016/j.eswa.2010.09.153
  18. A. Pietramala, V. L. Policicchio, and P. Rullo, "Automatic filtering of valuable features for text categorization," in Advanced Data Mining and Applications. Heidelberg: Springer, pp. 284-295, 2012.
  19. R. H. Pinheiro, G. D. Cavalcanti, R. F. Correa, and T. I. Ren, "A global-ranking local feature selection method for text categorization," Expert Systems with Applications, vol. 39, no. 17, pp. 12851-12857, 2012. https://doi.org/10.1016/j.eswa.2012.05.008
  20. R. H. Pinheiro, G. D. Cavalcanti, and T. I. Ren, "Data-driven global-ranking local feature selection methods for text categorization," Expert Systems with Applications, vol. 42, no. 4, pp. 1941-1949, 2015. https://doi.org/10.1016/j.eswa.2014.10.011

Cited by

  1. Text Categorization with Improved Deep Learning Methods vol.16, pp.2, 2017, https://doi.org/10.6109/jicce.2018.16.2.106
  2. Product Recommendation System based on User Purchase Priority vol.18, pp.1, 2017, https://doi.org/10.6109/jicce.2020.18.1.55