A Novel Statistical Feature Selection Approach for Text Categorization

  • Fattah, Mohamed Abdel (Dept. of Computer Sciences, College of Computer Science and Engineering, Taibah University)
  • Received : 2016.11.11
  • Accepted : 2017.05.29
  • Published : 2017.10.31


For text categorization task, distinctive text features selection is important due to feature space high dimensionality. It is important to decrease the feature space dimension to decrease processing time and increase accuracy. In the current study, for text categorization task, we introduce a novel statistical feature selection approach. This approach measures the term distribution in all collection documents, the term distribution in a certain category and the term distribution in a certain class relative to other classes. The proposed method results show its superiority over the traditional feature selection methods.


  1. S. Gunal, S. Ergin, M. B. Gulmezoglu, and O. N. Gerek, "On feature extraction for spam e-mail detection," in International Workshop on Multimedia Content Representation, Classification and Security, Berlin, Germany: Springer, 2006, pp. 635-642.
  2. T. S. Guzella and W. M. Caminhas, "A review of machine learning approaches to Spam filtering," Expert Systems with Applications, vol. 36, no. 7, pp. 10206-10222, 2009.
  3. D. B. Bracewell, J. Yan, F. Ren, and S. Kuroiwa, "Category classification and topic discovery of Japanese and English news articles," Electronic Notes in Theoretical Computer Science, vol. 225, pp. 51-65, 2009.
  4. I. Anagnostopoulos, C. Anagnostopoulos, V. Loumos, and E. Kayafas, "Classifying web pages employing a probabilistic neural network," IEE Proceedings-Software, vol. 151, no. 3, pp. 139-150, 2004.
  5. R. C. Chen and C. H. Hsieh, "Web page classification based on a support vector machine using a weighted vote schema," Expert Systems with Applications, vol. 31, no. 2, pp. 427-435, 2006.
  6. S. A. Ozel, "A web page classification system based on a genetic algorithm using tagged-terms as features," Expert Systems with Applications, vol. 38, no. 4, pp. 3407-3415, 2011.
  7. N. Cheng, R. Chandramouli, and K. P. Subbalakshmi, "Author gender identification from text," Digital Investigation, vol. 8, no. 1, pp. 78-88, 2011.
  8. E. Stamatatos, "Author identification: using text sampling to handle the class imbalance problem," Information Processing & Management, vol. 44, no. 2, pp. 790-799, 2008.
  9. I. Guyon and A. Elisseeff, "An introduction to variable and feature selection," Journal of Machine Learning Research, vol. 3, pp. 1157-1182, 2003.
  10. S. Gunal, O. N. Gerek, D. G. Ece, and R. Edizkan, "The search for optimal feature set in power quality event classification," Expert Systems with Applications, vol. 36, no. 7, pp. 10266-10273, 2009.
  11. R. Kohavi and G. H. John, "Wrappers for feature subset selection," Artificial Intelligence, vol. 97, no. 1-2, pp. 273-324, 1997.
  12. Y. Saeys, I. Inza, and P. Larranaga, "A review of feature selection techniques in bioinformatics," Bioinformatics, vol. 23, no. 19, pp. 2507-2517, 2007.
  13. G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic indexing," Communications of the ACM, vol. 18, no. 11, pp. 613-620, 1975.
  14. T. Joachims, "A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization," in Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, 1997, pp. 143-151.
  15. A. McCallum and K. Nigam, "A comparison of event models for naive Bayes text classification," in Proceeding of the AAAI-98 Workshop on Learning for Text Categorization, Madison, WI, 1998, pp. 41-48.
  16. M. A. Fattah, F. Ren, and S. Kuroiwa, "Effects of phoneme type and frequency on distributed speaker identification and verification," IEICE Transactions on Information and Systems, vol. E89-D, no. 5, pp. 1712-1719, 2006.
  17. M. A. Fattah, "A hybrid machine learning model for multi-document summarization," Applied Intelligence, vol. 40, no. 4, pp. 592-600, 2014.
  18. D. D. Lewis, "Naive (Bayes) at forty: the independence assumption in information retrieval," in European Conference Machine Learning ECML-98, Berlin, Germany: Springer, 1998, pp. 4-15.
  19. T. Joachims, "Text categorization with support vector machines: learning with many relevant features," in European Conference Machine Learning ECML-98, Berlin, Germany: Springer, 1998, pp. 137-142.
  20. Y. Yang and X. Liu, "A re-examination of text categorization methods," in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, 1999, pp. 42-49.
  21. G. Feng, J. Guo, B. Y. Jing, and T. Sun, "Feature subset selection using naive Bayes for text classification," Pattern Recognition Letters, vol. 65, pp. 109-115, 2015.
  22. M. Tutkan, M. C. Ganiz, and S. Akyokus, "Helmholtz principle based supervised and unsupervised feature selection methods for text mining," Information Processing & Management, vol. 52, no. 5, pp. 885-910, 2016.
  23. A. Rehman, K. Javed, and H. A. Babri, "Feature selection based on a normalized difference measure for text classification," Information Processing & Management, vol. 53, no. 2, pp. 473-489, 2017.
  24. K. Javed, S. Maruf, and H. A. Babri, "A two-stage Markov blanket based feature selection algorithm for text classification," Neurocomputing, vol. 157, pp. 91-104, 2015.
  25. B. Seijo-Pardo, I. Porto-Diaz, V. Bolon-Canedo, and A. Alonso-Betanzos, "Ensemble feature selection: homogeneous and heterogeneous approaches," Knowledge-Based Systems, vol. 118, pp. 124-139, 2017.
  26. A. Yousefpour, R. Ibrahim, and H. N. A. Hamed, "Ordinal-based and frequency-based integration of feature selection methods for sentiment analysis," Expert Systems with Applications, vol. 75, pp. 80-93, 2017.
  27. Y. Lu, M. Liang, Z. Ye, and L. Cao, "Improved particle swarm optimization algorithm and its application in text feature selection," Applied Soft Computing, vol. 35, pp. 629-636, 2015.
  28. M. M. Syiam, Z. T. Fayed, and M. B. Habib, "An intelligent system for Arabic text categorization," International Journal of Intelligent Computing and Information Sciences, vol. 6, no. 1, pp. 1-19, 2006.
  29. G. Kanaan, R. Al-Shalabi, S. Ghwanmeh, and H. Al-Ma'adeed, "A comparison of text-classification techniques applied to Arabic text," Journal of the Association for Information Science and Technology, vol. 60, no. 9, pp. 1836-1844, 2009.
  30. L. Khreisat, "A machine learning approach for Arabic text classification using N-gram frequency statistics," Journal of Informetrics, vol. 3, no. 1, pp. 72-77, 2009.
  31. M. J. Bawaneh, M. S. Alkoffash, and A. I. Al Rabea, "Arabic text classification using K-NN and naive Bayes," Journal of Computer Science, vol. 4, no. 7, pp. 600-605, 2008.
  32. S. Al-Harbi, A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed, and A. Al-Rajeh, "Automatic Arabic text classification," in Proceedings of the 9th International Conference on the Statistical Analysis of Textual Data, Lyon, France, 2008, pp. 77-83.
  33. A. M. Mesleh, "Support vector machines based Arabic language text classification system: Feature selection comparative study," in Advances in Computer and Information Sciences and Engineering, Dordrecht, Netherlands: Springer, 2008, pp. 11-16.
  34. H. Ogura, H. Amano, and M. Kondo, "Feature selection with a measure of deviations from Poisson in text categorization," Expert Systems with Applications, vol. 36, no. 3(Part 2), pp. 6826-6832, 2009.
  35. W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, and Z. Wang, "A novel feature selection algorithm for text categorization," Expert Systems with Applications, vol. 33 no. 1, pp. 1-5, 2007.
  36. Y. Yang and J. O. Pedersen, "A comparative study on feature selection in text categorization," in Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, 1997, pp. 412-420.
  37. M. A. Fattah, "New term weighting schemes with combination of multiple classifiers for sentiment analysis," Neurocomputing, vol. 167, pp. 434-442, 2015.
  38. Z. H. Deng, K. H. Luo, and H. L. Yu, "A study of supervised term weighting scheme for sentiment analysis," Expert Systems with Applications, vol. 41, no. 7, pp. 3506-3513, 2014.
  39. K. W. Church and P. Hanks, "Word association norms, mutual information and lexicography," in Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 1989, pp. 76-83.
  40. A. K. Uysal and S. Gunal, "A novel probabilistic feature selection method for text classification," Knowledge-Based Systems, vol. 36, pp. 226-235, 2012.
  41. D. Mladeni'c and M. Grobelnik, "Feature selection for classification based on text hierarchy," in Proceeding of the Conference on Automated Learning and Discovery (CONALD), Pittsburgh, PA, 1998.
  42. G. Forman, "An extensive empirical study of feature selection metrics for text classification," Journal of Machine Learning Research, vol. 3, pp. 1289-1305, 2003.
  43. V. Ng, S. Dasgupta, & S. M. Niaz Arifin, "Examining the role of linguistic knowledge sources in the automatic identification and classification of reviews," in Proceedings of the COLING/ACL Main Conference Poster Sessions, Sydney, Australia, 2006, pp. 611-618.
  44. M. A. Fattah, "The use of MSVM and HMM for sentence alignment," Journal of Information Processing Systems, vol. 8, no. 2, pp. 301-314, 2012.
  45. M. Elmarhoumy, M. A. Fattah, M. Suzuki, and F. Ren, "A new modified centroid classifier approach for automatic text classification," IEEJ Transactions on Electrical and Electronic Engineering, vol. 8, no. 4, pp. 364-370, 2013.

Cited by

  1. QER: a new feature selection method for sentiment analysis vol.8, pp.1, 2018,