DOI QR코드

DOI QR Code

A Novel Statistical Feature Selection Approach for Text Categorization

  • Fattah, Mohamed Abdel (Dept. of Computer Sciences, College of Computer Science and Engineering, Taibah University)
  • Received : 2016.11.11
  • Accepted : 2017.05.29
  • Published : 2017.10.31

Abstract

For text categorization task, distinctive text features selection is important due to feature space high dimensionality. It is important to decrease the feature space dimension to decrease processing time and increase accuracy. In the current study, for text categorization task, we introduce a novel statistical feature selection approach. This approach measures the term distribution in all collection documents, the term distribution in a certain category and the term distribution in a certain class relative to other classes. The proposed method results show its superiority over the traditional feature selection methods.

Keywords

References

  1. S. Gunal, S. Ergin, M. B. Gulmezoglu, and O. N. Gerek, "On feature extraction for spam e-mail detection," in International Workshop on Multimedia Content Representation, Classification and Security, Berlin, Germany: Springer, 2006, pp. 635-642.
  2. T. S. Guzella and W. M. Caminhas, "A review of machine learning approaches to Spam filtering," Expert Systems with Applications, vol. 36, no. 7, pp. 10206-10222, 2009. https://doi.org/10.1016/j.eswa.2009.02.037
  3. D. B. Bracewell, J. Yan, F. Ren, and S. Kuroiwa, "Category classification and topic discovery of Japanese and English news articles," Electronic Notes in Theoretical Computer Science, vol. 225, pp. 51-65, 2009. https://doi.org/10.1016/j.entcs.2008.12.066
  4. I. Anagnostopoulos, C. Anagnostopoulos, V. Loumos, and E. Kayafas, "Classifying web pages employing a probabilistic neural network," IEE Proceedings-Software, vol. 151, no. 3, pp. 139-150, 2004. https://doi.org/10.1049/ip-sen:20040121
  5. R. C. Chen and C. H. Hsieh, "Web page classification based on a support vector machine using a weighted vote schema," Expert Systems with Applications, vol. 31, no. 2, pp. 427-435, 2006. https://doi.org/10.1016/j.eswa.2005.09.079
  6. S. A. Ozel, "A web page classification system based on a genetic algorithm using tagged-terms as features," Expert Systems with Applications, vol. 38, no. 4, pp. 3407-3415, 2011. https://doi.org/10.1016/j.eswa.2010.08.126
  7. N. Cheng, R. Chandramouli, and K. P. Subbalakshmi, "Author gender identification from text," Digital Investigation, vol. 8, no. 1, pp. 78-88, 2011. https://doi.org/10.1016/j.diin.2011.04.002
  8. E. Stamatatos, "Author identification: using text sampling to handle the class imbalance problem," Information Processing & Management, vol. 44, no. 2, pp. 790-799, 2008. https://doi.org/10.1016/j.ipm.2007.05.012
  9. I. Guyon and A. Elisseeff, "An introduction to variable and feature selection," Journal of Machine Learning Research, vol. 3, pp. 1157-1182, 2003.
  10. S. Gunal, O. N. Gerek, D. G. Ece, and R. Edizkan, "The search for optimal feature set in power quality event classification," Expert Systems with Applications, vol. 36, no. 7, pp. 10266-10273, 2009. https://doi.org/10.1016/j.eswa.2009.01.051
  11. R. Kohavi and G. H. John, "Wrappers for feature subset selection," Artificial Intelligence, vol. 97, no. 1-2, pp. 273-324, 1997. https://doi.org/10.1016/S0004-3702(97)00043-X
  12. Y. Saeys, I. Inza, and P. Larranaga, "A review of feature selection techniques in bioinformatics," Bioinformatics, vol. 23, no. 19, pp. 2507-2517, 2007. https://doi.org/10.1093/bioinformatics/btm344
  13. G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic indexing," Communications of the ACM, vol. 18, no. 11, pp. 613-620, 1975. https://doi.org/10.1145/361219.361220
  14. T. Joachims, "A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization," in Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, 1997, pp. 143-151.
  15. A. McCallum and K. Nigam, "A comparison of event models for naive Bayes text classification," in Proceeding of the AAAI-98 Workshop on Learning for Text Categorization, Madison, WI, 1998, pp. 41-48.
  16. M. A. Fattah, F. Ren, and S. Kuroiwa, "Effects of phoneme type and frequency on distributed speaker identification and verification," IEICE Transactions on Information and Systems, vol. E89-D, no. 5, pp. 1712-1719, 2006. https://doi.org/10.1093/ietisy/e89-d.5.1712
  17. M. A. Fattah, "A hybrid machine learning model for multi-document summarization," Applied Intelligence, vol. 40, no. 4, pp. 592-600, 2014. https://doi.org/10.1007/s10489-013-0490-0
  18. D. D. Lewis, "Naive (Bayes) at forty: the independence assumption in information retrieval," in European Conference Machine Learning ECML-98, Berlin, Germany: Springer, 1998, pp. 4-15.
  19. T. Joachims, "Text categorization with support vector machines: learning with many relevant features," in European Conference Machine Learning ECML-98, Berlin, Germany: Springer, 1998, pp. 137-142.
  20. Y. Yang and X. Liu, "A re-examination of text categorization methods," in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, 1999, pp. 42-49.
  21. G. Feng, J. Guo, B. Y. Jing, and T. Sun, "Feature subset selection using naive Bayes for text classification," Pattern Recognition Letters, vol. 65, pp. 109-115, 2015. https://doi.org/10.1016/j.patrec.2015.07.028
  22. M. Tutkan, M. C. Ganiz, and S. Akyokus, "Helmholtz principle based supervised and unsupervised feature selection methods for text mining," Information Processing & Management, vol. 52, no. 5, pp. 885-910, 2016. https://doi.org/10.1016/j.ipm.2016.03.007
  23. A. Rehman, K. Javed, and H. A. Babri, "Feature selection based on a normalized difference measure for text classification," Information Processing & Management, vol. 53, no. 2, pp. 473-489, 2017. https://doi.org/10.1016/j.ipm.2016.12.004
  24. K. Javed, S. Maruf, and H. A. Babri, "A two-stage Markov blanket based feature selection algorithm for text classification," Neurocomputing, vol. 157, pp. 91-104, 2015. https://doi.org/10.1016/j.neucom.2015.01.031
  25. B. Seijo-Pardo, I. Porto-Diaz, V. Bolon-Canedo, and A. Alonso-Betanzos, "Ensemble feature selection: homogeneous and heterogeneous approaches," Knowledge-Based Systems, vol. 118, pp. 124-139, 2017.
  26. A. Yousefpour, R. Ibrahim, and H. N. A. Hamed, "Ordinal-based and frequency-based integration of feature selection methods for sentiment analysis," Expert Systems with Applications, vol. 75, pp. 80-93, 2017. https://doi.org/10.1016/j.eswa.2017.01.009
  27. Y. Lu, M. Liang, Z. Ye, and L. Cao, "Improved particle swarm optimization algorithm and its application in text feature selection," Applied Soft Computing, vol. 35, pp. 629-636, 2015. https://doi.org/10.1016/j.asoc.2015.07.005
  28. M. M. Syiam, Z. T. Fayed, and M. B. Habib, "An intelligent system for Arabic text categorization," International Journal of Intelligent Computing and Information Sciences, vol. 6, no. 1, pp. 1-19, 2006.
  29. G. Kanaan, R. Al-Shalabi, S. Ghwanmeh, and H. Al-Ma'adeed, "A comparison of text-classification techniques applied to Arabic text," Journal of the Association for Information Science and Technology, vol. 60, no. 9, pp. 1836-1844, 2009.
  30. L. Khreisat, "A machine learning approach for Arabic text classification using N-gram frequency statistics," Journal of Informetrics, vol. 3, no. 1, pp. 72-77, 2009. https://doi.org/10.1016/j.joi.2008.11.005
  31. M. J. Bawaneh, M. S. Alkoffash, and A. I. Al Rabea, "Arabic text classification using K-NN and naive Bayes," Journal of Computer Science, vol. 4, no. 7, pp. 600-605, 2008. https://doi.org/10.3844/jcssp.2008.600.605
  32. S. Al-Harbi, A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed, and A. Al-Rajeh, "Automatic Arabic text classification," in Proceedings of the 9th International Conference on the Statistical Analysis of Textual Data, Lyon, France, 2008, pp. 77-83.
  33. A. M. Mesleh, "Support vector machines based Arabic language text classification system: Feature selection comparative study," in Advances in Computer and Information Sciences and Engineering, Dordrecht, Netherlands: Springer, 2008, pp. 11-16.
  34. H. Ogura, H. Amano, and M. Kondo, "Feature selection with a measure of deviations from Poisson in text categorization," Expert Systems with Applications, vol. 36, no. 3(Part 2), pp. 6826-6832, 2009. https://doi.org/10.1016/j.eswa.2008.08.006
  35. W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, and Z. Wang, "A novel feature selection algorithm for text categorization," Expert Systems with Applications, vol. 33 no. 1, pp. 1-5, 2007. https://doi.org/10.1016/j.eswa.2006.04.001
  36. Y. Yang and J. O. Pedersen, "A comparative study on feature selection in text categorization," in Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, 1997, pp. 412-420.
  37. M. A. Fattah, "New term weighting schemes with combination of multiple classifiers for sentiment analysis," Neurocomputing, vol. 167, pp. 434-442, 2015. https://doi.org/10.1016/j.neucom.2015.04.051
  38. Z. H. Deng, K. H. Luo, and H. L. Yu, "A study of supervised term weighting scheme for sentiment analysis," Expert Systems with Applications, vol. 41, no. 7, pp. 3506-3513, 2014. https://doi.org/10.1016/j.eswa.2013.10.056
  39. K. W. Church and P. Hanks, "Word association norms, mutual information and lexicography," in Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 1989, pp. 76-83.
  40. A. K. Uysal and S. Gunal, "A novel probabilistic feature selection method for text classification," Knowledge-Based Systems, vol. 36, pp. 226-235, 2012. https://doi.org/10.1016/j.knosys.2012.06.005
  41. D. Mladeni'c and M. Grobelnik, "Feature selection for classification based on text hierarchy," in Proceeding of the Conference on Automated Learning and Discovery (CONALD), Pittsburgh, PA, 1998.
  42. G. Forman, "An extensive empirical study of feature selection metrics for text classification," Journal of Machine Learning Research, vol. 3, pp. 1289-1305, 2003.
  43. V. Ng, S. Dasgupta, & S. M. Niaz Arifin, "Examining the role of linguistic knowledge sources in the automatic identification and classification of reviews," in Proceedings of the COLING/ACL Main Conference Poster Sessions, Sydney, Australia, 2006, pp. 611-618.
  44. M. A. Fattah, "The use of MSVM and HMM for sentence alignment," Journal of Information Processing Systems, vol. 8, no. 2, pp. 301-314, 2012. https://doi.org/10.3745/JIPS.2012.8.2.301
  45. M. Elmarhoumy, M. A. Fattah, M. Suzuki, and F. Ren, "A new modified centroid classifier approach for automatic text classification," IEEJ Transactions on Electrical and Electronic Engineering, vol. 8, no. 4, pp. 364-370, 2013. https://doi.org/10.1002/tee.21867

Cited by

  1. QER: a new feature selection method for sentiment analysis vol.8, pp.1, 2018, https://doi.org/10.1186/s13673-018-0135-8