DOI QR코드

DOI QR Code

Robust Algorithms for Combining Multiple Term Weighting Vectors for Document Classification

  • Kim, Minyoung (Department of Electronics & IT Media Engineering, Seoul National University of Science & Technology)
  • Received : 2016.04.03
  • Accepted : 2016.06.20
  • Published : 2016.06.30

Abstract

Term weighting is a popular technique that effectively weighs the term features to improve accuracy in document classification. While several successful term weighting algorithms have been suggested, none of them appears to perform well consistently across different data domains. In this paper we propose several reasonable methods to combine different term weight vectors to yield a robust document classifier that performs consistently well on diverse datasets. Specifically we suggest two approaches: i) learning a single weight vector that lies in a convex hull of the base vectors while minimizing the class prediction loss, and ii) a mini-max classifier that aims for robustness of the individual weight vectors by minimizing the loss of the worst-performing strategy among the base vectors. We provide efficient solution methods for these optimization problems. The effectiveness and robustness of the proposed approaches are demonstrated on several benchmark document datasets, significantly outperforming the existing term weighting methods.

Keywords

References

  1. A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, "Learning word vectors for sentiment analysis," 2011. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
  2. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A bayesian approach to filtering junk e-mail," 1998. Proceedings of the 21st national conference on Artificial intelligence (AAAI).
  3. Upasana and S. Chakravarty, "A survey on text classification techniques for e-mail filtering," 2010. International Conference on Machine Learning and Computing.
  4. T. Joachims, "Text categorization with suport vector machines: Learning with many relevant features," 1998. European Conference on Machine Learning.
  5. A. Chy, M. Seddiqui, and S. Das, "Bangla news classification using naive Bayes classifier," 2014. International Conference on Computer and Information Technology.
  6. F. Debole and F. Sebastiani, "Supervised term weighting for automated text categorization," 2003. Proceedings of the ACM symposium on Applied computing.
  7. Z.-H. Deng, S.-W. Tang, D.-Q. Yang, M. Z. L.-Y. Li, and K.-Q. Xie, "A comparative study on feature weight in text categorization," Advanced Web Technologies and Applications, Lecture Notes in Computer Science, vol. 3007, pp. 588-597, 2004.
  8. M. Lan, C. Tan, J. Su, and Y. Lu, "Supervised and traditional term weighting methods for automatic text categorization," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 721-735, 2009. https://doi.org/10.1109/TPAMI.2008.110
  9. V. N. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995.
  10. K. Crammer, Y. Singer, N. Cristianini, J. Shawe-taylor, and B. Williamson, "On the algorithmic implementation of multiclass kernel-based vector machines," Journal of Machine Learning Research, vol. 2, 2001.
  11. D. P. Bertsekas, Nonlinear Programming. Athena Scientific, Belmont, MA, 1999.
  12. M. F. Porter, "An algorithm for suffix stripping," Program, vol. 14, no. 3, pp. 130-137, 1980. https://doi.org/10.1108/eb046814

Cited by

  1. Simultaneous Learning of Sentence Clustering and Class Prediction for Improved Document Classification vol.17, pp.1, 2017, https://doi.org/10.5391/IJFIS.2017.17.1.35