DOI QR코드

DOI QR Code

Classification Accuracy by Deviation-based Classification Method with the Number of Training Documents

학습문서의 개수에 따른 편차기반 분류방법의 분류 정확도

  • Lee, Yong-Bae (Dept. of Computer Education, Jeonju National University of Education)
  • 이용배 (전주교육대학교 컴퓨터교육과)
  • Received : 2014.04.13
  • Accepted : 2014.06.20
  • Published : 2014.06.28

Abstract

It is generally accepted that classification accuracy is affected by the number of learning documents, but there are few studies that show how this influences automatic text classification. This study is focused on evaluating the deviation-based classification model which is developed recently for genre-based classification and comparing it to other classification algorithms with the changing number of training documents. Experiment results show that the deviation-based classification model performs with a superior accuracy of 0.8 from categorizing 7 genres with only 21 training documents. This exceeds the accuracy of Bayesian and SVM. The Deviation-based classification model obtains strong feature selection capability even with small number of training documents because it learns subject information within genre while other methods use different learning process.

Keywords

automatic classification;accuracy of classification;the number of training documents

References

  1. M. Santini, M. Rosso, Testing a Genre-Enabled Application: A Preliminary Assessment, In Proceedings of the 2nd BCS-IRSG Symposium on Future Directions in Information Access, pp. 54-63, 2008.
  2. Z. Zhang, Q. Ye, Z. Zhang, Y. Li, Sentiment Classification of Internet Restaurant Reviews Written in Cantonese, Expert Systems with Applications, Vol.38, No.6, pp.7674-7682, 2011. https://doi.org/10.1016/j.eswa.2010.12.147
  3. A. Fuxman, A. Kannan, A. B. GOldberg, R. Agrawal, P. Tsaparas, J. Shafer, Improving Classification Accuracy Using Automatically Extracted Training Data, In Proceedings of the 15th ACM SIGKDD, pp. 1145-1154, 2009.
  4. Y. Maeda, H. Yoshida, T. Matsushima, Document Classification Method with Small Training Data, In Proceedings of the ICCAS-SICE, pp. 138-141, 2009.
  5. C. Apte, F. Damerau, S. M. Weiss, Automated Learning of Decision Rules for Text Categorization, ACM Transactions on Information Systems, Vol.12, No.3, pp.233-251, 1994. https://doi.org/10.1145/183422.183423
  6. Y. B. Lee, S. H. Myaeng, Automatic Identification of Text Genres and Their Roles in Subject-Based Categorization, In Proceedings of the 37th HICSS, 2004.
  7. Y. B. Lee, S. H. Myaeng, Text Genre Classification with Genre-Revealing and Subject-Revealing Features, In Proceeding of the 25th ACM SIGIR, pp. 145-150, 2002.
  8. Y. Li, C. Chen, Research on the Feature Selection Techniques Used in Text Classification, In Proceedings of the 9th Fuzzy Systems and Knowledge Discovery(FSKD), pp. 725-729, 2012.
  9. X. Zhang, W. Xiao, Clustering based Two-stage Text Classification Requiring Minimal Training Data, In Proceedings of the International Conference on System and Informatics(ICSAI), pp. 2233-2237, 2012.
  10. J. Novovicova. Text Document Classification, ERCIM News, No.62, pp. 53-54, 2005.
  11. D. Lewis, R. Schapire, J. Callan, R. Papka, Training Algorithms for Linear Text Classifiers, In Proceedings of the 19th ACM SIGIR, pp. 298-306, 1996.
  12. R. Jayashree, K. Srikantamurthy, S. A. Basavaraj, Suitability of Naive Bayesian Methods for Paragraph Level Text Classification in the Kannada Language Using Dimensionality Reduction Technique, International Journal of Artificial Intelligence & Applications(IJAIA), Vol.4, No.5, pp.121-131, 2013. https://doi.org/10.5121/ijaia.2013.4509
  13. C. C. Chang, C. J. Lin, LIBSVM Tools, http://csie.ntu.edu.tw/-cjlin/libsvmtools/