DOI QR코드

DOI QR Code

Comparison Between Optimal Features of Korean and Chinese for Text Classification

한중 자동 문서분류를 위한 최적 자질어 비교

  • Ren, Mei-Ying (Dept. of Computer & Information Engineering, Daegu University) ;
  • Kang, Sinjae (School of Computer & Information Technology, Daegu University)
  • 임미영 (대구대학교 컴퓨터정보공학과) ;
  • 강신재 (대구대학교 컴퓨터.IT공학부)
  • Received : 2015.03.03
  • Accepted : 2015.06.11
  • Published : 2015.08.25

Abstract

This paper proposed the optimal attributes for text classification based on Korean and Chinese linguistic features. The experiments committed to discover which is the best feature among n-grams which is known as language independent, morphemes that have language dependency and some other feature sets consisted with n-grams and morphemes showed best results. This paper used SVM classifier and Internet news for text classification. As a result, bi-gram was the best feature in Korean text categorization with the highest F1-Measure of 87.07%, and for Chinese document classification, 'uni-gram+noun+verb+adjective+idiom', which is the combined feature set, showed the best performance with the highest F1-Measure of 82.79%.

본 논문에서는 한국어와 중국어의 언어학적인 특징을 고려하여 문서 자동분류 시스템의 성능을 높일 수 있는 최적의 자질어 단위를 제안한다. 언어 종속적 단위인 형태소 자질어와 언어 독립적 단위인 n-gram 자질어 그리고 이들을 조합한 복합 자질어 집합을 대상으로 각 언어의 인터넷 신문기사를 SVM으로 분류하는 실험을 수행하였다. 실험 결과, 한국어 문서분류에서는 bi-gram이 F1-measure 87.07%로 가장 좋은 분류 성능을 보였고, 중국어 문서분류에서는 'uni-gram 명사 동사 형용사 사자성어'의 복합 자질어 집합이 F1-measure 82.79%로 가장 좋은 성능을 보였다.

Keywords

References

  1. B. Pang, L. Lee and S. Vaithyanathan, "Thumbs up?: sentiment classification using machine learning techniques," EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing, Vol. 10, pp. 79-86, 2002
  2. B. Kim, "A Study on Comparison with SVM, EM, and Naivebayes Algorithm." The Institute of Electronics and Information Engineers Summer Conference, Vol. 32 (1), pp. 683-684, 2009
  3. C. Park, D. Seong, K. Park, "Automatic IPC Classification for Patent Documents using Machine Learning," Journal of Advanced Information Technology and Convergence, Vol. 10 (4), pp. 119-128, 2012
  4. X. Li, J. Liu and Z. Shi, "A Chinese Web Page Classifier Based on SVM and Unsupervised Clustering," Chinese Journal of Computers, Vol. 24(1), pp. 62-68, 2001
  5. D. Choi, S. Lee, J. Kim, J. Lee, "A Study on Graph-based Topic Extraction form Microblogs," Journal of The Korean Institute of Intelligent Systems, Vol. 21(5), pp. 564-568, 2011 https://doi.org/10.5391/JKIIS.2011.21.5.564
  6. T. Basu, C. A. Murty, "Effective Text Classification by a Supervised Feature Selection Approach," IEEE 12th International Conference on Data Mining Workshops, pp. 918-925, 2012
  7. Y. Yang and J. O. Pedersen. "A Comparison Study on Feature Selection in Text Categorization," In Proceedings of the Fourteenth International Conference on Machine Learning (ICML 97), pp. 412-420, 1997
  8. B. Shim, J. Park, J. Seo, "Term Weighting Using Date Information and Its Appliance in Automatic Text Classification," Proceedings of the 19th Annual Conference on Human and Cognitive Language Technology, Vol. 10, pp. 169-173, 2007
  9. Y. Zhang, J. Lu and J. Yang, "Research on the Technique of Chinese Text Classification Based on the Single Chinese Character Feature," Pattern Recognition, 2009. CCPR 2009. Chinese Conference on, pp. 1-5, 2009
  10. S. Rho, B. Kim, N. Huh, "Representative keyword Extraction from Few Documents through Fuzzy Inference," Journal of The Korean Institute of Intelligent Systems, Vol. 11(9), pp. 837-843, 2001
  11. T. Goncalves and P. Quaresma, "Text Classification Using Tree Kernels and Linguistic Information,", IEEE Seventh International Conference on Machine Learning and Applications, pp. 763-768, 2008
  12. J. Roh, H. Kim, J. Chang, "Improving Hypertext Classification Systems through WordNet-based Feature Abstraction," Journal of Society for e-Business Studies, Vol. 18(2), pp. 95-110, 2013 https://doi.org/10.7838/jsebs.2013.18.2.095
  13. S. Park, B. Zhang, "Text Categorization Using Both Lexical Information and Syntactic Information," The Korean Institute of Information Scientists and Engineers Autumn Conference, Vol 28(2), pp. 37-39, 2001
  14. I. Kang, "A Comparative Study on Using SentiWordNet for English Twitter Sentiment Analysis," Journal of Korean Institute of Intelligent Systems, Vol. 23 (4), pp. 317-324, 2013 https://doi.org/10.5391/JKIIS.2013.23.4.317
  15. E. D'hondt, S. Verberne, C. Koster and L. Boves, "Text Representation for Patent Classification," Computational Linguistics, vol 39(3), pp. 755-775, 2013 https://doi.org/10.1162/COLI_a_00149
  16. J. In, J. Kim, S. Chae, "Combined Feature Set and Hybrid Feature Selection Method for Effective Document Classification," Journal of Korean Society for Internet Information, vol. 14 (5), pp. 49-57, 2013
  17. S. Choi, S. Park, "Categorization of POIs Using Word and Context information," Journal of Korean Institute of Intelligent Systems, Vol 24 (5), pp. 470-476,2014 https://doi.org/10.5391/JKIIS.2014.24.5.470
  18. S. Kang, J. Kim, "Intelligent Spam-mail Filtering Based on Textual Information and Hyperlinks," Journal of The Korean Institute of Intelligent Systems, Vol. 14 (7), pp.895-901, 2004 https://doi.org/10.5391/JKIIS.2004.14.7.895
  19. T. Kim, J. Lee, M. Chang, "A Minimal Pair Searching Tool Based on Dictionary," Journal of The Korean Institute of Intelligent Systems, Vol. 24(2), pp. 117-122, 2014 https://doi.org/10.5391/JKIIS.2014.24.2.117
  20. J. Son, J. Go, S. Park, K. Kim, "Kernelized Structure Feature for Discriminating Meaningful Table from Decorative Table," Journal of The Korean Institute of Intelligent Systems, Vol. 21(5), pp. 618-623, 2011 https://doi.org/10.5391/JKIIS.2011.21.5.618
  21. P. Wang and X. Fan, "Study on Chinese Text Classification Based on Dependency Relation," Computer Engineering and Applications, Vol.46(3), pp. 131-141, 2010
  22. H. Xiao, "CorpusWordParser.exe, Computer software. Corpus Online. Vers. 3.0.0.0," Ministry of Education and Institute of Applied Linguistics, Web. . 2014
  23. L. H. Witten, E. Frank and M. A. Hall, "DATA MINING: Practical Machine Learning Tools and Techniques," third Edition.