DOI QR코드

DOI QR Code

A Document Classification System Using Modified ECCD and Category Weight for each Document

Modified ECCD 및 문서별 범주 가중치를 이용한 문서 분류 시스템

  • Received : 2012.04.10
  • Accepted : 2012.05.22
  • Published : 2012.08.31

Abstract

Web information service needs a document classification system for efficient management and conveniently searches. Existing document classification systems have a problem of low accuracy in classification, if a few number of feature words is selected in documents or if the number of documents that belong to a specific category is excessively large. To solve this problem, we propose a document classification system using 'Modified ECCD' feature selection method and 'Category Weight for each Document'. Experimental results show that the 'Modified ECCD' feature selection method has higher accuracy in classification than ${\chi}^2$ and the ECCD method. Moreover, combining the 'Category Weight for each Document' feature value and 'Modified ECCD' feature selection method results better accuracy in classification.

웹 문서 정보 서비스는 관리자의 효율적 문서관리와 사용자의 문서검색 편의성을 위해 문서 분류 시스템을 필요로 한다. 기존의 문서 분류 시스템은 분류하고자 하는 문서 내 선택된 자질어의 개수가 적거나, 특정 범주의 문서 비율이 높아 그 범주에서 대부분의 자질어가 선택되어 모델이 생성된 경우 분류 정확도가 저하되는 문제점을 가진다. 이러한 문제점을 해결하기 위해 본 논문에서는 'Modified ECCD' 기법 및 '문서별 범주 가중치' 특징 변수를 사용한 문서 분류 시스템을 제안한다. 실험 결과, 제안 방법인 'Modified ECCD' 기법이 ${\chi}^2$ 및 ECCD 기법에 비해 높은 분류 성능을 보였으며, '문서별 범주 가중치' 특징 변수를 'Modified ECCD' 기법으로 선택된 자질어 변수에 추가하여 학습하였을 경우에 더 높은 분류 성능을 보였다.

Keywords

References

  1. Salton, G. "Automatic processing of foreign language documents." Journal of the American Society for Information Science, 21(3), pp.187-194, 1970. https://doi.org/10.1002/asi.4630210305
  2. Kil-Hong Joo, Eun-young Shin, Joo_Il Lee, Won-Suk Lee, "Hierarchical Automatic Classification of News Articles based on Association Rules" Journal of Korea Multimedia Society, Vol.14, No.6, pp.730-741, June, 2011. https://doi.org/10.9717/kmms.2011.14.6.730
  3. Sanasam Ranbir Singh, Hema A. Murthy, Timothy A. Gonsalves, "Feature Selection for Text Classification Based on Gini Coefficient of Inequality" JMLR:Workshop and Conference Proceedings, pp.76-85, 2010.
  4. Christine Largeron, Christophe Moulin, Mathias Gery, "Entropy based feature selection for text categorization" Proceedings of the 2011 ACM Symposium on Applied Computing, pp.924-928, 2011.
  5. C. E. Shannon, "A mathematical theory of communication" ACM SIGMOBILE Mobile Computing and Communications Review, Vol.5 Issue 1, January, 2001.
  6. Haichao Dong, Siu Cheung Hui, Yulan He*, "Structural Analysis of Chat Messages for Topic Detection" Online Information Review, pp.496-516, 2006.
  7. S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, K. R. K. Murthy, "Improvements to Platt's SMO Algorithm for SVM Classifier Design" Journal Neural Computation, Vol.13 Issue 3, March, 2001.
  8. Steven L. "C4.5: Programs for Machine Learning" Book Review, Machine Learning, 16, pp.235-240, 1994.
  9. P.Winstron, http://www2.cs.uregina.ca/-dbd/cs831/notes/ml/dtrees/c4.5/tutorial.html, 1992.
  10. Hang Li, Kenji Yamanishi, "Document classification using a finite mixture model", EACL '97 Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, 1997.
  11. McCallum, A., Nigam, K., "A comparison of event models for Naive Bayes text classification.", AAAI-98 Workshop on Learning for Text Categorization, pp.41-48, 1998.
  12. KLT2010, http://nlp.kookmin.ac.kr/
  13. Dan-Ho Park, Won-Sik Choi, Hong-Jo Kim, Seok-Lyong Lee, "Web Document Classification System Using the Text Analysis and Decision Tree Model", Proceedings of The Korean Institute of Information Scientists and Engineers 2011 fall, Vol.38, No.2(A), pp.248-251, 2011.