Automatic Document Classification Based on k-NN Classifier and Object-Based Thesaurus

k-NN 분류 알고리즘과 객체 기반 시소러스를 이용한 자동 문서 분류

  • 방선이 (전북대학교 컴퓨터통계정보학과) ;
  • 양재동 (전북대학교 전자정보공학부) ;
  • 양형정 (카네기멜론대학 컴퓨터과학과)
  • Published : 2004.09.01

Abstract

Numerous statistical and machine learning techniques have been studied for automatic text classification. However, because they train the classifiers using only feature vectors of documents, ambiguity between two possible categories significantly degrades precision of classification. To remedy the drawback, we propose a new method which incorporates relationship information of categories into extant classifiers. In this paper, we first perform the document classification using the k-NN classifier which is generally known for relatively good performance in spite of its simplicity. We employ the relationship information from an object-based thesaurus to reduce the ambiguity. By referencing various relationships in the thesaurus corresponding to the structured categories, the precision of k-NN classification is drastically improved, removing the ambiguity. Experiment result shows that this method achieves the precision up to 13.86% over the k-NN classification, preserving its recall.

기존의 통계적인 기법과 기계학습 기법 등을 이용한 자동 문서 분류는 주로 문서 벡터만으로 분류기를 학습하여 분류를 행하기 때문에 특정 범주로 문서를 분류하는데 명확치 않은 경우가 빈번히 발생하여 일정 수준 이상의 정확도를 얻는 데에는 한계를 보이고 있다. 이러한 문제를 해결하기 위해 본 논문에서는 기존 문서 분류 알고리즘에 범주 간의 관련성을 반영하여 분류를 시행하는 방법을 제안한다. 이 방법은 간단한 알고리즘에 비해 좋은 성능을 보이고 있는 k-NN 분류 알고리즘을 이용하여 일차적인 문서 분류를 수행한 후 특정 범주로 분류하기가 명확치 않을 경우, 객체 기반 시소러스에서 제공되는 범주들 간의 일반화 관계, 집성화 관계, 연관화 관계 그리고 인스턴스 관계를 이용하여 문서가 할당될 범주를 결정함으로써 자동 문서 분류의 정확도를 향상시킬 수 있다. 본 논문에서 제안된 방법으로 실험한 결과 k-NN 분류 알고리즘의 분류 결과에 비해 재현율은 유지되면서 최고 13.86% 까지 정확률이 향상되었다.

Keywords

References

  1. Mehnet, R., 'Federal Agency and Federal Library Reports : National Library of Medicine,' Bowker Ann : Library and Book Trade Almance, second ed., pp. 110-115, 1997
  2. Yiming Yang. 'An Evaluation of Statistical Approaches to Text Categorization,' Journal of Information Retrieval, Vol.1, No.1, pp.67-88, 1999 https://doi.org/10.1023/A:1009982220290
  3. Lam, W., Low, K. F. and Ho, C. Y., 'Using a Bayesian network induction approach for text categorization,' In Proceeding of the fifteenth International Joint Conference on Artificial Intelligence(IJCAI), Vol. 1, pp. 745-750, 1997
  4. Diao, L., Hu, K., Lu, Y. and Shi, C., 'Boosting simple decision trees with Bayesian learning for text categorization,' In Proceeding of the fourth World Congress on Intelligent Control and Automation, Vol. 1, pp. 321-325, 2002 https://doi.org/10.1109/WCICA.2002.1022121
  5. Soucy, P. and Mineau, G. W., 'A Simple KNN Algorithm for Text Categorization,' In Proceeding of the first IEEE International Conference on Data Mining(ICDM), Vol. 28, pp. 647-648, 2001 https://doi.org/10.1109/ICDM.2001.989592
  6. Sasaki, M. and Kita, K., 'Rule-Based Text Categorization Using Hierarchical Categories,' In Proceeding of the IEEE International Conference on Systems, Man and Cybernetics, Vol. 3, pp. 2827-2830, 1998 https://doi.org/10.1109/ICSMC.1998.725090
  7. Jalam, R. and Teytaud, O., 'Kernel-based text categorization,' In Proceeding of the International Joint Conference on Neural Networks(IJCNN), Vol. 3, pp. 15-19, 2001 https://doi.org/10.1109/IJCNN.2001.938452
  8. Schapire, R E. and Singer, Y., 'Text categorization with the concept of fuzzy set of informative keywords,' In Proceeding of the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Vol. 2, pp. 609-614, 1999 https://doi.org/10.1109/FUZZY.1999.793010
  9. Duda, R. O. and Hart, P. E., 'An algorithm for text categorization with SVM,' TENCON '02. In Proceeding of the IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering, Vol.1, pp. 47-50, 2002
  10. Sebastiani F., 'Machine learning in automated text categorization,' ACM Computing Surveys, Vol.34, No.1, pp.1-47, 2002 https://doi.org/10.1145/505282.505283
  11. Antonie, M. L. and Zaiane, O. R, 'Text document categorization by term association,' In Proceeding of the second IEEE International Conference on Data Mining (ICDM) , pp. 19-26, Dec. 2002 https://doi.org/10.1109/ICDM.2002.1183881
  12. Hiroshi, U., Takao, M. and SHIOYA, I., 'Improving Text Categorization By Resolving Semantic Ambiguity,' In Proceeding of the IEEE Pacific Rim Conference on Communications, Computers and Signal processing(PACRIM), pp. 796-799, 2003 https://doi.org/10.1109/PACRIM.2003.1235901
  13. Bao, Y. and Ishii, N., 'Combining Multiple K-Nearest Neighbor Classifiers for Text Classification by Reducts,' In Proceeding of the fifth International Conference on Discovery Science, pp. 340-347, 2002
  14. Han, E. H., Karypis, G. and Kumar, V., 'Text categorization using weight adjusted k-nearest neighbor classification,' In Proceeding of the fifth Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining(PAKDD), pp. 53-65, 1999
  15. Lim, H. S., 'A Comparative Evaluation of Korean Text Categorization based on kNN Learning,' In Proceeding of the International Conference on Artificial Intelligence(IC-AI), pp. 755-759, 2002
  16. 고영중, 서정연, '문서관리를 위한 자동문서범주화에 대한 이론 및 기법', 정보관리연구, 제33권, 제2호, pp.19-32, 2002
  17. Aas, K. and Eikvil, L., 'Text Categorization : A Survey,' Report No. NR 941, Norwegian Computing Center. URL http://citeseer.ist.psu.edu/aas99text.htm
  18. 이경찬, 강승식, '자질 중요도 계산 기법에 의한 자동 문서 범주화', 한국정보과학회 봄 학술발표 논문집(B), 제30권, 제2호, pp. 537-539, 2003
  19. Choi, J. H., Yang, J. D. and Lee, D. G., 'An Object-Based Approach to Managing Domain Specific Thesauri: Semiautomatic Thesaurus Construction and Query-Based Browsing,' International Journal of Software Engineering & Knowledge Engineering, Vol. 10, No.4, pp. 1-27, 2002