Developing a Text Categorization System Based on Unsupervised Learning Using an Information Retrieval Technique

정보검색 기술을 이용한 비지도 학습 기반 문서 분류 시스템 개발

  • 노대욱 (연세대학교 정보통신공학부) ;
  • 이수용 (연세대학교 정보통신공학부) ;
  • 나동열 (연세대학교 정보통신공학부)
  • Published : 2007.02.15

Abstract

For developing a text classifier using supervised learning, a manually labeled corpus of large size is required. However, it takes a lot of time and human effort. Recently a research paradigm was proposed to use a raw corpus and a small amount of seed information instead of manually labeled corpus. In this paper we introduce an unsupervised learning method that makes it possible to achieve better performance than other related works. The characteristics of our approach is that average mutual information is used to learn representative words and their weights and then update of the weights is done using a technique inspired by the works in information retrieval. By iterating this teaming process it was shown that a high performance system can be developed.

문서분류기의 개발에 있어 지도학습기법을 이용할 경우 많은 양의 사람에 의한 범주 부착 말뭉치가 필요하다. 그러나 이의 구축은 많은 시간과 노력을 필요로 한다. 최근 이러한 범주 부착 말뭉치 대신 원시말뭉치와 범주마다 약간의 씨앗 정보를 이용하여 학습을 수행하여 문서분류기를 개발하는 방법론이 제시되었다. 본 논문에서는 이 방법론 하에서 다른 연구에서의 결과보다 좋은 성능을 나타내는 비지도 학습 기법을 소개한다. 본 논문에서 제시하는 기법의 특징은 씨앗 단어에서 출발하여 평균상호정보를 이용하여 다른 대표단어 및 그들의 가중치를 학습한 다음, 정보검색에서 많이 사용하는 기술을 이용하여 그 가중치를 갱신하는 것이다. 그리고 이 과정을 반복 수행하여 최종적으로 높은 성능의 시스템을 개발 할 수 있음을 제시하였다.

Keywords

References

  1. C. Manning and H. Schutze, 1999. Foundations of Statistical Natural Language Processing. The MIT Press
  2. T. Joachims, 1998. Text categorization with support vector machines: learning with many relevant features. In Proc. of ECML '98, Pages 137-142
  3. D. Lewis and W. Gale. 1994. A sequential algorithm for training text classifiers, In Proc. of SIGIR-94
  4. A. Blum and T. Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proc. COLT-98
  5. K.P. Nigam, A. McCallum, S. Thrun, and T. Mitchell. 1998. Learning to classify text from labeled and unlabeled documents. In Proc. of AAAI-98
  6. A. A. Gliozzo, C. Strapparava, and I. Dagan. 2005. Investigating unsupervised learning for text categorization bootstrapping. In Proc. of HLT-2005, October. Pages 129-136
  7. Y. Ko and J. Seo. 2004. Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques. In Proc. of the ACL-04, Barcelona, Spain, July
  8. B. Liu, X. Li, W.S. Lee, and P.S. Yu. 2004. Text classification by labeling words. In Proc. of AAAI-04, San Jose, July
  9. G. Salton and M. McGill. 1983. Introduction to Modern Information Retrieval. McGraw-Hill
  10. Y. Yang and J.P. Pederson. 1997. Feature selection in statistical learning of text categorization. In Proc. of ICML '97, Pages 412-420
  11. A. McCallum and K. Nigam. 1999. Text classification by bootstrapping with keywords, EM and shrinkage. In ACL-99-Workshop on Unsupervised Learning in Natural Language Processing
  12. S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society of Information Science
  13. A. Dempster, N. M. Laird and D. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. of the Royal Stat. Society, B:39, Pages 1-38
  14. R. Ghani. 2002. Combining labeled and unlabeled data for multiclass text categorization. In Proc. of ICML-02
  15. A. Gliozzo, C. Strapparava, and I. Dagan. 2004. Unsupervised and supervised exploitation of semantic domains in lexical disambiguation,. Computer Speech and Language, 18:275-299 https://doi.org/10.1016/j.csl.2004.05.006
  16. A.K. Jain and R.C. Dubes. 1988. Algorithms for Clustering Data. Engle-wood Cliffs, NJ: Prentice Hall
  17. T. Joachims, 1999. Estimating the Generalization Performance of an SVM Efficiently. In Proc. of ICML' 2000, Pages 431-438
  18. Y. Ko and J. Seo. 2000. Automatic text categorization by unsupervised learning. In Proc. of COLING 2000
  19. A. McCallum and K. Nigam. 1998. A comparison of event models for naive Bayes text classification. In Proc. of AAAI-98 Workshop on Learning for Text Categorization
  20. N. Slonim, N. Friedman, and N. Tishby, 2002. Unsupervised document classification using sequential information maximization, In Proc. of SIGIR '02, Pages 129-136
  21. V. Vapnik. 1995. The nature of statistical learning theory
  22. C. Burges, 1998. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, vol. 2, no. 2
  23. N., Cristianini J. Shawe-Taylor2000. An introduction to Support Vector and other kernel-based learning methods. Cambridge Univ. Press