DOI QR코드

DOI QR Code

An Automatic Text Categorization Theories and Techniques for Text Management

문서관리를 위한 자동문서범주화에 대한 이론 및 기법

  • Published : 2002.06.30

Abstract

With the growth of the digital library and the use of Internet, the amount of online text information has increased rapidly. The need for efficient data management and retrieval techniques has also become greater. An automatic text categorization system assigns text documents to predefined categories. The system allows to reduce the manual labor for text categorization. In order to classify text documents, the good features from the documents should be selected and the documents are indexed with the features. In this paper, each steps of text categorization and several techniques used in each step are introduced.

최근 디지털 도서관이 등장하고 인터넷이 폭 넓게 보급되어 온라인 상에서 얻을 수 있는 텍스트 정보의 양이 급증함에 따라 효율적인 정보 관리 및 검색이 요구되고 있다. 자동 문서 범주화란 문서의 내용에 기반하여 미리 정의되어 있는 범주에 문서를 자동으로 할당하는 작업으로써 효율적인 정보 관리 및 검색을 가능하게 하는 동시에 방대한 양의 수작업을 감소시키는데 그 목적이 있다. 문서 분류를 위해서는 문서들을 가장 잘 표현할 수 있는 자질들을 정하고, 이러한 자질들을 통해 분류할 문서를 색인 과정을 통해 표현한다. 또한, 문서 분류기를 통해 문서를 목적에 맞게 분류한다. 본 논문에서는 자동 문서 범주화를 수행하기 위한 각 단계를 소개하고 각 수행 단계에서 사용되는 여러 가지 기법들을 소개하고자 한다.

Keywords

References

  1. 조광제, 김준태. 1997. “역카테고리 빈도에 의한 계층적 분류체계에서의 문서의 자동분류.” 한국정보과학회 봄 학술발표논문집(B), 507-510.
  2. Buckley, C. G. Salton, J. Allan and A. Singhal. 1994. “Automatic Query Expansion Using SMART: TREC 3” Proceedings 3rd Text Retrieval Conference,NIST.
  3. Cortes, C. and V. Vapnik. 1995. “Support vector networks.” Machine Learning, 20(3): 273-297.
  4. Chidanand Apte, Fred Damerau, and Sholom M. Weis. 1994. “Towards language independent automated learning of text categorization models.” Proceeding of the 17th annual international ACM-SIGIR.
  5. Church, Kenneth Ward and Patrick Hanks. 1989. “Word association norms, mutual information and lexicagraphy,” Proceedings of ACL 27: 76-83, Cancouver, Canada .
  6. Dasarathy, Belur V. 1991. “ Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques.” McGraw-Hill Computer Science Series. CA: IEEE Computer Society Press.
  7. Dumais, S. T. 1991. “Improving the retrieval information from external sources.” Behaviour Research Methods, Instruments and Computers, 23(2): 229-236. https://doi.org/10.3758/BF03203370
  8. Dumais, S. T. J. Platt, D. Heckerman and M. Sahami. 1998. “Inductive learning algorithms and representations for text categorization.” Proceedings of ACM-CIKM98, Nov. 148-155.
  9. Frakes, W. B. and R. B. Yates. 1997. Information Retrieval Data Structures & Algorithms. Prentice-Hall.
  10. Joachims, T. 1998. “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”. European Conference on Machine Learning(ECML).
  11. Ko, Youngjoong Jungyun Seo. 2000. “Automatic Text Categorization by Unsupervised Learning”, Proceedings of The 18th International Conference on Computational Linguistics (COLING 2000), 453-459
  12. Lewis, David D. and Marc Ringuette. 1994. “A comparison of Two Learning Algorithms for Text categorization.” Proceeding of the 3rd Annual Symposium on Document Analysis and Information Retrieval..
  13. Lewis, David D. Robert E. Schapire, james P. Callan and Ron Papka. 1996. “Training Algorithms for Linear Text Classifiers.” Proceedings of the 19th International Conference on Research and Development in Information Retrieval (SIGIR'96), 289-297.
  14. Lewis, David D. 1998. “Naive (bayes) at forty: The independence assumption in information retrieval.” European Conference on Machine Learning..
  15. McCallum, Andrew and Kamal Nigram. 1998. “A comparison of Event Models for Naive Bayes Text Classification”. AAAI '98 workshop on Learning for Text Categorization..
  16. Mitchell, Tom 1996. Machine Learning. McCraw Hill.
  17. Salton, G. E. A. Fox and H. Wu. 1983 “Extended boolean information retrieval.” Communications of the ACM, 26(12): 1022-1036. https://doi.org/10.1145/182.358466
  18. Salton G. and M. J. McGill. 1983. An Introduction to Modern Information Retrieval, McGraw-Hill.
  19. Salton G. and C. Buckley. 1988. “Term weighting approaches in automatic text retrieval.” Information Processing and Management, 24(5): 513-523. https://doi.org/10.1016/0306-4573(88)90021-0
  20. Vapnik, V. 1995. The Nature of Statistical Learning Theory. New York: Springer.
  21. Wiener, E. J. O. Pedersen, and A.S. Weigend. 1995. “A neural network approach to topic spotting.” Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR'95).
  22. Yang, Y. and J. O. Pederson. 1997. “A comparative study on feature selection in text categorization.” Proceedings of the 14th International Conference on Machine Learning
  23. Yang, Y. and Xin Liu. 1999. “A reexamination of text categorization methods”. Proceedings of Conference on Research and Development in Information Retrieval (ACM SIGIR'99).
  24. Yang, Y. S. Slattery, and R. Ghani. 2002. “A study of approaches to hypertext categorization”,. Journal of Intelligent Information Systems, 18(2).