DOI QR코드

DOI QR Code

Automatic Text Categorization Using Passage-based Weight Function and Passage Type

문단 단위 가중치 함수와 문단 타입을 이용한 문서 범주화

  • 주원균 (한국과학기술정보연구원) ;
  • 김진숙 (한국과학기술정보연구원) ;
  • 최기석 (한국과학기술정보연구원 국가RnD시스템개발실)
  • Published : 2005.10.01

Abstract

Researches in text categorization have been confined to whole-document-level classification, probably due to lacks of full-text test collections. However, full-length documents availably today in large quantities pose renewed interests in text classification. A document is usually written in an organized structure to present its main topic(s). This structure can be expressed as a sequence of sub-topic text blocks, or passages. In order to reflect the sub-topic structure of a document, we propose a new passage-level or passage-based text categorization model, which segments a test document into several Passages, assigns categories to each passage, and merges passage categories to document categories. Compared with traditional document-level categorization, two additional steps, passage splitting and category merging, are required in this model. By using four subsets of Routers text categorization test collection and a full-text test collection of which documents are varying from tens of kilobytes to hundreds, we evaluated the proposed model, especially the effectiveness of various passage types and the importance of passage location in category merging. Our results show simple windows are best for all test collections tested in these experiments. We also found that passages have different degrees of contribution to main topic(s), depending on their location in the test document.

문서 범주화 분야에 대한 연구들은 전체 문서 단위에 한정되어 왔으나, 오늘날 대부분의 전문들이 주요 주제를을 표현하기 위해서 조직화 된 특정 구조로 기술되고 있어, 텍스트 범주화에 대한 새로운 인식이 필요하게 되었다. 이러한 구조는 부주제(Sub-topic)의 텍스트 블록이나 문단(Passage) 단위의 나열로서 표현되는데, 이러한 구조 문서에 대한 부주제 구조를 반영하기 위해서 문단 단위(Passage-based) 문서 범주화 모델을 제안한다. 제안한 모델에서는 문서를 문단들로 분리하여 각각의 문단에 범주(Category)를 할당하고, 각 문단의 범주를 전체 문서의 범주로 병합하는 방법을 사용한다. 전형적인 문서 범주화와 비교할 때, 두 가지 부가적인 절차가 필요한데, 문단 분리와 문단 병합이 그것이다. 로이터(Reuter)의 4가지 하위 집합과 수십에서 수백 KB에 이르는 전문 테스트 컬렉션(KISTl-Theses)을 이용하여 실험하였는데, 다양한 문단 타입들의 효과와 범주 병합 과정에서의 문단 위치의 중요성에 초점을 맞추었다 실험한 결과 산술적(Window) 문단이 모든 테스트 컬렉션에 대해서 가장 좋은 성능을 보였다. 또한 문단은 문서 안의 위치에 따라 주요 주제에 기여하는 바가 다른 것으로 나타났다.

Keywords

References

  1. Apte, C., Damerau, F., and Weiss, F. 'Towards Language Independent Automated Learning of Text Categorization Models,' Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pp.23-30, 1994
  2. Baker, L. D. and McCallum, A. K. 'Distributional Clustering of Words for Text Classification,' Proceedings of the 21th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pp.96-103, 1998 https://doi.org/10.1145/290941.290970
  3. Callan, J. P. 'Passage Retrieval Evidence in Document Retrieval,' Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pp.302-310, 1994
  4. Hearst, M. A., and Plaunt, C. 'Subtopic Structuring for Full-length Document Access,' Proceedings of the 16th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pp.59-68, 1993 https://doi.org/10.1145/160688.160695
  5. Hearst, M. A. 'Multi-paragraph Segmentation of Expository Texts,' Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp.9-16, 1994 https://doi.org/10.3115/981732.981734
  6. Kaszkiel, M., Zobel, J. and Sacks-Davis, R. 'Efficient Passage Ranking for Document Databases,' ACM Transactions on Information Systems, Vol.17, No.4, pp.406-439, 1999 https://doi.org/10.1145/326440.326445
  7. Kaszkiel, M., and Zobel, J. 'Effective Ranking with Arbitary Passages,' The Journal of American Society for Information Science and Technology, Vol.52, No.4, pp.344-364, 2001 https://doi.org/10.1002/1532-2890(2000)9999:9999<::AID-ASI1075>3.3.CO;2-R
  8. Larkey, L. S., and Croft, W. B. 'Combining Classifiers in Text Categorization,' Proceedings of SIGIR-96, 19th ACM International Conference on research and Development in Information Retrieval, pp.289-297, 1996 https://doi.org/10.1145/243199.243276
  9. Moffat, A., Sacks-Davis, R., Wilkinson, R. and Zobel, J. 'Retrieval of Partial Documents,' NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC 2), pp.181-190, 1994
  10. Salton, G., Allan, J., and Buckley, C. 'Approaches to Passage Retrieval in Full Text Information Systems,' Proceedings of the 16th Annual International Conference on Research and Development in Information Retrieval, pp.49-58, 1993 https://doi.org/10.1145/160688.160693
  11. Sebastiani, F. 'Machine Learning in Automated Text Categorization,' ACM Computing Surveys, Vol.34, No.1, pp.1-47, 2002 https://doi.org/10.1145/505282.505283
  12. van Rijsbergen, C. 'Information Retrieval,' Butter-worths, London, 1979
  13. Witten, I. H., Moffat, A., and Bell, T. C. 'Managing Gigabytes: Compressing and Indexing Documents and Images,' Morgan Kaufmann Publishing, San Francisco, 1999
  14. Yang, Y. 'Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval,' Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pp.13-22, 1994
  15. Yang, Y. and Pedersen, J.-O. 'A Comparative Study on Feature Selection in Text Categorization,' Proceedings of the 14th International Conference on Machine Learning (ICML'97), pp.412-420, 1997
  16. Yang, Y. 'An Evaluation of Statistical Approaches to Text Categorization,' Journal of Information Retrieval, Vol.1, No.1, pp.67-88, 1999 https://doi.org/10.1023/A:1009982220290
  17. Yang, Y., Slattery, S., and Ghani, R. 'A Study of Approaches to Hypertext Categorization,' Journal of Intelligent Information Systems, Vol.17, No.2, pp.219-241, 2002 https://doi.org/10.1023/A:1013685612819
  18. Zobel. J., Moffat, A. Wilkinson, R., and Sacks-Davis, R. 'Efficient Retrieval of Partial Documents,' Information Processing and Management, Vol.31, No.3, pp.361-377, 1995 https://doi.org/10.1016/0306-4573(94)00052-5
  19. Callan J. Characteristics of text, 1997
  20. Harman D. 'The DARPA Tipster Project,' SIGIR Forum, Vol.26, No.2, pp.26-28, 1992 https://doi.org/10.1145/146565.146567
  21. Kaszkiel M., Zobel J., Davis Sacks-R. 'Efficient passage ranking for document databases,' ACM Transactions on information systems, Vol.17, No.4, pp.406-439, 1999 https://doi.org/10.1145/326440.326445
  22. Wilkinson R. 'Effective Retrieval of structured documents,' Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1994
  23. Ludovic D., Hugo Z. 'HMM-based Passage Models for Document Classification and Ranking,' Proceedings of ECIR-01, 23rd European Colloquium on Information Retrieval Research, 2001
  24. Cehn W., Chang X., Wang H., Zhu J., and Yao T. 'Automatic Word Clustering for Text Categorization Using Global Information,' Proceedings of AIRS 2004
  25. Thanaruk T. 'Applying passage in Web text mining,' International Journal of Intelligent Systems, Vol.19, Issue 1-2, pp.149-158, 2004 https://doi.org/10.1002/int.v19:1/2