Text Document Categorization using FP-Tree

FP-Tree를 이용한 문서 분류 방법

  • 박용기 (경북대학교 컴퓨터과학과) ;
  • 김황수 (경북대학교 컴퓨터과학과)
  • Published : 2007.11.15

Abstract

As the amount of electronic documents increases explosively, automatic text categorization methods are needed to identify those of interest. Most methods use machine learning techniques based on a word set. This paper introduces a new method, called FPTC (FP-Tree based Text Classifier). FP-Tree is a data structure used in data-mining. In this paper, a method of storing text sentence patterns in the FP-Tree structure and classifying text using the patterns is presented. In the experiments conducted, we use our algorithm with a #Mutual Information and Entropy# approach to improve performance. We also present an analysis of the algorithm via an ordinary differential categorization method.

References

  1. D.D.Lewis, An evaluation of phrasal and clustered representations on a text categorization task, In Proceedings of SIGIR-92, pages 37-50, 1992
  2. W.Lam, C.Y.Ho, Using a generalized instance set for automatic text categorization, In Proceedings of SIGIR-98, pages 81-89, 1998
  3. R.E.Schapire, Y.Singer, BoosTexter: a boosting-based system for text categorization, Mach. Learn. 39 2000
  4. T.Joachims, Text categorization with support vector machines: learning with many relevant features, In Proceedings of ECML-98, pages 137-142, 1998
  5. Jiawei Han, Jian Pet, Yiwen Yin Runying Mao, Mining Frequent Patterns without Candidate Generation, Data Mining and Knowledge Discovery 2004
  6. Gerard Salton, Chris Buckley, 571 stopword list for the experimental SMART information retrieval system at Cornell University http://www.lextek.com/manuals/onix/stopwords2.html
  7. G.A. Miller, WordNet: A Dictionary Browser, 1st Int'l Conf. Information in data 1985
  8. David J.C. Mackay, Information Theory, Inference, and Learning Algorithm. Cambridge University Press 2003
  9. Yiming Yang and J. O. Pedersen, A Comparative Study on Feature Selection in Text Categorization, Proceedings of the 14th International Conference on Machine Learning pages 412-420 1997
  10. D.D.Lewis 'Reuters-21578' http://www.research.att.com/~lewis
  11. S. T. Dumais, J. Platt, D. Heckerman, M. Sahami, Inductive learning algorithms and representations for text categorization. Proceedings of ACM CIKM98 pages 148-155, 1998