Improving the Performance of Statistical Automatic Text Categorization by using Phrasal Patterns and Keyword Sets

구문 패턴과 키워드 집합을 이용한 통계적 자동 문서 분류의 성능 향상

  • 한정기 (동국대학교 대학원 컴퓨터공학과) ;
  • 박민규 (웹 패턴 테크놀로지 개발실) ;
  • 조광제 (서울시스템 DTP 사업본부) ;
  • 김준태 (동국대학교 컴퓨터공학과)
  • Published : 2000.04.01

Abstract

This paper presents an automatic text categorization model that improves the accuracy by combining statistical and knowledge-based categorization methods. In our model we apply knowledge-based method first, and then apply statistical method on the text which are not categorized by knowledge-based method. By using this combined method, we can improve the accuracy of categorization while categorize all the texts without failure. For statistical categorization, the vector model with Inverted Category Frequency (ICF) weighting is used. For knowledge-based categorization, Phrasal Patterns and Keyword Sets are introduced to represent sentence patterns, and then pattern matching is performed. Experimental results on new articles show that the accuracy of categorization can be improved by combining the tow different categorization methods.

Keywords

References

  1. M Blosseville. G. Hebrail, M Monteil, and N Penot., 'Automatic document ciassification . natural language processing. statistical analysis, and expert system techniques used together,' SIGIR'92, 1992 https://doi.org/10.1145/133160.133175
  2. W. Frakes. and R Baeza- Yates, Information Retrieval, Prentice Hall, 1992
  3. N. Fuhr, 'Models for retrieval with probabilistic indexing,' Information Processing and Management, 25(1), 1989 https://doi.org/10.1016/0306-4573(89)90091-5
  4. K. Hamill and A Zamora. 'The Use of Titles for Automatic Document Classification,' Journal of the ?American Society for Information Science, 1980 https://doi.org/10.1002/asi.4630310603
  5. D. Harman, 'Ranking algorithms,' in Information Retrieval Data Structures and Algorithms, Prentice Hall, 1992
  6. P. Hayes and S. Wernstem, 'CONSTRUE/TIS . A system for content-based indexing of a database of news stories,' Second Annual Conference on Innovative Applicatins of Artificial Intelligence, 1990
  7. P. Hayes, P. Anderson, I. Nirenburg, and L. Schmandt 'TCS A Shell for Content-based Text Categonzation,' Proceedings qf the 6th IEEE Conference on Artiticial lntelligence Applications. Santa Monica, March 1990
  8. J. R. Hobbs, D. Appelt, M. Tyson, J. Bear and D. Israel, 'FASTUS : System summary,'Proceedings of Fourth Message Understanding Conference, 1992
  9. R Hoch., 'Using IR techniques for text classification in document analysis,'SIGIR'94, 1994
  10. P. Jacobs, Text-Based Intelligent Systems. Lawrence Erlbaum, 1992
  11. P. Jacobs., 'Using statistical methods to Improve knowledge-based news categorization,'IEEE Expert, April, 1993 https://doi.org/10.1109/64.207425
  12. J. Hobbs, D. Appelt, J. Bear, D. Israel, and M. Tyson 'FASTUS : A System for Extracting Information from Natural-Language Text'
  13. L. Larkey and W. Croft, 'Combining classifiers in text categorization,' SIGIR'96, 1996 https://doi.org/10.1145/243199.243276
  14. D. Lewis, 'An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task,' SIGIR'92 https://doi.org/10.1145/133160.133172
  15. D. Lewis, 'Evaluation and optimizing autonomous text classification system,' SIGIR'.95 https://doi.org/10.1145/215206.215366
  16. D. Lewis., R. Schapire, and J. Callan, 'Training algorithms for Imear text classifiers,'SIGIR'96
  17. B. Masand, 'Classifying News Stories using Memory Based Reasoning.' SIGlR`92 https://doi.org/10.1145/133160.133177
  18. M. Maron. 'Automatic indexing . An experimental inquiry.' Journal of the ACM, 1961 https://doi.org/10.1145/321075.321084
  19. Proceedings of the Fourth Message Understanding Conference. Morgan Kaufmann. CA 1992
  20. G Salton Automatic Text Processing : The Trans- formation, Analysis, and Retrieval of information by Computer. Addison Wesley Publishing Co., 1989
  21. 강승식,이하규,'한국어 형태소 분석기HAM의 형태소분석 및 철자 검사 기능',한글 및 한국어 정보처리 학술 발표논문집.1996
  22. 김재군,김영환,김성혁,'한국어 정보검색 연구를 위한 시험용 데이터 모음 KTSET 개발', 한글 및 한국어 정보처리 학술 발표논문집,1996
  23. 박미경,김민정,'부분 파싱을 이용한 한국어 명사구,술이구와 접사의 색인 기법',정보과학회 학술발표 논문집,4,1997
  24. 송재관,홍성웅,박찬곤'기계 번역을 위한 한국어 문장 패턴에 관한 연구', 정보과학회 학술발표 논문집,10.1996
  25. 엄미현,신대규,나동필,'한국어의 구조적 예매성', 정보과학회 학술발표 논문집,4.1996
  26. 임해창,임희석,윤보현,'자연어처리 연구 동향:통계 기반의 자연어 처리',한국정보과학회지,Vol.12 No.9,pp.20-30,1994
  27. 정영미,정보검색론,구미무역 출판부,1993
  28. 조광재,김준태,'역 카테고리 빈도에 의한 계층적 분류체계에서의 문서의 자동 분류', 정보과학회 학술발표 논문집.4,1997
  29. 최동시,정경택.'카테고리와 키워드의 밀접성 정보에 의한 문서 자동 분류 시스템 설계 및 구현', 정보 과학회 학술 발표 논문집,10.1995