DOI QR코드

DOI QR Code

A Research on Enhancement of Text Categorization Performance by using Okapi BM25 Word Weight Method

Okapi BM25 단어 가중치법 적용을 통한 문서 범주화의 성능 향상

  • 이용훈 (단국대학교 전자계산학과) ;
  • 이상범 (단국대학교 전자계산학과)
  • Received : 2010.10.21
  • Accepted : 2010.12.17
  • Published : 2010.12.31

Abstract

Text categorization is one of important features in information searching system which classifies documents according to some criteria. The general method of categorization performs the classification of the target documents by eliciting important index words and providing the weight on them. Therefore, the effectiveness of algorithm is so important since performance and correctness of text categorization totally depends on such algorithm. In this paper, an enhanced method for text categorization by improving word weighting technique is introduced. A method called Okapi BM25 has been proved its effectiveness from some information retrieval engines. We applied Okapi BM25 and showed its good performance in the categorization. Various other words weights methods are compared: TF-IDF, TF-ICF and TF-ISF. The target documents used for this experiment is Reuter-21578, and SVM and KNN algorithms are used. Finally, modified Okapi BM25 shows the most excellent performance.

Keywords

Text Categorization;Document Classification;TF-IDF;TF-ICF;TF-ISF;Okapi BM25;SVM;Reuter-21578

Acknowledgement

Supported by : 단국대학교

References

  1. Sebastiani. "Machine learning in automated text categorization." Techinical report, Consigilo Nazionale delle Rieche, Italy. 1999.
  2. T.Mitchell. "Machine Learning." MCGraw Hill, NY, US, 1996.
  3. Yang, Y. and J. O. Pderson. "A comparative study on feature selection in text categorization." Proceedings of the 14th International Conference on Machine Learning. 1997.
  4. Gerard Salton and Michael J. McGill. "Introduction to Modern Information Retrieval." McGraw-Hill Book Company, New York, 1983.
  5. 조광제, 김준태. "역카테고리 빈도에 의한 계층적 분류체계에서의 문서의 자동분류." 한국정보과학회 봄학술발표논문집(B), 507-510. 1997.
  6. Larocca Neto, Joel. "A Text Mining Tool for Document Clustering and Text Summarization.", Proceedings of The Text Mining Tool for Document Clustering and Text Summarization Fourth International Conference on The Practical Application of Knowledge Discovery and Data Mining, 41-56.Manchester, UK. Apr, 2000.
  7. Osuna, E., Freund R., and Girosi, F. "Training support vector machines: An application to face detection", Proceedings of Compuer Vision and Pattern Recognition, pp. 130-136, 1997.
  8. Dasarathy, Belur V. "Nearest Neighbor(NN) Norms: NN Pattern Classificatioin Techniques." McGraw-Hill Computer Science Series, CA: IEEE Computer Society Press. 1991.
  9. 리청화, "BPNN의 효율적인 개선방법 및 개념에 기초한 문서분류 시스템 응용" 전북대학교 대학원 박사논문. 2009
  10. 정은경, "문서 범주화 효율성 제고를 위한 정보원평가에 관한 연구 ", 한국정보관리학회, 제24권, 제4호, pp. 305-321, 12월, 2007.
  11. David D. Lewis. "Distribution 1.0 readme file (v1.2) for reuters-21578", AT&T Labs - Research, 1997.
  12. GSalton, "Automatic Information Organization and Retrieval." New York:McGraw-Hill, 1968.
  13. M. F. Porter. "An algorithm for suffix stripping." Program, Vol. 14 no.3 130-137. 1980. https://doi.org/10.1108/eb046814
  14. Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. (1994) "Okapi at TREC-3". In Proceedings of the Third Text REtrieval Conference (TREC 1994).
  15. Chin-Chung Chang and Chih-Jen Lin, LIBSVM: a library for SVM, URL : http://www.csie.ntu.edu.tw/-cjlin/libsvm
  16. D.D.Lewis, "Evaluating text categorization", in Proceedings of the Speech and Natural Language Workshop, 1991.