A Study on the Automatic Descriptor Assignment for Scientific Journal Articles Using Rocchio Algorithm

로치오 알고리즘을 이용한 학술지 논문의 디스크 립터 자동부여에 관한 연구

  • 김판준 (연세대학교 문헌정보학과)
  • Published : 2006.09.29


Several performance factors which have applied to the automatic indexing with controlled vocabulary and text categorization based on Rocchio algorithm were examined, and the simple method for performance improvement of them were tried. Also, results of the methods using Rocchio algorithm were compared with those of other learning based methods on the same conditions. As a result, keeping with the strong points which are implementational easiness and computational efficiency, the methods based Rocchio algorithms showed equivalent or better results than other learning based methods(SVM, VPT, NB). Especially, for the semi-automatic indexing(computer-aided indexing), the methods using Rocchio algorithm with a high recall level could be used preferentially.


  1. 김판준. 2006. 기계학습을 통한 디스크립터 자동부여에 관한 연구. '정보관리학회지', 23(1) : 279-299
  2. 정영미. 2005. '정보검색연구'. 서울 : 구미 무역(주) 출판부
  3. 이재윤. 2005a. 문헌간 유사도를 이용한 SVM 분류기의 문헌분류성능 향상에 관한 연구 '정보관리학회지', 22(3) : 261-287
  4. 이재윤. 2005b. 자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 관한 연구. '문헌정보학회지', 39(2) : 123-146
  5. Cohen, W. W. and Y. Singer. 1999. 'Context-sensitive learning methods for text categorization.' ACM Transactions on Information Systems, 17(2) : 141-173
  6. Dattola, R. T. 1969. 'A fast algorithm for automatic classification.' Journal of Library Automation, 2(1) : 31-48
  7. Ferber, Reginald. 1997. 'Automated indexing with thesaurus descriptors : a co-occurrence based approach to multilingual retrieval.' In : Peters, Carol, and Costantino Thanos eds. In : Lecture Notes in Computer Science 1324, Research and Advanced Technology for Digital Libraries, First European Conference, Springer : 233-251
  8. Galavotti, Luigi, Fabrizio Sebastiani, and Maria Simi. 2000. 'Experiments on the use of feature selection and negative evidence in automated text categorization.' In : Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries: 59-68
  9. Gay, Clifford W., Mehmet Kayaalp, and Alan R. Aronson. 2005. 'Semi-automatic indexing of full text biomedical articles' In : Proceedings of AMIA 2005 Symposium : 271-275
  10. Hull, D. A. 1994. 'Improving text retrieval for the routing problem using latent semantic indexing.' In : Proceedings of SIGIR 94 : 282-289
  11. Ittner, D. J., D. D. Lewis, and D. D. Ahn. 1995. 'Text categorization of low quality images.' In : Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval : 301-315
  12. Joachims, Thorsten. 1996. 'A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization.' Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, TN : 143-151
  13. Joachims, Thorsten. 1998. 'Text categorization with support vector machines : learning with many relevant features.' In : Proceedings of the 10th European Conference on Machine Learning: 137-142
  14. Lancaster, F. W. 2003. Indexing and Abstracting in Theory and Practice. Third ed. London : Facet publishing
  15. Lewis, D. D. et al. 1996. 'Training algorithms for linear text classifiers.' In : Proceedings of SIGIR' 96 : 298-306
  16. Moens , Marie-Francine. 2000. Automatic Indexing and Abstracting of Document Texts. The K1uwer International Series on Information Retrieval. Boston: Kluwer Academic Publishers
  17. Montejo-Raez, Arturo. 2002. 'Toward conceptual indexing using automatic assignment of descriptors.' In : Proceedings of the AM 2002 Workshop on Personalization Techniques in Electronig Publishing. Malaga, Spain, May 2002. [cited 2006.5.11.]
  18. Ng, H. T., W. B. Goh, and K. L. Low. 1997. 'Feature selection, perceptron learning, and a usability case study for text categorization.' In : Proceedings of SIGIR' 97:67-73
  19. Pouliquen, Bruno, Ralf Steinberger, and Camelia Ignat. 2003. 'Automatic annotation of multilingual text collections with a conceptual thesaurus.' In : Proceedings of the workshop 'Ontologies and Information Extraction' at the Summer School 'The Semantic Web and Language Technology' (EUROLAN 2003), Bucharest
  20. Rogati, M., and Y. Yang. 2002. 'High-performing feature selection for text classification.' In : Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management : 659-661
  21. Ruiz, Miguel E., and Padmini Srinivasan. 2002. 'Hierarchical text categorization using neural networks.' Information Retrieval, 5(10) : 87-118
  22. Schapire, R. E., Y. Singer, and A. Singhal. 1998. 'Boosting and rocchio applied to text filtering.' In : Proceedings of SIGIR' 98 : 215-223
  23. Schapire, R. E., and Y. Singer. 2000. 'BoosTexter : A boosting-based system for text categorization.' Machine Learning, 39(2/3) : 135-168
  24. Schutze, H., D. A. Hull, and J. O. Pedersen. 1995. 'A comparison of classifiers and document representations for the routing problem.' In : Proceedings of the SIGIR' 95 : 229-237
  25. Sebastiani, Fabrizio. 2002. 'Machine learning in automated text categorization.' ACM Computing Surveys, 34(1) : 1-47
  26. Steinberger, Ralf, Bruno Pouliquen, and Johan Hagman. 2002. 'Cross-lingual document similarity calculation using the multilingual thesaurus EUROVOC.' In : A Gelbukh ed. Computational Linguistics and Intelligent Text Processing, Third International Conference, CICLing 2002 : 415-424
  27. Steinberger, Ralf, Johan Hagman, and Stefan Scheer. 2000. 'Using thesauri for automatic indexing and for the visualisation of multilingual document collections.' In : Proceedings of the workshop on Ontologies and lexical knowledge bases (Ontolex, 2000), Sozopol, Bulgaria, pp. 130-141
  28. Steinberger, Ralf. 2001. 'Cross-lingual keyword assignment.' In : Proceedings of the SEPLN 2001 : 273-280
  29. Weigend, A. S., Wiener, E. D., and Pedersen, J. O. 1999. 'Exploiting hierarchy in text categorization.' Information Retrieval, 1(3) : 193-216
  30. Wiener, E. D., Pedersen, J. O. and Weigend, A. S. 1995. 'A neural network approach to topic spotting.' In : Proceedings of SDAIR 1995, 4 th Annual Symposium
  31. Yang, Y. 1999. 'Evaluation of statistical approaches to text categorization.' Information Retrieval, 1 : 69-90