DOI QR코드

DOI QR Code

로치오 알고리즘을 이용한 학술지 논문의 디스크 립터 자동부여에 관한 연구

A Study on the Automatic Descriptor Assignment for Scientific Journal Articles Using Rocchio Algorithm

  • 김판준 (연세대학교 문헌정보학과)
  • 발행 : 2006.09.29

초록

로치오 알고리즘에 기초한 통제어휘 자동색인 또는 텍스트 범주화에서 적용되어 온 여러 성능 요인들을 재검토하였고, 성능 향상을 위한 기본적인 방법을 찾아보았다. 또한, 동등한 조건에서 통제어휘 자동색인을 위한 로치오 알고리즘 기반 방법의 성능을 다른 학습기반 방법들의 성능과 비교하였다. 결과에 따르면, 통제어휘 자동색인을 위한 로치오 기반의 프로파일 방법은 구현의 용이성과 컴퓨터 처리시간 측면의 경제성이라는 기존의 장점을 그대로 유지하면서도, 다른 학습기반 방법들(SVM, VPT, NB)과 거의 동등하거나 더 나은 성능을 보여주었다. 특히, 색인전문가의 색인작업을 지원하는 반-자동 색인의 목적으로는 비교적 높은 수준의 재현율을 유지하면서 학습 데이터의 증가에 따라 정확률이 크게 향상되는 로치오 알고리즘을 이용한 방법을 우선적으로 고려할 수 있을 것이다.

Several performance factors which have applied to the automatic indexing with controlled vocabulary and text categorization based on Rocchio algorithm were examined, and the simple method for performance improvement of them were tried. Also, results of the methods using Rocchio algorithm were compared with those of other learning based methods on the same conditions. As a result, keeping with the strong points which are implementational easiness and computational efficiency, the methods based Rocchio algorithms showed equivalent or better results than other learning based methods(SVM, VPT, NB). Especially, for the semi-automatic indexing(computer-aided indexing), the methods using Rocchio algorithm with a high recall level could be used preferentially.

키워드

참고문헌

  1. 김판준. 2006. 기계학습을 통한 디스크립터 자동부여에 관한 연구. '정보관리학회지', 23(1) : 279-299
  2. 정영미. 2005. '정보검색연구'. 서울 : 구미 무역(주) 출판부
  3. 이재윤. 2005a. 문헌간 유사도를 이용한 SVM 분류기의 문헌분류성능 향상에 관한 연구 '정보관리학회지', 22(3) : 261-287
  4. 이재윤. 2005b. 자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 관한 연구. '문헌정보학회지', 39(2) : 123-146
  5. Cohen, W. W. and Y. Singer. 1999. 'Context-sensitive learning methods for text categorization.' ACM Transactions on Information Systems, 17(2) : 141-173 https://doi.org/10.1145/306686.306688
  6. Dattola, R. T. 1969. 'A fast algorithm for automatic classification.' Journal of Library Automation, 2(1) : 31-48
  7. Ferber, Reginald. 1997. 'Automated indexing with thesaurus descriptors : a co-occurrence based approach to multilingual retrieval.' In : Peters, Carol, and Costantino Thanos eds. In : Lecture Notes in Computer Science 1324, Research and Advanced Technology for Digital Libraries, First European Conference, Springer : 233-251
  8. Galavotti, Luigi, Fabrizio Sebastiani, and Maria Simi. 2000. 'Experiments on the use of feature selection and negative evidence in automated text categorization.' In : Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries: 59-68
  9. Gay, Clifford W., Mehmet Kayaalp, and Alan R. Aronson. 2005. 'Semi-automatic indexing of full text biomedical articles' In : Proceedings of AMIA 2005 Symposium : 271-275
  10. Hull, D. A. 1994. 'Improving text retrieval for the routing problem using latent semantic indexing.' In : Proceedings of SIGIR 94 : 282-289
  11. Ittner, D. J., D. D. Lewis, and D. D. Ahn. 1995. 'Text categorization of low quality images.' In : Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval : 301-315
  12. Joachims, Thorsten. 1996. 'A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization.' Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, TN : 143-151
  13. Joachims, Thorsten. 1998. 'Text categorization with support vector machines : learning with many relevant features.' In : Proceedings of the 10th European Conference on Machine Learning: 137-142
  14. Lancaster, F. W. 2003. Indexing and Abstracting in Theory and Practice. Third ed. London : Facet publishing
  15. Lewis, D. D. et al. 1996. 'Training algorithms for linear text classifiers.' In : Proceedings of SIGIR' 96 : 298-306
  16. Moens , Marie-Francine. 2000. Automatic Indexing and Abstracting of Document Texts. The K1uwer International Series on Information Retrieval. Boston: Kluwer Academic Publishers
  17. Montejo-Raez, Arturo. 2002. 'Toward conceptual indexing using automatic assignment of descriptors.' In : Proceedings of the AM 2002 Workshop on Personalization Techniques in Electronig Publishing. Malaga, Spain, May 2002. [cited 2006.5.11.]
  18. Ng, H. T., W. B. Goh, and K. L. Low. 1997. 'Feature selection, perceptron learning, and a usability case study for text categorization.' In : Proceedings of SIGIR' 97:67-73
  19. Pouliquen, Bruno, Ralf Steinberger, and Camelia Ignat. 2003. 'Automatic annotation of multilingual text collections with a conceptual thesaurus.' In : Proceedings of the workshop 'Ontologies and Information Extraction' at the Summer School 'The Semantic Web and Language Technology' (EUROLAN 2003), Bucharest
  20. Rogati, M., and Y. Yang. 2002. 'High-performing feature selection for text classification.' In : Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management : 659-661
  21. Ruiz, Miguel E., and Padmini Srinivasan. 2002. 'Hierarchical text categorization using neural networks.' Information Retrieval, 5(10) : 87-118 https://doi.org/10.1023/A:1012782908347
  22. Schapire, R. E., Y. Singer, and A. Singhal. 1998. 'Boosting and rocchio applied to text filtering.' In : Proceedings of SIGIR' 98 : 215-223
  23. Schapire, R. E., and Y. Singer. 2000. 'BoosTexter : A boosting-based system for text categorization.' Machine Learning, 39(2/3) : 135-168 https://doi.org/10.1023/A:1007649029923
  24. Schutze, H., D. A. Hull, and J. O. Pedersen. 1995. 'A comparison of classifiers and document representations for the routing problem.' In : Proceedings of the SIGIR' 95 : 229-237
  25. Sebastiani, Fabrizio. 2002. 'Machine learning in automated text categorization.' ACM Computing Surveys, 34(1) : 1-47 https://doi.org/10.1145/505282.505283
  26. Steinberger, Ralf, Bruno Pouliquen, and Johan Hagman. 2002. 'Cross-lingual document similarity calculation using the multilingual thesaurus EUROVOC.' In : A Gelbukh ed. Computational Linguistics and Intelligent Text Processing, Third International Conference, CICLing 2002 : 415-424
  27. Steinberger, Ralf, Johan Hagman, and Stefan Scheer. 2000. 'Using thesauri for automatic indexing and for the visualisation of multilingual document collections.' In : Proceedings of the workshop on Ontologies and lexical knowledge bases (Ontolex, 2000), Sozopol, Bulgaria, pp. 130-141
  28. Steinberger, Ralf. 2001. 'Cross-lingual keyword assignment.' In : Proceedings of the SEPLN 2001 : 273-280
  29. Weigend, A. S., Wiener, E. D., and Pedersen, J. O. 1999. 'Exploiting hierarchy in text categorization.' Information Retrieval, 1(3) : 193-216 https://doi.org/10.1023/A:1009983522080
  30. Wiener, E. D., Pedersen, J. O. and Weigend, A. S. 1995. 'A neural network approach to topic spotting.' In : Proceedings of SDAIR 1995, 4 th Annual Symposium
  31. Yang, Y. 1999. 'Evaluation of statistical approaches to text categorization.' Information Retrieval, 1 : 69-90 https://doi.org/10.1023/A:1009982220290