DOI QR코드

DOI QR Code

A Study on automatic assignment of descriptors using machine learning

기계학습을 통한 디스크립터 자동부여에 관한 연구

  • 김판준 (연세대학교 문헌정보학과)
  • Published : 2006.03.01

Abstract

This study utilizes various approaches of machine learning in the process of automatically assigning descriptors to journal articles. The effectiveness of feature selection and the size of training set were examined, after selecting core journals in the field of information science and organizing test collection from the articles of the past 11 years. Regarding feature selection, after reducing the feature set using $x^2$ statistics(CHI) and criteria that prefer high-frequency features(COS, GSS, JAC), the trained Support Vector Machines(SVM) performed the best. With respect to the size of the training set, it significantly influenced the performance of Support Vector Machines(SVM) and Voted Perceptron(VTP). However, it had little effect on Naive Bayes(NB).

학술지 논문에 디스크립터를 자동부여하기 위하여 기계학습 기반의 접근법을 적용하였다. 정보학 분야의 핵심 학술지를 선정하여 지난 11년간 수록된 논문들을 대상으로 문헌집단을 구성하였고, 자질 선정과 학습집합의 크기에 따른 성능을 살펴보았다. 그 결과, 자질 선정에서는 카이제곱 통계량(CHI)과 고빈도 선호 자질 선정 기준들(COS, GSS, JAC)을 사용하여 자질을 축소한 다음, 지지벡터기계(SVM)로 학습한 결과가 가장 좋은 성능을 보였다. 학습집합의 크기에서는 지지벡터기계(SVM)와 투표형 퍼셉트론(VPT)의 경우에는 상당한 영향을 받지만 나이브 베이즈(NB)의 경우에는 거의 영향을 받지 않는 것으로 나타났다.

Keywords

References

  1. 김판준. 2005. '새로운 주제 탐지를 통한 지식 구조 갱신에 관한 연구'. 박사 학위논문, 연세대학교 대학원, 문헌정보학과
  2. 윤구호. 1999. '색인.초록'. 서울 : 도서관 협회
  3. 이재윤. 2005a. 자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 관한 연구. '문헌정보학회지', 39(2) : 123-146
  4. __. 2005b. 문헌간 유사도를 이용한 SVM 분류기의 문헌분류성능 향상에 관한 연구. '정보관리학회지', 22(3) : 261-287
  5. 정영미. 2005. '정보검색연구'. 서울: 구미 무역(주) 출판부
  6. Borko, H. and Myrna Bernick. 1963. 'Automatic Document Classification.' JACM, 10(2) : 151-162 https://doi.org/10.1145/321160.321165
  7. Chang, Jeffrey. 2000. Using the MeSH Hierarchy to Index Bioin formatics Articles. CS224N/Ling237 Final Projects 2000, Stanford University
  8. Chung, Y., W. M. Pottenger, and B. R. Schatz. 1998. 'Automatic subject indexing using an associative neural network.' Proceedings of the 3rd ACM international Conference on Digital Libraries(DL ' 98), ACM Press, 59-68
  9. Freund, Yoav and Robert E. Schapire. 1998. 'Large Margin Classification Using the Perceptron Algorithms.' Proceedings of the 11th Annual Conference on Computer Learning Theory, ACM Press, 209-217
  10. Humprey, Susanne M. 1999. 'Automatic indexing of Documents from Journal Descriptors: A Preliminary Investigation.' JASIS, 50(8) : 661-674 https://doi.org/10.1002/(SICI)1097-4571(1999)50:8<661::AID-ASI4>3.0.CO;2-R
  11. Joachims, Thorsten. 1998. 'Text Categorization with Support Vector Machines: Learning with Many Relevant Features.' Proceedings of the 10th European Conference on Machine Learning, 137-142
  12. Joachims, Thorsten. 2001. Learning to Classify Text Using Support Vector Machines. Boston: Kluwer Academic Publishers
  13. John, George H. and Pat Langley. 1995. 'Estimating Continuous Distributions in Bayesian Classifiers.' Proceedings of the Eleven th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers, San Mateo, 338-345
  14. Lan, Man et a1. 2005. 'A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Support Vector Machines.' Proceedings of the 14th International Conference on World Wide Web, WWW(Special Interest Tracks and Posters), Chiba, Japan, May 10-14, 1032-1033
  15. Lauser, B. and Andreas Rotho. 2003. 'Automatic Multi-Label Subject Indexing in a Multilingual Environment.' Proceedings of the 7th European Conference in Research and Adavanced Technology for Digital Libraries(ECDL ' 03), 140-151
  16. Lewis, D. D. et al. 1996. 'Training Algorithms for Linear Text Classfiers.' Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ' 96), 298-306
  17. Liang, Chun-Yan et a l. 2006. 'Dictionary-based Text Categorization of Chemical Web Pages.' IPM, 42(4) : 1017-1029
  18. Lewis, D. D. et al. 1996. 'Training Algorithms for Linear Text Classifiers.' Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ' 96), 298-306
  19. Moens, Marie-Francine. 2000. Automatic Indexing and Abstracting of Document Texts. The Kluwer International Series on Information Retrieval. Boston: Kluwer Academic Publishers
  20. Platt, John. 1999. 'Fast Training of Support Vector Machines using Sequential Minimal Optimization.' In Advances in Neural Information Processing Systems 11, by Kearns, M. S., S. A. Solla, and D. A. Cohn. MIT Press
  21. Plaunt, C. and Barbara A. N. 1998. 'An association-based Method for automatic indexing with a controlled vocabulary.' JASIS, 49(10) : 888-902
  22. Rogati, M. and Y. Yang. 2002. 'High-Performing Feature Selection for Text Classification.' Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, 659-661
  23. Ruiz, Miguel E. and Padmini Srinivasan. 1999. 'Combining Machine Learning and Hierarchical Indexing Structures for Text Categorization.' To appear in Advances in Classification Research Vol. 10: Proceedings of the 10th ASIS SIG/CR Classification Research Workshop, Washington D. C. [cited2006.1.27].
  24. Ruiz, Miguel E. and Padmini Srinivasan. 2002 'Hierarchical Text Categorization Using Neural Networks.' Information Retrieval, 5(10) : 87-118 https://doi.org/10.1023/A:1012782908347
  25. Sebastiani, Fabrizio. 2002 'Machine Learning in Automated Text Categorization,' ACM Computing Surveys, 34(1) : 1-47 https://doi.org/10.1145/505282.505283
  26. Tzeras, Kostas and Stephan Hartmann. 1993. 'Automatic indexing based on Bayesian Inference Network.' Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ' 93), 22-34
  27. Yang, Y. 1999. 'An Evaluation of Statistical Approaches to Text Categorization'. Information Retrieval, 1 : 69-90 https://doi.org/10.1023/A:1009982220290
  28. Yang, Y. and Jan O. Pedersen. 1997. 'A Comparative Study on Feature Selection in Text Categorization.' Proceedings of the 14 th International Conference on Machine Learning(ICML ' 97), 412-420
  29. Yang, Y. and Xin Liu. 1999. 'A Reexamination for Text Categorization Methods.' Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ' 99), 42-49
  30. Zhang, J. and Y. Yang. 2003. 'Robustness of Regularized Linear Classification Methods in Text Categorization.' Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ' 03), 190-197

Cited by

  1. An Experimental Study on the Performance Improvement of Automatic Classification for the Articles of Korean Journals Based on Controlled Keywords in International Database vol.48, pp.3, 2014, https://doi.org/10.4275/KSLIS.2014.48.3.491