DOI QR코드

DOI QR Code

A Study on the Reclassification of Author Keywords for Automatic Assignment of Descriptors

디스크립터 자동 할당을 위한 저자키워드의 재분류에 관한 실험적 연구

  • Received : 2012.06.02
  • Accepted : 2012.06.21
  • Published : 2012.06.30

Abstract

This study purported to investigate the possibility of automatic descriptor assignment using the reclassification of author keywords in domestic scholarly databases. In the first stage, we selected optimal classifiers and parameters for the reclassification by comparing the characteristics of machine learning classifiers. In the next stage, learning the author keywords that were assigned to the selected articles on readings, the author keywords were automatically added to another set of relevant articles. We examined whether the author keyword reclassifications had the effect of vocabulary control just as descriptors collocate the documents on the same topic. The results showed the author keyword reclassification had the capability of the automatic descriptor assignment.

본 연구는 국내 주요 학술 DB의 검색서비스에서 제공되고 있는 저자키워드(비통제키워드)의 재분류를 통하여 디스크립터(통제키워드)를 자동 할당할 수 있는 가능성을 모색하였다. 먼저 기계학습에 기반한 주요 분류기들의 특성을 비교하는 실험을 수행하여 재분류를 위한 최적 분류기와 파라미터를 선정하였다. 다음으로, 국내 독서 분야 학술지 논문들에 부여된 저자키워드를 학습한 결과에 따라 해당 논문들을 재분류함으로써 키워드를 추가로 할당하는 실험을 수행하였다. 또한 이러한 재분류 결과에 따라 새롭게 추가된 문헌들에 대하여 통제키워드인 디스크립터와 마찬가지로 동일 주제의 논문들을 모아주는 어휘통제 효과가 있는지를 살펴보았다. 그 결과, 저자키워드의 재분류를 통하여 디스크립터를 자동 할당하는 효과를 얻을 수 있음을 확인하였다.

Keywords

References

  1. 김용환, 정영미 (2012). 위키피디아를 이용한 분류자질 선정에 관한 연구. 정보관리학회지, 29(2), 155-171. http://dx.doi.org/10.3743/KOSIM.2012.29.2.155(Kim, Yong-Hwan, & Chung, Young Mee (2012). An experimental study on feature selection using Wikipedia for text categorization. Journal of the Korean Society for Information Management, 29(2), 155-171. http://dx.doi.org/10.3743/KOSIM.2012.29.2.155)
  2. 김판준 (2006a). 기계학습을 통한 디스크립터 자동부여에 관한 연구. 정보관리학회지, 23(1), 279-299.(Kim, Pan Jun (2006a). A study on automatic assignment of descriptors using machine learning. Journal of the Korean Society for Information Management, 23(1), 279-299.) https://doi.org/10.3743/KOSIM.2006.23.1.279
  3. 김판준 (2006b). 로치오 알고리즘을 이용한 학술지 논문의 디스크립터 자동부여에 관한 연구. 정보관리학회지, 23(3), 69-90.(Kim, Pan Jun (2006b). A study on the automatic descriptor assignment for scientific journal articles using Rocchio algorithm. Journal of the Korean Society for Information Management, 23(3), 69-90.)
  4. 김판준 (2008). 용어 가중치부여 방법을 이용한 로치오 분류기의 성능 향상에 관한 연구. 정보관리학회지, 25(1), 211-233. http://dx.doi.org/10.3743/KOSIM.2008.25.1.211(Kim, Pan Jun (2008). A study on the performance improvement of Rocchio classifier with term weighting methods. Journal of the Korean Society for Information Management, 25(1), 211-233. http://dx.doi.org/10.3743/KOSIM.2008.25.1.211)
  5. 김판준, 이재윤 (2007). 문헌간 유사도를 이용한 자동분류에서 미분류 문헌의 활용에 관한 연구. 정보관리학회지, 24(1), 251-271. http://dx.doi.org/10.3743/KOSIM.2007.24.1.251(Kim, Pan Jun, & Lee, Jae Yun (2007). Utilizing unlabeled documents in automatic classification with inter-document similarities. Journal of the Korean Society for Information Management, 24(1), 251-271. http://dx.doi.org/10.3743/KOSIM.2007.24.1.251)
  6. 윤구호 (1999). 색인․초록. 서울: 한국도서관협회.(Yoon, Koo-ho (1999). Index & abstract. Seoul: Korean Library Association.)
  7. 이재윤 (2005a). 문헌간 유사도를 이용한 SVM 분류기의 문헌분류성능 향상에 관한 연구. 정보관리학회지, 22(3), 261-287.(Lee, Jae Yun (2005a). Improving the performance of SVM text categorization with inter-document similarities. Journal of the Korean Society for Information Management, 22(3), 261-287.) https://doi.org/10.3743/KOSIM.2005.22.3.261
  8. 이재윤 (2005b). 자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 대한 연구. 한국문헌정보학회지, 39(2), 123-146.(Lee, Jae Yun (2005b). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Society for Library and Information Science, 39(2), 123-146.) https://doi.org/10.4275/KSLIS.2005.39.2.123
  9. 정영미 (2012). 정보검색연구 (증보판). 서울: 연세대학교 출판문화원.(Chung, Young Mee (2012). Research in information retrieval. Seoul: Yonsei University Press.)
  10. 정은경 (2009). 문서범주화 성능 향상을 위한 의미기반 자질확장에 관한 연구. 정보관리학회지, 26(3), 261-278. http://dx.doi.org/10.3743/KOSIM.2009.26.3.261(Chung, Eun-Kyung (2009). A semantic-based feature expansion approach for improving the effectiveness of text categorization by using WordNet. Journal of the Korean society for Information Management, 26(3), 261-278. http://dx.doi.org/10.3743/KOSIM.2009.26.3.261)
  11. Chen, E., Lin, Y., Xiong, H., Luo, Q., & Ma, H. (2011). Exploiting probabilistic topic models to improve text categorization under class imbalance. Information Processing and Management, 47(2), 202-214. http://dx.doi.org/10.1016/j.ipm.2010.07.003
  12. Chen, Yao-Tsung, & Chen, Meng Chang (2011). Using chi-square statistics to measure similarities for text categorization. Expert Systems with Application, 38(4), 3085-3090. http://dx.doi.org/10.1016/j.eswa.2010.08.100
  13. Chung, Y., Pottenger, W. M., & Schatz, B. R. (1998). Automatic subject indexing using an associative neural network. Proceedings of the 3rd ACM International Conference on Digital Libraries (DL '98), ACM Press, 59-68.
  14. Gil-Leiva, I., & Alonso-Arroyo, A. (2007). Keywords given by authors of scientific articles in database descriptors. Journal of the American Society for Information Science and Technology, 58(8), 1175-1187. https://doi.org/10.1002/asi.20595
  15. Harish, B. S., Guru, D. S., & Manjunath, S. (2010). Representation and classification of text documents: A brief review. IJCA Special Issue on "Recent Trends in Image Processing and Pattern Recognition" RTIPPR, 2010, 110-119.
  16. Hurt, C. D. (2010). Automatically generated keywords: A comparison to author-generated keywords in the sciences. Journal of Information and Organizational Sciences, 34(1), 81-88. Retrieved from https://jios.foi.hr/index.php/jios/article/view/158
  17. Jiang, S., Pang, G., Wu, M., & Kuang, L. (2012). An improved k-nearest-neighbor algorithm for text categorization. Expert Systems with Applications, 39(1), 1503-1509. https://doi.org/10.1016/j.eswa.2011.08.040
  18. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning, 137-142.
  19. Khan, A., Baharudin, B., & Lee, Lam Hong (2010). A review of machine learning algorithms for text-documents classification. Journal of Advances in Information Technology, 1(1), 4-20.
  20. Kumar, M. Arun, & Gopal, M. (2010). A comparison study on multiple binary-class SVM methods for unilabel text categorization. Pattern Recognition Letters, 31(11), 1437-1444. https://doi.org/10.1016/j.patrec.2010.02.015
  21. Lauser, B., & Hotho, A. (2003). Automatic multi-label subject indexing in a multilingual environment. Proceedings of the 7th European Conference in Research and Adavanced Technology for Digital Libraries (ECDL '03), 140-151.
  22. Lewis, D. D., Schapire, R. E., Callan, J. P., & Papka, R. (1996). Training algorithms for linear text classfiers. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '96), 298-306.
  23. Li, Cheng Hua, & Park, Soon Choel (2009). An efficient document classification model using an improved back propagation neural network and singular value decomposition. Expert Systems with Applications, 36(2), 3208-3215. https://doi.org/10.1016/j.eswa.2008.01.014
  24. Li, Xiangdong, & Sun, Qin (2011). The review of text categorization research over Chinese Library Classification. American Journal of Engineering and Technology Research, 11(9), 2729-2734.
  25. Miao, Yun-Qian, & Kamel, M. (2011). Pairwise optimized Rocchio algorithm for text categorization. Pattern Recognition, 32(2), 375-382. https://doi.org/10.1016/j.patrec.2010.09.018
  26. Mitchell, T. M. (1997). Machine learning. New York, NY: McGraw-Hill.
  27. Moens, Marie-Francine (2000). Automatic indexing and abstracting of document texts. Boston: Kluwer Academic Publishers.
  28. Nidhi, & Gupta, V. (2011). Recent trends in text classification techniques. International Journal of Computer Applications, 35(6), 45-51. https://doi.org/10.5120/4400-6110
  29. Ruiz, M. E., & Srinivasan, P. (2002). Hierarchical text categorization using neural networks. Information Retrieval, 5(1), 87-118. https://doi.org/10.1023/A:1012782908347
  30. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47. https://doi.org/10.1145/505282.505283
  31. Torii, M., Yin, L., Nguyen, T., Mazumdar, C. T., Liu, H., Hartley, D. M., & Nelson, N. P. (2011). An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics. International Journal of Medical Informatics, 80(1), 56-66. https://doi.org/10.1016/j.ijmedinf.2010.10.015
  32. Uguz, H. (2011). A two-stage feature selection methods for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems, 24(7), 1024-1032. https://doi.org/10.1016/j.knosys.2011.04.014
  33. Vasuki, V. & Cohen, T. (2010). Reflective random indexing for semi-automatic indexing of the biomedical literature. Journal of Biomedical Informatics, 43(5), 694-700. https://doi.org/10.1016/j.jbi.2010.04.001
  34. Villena-Roman, J., Collada-Perez, S., Lana-Serrano, S., & Gonzalez-Cristobal, J. C. (2011). Hybrid approach combining machine learning and a rule-based expert system for text categorization. Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference, 323-328.
  35. Voorhees, E. M., & Harman, D. K. (2005). TREC: Experiment and evaluation in information retrieval. Cambridge, Mass.: MIT Press.
  36. Wang, Tai-Yue, & Chiang, Huei-Min (2007). Fuzzy support vector machine for multi-class text categorization. Information Processing and Management, 43(4), 914-929. https://doi.org/10.1016/j.ipm.2006.09.011
  37. Wu, Chih-Hung (2009). Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks. Expert Systems with Applications, 36(1), 4321-4330. https://doi.org/10.1016/j.eswa.2008.03.002
  38. Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1-2), 69-90. https://doi.org/10.1023/A:1009982220290
  39. Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. Proceedings of the 14th International Conference on Machine Learning (ICML '97), 412-420.
  40. Yang, Y., & Liu, Xin (1999). A re-examination for text categorization methods. Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ('SIGIR 99), 42-49.
  41. Yu, Bo, Xu, Zong-ben, & Li, Cheng-hua (2008). Latent semantic analysis for text categorization using neural network. Knowledge-Based Systems, 21(8), 900-904. https://doi.org/10.1016/j.knosys.2008.03.045
  42. Zhang, J., & Yang, Y. (2003). Robustness of regularized linear classification methods in text categorization. Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '03), 190-197.
  43. Zhang, Y., Tsai, F. S., & Kwee, A. T. (2011). Multilingual sentence categorization and novelty mining. Information Processing and Management, 47(5), 667-675. https://doi.org/10.1016/j.ipm.2010.02.003

Cited by

  1. An Experimental Study on the Performance Improvement of Automatic Classification for the Articles of Korean Journals Based on Controlled Keywords in International Database vol.48, pp.3, 2014, https://doi.org/10.4275/KSLIS.2014.48.3.491
  2. A Study on the Application to Network Analysis on the Importance of Author Keyword based on the Position of Keyword vol.31, pp.2, 2014, https://doi.org/10.3743/KOSIM.2014.31.2.121
  3. A Study on the Correlation between the Appearance Frequency of Author Keyword and the Number of Citation in the Humanities and Social Science Journal Articles of the Korea Citation Index (KCI) vol.30, pp.2, 2013, https://doi.org/10.3743/KOSIM.2013.30.2.227
  4. A Study on the Factors Influencing Semantic Relation in Building a Structured Glossary vol.48, pp.2, 2014, https://doi.org/10.4275/KSLIS.2014.48.2.353