An Experimental Study on the Automatic Classification of Korean Journal Articles through Feature Selection

Kim, Pan Jun;

doi:10.3743/KOSIM.2022.39.1.069

Journal of the Korean Society for information Management (정보관리학회지)

Volume 39 Issue 1
/
Pages.69-90
/
2022
/
1013-0799(pISSN)
/
2586-2073(eISSN)

Korean Society for Information Management (한국정보관리학회)

DOI QR Code

An Experimental Study on the Automatic Classification of Korean Journal Articles through Feature Selection

자질선정을 통한 국내 학술지 논문의 자동분류에 관한 연구

Kim, Pan Jun

김판준 (신라대학교 문헌정보학과)

Received : 2022.02.14
Accepted : 2022.03.04
Published : 2022.03.30

https://doi.org/10.3743/KOSIM.2022.39.1.069 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

As basic data that can systematically support and evaluate R&D activities as well as set current and future research directions by grasping specific trends in domestic academic research, I sought efficient ways to assign standardized subject categories (control keywords) to individual journal papers. To this end, I conducted various experiments on major factors affecting the performance of automatic classification, focusing on feature selection techniques, for the purpose of automatically allocating the classification categories on the National Research Foundation of Korea's Academic Research Classification Scheme to domestic journal papers. As a result, the automatic classification of domestic journal papers, which are imbalanced datasets of the real environment, showed that a fairly good level of performance can be expected using more simple classifiers, feature selection techniques, and relatively small training sets.

국내 학술연구의 동향을 구체적으로 파악하여 연구개발 활동의 체계적인 지원 및 평가는 물론 현재와 미래의 연구 방향을 설정할 수 있는 기초 데이터로서, 개별 학술지 논문에 표준화된 주제 범주(통제키워드)를 부여할 수 있는 효율적인 방안을 모색하였다. 이를 위해 한국연구재단 「학술연구분야분류표」 상의 분류 범주를 국내학술지 논문에 자동 할당하는 과정에서, 자질선정 기법을 중심으로 자동분류의 성능에 영향을 미치는 주요 요소들에 대한 다각적인 실험을 수행하였다. 그 결과, 실제 환경의 불균형 데이터세트(imbalanced dataset)인 국내 학술지 논문의 자동분류에서는 보다 단순한 분류기와 자질선정 기법, 그리고 비교적 소규모의 학습집합을 사용하여 상당히 좋은 수준의 성능을 기대할 수 있는 것으로 나타났다.

Keywords

References

Chung, Eunkyung (2009). A semantic-based feature expansion approach for improving the effectiveness of text categorization by using WordNet. Journal of the Korean Society for information Management, 26(3), 261-278. https://doi.org/10.3743/KOSIM.2009.26.3.261
KCI(Korea Citation Index) (2022). Data Statistics. National Research Foundation of Korea. Available: https://www.kci.go.kr/kciportal/po/statistics/poStatisticsMain.kci?tab_code=Tab3
Kim, Pan Jun & Lee, Jae Yun (2012). A study on the reclassification of author keywords for automatic assignment of descriptors. Journal of the Korean Society for Information Management, 29(2), 225-246. https://doi.org/10.3743/KOSIM.2012.29.2.225
Kim, Pan Jun & Lee, Jae Yun (2018). An experimental study on the performance improvement of automatic classification for the articles of Korean journals based on controlled keywords in international database. Journal of the Korean Library and Information Science, 48-3, 491-510. https://doi.org/10.4275/KSLIS.2014.48.3.491
Kim, Pan Jun (2006). A study on automatic assignment of descriptors using machine learning. Journal of the Korean Society for Information Management, 23(1), 279-299. https://doi.org/10.3743/KOSIM.2006.23.1.279
Kim, Pan Jun (2016). An analytical study on performance factors of automatic classification based on machine learning. Journal of the Korean Society for Information Management, 33(2), 33-59. http://dx.doi.org/10.3743/KOSIM.2016.33.2.033
Kim, Pan Jun (2018). An analytical study on automatic classification of domestic journal articles based on machine learning. Journal of the Korean Society for Information Management, 35(2), 37-62. https://doi.org/10.3743/KOSIM.2018.35.2.037
Kim, Pan Jun (2019). An analytical study on automatic classification of domestic journal articles using random forest. Journal of the Korean Society for Information Management, 36(2), 37-62. https://doi.org/10.3743/KOSIM.2019.36.2.057
Kim, Pan Jun (2021a). A study on the characteristics by keyword types in the intellectual structure analysis based on co-word analysis: focusing on overseas open access field. Journal of the Korean Library and Information Science, 55-3, 103-129. http://dx.doi.org/10.4275/KSLIS.2021.55.3.103
Kim, Pan Jun (2021b). A study on the intellectual structure analysis by keyword type based on profiling: focusing on overseas open access field. Journal of the Korean Library and Information Science, 55-4, 115-140. http://dx.doi.org/10.4275/KSLIS.2021.55.4.115
Kim, Seon-Wu, Ko, Gun-Woo, Choi, Won-Jun, Jeong, Hee-Seok, Yoon, Hwa-Mook, & Choi, Sung-Pil (2018). Semi-automatic construction of learning set and integration of automatic classification for academic literature in technical sciences. Journal of the Korean Society for Information Management, 35(4), 141-164. http://dx.doi.org/10.3743/KOSIM.2018.35.4.141
Lee, Jae Yun (2005). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Society for Library and Information Science, 39(2), 123-146. http://dx.doi.org/10.4275/kslis.2005.39.2.123
National Research Foundation of Korea (2016). Academic Research Classification Scheme. Available: https://www.nrf.re.kr/biz/doc/class/view?menu_no=323
Yuk, Jee Hee & Song, Min (2018). A study of research on methods of automated biomedical document classification using topic modeling and deep learning. Journal of the Korean Society for information Management, 35(2), 63-88. https://doi.org/10.3743/KOSIM.2018.35.2.063
Abiodun, E. O., Alabdulatif, A., Abiodun, O. I., Alawida, M., Alabdulatif, A., & Alkhawaldeh, R. S. (2021). A systematic review of emerging feature selection optimization methods for optimal text classification: the present state and prospective opportunities. Neural Computing & Applications, 33(4), 1-28. https://doi.org/10.1007/s00521-021-06406-8
Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: a new perspective. Neurocomputing, 300, 70-79. https://doi.org/10.1016/j.neucom.2017.11.077
Chandrashekar, G. & Sahin, F. (2014) A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28. https://doi.org/10.1016/j.compeleceng.2013.11.024
Chang, F., Guo, J., Xu, W., & Yao, K. (2015). A feature selection method to handle imbalanced data in text classification. Journal of Digital Information Management, 13, 169-175. Available: https://www.dline.info/fpaper/jdim/v13i3/v13i3_6.pdf
Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: a review. Multimedia Tools and Applications, 78, 3797-3816. https://doi.org/10.1007/s11042-018-6083-5
Drotar, P., Gazda, J., & Smekal, Z. (2015). An experimental comparison of feature selection methods on two-class biomedical datasets. Computers in Biology and Medicine, 66, 1-10. https://doi.org/10.1016/j.compbiomed.2015.08.010
Drotar, P., Gazda, M., & Vokorokos, L. (2019). Ensemble feature selection using election methods and ranker clustering. Information Sciences, 480, 365-380. https://doi.org/10.1016/j.ins.2018.12.033
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, 3, 1289-1305. Available: https://www.jmlr.org/papers/volume3/forman03a/forman03a_full.pdf
Fragoudis, D., Meretakis, D., & Likothanassis, S. (2005). Best terms: an efficient feature-selection algorithm for text categorization. Knowledge and Information Systems, 8(1), 16-33. https://doi.org/10.1007/s10115-004-0177-2
Gunal, S. (2012). Hybrid feature selection for text classification. Turkish Journal of Electrical Engineering and Computer Science, 20(Sup.2), 1296-1311. Available: https://dergipark.org.tr/en/pub/tbtkelektrik/issue/12058/144170
Gutkin, M., Shamir, R., & Dror, G. (2009). SlimPLS: a method for feature selection in gene expression-based disease classification. PloS One, 4(7), e6416. https://doi.org/10.1371/journal.pone.0006416
Guyon, I. & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157-1182. Available: https://dl.acm.org/doi/pdf/10.5555/944919.944968
Harish, B. & Revanasiddappa, M. (2017). A comprehensive survey on various feature selection methods to categorize text documents. International Journal of Computer Applications, 164, 1-7. http://doi.org/10.5120/ijca2017913711
Iqbal, M., Abid, M. M., Khalid, M. N., & Manzoor, A. (2020). Review of feature selection methods for text classification. International Journal of Advanced Computer Research, 10(49), 138-152. http://dx.doi.org/10.19101/IJACR.2020.1048037
Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Proceedings of the Fourteenth International Conference on Machine Learning (ICML '97), 143-151. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.45.6977&rep=rep1&type=pdf
Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines: Methods, theory and algorithms. USA: Kluwer Academic Publishers.
Kashef, S., Nezamabadi-pour, H., & Nikpour, B. (2018). Multi-label feature selection: a comprehensive review and guiding experiments. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(2), e1240. https://doi.org/10.1002/widm.1240
Kragelj, M. & Kljajic Borstnar, M. (2021). Automatic classification of older electronic texts into the Universal Decimal Classification-UDC. Journal of Documentation, 77(3), 755-776. https://doi.org/10.1108/JD-06-2020-0092
Kumar, V. & Minz, S. (2014). Feature selection: a literature review. Smart Computing Review, 4(3), 211-229. Available: https://faculty.cc.gatech.edu/~hic/CS7616/Papers/Kumar-Minz-2014.pdf
Manning, C., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval. NY, USA: Cambridge University Press.
Mengle, S. S. R. & Goharian, N. (2009). Ambiguity measure feature-selection algorithm. Journal of the American Society for Information Science & Technology, 60(5), 1037-1050. https://doi.org/10.1002/asi.21023
Mironczuk, M. & Protasiewicz, J. (2018). A recent overview of the state-of-the-art elements of text classification. Expert Systems with Applications, 106, 36-54. https://doi.org/10.1016/j.eswa.2018.03.058
Pereira, R. B., Plastino, A., Zadrozny, B., & Merschmann, L. H. (2018). Correlation analysis of performance measures for multi-label classification. Information Processing & Management, 54(3), 359-369. https://doi.org/10.1016/j.ipm.2018.01.002
Pinheiro, R. H. W., Cavalcanti, G. D. C., & Ren, T. I. (2015). Data-driven global-ranking local feature selection methods for text categorization, Expert Systems with Applications, 42 (4), 1941-1949. https://doi.org/10.1016/j.eswa.2014.10.011
Pintas, J. T., Fernandes, L. A. F., & Garcia, A. C. B. (2021). Feature selection methods for text classification: a systematic literature review. Artificial Intelligence Review, 54, 6149-6200. https://doi.org/10.1007/s10462-021-09970-6
Rehman, A., Javed, K., Babri, H. A., & Asim, N. (2018). Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Systems with Applications, 114, 78-96. https://doi.org/10.1016/j.eswa.2018.07.028
Salton, G. & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513-523. https://doi.org/10.1016/0306-4573(88)90021-0
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1-47. https://doi.org/10.1145/505282.505283
Talavera, L. (2005). An evaluation of filter and wrapper methods for feature selection in categorical clustering. In International Symposium on Intelligent Data Analysis. Springer, Berlin, Heidelberg, 440-451. https://doi.org/10.1007/11552253_40
Uysal, A. K. (2016). An improved global feature selection scheme for text classification. Expert Systems with Applications, 43(1), 82-92, https://doi.org/10.1016/j.eswa.2015.08.050
Venkatesh, B. & Anuradha, J. (2019). A review of feature selection and its methods. Cybernetics and Information Technologies, 19(1), 3-26. https://doi.org/10.2478//cait-2019-0001
Wang, D., Zhang, H., Liu, R., Liu, X., & Wang, J. (2016). Unsupervised feature selection through gram-Schmidt orthogonalization-A word co-occurrence perspective. Neurocomputing, 173(P3), 845-854. https://doi.org/10.1016/j.neucom.2015.08.038
Wang, D., Zhang, H., Liu, R., Lv, W., & Wang, D. (2014). t-test feature selection approach based on term frequency for text categorization. Pattern Recognition Letters, 45, 1-10. https://doi.org/10.1016/j.patrec.2014.02.013
Wu, Y. & Zhang, A. (2004). Feature selection for classifying high-dimensional numerical data. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, CVPR 2004, 2, 251-258. http://doi.org/10.1109/CVPR.2004.1315171
Yang, Y. & Pedersen. J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, July 08-12, 412-420. Available: http://nyc.lti.cs.cmu.edu/yiming/Publications/yang-icml97.pdf

Journal of the Korean Society for information Management (정보관리학회지)

An Experimental Study on the Automatic Classification of Korean Journal Articles through Feature Selection

자질선정을 통한 국내 학술지 논문의 자동분류에 관한 연구

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)