An Analytical Study on Automatic Classification of Domestic Journal articles Using Random Forest

Kim, Pan Jun;

doi:10.3743/KOSIM.2019.36.2.057

Journal of the Korean Society for information Management (정보관리학회지)

Volume 36 Issue 2
/
Pages.57-77
/
2019
/
1013-0799(pISSN)
/
2586-2073(eISSN)

Korean Society for Information Management (한국정보관리학회)

DOI QR Code

An Analytical Study on Automatic Classification of Domestic Journal articles Using Random Forest

랜덤포레스트를 이용한 국내 학술지 논문의 자동분류에 관한 연구

Kim, Pan Jun

김판준 (신라대학교 문헌정보학과)

Received : 2019.05.15
Accepted : 2019.06.21
Published : 2019.06.30

https://doi.org/10.3743/KOSIM.2019.36.2.057 Citation PDF KSCI HTML

Download PDF

⟨ Previous Next ⟩

Abstract

Random Forest (RF), a representative ensemble technique, was applied to automatic classification of journal articles in the field of library and information science. Especially, I performed various experiments on the main factors such as tree number, feature selection, and learning set size in terms of classification performance that automatically assigns class labels to domestic journals. Through this, I explored ways to optimize the performance of random forests (RF) for imbalanced datasets in real environments. Consequently, for the automatic classification of domestic journal articles, Random Forest (RF) can be expected to have the best classification performance when using tree number interval 100~1000(C), small feature set (10%) based on chi-square statistic (CHI), and most learning sets (9-10 years).

대표적인 앙상블 기법으로서 랜덤포레스트(RF)를 문헌정보학 분야의 학술지 논문에 대한 자동분류에 적용하였다. 특히, 국내 학술지 논문에 주제 범주를 자동 할당하는 분류 성능 측면에서 트리 수, 자질선정, 학습집합 크기 등 주요 요소들에 대한 다각적인 실험을 수행하였다. 이를 통해, 실제 환경의 불균형 데이터세트(imbalanced dataset)에 대하여 랜덤포레스트(RF)의 성능을 최적화할 수 있는 방안을 모색하였다. 결과적으로 국내 학술지 논문의 자동분류에서 랜덤포레스트(RF)는 트리 수 구간 100~1000(C)과 카이제곱통계량(CHI)으로 선정한 소규모의 자질집합(10%), 대부분의 학습집합(9~10년)을 사용하는 경우에 가장 좋은 분류 성능을 기대할 수 있는 것으로 나타났다.

Keywords

JBGRBQ_2019_v36n2_57_f0001.png 이미지

<그림 1> 실험 단계별 변수와 평가 방법

JBGRBQ_2019_v36n2_57_f0002.png 이미지

<그림 2> 랜덤포레스트(RF)의 트리 수 구간별 성능: 처리 시간(단위 : ms)

JBGRBQ_2019_v36n2_57_f0003.png 이미지

<그림 3> 학습집합 크기에 따른 랜덤포레스트(RF) 분류 성능: 단일_범주, mac_F1

JBGRBQ_2019_v36n2_57_f0004.png 이미지

<그림 4> 학습집합 크기에 따른 랜덤포레스트(RF) 분류 성능: 단일_범주, mic_F1

JBGRBQ_2019_v36n2_57_f0005.png 이미지

<그림 5> 학습집합 크기에 따른 랜덤포레스트(RF) 분류 성능: 복수_범주, mac_F1

JBGRBQ_2019_v36n2_57_f0006.png 이미지

<그림 6> 학습집합 크기에 따른 랜덤포레스트(RF) 분류 성능: 복수_범주, mic_F1

<표 1> 랜덤포레스트(RF)의 트리 수 구간별 성능: mac_F1, mic_F1

JBGRBQ_2019_v36n2_57_t0001.png 이미지

<펴 2> 자질선정을 적용한 랜덤포레스트(RF) 분류 성능: 단일-범주, mac_F1

JBGRBQ_2019_v36n2_57_t0002.png 이미지

<표 3> 자질선정을 적용한 랜덤포레스트(RF) 분류 성능: 단일-범주, mic_F1

JBGRBQ_2019_v36n2_57_t0003.png 이미지

<표 4> 자질선정을 적용한 랜덤포레스트(RF) 분류 성능: 복수-범주, mac_F1

JBGRBQ_2019_v36n2_57_t0004.png 이미지

<표 5> 자질선정을 적용한 랜덤포레스트(RF) 분류 성능: 복수-범주, mic_F1

JBGRBQ_2019_v36n2_57_t0005.png 이미지

References

Kang, S., Jeon, H., Kim, J., & Song, J. (2015). A study on domestic drama rating prediction. The Korean Journal of Applied Statistics, 28(5), 933-949. http://dx.doi.org/10.5351/KJAS.2015.28.5.933
Kwon, A. (2013). Variable selection using Random Forest. unpublished master's thesis, Inha University.
Kim, S., & Ahn, H. (2016). Application of Random Forests to corporate credit rating prediction. The Journal of Business and Economics, 32(1), 187-211.
Kim, P. J. (2006). A Study on automatic assignment of descriptors using machine learning. Journal of the Korean Society for information Management, 23(1), 279-299. https://doi.org/10.3743/KOSIM.2006.23.1.279
Kim, Pan Jun (2016). An analytical study on performance factors of automatic classification based on machine learning. Journal of the Korean Society for information Management, 33(2), 33-59. http://dx.doi.org/10.3743/KOSIM.2016.33.2.033
Kim, Pan Jun (2018). An analytical study on automatic classification of domestic journal articles based on machine learning. Journal of the Korean Society for information Management, 35(2), 37-62. https://doi.org/10.3743/KOSIM.2018.35.2.037
Nam, S., Oh, M., Kim, S., Kang, C., Kim, G., & Choi, S. (2017). Comparison of machine learning models for classification into user-oriented groups. Journal of the Korean Data Analysis Society, 19(5), 2501-1507. https://doi.org/10.37727/jkdas.2017.19.5.2501
Suh, J. (2016). Foreign exchange rate forecasting using the GARCH extended Random Forest model. Journal of Industrial Economics and Business, 29(5), 1607-1628.
Yoo, J. (2015). Random forests, an alternative data mining technique to decision tree. Journal of Educational Evaluation, 28(2), 427-448.
Yun, Taegyun, & Yi, Gwan-Su (2008). Application of Random Forest algorithm for the decision support system of medical diagnosis with the selection of significant clinical test. The Transactions of the Korean Institute of Electrical Engineers, 57(6), 1058-1062.
Lee, Jae-Yun (2005). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Society for Library and Information Science, 39(2), 123-146. http://dx.doi.org/10.4275/kslis.2005.39.2.123
Lee, C., Yoo, K., Mun, B., & Bae, S. (2017). Informal quality data analysis via sentimental analysis and Word2vec method. Journal of Korean Society for Quality Management, 45(1), 117-127. http://dx.doi.org/10.7469/JKSQM.2017.45.1.117
Lee, H., Shin, D., Park, H., Kim, S., & Shin, D. (2011). Research on the modified algorithm for improving accuracy of Random Forest classifier which identifies automatically arrhythmia. The KIPS Transactions: Part B, 18(6), 341-348.
Jeong, S., Choi, M., & Kim, H. (2016). Coreference resolution for Korean using Random Forests. Journal of KIISE, 5(11), 535-540.
Jeong, J., Jang, K., & Kim, J. (2016). Target classification method using Random Forest and genetic algorithm. 2016 IEIE Fall Conference, 601-604.
Jo, H., & Park, C. (2018). Analysis of reporting characteristics of newspapers in the 19th presidential election based on random forest. Journal of the Korean data & information science society, 29(2), 367-375. http://dx.doi.org/10.7465/jkdi.2018.29.2.367
Choi, H., Choi, S., & Han, K. (2012). Prediction of DNA binding sites in proteins using a Random Forest. Journal of KIISE, 39(7), 515-522.
Hong, J., Ko, B., & Nam, J. (2013). Human action recognition in still image using weighted bag-of-features and ensemble decision trees. The Journal of Korean Institute of Communications and Information Sciences, 38(1), 1-9. https://doi.org/10.7840/kics.2013.38A.1.1
Afianto, M. F., Adiwijaya, & Al-Faraby, S. (2017). Text categorization on Hadith Sahih Al-Bukhari using Random Forest, International Conference on Data and Information Science, IOP Conference Series: Journal of Physics: Conf. Series 971. http://doi.org/10.1088/1742-6596/971/1/012037
Amaratunga, D., Cabrera, J., & Lee, Y. (2008). Enriched random forests. Bioinformatics, 24(18), 2010-2014. https://doi.org/10.1093/bioinformatics/btn356
Aung, W. T., Myanmar, Y., & Hla, K. H. M. S. (2009). Random forest classifier for multi-category classification of web pages. In Services Computing Conference, APSCC 2009. IEEE Asia-Pacific, 372-376. http://doi.org/10.1109/APSCC.2009.5394100
Austin, P. C., Tu, J. V., Ho, J. E., Levy, D., & Lee, D. S. (2013). Using methods from the data-mining and machine-learning literature for disease classification and prediction: A case study examining classification of heart failure subtypes. Journal of Clinical Epidemiology, 66(4), 398-407. http://doi.org/10.1016/j.jclinepi.2012.11.008
Berk, R., Li, A., & Hickman, L. J. (2005). Statistical difficulties in determining the role of race in capital cases: A re-analysis of data from the state of Maryland. Journal of Quantitative Criminology, 21(4), 365-390. https://doi.org/10.1007/s10940-005-7354-7
Boinee, P., Angelis, A. D., & Foresti, G. L. (2005). Meta random forests. International Journal of Computational Intelligence, 2(3), 138-147.
Brandenburg, Minke (2017). Text classification of Dutch police records. Unpublished master's thesis, Utrecht University Artificial Intelligence, Netherlands.
Breiman L. (2002). Random forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446-3453. http://doi.org/10.1016/j.eswa.2011.09.033
Choi, S., & Kim, H. (2016). Tree size determination for classification ensemble. Journal of the Korean Data &Information Science Society, 27(1), 255-264. http://dx.doi.org/10.7465/jkdi.2016.27.1.255
Cutler, D. R., Edwards, T. C., Beard, K. H., Cutler, A., Hess, K. T., Gibson, J., & Lawler, J. J. (2007). Random forests for classification in ecology. Ecology, 88(11), 2783-2792. http://doi.org/10.1890/07-0539.1
Dogan, T., & Uysal, A. K. (2018). The impact of feature selection on urban land cover classification. International Journal of Intelligent Systems and Applications in Engineering(IJISAE), 6(1), 59-64. http://doi.org/10.18201/ijisae.2018637933
Fawagreh, K., Gaber, M. M., & Elyan, E. (2014). Random forests: From early developments to recent advancements. Systems Science & Control Engineering, 2(1), 602-609. http://doi.org/10.1080/21642583.2014.956265
Gao, D., Zhang, Y., & Zhao, Y. (2009). Random forest algorithm for classification of multiwavelength data. Research in Astronomy and Astrophysics, 9(2), 220-226. http://doi.org/101088/1674-4527/9/2/011 https://doi.org/10.1088/1674-4527/9/2/011
Kim, M. J., Kang, D. K., &. Kim, H. B. (2015). Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Systems with Applications, 42(3), 1074-1082. https://doi.org/10.1016/j.eswa.2014.08.025
Klassen, M., & Paturi, N. (2010). Web document classification by keywords using Random Forests. In: Zavoral F., Yaghob J., Pichappan P., El-Qawasmeh E. (eds) Networked Digital Technologies. NDT 2010. Communications in Computer and Information Science, vol 88. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14306-9_26
Kong, Q., Gong, H., Ding, X., & Hou, R. (2017). Classification application based on mutual information and Random Forest method for high dimensional data. 9th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, 171-174. https://doi.org/10.1109/IHMSC.2017.45
Latinne, P., Debeir, O., & Decaestecker, C. (2001). Limiting the number of trees in Random Forests. In: Kittler J., Roli F. (eds) Multiple Classifier Systems. MCS 2001. Lecture Notes in Computer Science, vol 2096. Springer, Berlin, Heidelberg, 178-187. https://doi.org/10.1007/3-540-48219-9_18
Lee, Jaesung, & Kim, Dae-Won (2015). Mutual information-based multi-label feature selection using interaction information. Expert Systems with Applications, 42(4), 2013-2025. https://doi.org/10.1016/j.eswa.2014.09.063
Liparas D., HaCohen-Kerner Y., Moumtzidou A., Vrochidis S., & Kompatsiaris I. (2014). News articles classification using Random Forests and weighted multimodal features. In: Lamas D., Buitelaar P. (eds) Multidisciplinary Information Retrieval. IRFC 2014. Lecture Notes in Computer Science, vol 8849. Springer, Cham. https://doi.org/10.1007/978-3-319-12979-2_6
Lok, C. (2010). Speed reading. Nature 463, 28. http://doi.org/10.1038/463416a
Low, F., Schorcht, G., Michel, U., Dech, S., & Conrad, C. (2012). Per-field crop classification in irrigated agricultural regions in middle Asia using random forest and support vector machine ensemble. Proc. SPIE 8538, Earth Resources and Environmental Remote Sensing/GIS Applications III, 85380R (25 October 2012). http://doi.org/10.1117/12.974588
Ma, L. (2017). A multi-label text classification framework: Using supervised and unsupervised feature selection strategy. Unpublished doctoral dissertation, Georgia State University. retrieved from https://scholarworks.gsu.edu/cs_diss/134
Ma, L., & Fan, S. (2017). CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests. BMC Bioinformatics, 18(1), 169. https://doi.org/10.1186/s12859-017-1578-z
Ma, L., Zhang, Y., Sunderraman, R., Fox, P., Laird, A., Turner, J., & Turner, M. (2015). Hybrid feature selection methods for online biomedical publication classification. 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, Canada, 1-8. https://doi.org/10.1109/CIBCB.2015.7300320
Madjarov, G., Kocev, D., Gjorgjevikj, D., & Dzeroski, S. (2012). An extensive experimental comparison of methods for multi-label learning. Pattern Recognition, 45, 3084-3104. https://doi.org/10.1016/j.patcog.2012.03.004
Manning, Christopher, Raghavan, & Prabhakar (2008). Introduction to information retrieval. NY, USA: Cambridge University Press.
Nayak, S., Ramesh, R., & Shah, S. (2013). A study of multi-label text classification and the effect of label hierarchy. CS224N Project Report, USA: Stanford University. retrieved from https://nlp.stanford.edu/courses/cs224n/2013/reports/nayak.pdf
Robnik-Sikonja M. (2004). Improving Random Forests. In: Boulicaut JF., Esposito F., Giannotti F., Pedreschi D. (eds) Machine Learning: ECML 2004. ECML 2004. Lecture Notes in Computer Science, vol 3201. Springer, Berlin. https://doi.org/10.1007/978-3-540-30115-8_34
Roul, R. K., & Rai, P. (2016). A new feature selection technique combined with elm feature space for text classification. In Proceedings of the 13th International Conference on Natural Language Processing, 285-292.
Siroky, D. S. (2009). Navigating random forests and related advances in algorithmic modeling. Statistics Surveys, 3, 147-163. https://doi.org/10.1214/07-SS033
Trieschnigg, D., Pezik, P., Lee, V., Jong, F. D., Kraaij, W., & Rebholz-Schuhmann, D. (2009). MeSH Up: Effective MeSH text classification for improved document retrieval. Bioinformatics, 25(11), 1412-1418. https://doi.org/10.1093/bioinformatics/btp249
Tsymbal, A., Pechenizkiy, M., & Cunningham, P. (2006) Dynamic integration with random forests. In: Furnkranz, J., Scheffer, T., Spiliopoulou, M. (eds) Machine Learning: ECML 2006. ECML 2006. Lecture Notes in Computer Science, vol 4212. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11871842_82
Turner, M. D., Chakrabarti, C., Jones, T. B., Xu, J. F., Fox, P. T., Luger, G. F., Laird, A. R., & Turner, J. A. (2013). Automated annotation of functional imaging experiments via multi-label classification. Frontiers in neuroscience, 7, 240. http://doi.org/10.3389/fnins.2013.00240
Ward, M. S., Pajevic, J., Dreyfuss, J., & Malley, J. (2006). Short-term prediction of mortality in patients with systemic lupus erythematosus: Classification of outcomes using random forests. Arthritis and Rheumatism, 55(1), 74-80. http:/doi.org/10.1002/art.21695
Wu, Q., Ye, Y., Zhang, H., Ng, M. K., & Ho, Shen-Shyang. (2014). Fores texter: An efficient random forest algorithm for imbalanced text categorization. Knowledge-Based System, 67, 105-116. http://doi.org/10.1016/j.knosys.2014.06.004
Xu, B., Guo, X., Ye, Y., & Cheng, J. (2012). An improved random forest classifier for text categorization. Journal of Computers, 7(12), 2913-2920. http://dx.doi.org/10.4304/jcp.7.12.2913-2920.
Xu, B., Huang, J. Z., Williams, G., & Ye, Y. (2012). Hybrid weighted random forests for classifying very high dimensional data. International Journal of Data Warehousing and Mining, 8(2), 44-63. http://dx.doi.org/10.4018/jdwm.2012040103
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, July 08-12, 412-420.
Yao, D., Yang, J., & Zhan, X. (2013). An improved random forest algorithm for class-imbalanced data classification and its application in PAD risk factors analysis. The Open Electrical & Electronic Engineering Journal, 7, (Supple 1: M7) 62-70. http://dx.doi.org/10.2174/1874129001307010062
Zhou Q., Zhou H., & Li, T. (2016). Cost-sensitive feature selection using random forest: Selecting low-cost subsets of informative features. Knowledge-Based Systems, 95, 1-11. https://doi.org/10.1016/j.knosys.2015.11.010

Journal of the Korean Society for information Management (정보관리학회지)

An Analytical Study on Automatic Classification of Domestic Journal articles Using Random Forest

랜덤포레스트를 이용한 국내 학술지 논문의 자동분류에 관한 연구

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)