DOI QR코드

DOI QR Code

A Study of Research on Methods of Automated Biomedical Document Classification using Topic Modeling and Deep Learning

토픽모델링과 딥 러닝을 활용한 생의학 문헌 자동 분류 기법 연구

  • 육지희 (연세대학교 일반대학원 문헌정보학과) ;
  • 송민 (연세대학교 문헌정보학과)
  • Received : 2018.05.20
  • Accepted : 2018.06.19
  • Published : 2018.06.30

Abstract

This research evaluated differences of classification performance for feature selection methods using LDA topic model and Doc2Vec which is based on word embedding using deep learning, feature corpus sizes and classification algorithms. In addition to find the feature corpus with high performance of classification, an experiment was conducted using feature corpus was composed differently according to the location of the document and by adjusting the size of the feature corpus. Conclusionally, in the experiments using deep learning evaluate training frequency and specifically considered information for context inference. This study constructed biomedical document dataset, Disease-35083 which consisted biomedical scholarly documents provided by PMC and categorized by the disease category. Throughout the study this research verifies which type and size of feature corpus produces the highest performance and, also suggests some feature corpus which carry an extensibility to specific feature by displaying efficiency during the training time. Additionally, this research compares the differences between deep learning and existing method and suggests an appropriate method by classification environment.

본 연구는 LDA 토픽 모델과 딥 러닝을 적용한 단어 임베딩 기반의 Doc2Vec 기법을 활용하여 자질을 선정하고 자질집합의 크기와 종류 및 분류 알고리즘에 따른 분류 성능의 차이를 평가하였다. 또한 자질집합의 적절한 크기를 확인하고 문헌의 위치에 따라 종류를 다르게 구성하여 분류에 이용할 때 높은 성능을 나타내는 자질집합이 무엇인지 확인하였다. 마지막으로 딥 러닝을 활용한 실험에서는 학습 횟수와 문맥 추론 정보의 유무에 따른 분류 성능을 비교하였다. 실험문헌집단은 PMC에서 제공하는 생의학 학술문헌을 수집하고 질병 범주 체계에 따라 구분하여 Disease-35083을 구축하였다. 연구를 통하여 가장 높은 성능을 나타낸 자질집합의 종류와 크기를 확인하고 학습 시간에 효율성을 나타냄으로써 자질로의 확장 가능성을 가지는 자질집합을 제시하였다. 또한 딥 러닝과 기존 방법 간의 차이점을 비교하고 분류 환경에 따라 적합한 방법을 제안하였다.

Keywords

References

  1. Kim, Dowoo, & Koo, Moung-Wan (2017). Categorization of Korean news articles based on convolutional neural network using Doc2Vec and Word2Vec. Journal of KIISE, 44(7), 742-747. https://doi.org/10.5626/JOK.2017.44.7.742
  2. Kim, Pan-Jun (2016). An analytical study on performance factors of automatic classification based on machine learning. Journal of Korean Society for Information Management, 33(2), 33-59. http://dx.doi.org/10.3743/KOSIM.2016.33.2.033
  3. Lee, Jae-Yun (2005). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Library and Information Science Society, 39(2), 123-146. https://doi.org/10.4275/KSLIS.2005.39.2.123
  4. Chung, Yung-Mee. (2012). Research in information retrieval (Rev. ed.). Seoul: Yonsei University Press.
  5. Jin, Seol A, & Song, Min (2016). Topic modeling based interdisciplinarity measurement in the informatics related journals. Journal of Korean Society for Information Management, 33(1), 7-32. http://doi.org/10.3743/KOSIM.2016.33.1.007
  6. Choi, Sanghee, & Lee, Jae-Yun (2012). Usability analysis of structured abstracts in journal articles for document clustering. Journal of Korean Society for Information Management, 29(1), 331-349. http://dx.doi.org/10.3743/KOSIM.2012.29.1.331
  7. Atlig, C., Reyyan, K. O. C., & Yigit, T. A. K. A. (2017). Learning-based classification of natural science articles. International Journal of Scientific Research in Information Systems and Engineering (IJSRISE), 2(3), 20-26. http://www.ijsrise.com/index.php/IJSRISE/article/view/52
  8. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137-1155.
  9. Bhushan, S. B., Danti, A., & Fernandes, S. L. (2017). A novel integer representation based approach for classification of text documents. In Proceedings of the International Conference on Data Engineering and Communication Technology (pp. 557-564). Springer, Singapore.
  10. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84. http://dx.doi.org/10.1145/2133806.2133826
  11. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993-1022.
  12. Collobert, R., & Weston, J. (2008, July). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning (pp. 160-167). ACM. https://doi.org/10.1145/1390156.1390177
  13. Dai, A. M., Olah, C., & Le, Q. V. (2015). Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998.
  14. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  15. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3(Mar), 1289-1305.
  16. Fuhr, N., & Buckley, C. (1991). A probabilistic learning approach for document indexing. ACM Transactions on Information Systems (TOIS), 9(3), 223-248. https://doi.org/10.1145/125187.125189
  17. Harter, S. P. (1975). A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing. Journal of the American Society for Information Science, 26(5), 280-289. https://doi.org/10.1002/asi.4630260504
  18. Hofmann, T. (2017, August). Probabilistic latent semantic indexing. In ACM SIGIR Forum (Vol. 51, No. 2, pp. 211-218). ACM.
  19. Hughes, M., Li, I., Kotoulas, S., & Suzumura, T. (2017). Medical text classification using convolutional neural networks. Stud Health Technol Inform, 235, 246-50.
  20. Jiang, S., Lewris, J., Voltmer, M., & Wang, H. (2016, April). Integrating rich document representations for text classification. In Systems and Information Engineering Design Symposium (SIEDS), 2016 IEEE (pp. 303-308). IEEE. https://doi.org/10.1109/sieds.2016.7489319
  21. John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset selection problem. In Proceedings of the Eleventh International Conference on Machine Learning (pp. 121-129). https://doi.org/10.1016/b978-1-55860-335-6.50023-4
  22. Koller, D., & Sahami, M. (1996). Toward optimal feature selection. Stanford InfoLab.
  23. Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In Proceedings of the 32nd International Conference on Machine Learning (pp. 957-966).
  24. Lau, J. H., & Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368.
  25. Le, D. T., & Bernardi, R. (2012, July). Query classification using topic models and support vector machine. In Proceedings of ACL 2012 Student Research Workshop (pp. 19-24). Association for Computational Linguistics.
  26. Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (pp. 1188-1196).
  27. Lewis, D. D. (1992, February). Feature selection and feature extraction for text categorization. In Proceedings of the workshop on Speech and Natural Language for Computational Linguistics. https://doi.org/10.3115/1075527.1075574
  28. Li, C., Wang, H., Zhang, Z., Sun, A., & Ma, Z. (2016, July). Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (pp. 165-174). ACM. https://doi.org/10.1145/2911451.2911499
  29. Lilleberg, J., Zhu, Y., & Zhang, Y. (2015, July). Support vector machines and word2vec for text classification with semantic features. In Cognitive Informatics & Cognitive Computing (ICCI* CC), 2015 IEEE 14th International Conference on (pp. 136-140). IEEE. https://doi.org/10.1109/icci-cc.2015.7259377
  30. Liu, Y., Liu, Z., Chua, T. S., & Sun, M. (2015, January). Topical word embeddings. In AAAI (pp. 2418-2424).
  31. Luhn, H. P. (1957). A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1(4), 309-317. https://doi.org/10.1147/rd.14.0309
  32. Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations (pp. 55-60). https://doi.org/10.3115/v1/p14-5010
  33. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
  34. Mladenic, D., & Grobelnik, M. (1999). Predicting content from hyperlinks. In Proceedings of the ICML-99 Workshop on Machine Learning in Text Data Analysis, J. Stephan Institute.
  35. PubMed Central (2017). Retrieved from https://www.ncbi.nlm.nih.gov/pmc/
  36. Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill. 24-51.
  37. Tang, D., Qin, B., & Liu, T. (2015). Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 1422-1432). https://doi.org/10.18653/v1/d15-1167
  38. Torkkola, K. (2004). Discriminative features for text document classification. Formal Pattern Analysis & Applications, 6(4), 301-308. https://doi.org/10.1007/s10044-003-0196-8
  39. Turian, J., Ratinov, L., & Bengio, Y. (2010, July). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics (pp. 384-394). Association for Computational Linguistics.
  40. Wadbude, R., Gupta, V., Mekala, D., Jindal, J., & Karnick, H. (2016). User bias removal in fine grained sentiment analysis. arXiv preprint arXiv:1612.06821.
  41. Wang, P., Xu, B., Xu, J., Tian, G., Liu, C. L., & Hao, H. (2016). Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing, 174, 806-814. https://doi.org/10.1016/j.neucom.2015.09.096
  42. Wang, S., & Manning, C. D. (2012, July). Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2 (pp. 90-94). Association for Computational Linguistics.
  43. Wang, Z., & Qian, X. (2008, December). Text categorization based on LDA and SVM. In Computer Science and Software Engineering, 2008 International Conference on (Vol. 1, pp. 674-677). IEEE. https://doi.org/10.1109/csse.2008.571
  44. Wei, X., & Croft, W. B. (2006, August). LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 178-185). ACM. https://doi.org/10.1145/1148170.1148204
  45. Xing, C., Wang, D., Zhang, X., & Liu, C. (2014, December). Document classification with distributions of word vectors. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific (pp. 1-5). IEEE.
  46. Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information retrieval, 1(1-2), 69-90. https://doi.org/10.1109/apsipa.2014.7041633 http://dx.doi.org/10.3743/KOSIM.2016.33.2.033