DOI QR코드

DOI QR Code

The Stream of Uncertainty in Scientific Knowledge using Topic Modeling

토픽 모델링 기반 과학적 지식의 불확실성의 흐름에 관한 연구

  • Received : 2019.02.18
  • Accepted : 2019.03.27
  • Published : 2019.03.30

Abstract

The process of obtaining scientific knowledge is conducted through research. Researchers deal with the uncertainty of science and establish certainty of scientific knowledge. In other words, in order to obtain scientific knowledge, uncertainty is an essential step that must be performed. The existing studies were predominantly performed through a hedging study of linguistic approaches and constructed corpus with uncertainty word manually in computational linguistics. They have only been able to identify characteristics of uncertainty in a particular research field based on the simple frequency. Therefore, in this study, we examine pattern of scientific knowledge based on uncertainty word according to the passage of time in biomedical literature where biomedical claims in sentences play an important role. For this purpose, biomedical propositions are analyzed based on semantic predications provided by UMLS and DMR topic modeling which is useful method to identify patterns in disciplines is applied to understand the trend of entity based topic with uncertainty. As time goes by, the development of research has been confirmed that uncertainty in scientific knowledge is moving toward a decreasing pattern.

과학적 지식을 얻는 과정은 연구자의 연구를 통해 이루어진다. 연구자들은 과학의 불확실성을 다루고 과학적 지식의 확실성을 구축해나간다. 즉, 과학적 지식을 얻기 위해서 불확실성은 반드시 거쳐가야 하는 필수적인 단계로 인식되고 있다. 현존하는 불확실성의 특성을 파악하는 연구는 언어학적 접근의 hedging 연구를 통해 소개되었으며 컴퓨터 언어학에서 수작업 기반으로 불확실성 단어 코퍼스를 구축해왔다. 기존의 연구들은 불확실성 단어의 단순 출현 빈도를 기반으로 특정 학문 영역의 불확실성의 특성을 파악해오는데 그쳤다. 따라서 본 연구에서는 문장 내 생의학적 주장이 중요한 역할을 하는 생의학 문헌을 대상으로 불확실성 단어 기반 과학적 지식의 패턴을 시간의 흐름에 따라 살펴보고자 한다. 이를 위해 생의학 온톨로지인 UMLS에서 제공하는 의미적 술어를 기반으로 생의학 명제를 분석하였으며, 학문 분야의 패턴을 파악하는데 용이한 DMR 토픽 모델링을 적용하여 생의학 개체의 불확실성 기반 토픽의 동향을 종합적으로 파악하였다. 시간이 흐름에 따라 과학적 지식의 표현은 불확실성이 감소하는 패턴으로 연구의 발전이 이루어지고 있음을 확인하였다.

Keywords

JBGRBQ_2019_v36n1_191_f0001.png 이미지

<그림 1> 연구 개요

JBGRBQ_2019_v36n1_191_f0002.png 이미지

<그림 2> LDA 모델

JBGRBQ_2019_v36n1_191_f0003.png 이미지

<그림 3> DMR 모델

JBGRBQ_2019_v36n1_191_f0004.png 이미지

<그림 4> 연도별 불확실성 단어 데이터 집합의 비율

JBGRBQ_2019_v36n1_191_f0005.png 이미지

<그림 5> 20개 토픽 분포

JBGRBQ_2019_v36n1_191_f0006.png 이미지

<그림 6> 토픽 분포의 4가지 패턴 : (a) rising, (b) falling, (c)convex, (d) flat

<표 1> 최종 데이터 집합

JBGRBQ_2019_v36n1_191_t0001.png 이미지

<표 2> DMR 토픽 모델링의 입력 정보 예시

JBGRBQ_2019_v36n1_191_t0002.png 이미지

<표 3> 상위 10개 개체 정보

JBGRBQ_2019_v36n1_191_t0003.png 이미지

<표 4> 상위 10개 관계 유형 정보

JBGRBQ_2019_v36n1_191_t0004.png 이미지

<표 5> 상위 10개 개체 의미 유형 정보

JBGRBQ_2019_v36n1_191_t0005.png 이미지

<표 6> 상위 10개 의미적 술어 정보

JBGRBQ_2019_v36n1_191_t0006.png 이미지

<표 7> 상위 10개 개체 쌍 정보

JBGRBQ_2019_v36n1_191_t0007.png 이미지

<표 8> 연도별 데이터 집합

JBGRBQ_2019_v36n1_191_t0008.png 이미지

<표 9> 20개 토픽의 상위 5개 개체 토픽 모델링 결과

JBGRBQ_2019_v36n1_191_t0009.png 이미지

<표 10> 토픽 모델링 결과 개체 빈도

JBGRBQ_2019_v36n1_191_t0010.png 이미지

<표 11> 토픽의 표준편차와 순위

JBGRBQ_2019_v36n1_191_t0011.png 이미지

References

  1. Acedo, F. J., & Casillas, J. C. (2005). Current paradigms in the international management field: An author co-citation analysis. International Business Review, 14(5), 619-639. http://dx.doi.org/10.1016/j.ibusrev.2005.05.003
  2. An, X. Y., & Wu, Q. Q. (2011). Co-word analysis of the trends in stem cells field based on subject heading weighting. Scientometrics, 88(1), 133-144. http://dx.doi.org/10.1007/s11192-011-0374-1
  3. Astrom, F. (2007). Changes in the LIS research front: Time-sliced cocitation analyses of LIS journal articles, 1990-2004. Journal of the American Society for Information Science and Technology, 58(7), 947-957. https://doi.org/10.1002/asi.20567
  4. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993-1022.
  5. Bodenreider, O. (2004). The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Research, 32(suppl_1), D267-D270. https://doi.org/10.1093/nar/gkh061
  6. Callon, M., Rip, A., & Law, J. (Eds.). (1986). Mapping the dynamics of science and technology: Sociology of science in the real world. Springer.
  7. Cambrosio, A., Limoges, C., Courtial, J. P., & Laville, F. (1993). Historical scientometrics?: Mapping over 70 years of biological safety research with co-word analysis. Scientometrics, 27(2), 119-143. https://doi.org/10.1007/BF02016546
  8. Chang, Y. W., & Huang, M. H. (2012). A study of the evolution of interdisciplinarity in library and information science: Using three bibliometric methods. Journal of the American Society for Information Science and Technology, 63(1), 22-33. https://doi.org/10.1002/asi.21649
  9. Chapman, W. W., Bridewell, W., Hanbury, P., Cooper, G. F., & Buchanan, B. G. (2001). A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of biomedical informatics, 34(5), 301-310. https://doi.org/10.1006/jbin.2001.1029
  10. Chen, C. (2006). CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the American Society for information Science and Technology, 57(3), 359-377. https://doi.org/10.1002/asi.20317
  11. Chen, C., Song, M., & Heo, G. E. (2018). A scalable and adaptive method for finding semantically equivalent cue words of uncertainty. Journal of Informetrics, 12(1), 158-180. https://doi.org/10.1016/j.joi.2017.12.004
  12. Chen, K., & Guan, J. (2011). A bibliometric investigation of research performance in emerging nanobiopharmaceuticals. Journal of Informetrics, 5(2), 233-247. https://doi.org/10.1016/j.joi.2010.10.007
  13. Cobo, M. J., Lopez-Herrera, A. G., Herrera-Viedma, E., & Herrera, F. (2011). An approach for detecting, quantifying, and visualizing the evolution of a research field: A practical application to the fuzzy sets theory field. Journal of Informetrics, 5(1), 146-166. https://doi.org/10.1016/j.joi.2010.10.002
  14. Culnan, M. J. (1986). The intellectual development of management information systems, 1972-1982: A co-citation analysis. Management Science, 32(2), 156-172. https://doi.org/10.1287/mnsc.32.2.156
  15. Culnan, M. J. (1987). Mapping the intellectual structure of MIS, 1980-1985: A co-citation analysis. Mis Quarterly, 341-353. https://www.jstor.org/stable/248680
  16. Ding, Y., Chowdhury, G. G., & Foo, S. (2001). Bibliometric cartography of information retrieval research by using co-word analysis. Information Processing & Management, 37(6), 817-842. https://doi.org/10.1016/S0306-4573(00)00051-0
  17. Falahati, R. (2006, February). The use of hedging across different disciplines and rhetorical sections of research articles. In Proceedings of the 22nd NorthWest Linguistics Conference (NWLC22), 99-112.
  18. Farkas, R., Vincze, V., Mora, G., Csirik, J., & Szarvas, G. (2010, July). The CoNLL-2010 shared task: Learning to detect hedges and their scope in natural language text. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning - Shared Task (pp. 1-12). Association for Computational Linguistics.
  19. Friedman, C., Alderson, P. O., Austin, J. H., Cimino, J. J., & Johnson, S. B. (1994). A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association, 1(2), 161-174. https://doi.org/10.1136/jamia.1994.95236146
  20. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1), 5228-5235. https://doi.org/10.1073/pnas.0307752101
  21. Heo, G. E., Kang, K. Y., Song, M., & Lee, J. H. (2017). Analyzing the field of bioinformatics with the multi-faceted topic modeling technique. BMC Bioinformatics, 18(7), 251. https://doi.org/10.1186/s12859-017-1640-x
  22. Hristovski, D., Friedman, C., Rindflesch, T. C., & Peterlin, B. (2006). Exploiting semantic relations for literature-based discovery. In AMIA annual symposium proceedings (Vol. 2006, p. 349). American Medical Informatics Association.
  23. Hyland, K. (1998). Hedging in scientific research articles (Vol. 54). John Benjamins Publishing.
  24. Jensen, J. D. (2008). Scientific uncertainty in news coverage of cancer research: Effects of hedging on scientists' and journalists' credibility. Human Communication Research, 34(3), 347-369. https://doi.org/10.1111/j.1468-2958.2008.00324.x
  25. Jeong, Y. K., Heo, G. E., Kang, K. Y., Yoon, D. S., & Song, M. (2016). Trajectory analysis of drug-research trends in pancreatic cancer on PubMed and ClinicalTrials. gov. Journal of Informetrics, 10(1), 273-285 https://doi.org/10.1016/j.joi.2016.01.003
  26. Jin, Y., Myaeng, S. H., & Jung, Y. (2007). Use of place information for improved event tracking. Information Processing & Management, 43(2), 365-378. https://doi.org/10.1016/j.ipm.2006.07.007
  27. Kilicoglu, H., Rosemblat, G., & Rindflesch, T. C. (2017). Assigning factuality values to semantic relations extracted from biomedical research literature. PloS One, 12(7), e0179926. https://doi.org/10.1371/journal.pone.0179926
  28. Kilicoglu, H., Shin, D., Fiszman, M., Rosemblat, G., & Rindflesch, T. C. (2012). SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics, 28(23), 3158-3160. https://doi.org/10.1093/bioinformatics/bts591
  29. Kleinberg, J. (2003). Bursty and hierarchical structure in streams. Data Mining and Knowledge Discovery, 7(4), 373-397. https://doi.org/10.1023/A:1024940629314
  30. Lakoff, G. (1972). Hedges: A study in meaning criteria and the logic of fuzzy concepts. Papers from the eighth regional meeting, Chicago Linguistic Society, Chicago: University of Chicago Linguistics Department, 8, 183-228. https://doi.org/10.1007/978-94-010-1756-5_9
  31. Light, M., Qiu, X. Y., & Srinivasan, P. (2004). The language of bioscience: Facts, speculations, and statements in between. In HLT-NAACL 2004 Workshop: Linking Biological Literature, Ontologies and Databases.
  32. Liu, D. R., Omar, H., Liou, C. H., Chi, H. C., & Hsu, C. H. (2015). Recommending blog articles based on popular event trend analysis. Information Sciences, 305, 302-319. https://doi.org/10.1016/j.ins.2015.02.003
  33. Liu, G. Y., Hu, J. M., & Wang, H. L. (2012). A co-word analysis of digital library field in China. Scientometrics, 91(1), 203-217. http://dx.doi.org/10.1007/s11192-011-0586-4
  34. Malhotra, A., Younesi, E., Gurulingappa, H., & Hofmann-Apitius, M. (2013). 'HypothesisFinder:' A strategy for the detection of speculative statements in scientific text. PLoS Computational Biology, 9(7), e1003117. https://doi.org/10.1371/journal.pcbi.1003117
  35. Malin, B., & Carley, K. (2007). A longitudinal social network analysis of the editorial boards of medical informatics and bioinformatics journals. Journal of the American Medical Informatics Association, 14(3), 340-348. http://dx.doi.org/10.1197/jamia.M2228
  36. McCallum, A. K. (2002). Mallet: A machine learning for language toolkit.
  37. Medlock, B., & Briscoe, T. (2007). Weakly supervised learning for hedge classification in scientific literature. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 992-999).
  38. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (pp. 3111-3119).
  39. Milojevic, S., Sugimoto, C. R., Yan, E., & Ding, Y. (2011). The cognitive structure of library and information science: Analysis of article title words. Journal of the American Society for Information Science and Technology, 62(10), 1933-1953. http://dx.doi.org/10.1002/asi.21602
  40. Mimno, D., & McCallum, A. (2012). Topic models conditioned on arbitrary features with dirichletmultinomial regression. arXiv Preprint arXiv:1206.3278.
  41. Nerur, S. P., Rasheed, A. A., & Natarajan, V. (2008). The intellectual structure of the strategic management field: An author co-citation analysis. Strategic Management Journal, 29(3), 319-336. https://doi.org/10.1002/smj.659
  42. Newman, D. J., & Block, S. (2006). Probabilistic topic decomposition of an eighteenth-century American newspaper. Journal of the American Society for Information Science and Technology, 57(6), 753-767. https://doi.org/10.1002/asi.20342
  43. Peters, H., & Van Raan, A. (1991). Structuring scientific activities by co-author analysis: An expercise on a university faculty level. Scientometrics, 20(1), 235-255. https://doi.org/10.1007/BF02018157
  44. Pilkington, A., & Meredith, J. (2009). The evolution of the intellectual structure of operations management-1980-2006: A citation/co-citation analysis. Journal of Operations Management, 27(3), 185-202. https://doi.org/10.1016/j.jom.2008.08.001
  45. Ravetz, J. R. (1973). Scientific knowledge and its social problems. Transaction publishers.
  46. Rindflesch, T. C., & Fiszman, M. (2003). The interaction of domain knowledge and linguistic structure in natural language processing: Interpreting hypernymic propositions in biomedical text. Journal of Biomedical Informatics, 36(6), 462-477. https://doi.org/10.1016/j.jbi.2003.11.003
  47. Rip, A., & Courtial, J. P. (1984). Co-word maps of biotechnology: An example of cognitive scientometrics. Scientometrics, 6(6), 381-400. https://doi.org/10.1007/BF02025827
  48. Rizomilioti, V. (2006). Exploring epistemic modality in academic discourse using corpora. In Information Technology in Languages for Specific Purposes, 53-71. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-28624-2_4
  49. Sebastian, Y., Siew, E. G., & Orimaye, S. O. (2017). Emerging approaches in literature-based discovery: Techniques and performance review. The Knowledge Engineering Review, 32. https://doi.org/10.1017/S0269888917000042
  50. Solti, I., Cooke, C. R., Xia, F., & Wurfel, M. M. (2009, November). Automated classification of radiology reports for acute lung injury: Comparison of keyword and machine learning based natural language processing approaches. In 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshop, 314-319. IEEE. https://doi.org/10.1109/BIBMW.2009.5332081
  51. Song, M., Heo, G. E., & Lee, D. (2015). Identifying the landscape of Alzheimer's disease research with network and content analysis. Scientometrics, 102(1), 905-927. https://doi.org/10.1007/s11192-014-1372-x
  52. Song, M., Kim, S., Zhang, G., Ding, Y., & Chambers, T. (2014). Productivity and influence in bioinformatics: A bibliometric analysis using PubMed central. Journal of the Association for Information Science and Technology, 65(2), 352-371. https://doi.org/10.1002/asi.22970
  53. Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Handbook of Latent Semantic Analysis, 427(7), 424-440.
  54. Szarvas, G., Vincze, V., Farkas, R., & Csirik, J. (2008, June). The BioScope corpus: Annotation for negation, uncertainty and their scope in biomedical texts. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, 38-45. Association for Computational Linguistics.
  55. Uzun, A. (2002). Library and information science research in developing countries and eastern european countries: A brief bibliometric perspective. International Information & Library Review, 34(1), 21-33. https://doi.org/10.1080/10572317.2002.10762561
  56. Vincze, V., Szarvas, G., Farkas, R., Mora, G., & Csirik, J. (2008). The BioScope corpus: Biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinformatics, 9(11), S9. https://doi.org/10.1186/1471-2105-9-S11-S9
  57. Vold, E. T. (2006). Epistemic modality markers in research articles: a cross-linguistic and crossdisciplinary study. International Journal of Applied Linguistics, 16(1), 61-87. https://doi.org/10.1111/j.1473-4192.2006.00106.x
  58. White, H. D., & McCain, K. W. (1998). Visualizing a discipline: An author co-citation analysis of information science, 1972-1995. Journal of the American SOCIEty for Information Science, 49(4), 327-355. https://doi.org/10.1002/(SICI)1097-4571(19980401)49:4<327:AID-ASI4>3.0.CO;2-4
  59. Wilbur, W. J., Rzhetsky, A., & Shatkay, H. (2006). New directions in biomedical text annotation: Definitions, guidelines and corpus construction. BMC Bioinformatics, 7(1), 356. https://doi.org/10.1186/1471-2105-7-356
  60. Zehr, S. C. (1999). Scientists' representations of uncertainty. Communicating Uncertainty: Media Coverage of New and Controversial Science, 3-21.
  61. Zerva, C., Batista-Navarro, R., Day, P., & Ananiadou, S. (2017). Using uncertainty to link and rank evidence from biomedical literature for model curation. Bioinformatics, 33(23), 3784-3792. https://doi.org/10.1093/bioinformatics/btx466
  62. Zhao, D., & Strotmann, A. (2008). Evolution of research activities and intellectual influences in information science 1996-2005: Introducing author bibliographic-coupling analysis. Journal of the American Society for Information Science and Technology, 59(13), 2070-2086. https://doi.org/10.1002/asi.20910
  63. Zhao, L. M., & Zhang, Q. P. (2011). Mapping knowledge domains of Chinese digital library research output, 1994-2010. Scientometrics, 89(1), 51-87. http://dx.doi.org/10.1007/s11192-011-0428-4