DOI QR코드

DOI QR Code

The MeSH-Term Query Expansion Models using LDA Topic Models in Health Information Retrieval

MeSH 기반의 LDA 토픽 모델을 이용한 검색어 확장

  • You, Sukjin (Information studies, University of Wisconsin-Milwaukee)
  • Received : 2021.02.18
  • Accepted : 2021.03.20
  • Published : 2021.03.30

Abstract

Information retrieval in the health field has several challenges. Health information terminology is difficult for consumers (laypeople) to understand. Formulating a query with professional terms is not easy for consumers because health-related terms are more familiar to health professionals. If health terms related to a query are automatically added, it would help consumers to find relevant information. The proposed query expansion (QE) models show how to expand a query using MeSH terms. The documents were represented by MeSH terms (i.e. Bag-of-MeSH), found in the full-text articles. And then the MeSH terms were used to generate LDA (Latent Dirichlet Analysis) topic models. A query and the top k retrieved documents were used to find MeSH terms as topic words related to the query. LDA topic words were filtered by threshold values of topic probability (TP) and word probability (WP). Threshold values were effective in an LDA model with a specific number of topics to increase IR performance in terms of infAP (inferred Average Precision) and infNDCG (inferred Normalized Discounted Cumulative Gain), which are common IR metrics for large data collections with incomplete judgments. The top k words were chosen by the word score based on (TP *WP) and retrieved document ranking in an LDA model with specific thresholds. The QE model with specific thresholds for TP and WP showed improved mean infAP and infNDCG scores in an LDA model, comparing with the baseline result.

헬스 분야에서 정보 검색의 어려움 중의 하나는 일반 사용자들이 전문적인 용어들을 이해하기가 어렵다는 점이다. 헬스와 관련된 전문 용어들은 일반 사용자들이 검색어로 사용하기 어렵기 때문에 이러한 전문 용어들이 자동적으로 검색어에 더해질 수 있다면 좀 더 검색의 효과를 높일 수 있을 것이다. 제안된 검색어 확장 모델은 전문 용어를 포함하는 MeSH(Medical Subject Headings)를 검색어 확장을 위한 단어 후보 군으로 이용하였다. 문서들은 MeSH용어들로 표현이 되고 이렇게 표현된 문서들의 집합에 대해서 LDA(Latent Dirichlet Analysis) 토픽들이 생성된 후, (검색어+초기 검색어에 의해 검색된 상위 k개 문서들)에 연관된 토픽 단어들이 원래의 검색어를 확장하는 데 쓰여졌다. MeSH로 구성된 토픽 단어들은 임의로 정해진 토픽 확률 임계값과 토픽을 구성하는 단어의 확률 임계값보다 높았을 때 초기의 검색어에 포함되었다. 특정수의 토픽을 갖는 LDA 모델에서 이러한 적절한 임계값의 설정을 통해 선택된 토픽 단어들은 검색어 확장에 이용되어 검색시에 infAP(inferred Average Precision)와 infNDCG(inferred Normalized Discounted Cumulative Gain)를 높이는데 효과적으로 작용하였다. 또한 토픽 확률값과 토픽 단어의 확률값을 곱하여 계산된 토픽 단어의 스코어가 높은 상위 k개의 단어를 검색어를 확장하는 데 이용하였을 때에도 검색의 성능이 향상될 수 있음을 확인하였다.

Keywords

References

  1. Azad, H. K. & Deepak, A. (2019). Query expansion techniques for information retrieval: a survey. Information Processing & Management, 56(5), 1698-1735. https://doi.org/10.1016/j.ipm.2019.05.009
  2. Beaulieu, M., Gatford, M., Huang, X., Robertson, S., Walker, S., & Williams, P. (1997). Okapi at TREC-5. Nist Special Publication SP, 143-166.
  3. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993-1022.
  4. Bompada, T., Chang, C. C., Chen, J., Kumar, R., & Shenoy, R. (2007, July). On the robustness of relevance measures with incomplete judgments. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 359-366.
  5. Buckley, C., & Voorhees, E. M. (2004, July). Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 25-32.
  6. Carpineto, C. & Romano, G. (2012). A survey of automatic query expansion in information retrieval. Acm Computing Surveys (CSUR), 44(1), 1-50. https://doi.org/10.1145/2071389.2071390
  7. Chang, Y., Ounis, I., & Kim, M. (2006). Query reformulation using automatically generated query concepts from a document space. Information Processing & Management, 42(2), 453-468. https://doi.org/10.1016/j.ipm.2005.03.025
  8. Diaz-Galiano, M. C., Garcia-Cumbreras, M. A., Martin-Valdivia, M. T., Montejo-Raez, A., & Urena-Lopez, L. A. (2007, September). Integrating mesh ontology to improve medical information retrieval. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, Berlin, Heidelberg, 601-606.
  9. Efthimiadis, E. N. (1996). Query expansion. Annual Review of Information Science and Technology (ARIST), 31, 121-87.
  10. Harris, Z. S. (1954). Distributional structure. Word, 10(2/3), 146-62. https://doi.org/10.1080/00437956.1954.11659520
  11. Hofmann, T. (1999, August). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 50-57. ACM.
  12. Hoffman, M., Bach, F. R., & Blei, D. M. (2010). Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems, 856-864.
  13. Jian, F., Huang, J. X., Zhao, J., He, T., & Hu, P. (2016, July). A simple enhancement for ad-hoc information retrieval via topic modelling. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 733-736.
  14. Liu, H. & Singh, P. (2004). ConceptNet-a practical commonsense reasoning tool-kit. BT Technology Journal, 22(4), 211-226. https://doi.org/10.1023/B:BTTJ.0000047600.45421.6d
  15. Lu, Z., Kim, W., & Wilbur, W. J. (2009). Evaluation of query expansion using MeSH in PubMed. Information Retrieval, 12(1), 69-80. https://doi.org/10.1007/s10791-008-9074-8
  16. Lupu, M., Zhao, J., Huang, J., Gurulingappa, H., Fluck, J., Zimmermann, M., ... & Tait, J. (2011, November). Overview of the TREC 2011 Chemical IR Track. In TREC.
  17. Merabti, T., Letord, C., Abdoune, H., Lecroq, T., Joubert, M., & Darmoni, S. J. (2009). Projection and inheritance of SNOMED CT relations between MeSH terms. In MIE, 233-237.
  18. Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39-41. https://doi.org/10.1145/219717.219748
  19. Mitchell, P. C. (1973). A note about the proximity operators in information retrieval. ACM SIGPLAN Notices, 10(1), 177-180. https://doi.org/10.1145/951787.951778
  20. Mu, X., Lu, K., & Ryu, H. (2014). Explicitly integrating MeSH thesaurus help into health information retrieval systems: an empirical user study. Information Processing & Management, 50(1), 24-40. https://doi.org/10.1016/j.ipm.2013.03.005
  21. Munro, R. J., Bolanos, J. A., & May, J. (1978). LEXIS vs. WESTLAW: an analysis of automated education. Law Libr. J., 71.
  22. Natsev, A., Haubold, A., Tesic, J., Xie, L., & Yan, R. (2007, September). Semantic concept-based query expansion and re-ranking for multimedia retrieval. In Proceedings of the 15th ACM International Conference on Multimedia, 991-1000.
  23. Paik, J. H. (2013, July). A novel TF-IDF weighting scheme for effective ranking. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, 343-352.
  24. Roberts, K., Simpson, M. S., Voorhees, E. M., & Hersh, W. R. (2015, November). Overview of the TREC 2015 Clinical Decision Support Track. In TREC.
  25. Roberts, K., Demner-Fushman, D., Voorhees, E. M., & Hersh, W. R. (2016, November). Overview of the TREC 2016 Clinical Decision Support Track. In TREC.
  26. Roberts, K., Demner-Fushman, D., Voorhees, E. M., Hersh, W. R., Bedrick, S., Lazar, A. J., & Pant, S. (2017, November). Overview of the TREC 2017 Precision Medicine Track. In TREC.
  27. Schutze, H., Manning, C. D., & Raghavan, P. (2008). Introduction to Information Retrieval. Cambridge: Cambridge University Press.
  28. Voorhees, E. M. (2014, July). The effect of sampling strategy on inferred measures. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, 1119-1122.
  29. Wang, Y., Huang, H., & Feng, C. (2017, April). Query expansion based on a feedback concept model for microblog retrieval. In Proceedings of the 26th International Conference on World Wide Web, 559-568
  30. Xu, J. & Croft, W. B. (2017, August). Quary expansion using local and global document analysis. In Acm Sigir Forum. New York, NY, USA: ACM, 51(2), 168-175.
  31. Yanagawa, A., Chang, S. F., Kennedy, L., & Hsu, W. (2007). Columbia university's baseline detectors for 374 lscom semantic visual concepts. Columbia University ADVENT Technical Report, 222-2006.
  32. Yilmaz, E. & Aslam, J. A. (2006, November). Estimating average precision with incomplete and imperfect judgments. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, 102-111. ACM.
  33. Yilmaz, E., Kanoulas, E., & Aslam, J. A. (2008, July). A simple and efficient sampling method for estimating AP and NDCG. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 603-610. ACM.
  34. Zeng, Q. T., Redd, D., Rindflesch, T., & Nebeker, J. (2012). Synonym, topic model and predicate-based query expansion for retrieving clinical documents. In AMIA Annual Symposium Proceedings. American Medical Informatics Association, 2012, 1050.
  35. Zhai, C. & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2), 179-214. https://doi.org/10.1145/984321.984322
  36. Zhou, D., Wu, X., Zhao, W., Lawless, S., & Liu, J. (2017). Query expansion with enriched user profiles for personalized search utilizing folksonomy data. IEEE Transactions on Knowledge and Data Engineering, 29(7), 1536-1548. https://doi.org/10.1109/TKDE.2017.2668419