DOI QR코드

DOI QR Code

Named entity normalization for traditional herbal formula mentions

  • Ho Jang (KM Data Division, Korea Institute of Oriental Medicine. Department of Korean Convergence Medical Science, University of Science and Technology)
  • Received : 2024.09.03
  • Accepted : 2024.10.04
  • Published : 2024.10.31

Abstract

In this paper, we propose methods for the named entity normalization of traditional herbal formula found in medical texts. Specifically, we developed methodologies to determine whether mentions, such as full names of herbal formula and their abbreviations, refer to the same concept. Two different approaches were attempted. First, we built a supervised classification model that uses BERT-based contextual vectors and character similarity features of herbal formula mentions in medical texts to determine whether two mentions are identical. Second, we applied a prompt-based querying method using GPT-4o mini and GPT-4o to perform the same task. Both methods achieved over 0.9 in Precision, Recall, and F1-score, with the GPT-4o-based approach demonstrating the highest Precision and F1-Score. The results of this study demonstrate the effectiveness of machine learning-based approaches for named entity normalization in traditional medicine texts, with the GPT-4o-based method showing superior performance. This suggests its potential as a valuable foundation for the development of intelligent information extraction systems in the traditional medicine domain.

본 논문에서는 의학 텍스트에 기술된 한의 처방명의 개체명 정규화 방법을 연구하였다. 구체적으로, 주어진 텍스트에서 개체명으로 인식된 처방 명칭과 처방의 약어 등 처방 멘션들이 동일한 처방 개념을 가리키는지를 판단하는 방법론을 연구하였다. 이를 위해 두 가지 접근 방식을 시도하였다. 먼저, 의학 텍스트에 등장하는 처방 멘션에 대해 BERT 기반의 문맥 벡터와 멘션의 문자 유사도 정보를 기계 학습 모델의 특징으로 사용하여, 두 멘션의 동일 여부를 판별하는 지도 학습 기반 분류 모델을 구축하였다. 다음으로, GPT-4o mini 및 GPT-4o 기반의 프롬프트 질의 방식을 활용하여 동일한 작업을 수행하였다. 두 방법 모두 Precision, Recall, F1-score에서 0.9 이상의 성능을 보였으나, GPT-4o 기반 방법이 가장 높은 Precision과 F1-Score를 기록하였다. 본 연구의 결과는 한의학 텍스트에서 개체명 정규화를 위한 기계 학습 기반 접근 방식이 유의미한 성능을 달성할 수 있음을 보여주며, 특히 GPT-4o 기반 방법이 뛰어난 Precision과 F1-Score를 보임으로써 향후 한의학 도메인에서 지능형 정보 추출 시스템 개발에 중요한 기초 자료로 활용될 수 있을 것으로 기대된다.

Keywords

Acknowledgement

This work was funded by the Korea Institute of Oriental Medicine (No. KSN1824130 and No. KSN1923111).

References

  1. H. Lee, et al., "Treatment of nausea and vomiting associated with cerebellar infarction using the traditional herbal medicines banhabaekchulcheonma-tang and oryeong-san: Two case reports (CARE-complaint)," EXPLORE, Vol. 19, No. 1, pp. 141-146, January 2023. DOI: 10.1016/j.explore.2021.11.011. 
  2. J. Devlin, "BERT: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint, arXiv:1810.04805, 2018. https://doi.org/10.48550/arXiv.1810.04805 
  3. A. Radford, et al., "Improving language understanding by generative pre-training," 2018. https://api.semanticscholar.org/CorpusID:49313245 
  4. M. H. Kim, et al., "A multicenter study on the efficacy and safety of So-Cheong-Ryong-Tang for perennial allergic rhinitis," Complementary Therapies in Medicine, Vol. 45, pp. 50-56, August 2019. DOI: 10.1016/j.ctim.2019.05.018. 
  5. T. H. Lan, et al., "Systems pharmacology dissection of traditional Chinese medicine Wen-Dan decoction for treatment of cardiovascular diseases," Evidence Based Complementary and Alternative Medicine, Vol. 2018, No. 1, pp. 5170854, May 2018. DOI: 10.1155/2018/5170854 
  6. Y. Wang, J. Qin, and W. Wang, "Efficient approximate entity matching using Jaro-Winkler distance," in International Conference on Web Information Systems Engineering, Cham: Springer International Publishing, pp. 231-239, October 2017.
  7. C. Deng, B. Deng, L. Jia, H. Tan, P. Zhang, S. Liu, Y. Zhang, A. Song, and L. Pan, "Preventive effects of a Chinese herbal formula, Shengjiang Xiexin decoction, on irinotecan-induced delayed-onset diarrhea in rats," Evidence-Based Complementary and Alternative Medicine, Vol. 2017, No. 1, pp. 7350251, January 2017. 
  8. M.-J. Lin, H.-W. Chen, P.-H. Liu, W.-J. Cheng, S.-L. Kuo, and M.-C. Kao, "The prescription patterns of traditional Chinese medicine for women with polycystic ovary syndrome in Taiwan: a nationwide population-based study," Medicine, Vol. 98, No. 24, pp. e15890, June 2019. DOI: 10.1097/MD.0000000000015890. 
  9. Z. Liu, C. Luo, Z. Zheng, Y. Li, D. Fu, X. Yu, and J. Zhao, "TCMNER and PubMed: A novel Chinese character-level-based model and a dataset for TCM named entity recognition," Journal of Healthcare Engineering, Vol. 2021, No. 1, pp. 3544281, August 2021. DOI: 10.1155/2021/3544281. 
  10. H. Cho, W. Choi, and H. Lee, "A method for named entity normalization in biomedical articles: application to diseases and plants," BMC Bioinformatics, Vol. 18, pp. 1-12, October 2017. DOI: 10.1186/s12859-017-1975-0. 
  11. Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao, "Can ChatGPT understand too? A comparative study on ChatGPT and fine-tuned BERT," arXiv preprint, arXiv:2302.10198, 2023. 
  12. Y. Hu, Q. Chen, J. Du, X. Peng, V. K. K. Keloth, X. Zuo, Y. Zhou, et al., "Improving large language models for clinical named entity recognition via prompt engineering," Journal of the American Medical Informatics Association, September 2024. DOI: 10.1093/jamia/ocad259.