Comparative Analysis of Vectorization Techniques in Electronic Medical Records Classification

Yoo, Sung Lim;

doi:10.9718/JBER.2022.43.2.109

Journal of Biomedical Engineering Research (대한의용생체공학회:의공학회지)

Volume 43 Issue 2
/
Pages.109-115
/
2022
/
1229-0807(pISSN)
/
2288-9396(eISSN)

The Korean Society of Medical and Biological Engineering (대한의용생체공학회)

DOI QR Code

Comparative Analysis of Vectorization Techniques in Electronic Medical Records Classification

의무 기록 문서 분류를 위한 자연어 처리에서 최적의 벡터화 방법에 대한 비교 분석

Yoo, Sung Lim (Department of Medical Device Management and Research, SAIHST, Sungkyunkwan University)

유성림 (성균관대학교 SAIHST 의료기기산업학과)

Received : 2022.02.22
Accepted : 2022.04.15
Published : 2022.04.30

https://doi.org/10.9718/JBER.2022.43.2.109 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic medical records classification. Material and methods: 403 electronic medical documents were extracted retrospectively and classified using the cosine similarity calculated by Scikit-learn (Python module for machine learning) in Jupyter Notebook. Vectors for medical documents were produced by three different vectorization techniques (TF-IDF, latent sematic analysis and Word2Vec) and the classification precisions for three vectorization techniques were evaluated. The Kruskal-Wallis test was used to determine if there was a significant difference among three vectorization techniques. Results: 403 medical documents were relevant to 41 different diseases and the average number of documents per diagnosis was 9.83 (standard deviation=3.46). The classification precisions for three vectorization techniques were 0.78 (TF-IDF), 0.87 (LSA) and 0.79 (Word2Vec). There was a statistically significant difference among three vectorization techniques. Conclusions: The results suggest that removing irrelevant information (LSA) is more efficient vectorization technique than modifying weights of vectorization models (TF-IDF, Word2Vec) for medical documents classification.

Keywords

References

Chicco D, Lovejoy CA, Oneto L. A machine learning analysis of health records of patients with chronic kidney disease at risk of cardiovascular disease. IEEE Access. 2021;9(3):165132-44. https://doi.org/10.1109/ACCESS.2021.3133700
Blakey JD, Price DB, Pizzichini E. Identifying risk of future asthma attacks using UK medical record data: A respiratory effectiveness group initiative. J Allergy Clin Immunol. 2017;5(4):1015-24. https://doi.org/10.1016/j.jaip.2016.11.007
Tomasallo CD, Hanrahan LP, Tandias A. Estimating Wisconsin asthma prevalence using clinical electronic health records and public health data. Am J Public Health. 2014;104(1):65-73.
Spasic I, Livsey J, Keane JA. Text mining of cancer-related information: Review of current status and future directions. Int J Med Informatics. 2014;83(9):605-23. https://doi.org/10.1016/j.ijmedinf.2014.06.009
Jonnalagadda SR, Adupa AK, Garg RP. Text mining of the electronic health record: An information extraction approach for automated identification and subphenotyping of HFpEF patients for clinical trials. J Cardiovasc Transl Res. 2017;10(3):313-21. https://doi.org/10.1007/s12265-017-9752-2
Rahaman T. Discovering new trend and connections: Current application of biomedical text mining. Med Ref Services Quarterly. 2021;40(3):329-36. https://doi.org/10.1080/02763869.2021.1945869
Le Glaz A, Haralambous Y, Kim D. Machine learning and natural language processing in mental health: Systemic review. J Med Internet Res. 2021;23(5):15708.
Shai SS, Shai BD. Understanding machine learning: from theory to algorithms. New York: Cambridge University Press; 2014.
Peter F. Machine learning: the art and science of algorithms that make sense of data. Cambridge: Cambridge University Press; 2012.
Mehryar M, Afshin R, Ameet T. Foundations of machine learning. Cambridge: MIT press; 2012.
Chen MC, Ball RL, Yang L. Deep learning to classify radiology free-text reports. Radiology. 2018;286(3):845-2. https://doi.org/10.1148/radiol.2017171115
Pak DH, Hwang MG, Hwang JU. Application of text classification based machine learning in prediction psychiatric diagnosis. Korean J Biol Psychiatry. 2020;27(1):18-26.
Andrea C, Leif J, Hercules D. Natural language processing and machine learning to enable automatic extraction and classification of patients' smoking status from electronic medical records. Upsala J Med Sci. 2020;125(4):316-24. https://doi.org/10.1080/03009734.2020.1792010
Yuli V. Natural language processing with Python and spaCy: a practical introduction. San Francisco: No Starch Press; 2020.
Hobson L, Cole H, Arwen G. Natural language processing in action: understanding, analyzing and generating text with Python. Shelter Island, NY: Manning Publications Co.; 2019.
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In Proceedings of Workshop at ICLR. 2013;1301-13.
Gastaldi JL. Why can computers understand natural language? The structuralist image of language behind word embeddings. Phil Tech. 2021;34(1):149-214. https://doi.org/10.1007/s13347-020-00393-9
Guillermo JB, Ricardo O, Jose AL. Using latent semantic analysis and the predication algorithm to improve extraction of meanings from a diagnostic corpus. Span J Psychol. 2009;12(2):424-40. https://doi.org/10.1017/s1138741600001815
Zhou Y. An introduction to text classification with applications to medical records. 2nd international conference on informational technology and computer application. 2020;471-75.
Kherwa P, Bansal P. Latent semantic analysis: an approach to understanding sematic of text. International conference on current trends in computer, electrical, electronics and communication. 2017;870-4.
Almas J, Qamar U. Affect of data filter on performance of latent semantic analysis based research paper recommender system. 5th International conference on computational intelligence and application. 2020;50-54.
Weng WH, Wagholikar KB, McCray A., Szolovits P. Medical subdomain classification of clinical notes using a machine learning based natural language processing approach. BMC med inform Decis Mak. 2017;17(1):1-13. https://doi.org/10.1186/s12911-016-0389-x
Jamaluddin M, Wibawa AD. Patient diagnosis classification based on electronic medical record using text mining and support vector machine. International seminar on application for technology of information and communication. 2021;243-8.
Wang Y, Sohn SH, Liu S, Shen F. A clinical text classification paradigm using weak supervision and deep representation. BMC med inform Decis Mak. 2019;19(1).
Park KB, Lee JH, Jang SB, Jung DW. An empirical study of tokenization strategies for various Korean NLP tasks. Computer Science. 2020.
Cho DB, Lee HY, Kang SS. An empirical study of Korean sentence representation with various tokenization. Electronics. 2021;10(7):845-57. https://doi.org/10.3390/electronics10070845

Journal of Biomedical Engineering Research (대한의용생체공학회:의공학회지)

Comparative Analysis of Vectorization Techniques in Electronic Medical Records Classification

의무 기록 문서 분류를 위한 자연어 처리에서 최적의 벡터화 방법에 대한 비교 분석

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)