데이터 이산화와 러프 근사화 기술에 기반한 중요 임상검사항목의 추출방법: 담낭 및 담석증 질환의 감별진단에의 응용

Extraction Method of Significant Clinical Tests Based on Data Discretization and Rough Set Approximation Techniques: Application to Differential Diagnosis of Cholecystitis and Cholelithiasis Diseases

  • 손창식 (계명대학교 의과대학 의료정보학교실) ;
  • 김민수 (계명대학교 의과대학 생체정보기술개발사업단) ;
  • 서석태 (계명대학교 의과대학 생체정보기술개발사업단) ;
  • 조윤경 (계명대학교 의과대학 내과학교실) ;
  • 김윤년 (계명대학교 의과대학 의료정보학교실)
  • Son, Chang-Sik (Dept. of Medical Informatics, School of Medicine, Keimyung Univ.) ;
  • Kim, Min-Soo (Biomedical Information Technology Center, School of Medicine, Keimyung Univ.) ;
  • Seo, Suk-Tae (Biomedical Information Technology Center, School of Medicine, Keimyung Univ.) ;
  • Cho, Yun-Kyeong (Dept. of Internal Medicine, School of Medicine, Keimyung Univ.) ;
  • Kim, Yoon-Nyun (Dept. of Medical Informatics, School of Medicine, Keimyung Univ.)
  • 투고 : 2010.12.10
  • 심사 : 2011.03.21
  • 발행 : 2011.04.30


The selection of meaningful clinical tests and its reference values from a high-dimensional clinical data with imbalanced class distribution, one class is represented by a large number of examples while the other is represented by only a few, is an important issue for differential diagnosis between similar diseases, but difficult. For this purpose, this study introduces methods based on the concepts of both discernibility matrix and function in rough set theory (RST) with two discretization approaches, equal width and frequency discretization. Here these discretization approaches are used to define the reference values for clinical tests, and the discernibility matrix and function are used to extract a subset of significant clinical tests from the translated nominal attribute values. To show its applicability in the differential diagnosis problem, we have applied it to extract the significant clinical tests and its reference values between normal (N = 351) and abnormal group (N = 101) with either cholecystitis or cholelithiasis disease. In addition, we investigated not only the selected significant clinical tests and the variations of its reference values, but also the average predictive accuracies on four evaluation criteria, i.e., accuracy, sensitivity, specificity, and geometric mean, during l0-fold cross validation. From the experimental results, we confirmed that two discretization approaches based rough set approximation methods with relative frequency give better results than those with absolute frequency, in the evaluation criteria (i.e., average geometric mean). Thus it shows that the prediction model using relative frequency can be used effectively in classification and prediction problems of the clinical data with imbalanced class distribution.



  1. K.N. Lee, J.H. Yoon, Y.H. Choi, H.I. Cho, K.W. Bae, C.H. Yoon, and S.I. Kim, "Standardization of reference values among laboratories of Korean association of health promotion," J. Lab. Med. & Quality Assuarance, vol. 24, no. 2, pp. 185-195, 2002.
  2. E.J. Cha, T.S. Lee, Y.S. Whang, J.W. Kim, S.O. Yang, K.H. Jung, and H.K. Ryu, "Automated clinical test results analysis system application to liver function test," J. Biomed. Eng. Res., vol. 14, no. 4, pp. 341-348, 1993.
  3. C.S. Son, A.M. Shin, YD. Lee, H.J. Park, H.S. Park, and Y.N. Kim, "Variable threshold based feature selection using spatial distribution of data," J. Kor. Soc. Med. Informatics, vol. 15, no. 4, pp. 475-481, 2009.
  4. C.S. Son, A.M. Shin, Y.D. Lee, H.S. Park, H.J. Park, and Y.N. Kim, "Rule weight-based fuzzy classification model for analyzing admission-discharge of dyspnea patients," J. Biomed. Eng. Res., vol. 31, no. 1, pp. 40-49, 2010.
  5. ICD10 version 2007,
  6. D. Chiu, A. Wong, and B. Cheung, Information discovery through hierarchical maximum entropy discretization and synthesis, MIT Press, 1991.
  7. Z. Pawlak, "Rough sets," Int. J. Comput. Inf. Sci., vol. 11, no. 5, pp. 341-356, 1982.
  8. R. Slowinski and J. Stefanowski, "Rough classification in incomplete information systems," Math.Comput.Modeling, vol. 12, no. 10-11, pp. 1347-1357, 1989.
  9. Z. Pawlak, Rough sets: theoretical aspects of reasoning about data, Kluwer Academic Publisher, Dordrecht, Netherlands, 1991.
  10. R. Jensen and Q. Shen, Computational intelligence and feature selection: rough and fuzzy approaches, Wiley-IEEE Press, 2008.
  11. Y.M. Sun, M.S. Kamel, A.K.C. Wong, and Y.Wang, "Cost-sensitive boosting for classification of imbalanced data," Patt.Recog., vol. 40, no. 12, pp. 3358-3378, 2007.
  12. C.S. Son, A.M. Shin, I.H. Lee, H.J. Park, H.S. Park, and Y.N. Kim, "Fuzzy discretization with spatial distribution of data and its application to feature selection," J. Kor. Inst. Int. Syst., vol. 20, no. 2, pp. 165-172, 2010.
  13. K.S. Yoo, "Diagnosis of gallstone," Korean J. Med., vol. 75, no. 6, pp. 616-623, 2008.
  14. R. Kerber, "ChiMerge: discretization of numeric attributes," in Proceedings of AAAI-92, Ninth Intpppppl Conf. Artificial Intelligence, AAAI-Press, pp. 123-128, 1992.
  15. H. Liu and R. Setiono, "Feature selection via discretization of numeric attributes," IEEE Trans. Knowl.Data Eng., vol. 9, no. 4, pp. 642-645, 1997.
  16. U.M. Fayyad and K.B. Irani, "Multi-interval discretization of continuous attributes as preprocessing for classification learning," in Proceedings of 13th International Joint Conference on Artificial Intelligence, pp. 1022-1027, 1993.
  17. L. Kaufman and P.J. Rousseeuw, "Finding group in data: an introduction to cluster analysis, John Wiley & Sons, New York, 1990.