Rule Discovery for Cancer Classification using Genetic Programming based on Arithmetic Operators

산술 연산자 기반 유전자 프로그래밍을 이용한 암 분류 규칙 발견

  • 홍진혁 (연세대학교 컴퓨터과학과) ;
  • 조성배 (연세대학교 컴퓨터산업공학부)
  • Published : 2004.08.01

Abstract

As a new approach to the diagnosis of cancers, bioinformatics attracts great interest these days. Machine teaming techniques have produced valuable results, but the field of medicine requires not only highly accurate classifiers but also the effective analysis and interpretation of them. Since gene expression data in bioinformatics consist of tens of thousands of features, it is nearly impossible to represent their relations directly. In this paper, we propose a method composed of a feature selection method and genetic programming. Rank-based feature selection is adopted to select useful features and genetic programming based arithmetic operators is used to generate classification rules with features selected. Experimental results on Lymphoma cancer dataset, in which the proposed method obtained 96.6% test accuracy as well as useful classification rules, have shown the validity of the proposed method.

최근 생물정보 기술이 암 진단의 새로운 방법으로 관심을 모으고 있다. 다양한 기계학습 기법이 적용되어 우수한 결과를 얻고 있지만 의학 분야에서는 정확률이 높은 분류기뿐만 아니라 획득된 분류규칙을 사람이 분석하고 이해할 수 있어야 한다. 생물정보 기술에서 많이 이용되는 유전자 발현 데이터는 데이타 내에 수천 내지 수만의 변수가 존재하며, 직접 이들 사이의 복잡한 관계를 표현하고 이해하는 것은 매우 어렵다. 본 논문에서는 이러한 어려움을 극복하기 위해 유전자 발현 데이타에서 분류에 유용한 특징들을 추출하고 산술 연산자 기반 유전자 프로그래밍으로 암 분류규칙을 생성하는 방법을 제안한다. 림프종 유전자 발현 데이타에 대하여 실험하여 96.6%의 인식률을 얻었으며, 획득된 분류 규칙을 분석하여 다양한 지식을 발견할 수 있었다.

Keywords

References

  1. A. Ben-Dor, et al., 'Tissue classification with gene expression profiles,' Journal of Computational Biology, vol. 7, pp. 559-584, 2000 https://doi.org/10.1089/106652700750050943
  2. A. Brazma and J. Vilo, 'Gene expression data analysis,' Federation of European Biochemical Societies Letters, vol. 480, pp. 17-24, 2000 https://doi.org/10.1016/S0014-5793(00)01772-5
  3. C. Park and S.-B. Cho, 'Genetic search for optimal ensemble of feature-classifier pairs in DNA gene expression profiles,' Int. Joint Conf. on Neural Networks, pp. 1702-1707, 2003 https://doi.org/10.1109/IJCNN.2003.1223663
  4. K. Tan, et al., 'Evolutionary computing for knowledge discovery in medical diagnosis,' Artificial Intelligence in Medicine, vol. 27, no. 2, pp. 129-154, 2003 https://doi.org/10.1016/S0933-3657(03)00002-2
  5. J. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993
  6. D. Goldberg, Genetic Algorithms in Search, Optimaization, and Machine Learning, Addison-Wesley, 1989
  7. K. DeJong, et al., 'Using genetic algorithms for concept learning, vol. 13, pp. 161-188, 1993 https://doi.org/10.1023/A:1022617912649
  8. A. Freitas, 'A survey of evolutionary algorithms for data mining and knowledge discovery,' 'Advances in Evolutionary Computation, pp. 819-845, 2002
  9. C. Hsu and C. Knoblock, 'Discovering robust knowledge from databases that change', Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 69-95, 1998 https://doi.org/10.1023/A:1009717820785
  10. C. Zhou et al., 'Discovery of classification rules by using gene expression programming,' Proc. of the 2002 Int. Conf. on Artificial Intelligence, pp. 1355-1361, 2002
  11. C. Bojarczuk, et al., 'Discovering comprehensible classification rules using genetic programming: A case study in a medical domain,' Proc. of the Genetic and Evolutionary Computation Conf., pp. 953-958, 1999
  12. I. Falco, et al., 'Discovering interesting classification rules with genetic programming,' Applied Soft Computing, vol. 1, no. 4, pp. 257-269, 2002 https://doi.org/10.1016/S1568-4946(01)00024-2
  13. J. Koza, 'Genetic programming,' Encyclopedia of Computer Science and Technology, vol. 39, pp. 29-43, 1998
  14. J. Kishore, et al., 'Application of genetic programming for multicategory pattern classification,' IEEE Transactions of Evolutionary Computation, vol. 4, no. 3, pp. 242-258, 2000 https://doi.org/10.1109/4235.873235
  15. H.-H. Won and S.-B. Cho, 'Neural network ensemble with negatively correlated features for cancer classfication,' Lecture Notes in Computer Science, vol. 2714, pp. 1143-1150, 2003 https://doi.org/10.1007/3-540-44989-2_136
  16. J. Bins and B. Draper, 'Feature selection from huge feature sets,' Proc. Int. Conf. Computer Vision 2, pp. 159-165, 2001 https://doi.org/10.1109/ICCV.2001.937619
  17. S. Augier, et al., 'Learning first order logic rules with a genetic algorithm,' Proc. of the First Int. Conf. on Knowledge Discovery & Data Mining, AAAI Press, 1995
  18. A. Alizadeh, et al., 'Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,' Nature, vol. 403, pp. 503-511, 2000 https://doi.org/10.1038/35000501
  19. O. Monni, et al., 'BCL2 overexpression in diffuse large B-cell lymphoma', Leuk Lymphoma, vol. 34, no. 1-2, pp. 45-52, 1999 https://doi.org/10.3109/10428199909083379
  20. P. Koni and R. Flavell, 'A role for tumor necrosis factor receptor type 1 in gut-associated lymphoid tissue development: genetic evidence of synergism with lymphotoxin ${\beta}$,' J. of Experimental Medicine, vol. 187, no. 12,pp. 1977-1983, 1998 https://doi.org/10.1084/jem.187.12.1977
  21. Data mining tools See5, http://www.rulequest.com/see5-info.html