DOI QR코드

DOI QR Code

Feature Selection for Classification of Mass Spectrometric Proteomic Data Using Random Forest

단백체 스펙트럼 데이터의 분류를 위한 랜덤 포리스트 기반 특성 선택 알고리즘

  • 온승엽 (한국항공대학교 컴퓨터 공학과) ;
  • 지승도 (한국항공대학교 컴퓨터 공학과) ;
  • 한미영 (한국과학창의재단)
  • Received : 2013.08.22
  • Accepted : 2013.11.27
  • Published : 2013.12.31

Abstract

This paper proposes a novel method for feature selection for mass spectrometric proteomic data based on Random Forest. The method includes an effective preprocessing step to filter a large amount of redundant features with high correlation and applies a tournament strategy to get an optimal feature subset. Experiments on three public datasets, Ovarian 4-3-02, Ovarian 7-8-02 and Prostate shows that the new method achieves high performance comparing with widely used methods and balanced rate of specificity and sensitivity.

본 논문에서는 질량 분석 방법에 의하여 산출된 단백체 데이터(mass spectrometric proteomic data)의 분류 분석(classification analysis)을 위한 새로운 특성 선택(feature selection) 방법을 제안한다. 이 방법은 i)높은 상관관계를 가지는 중복된 특성을 효과적으로 제거하는 전처리 단계와 ii)토너먼트(tournament) 전략을 사용하여 최적 특성 부분집합(optimal feature subset)을 탐색해 내는 단계로 구성되어 있다. 제안되는 방법을 실제 암진단에 사용되는 공개된 혈액 단백체 데이터에 적용하였으며 널리 사용되는 타 방법과 비교할 때 우수한 성능과 균형된 특이도와 민감도를 달성함을 실증하였다.

Keywords

References

  1. S. Das, Filters, wrappers and a boosting-based hybrid for feature selection, Proceedings of the 18th ICML, pp. 74-81, 2001.
  2. A.Y. Ng, "On feature selection: learning with exponentially many irrelevant features as training examples", Proceedings of the Fifteenth International Conference on Machine Learning, 1998.
  3. E. Xing, M. Jordan and R. Carp, "Feature selection for highdimensional genomic microarray data", Proc. of the 18th ICML, 2001.
  4. E.F. Petricoin, A.M. Ardekani, B.A. Hitt, P.J. Levine, V.A. Fusaro, S.M. Steinberg, G.B. Mills, C. Simone, D.A. Fishman, E.C. Kohn and L.A. Liotta, "Use of proteomic patterns in serum to identify ovarian cancer", Lancet. Vol. 359, No. 9306, pp. 572-577, 2002. https://doi.org/10.1016/S0140-6736(02)07746-2
  5. K. Jong, E. Marchiori, M. Sebagy and A. Vaart, Feature Selection in Proteomic Pattern Data with Support Vector Machines, pp. 41-48, 2004.
  6. I. Levner, Feature selection and nearest centroid classification for protein mass spectrometry, BMC Bioinformatics, 2005, available from http://www.biomedcentral.com/1471-2105/6/68.
  7. R.H. Lilien, H. Farid and B.R. Donald, Probabilistic disease classification of expression-dependent proteomic data from mass spectrometry of human serum. Computational Biology, Vol. 10, No. 6, pp. 925-946, 2003. https://doi.org/10.1089/106652703322756159
  8. R. Tibshirani, T. Hastiey, B. Narasimhanz, S. Soltys, G. Shi, A. Koong and Q. Le, Sample classifcation from protein mass spectrometry by 'peak probability contrasts'. BioInformatics, Vol. 7, No. 17, pp. 3034-3044, 2004.
  9. W. Michael, D.N. Naik, S. Kasukurti, A. Pothen, R.R. Devineni, B.L. Adam, O.J. Semmes and G.L. Wright, Computational protein biomarker prediction: a case study for prostate cancer. BMC Bioinformatics, 2004, available from http://www.biomedcentral.com/1471-2105/5/26.
  10. B. Wu, T. Abbott, D. Fishman, W. McMurray, G. Mor, K. Stone, D. Ward, K. Williams and H. Zhao, Comparison of statistical methods for classifcation of ovarian cancer using mass spectrometry data. BioInformatics, Vol. 19, No. 13, pp. 1636-1643, 2003. https://doi.org/10.1093/bioinformatics/btg210
  11. L. Breiman, Random forest, Machine Learning, Vol. 45, pp. 5-32, 2001. https://doi.org/10.1023/A:1010933404324
  12. R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification; 2nd Edition, John Wiley & Sons Inc, 2001.
  13. P.N. Tan, M. Steinbach and V.S Kumar, Introduction to Data mining, Addison-Wesley, 2006.
  14. I, Guyon and A. Elisseeff, An introduction to variable and feature selection, Machine learning, Vol. 3, Special Issue on variable and feature selection, pp. 1157-1182, 2003.
  15. http://clinicalproteomics.steem.com/

Cited by

  1. Feature-Learning-Based Printed Circuit Board Inspection via Speeded-Up Robust Features and Random Forest vol.8, pp.6, 2018, https://doi.org/10.3390/app8060932