DOI QR코드

DOI QR Code

Set Covering-based Feature Selection of Large-scale Omics Data

Set Covering 기반의 대용량 오믹스데이터 특징변수 추출기법

  • Ma, Zhengyu (School of Industrial Management Engineering, Korea University) ;
  • Yan, Kedong (School of Information Management Engineering, Korea University) ;
  • Kim, Kwangsoo (Bioinformatics Institute, Seoul National University) ;
  • Ryoo, Hong Seo (School of Industrial Management Engineering, Korea University)
  • 마정우 (고려대학교 산업경영공학과) ;
  • 안기동 (고려대학교 정보경영공학과) ;
  • 김광수 (서울대학교 생물정보연구소) ;
  • 류홍서 (고려대학교 산업경영공학과)
  • Received : 2014.09.06
  • Accepted : 2014.11.04
  • Published : 2014.11.30

Abstract

In this paper, we dealt with feature selection problem of large-scale and high-dimensional biological data such as omics data. For this problem, most of the previous approaches used simple score function to reduce the number of original variables and selected features from the small number of remained variables. In the case of methods that do not rely on filtering techniques, they do not consider the interactions between the variables, or generate approximate solutions to the simplified problem. Unlike them, by combining set covering and clustering techniques, we developed a new method that could deal with total number of variables and consider the combinatorial effects of variables for selecting good features. To demonstrate the efficacy and effectiveness of the method, we downloaded gene expression datasets from TCGA (The Cancer Genome Atlas) and compared our method with other algorithms including WEKA embeded feature selection algorithms. In the experimental results, we showed that our method could select high quality features for constructing more accurate classifiers than other feature selection algorithms.

Keywords

References

  1. Alexe, G., S. Alexe, D.E. Axelrod, P.L. Hammer, and D. Weissmann, "Logical analysis of diffuse large B-cell lymphomas," Artificial Intelligence in Medicine, Vol.34 (2005), pp.235-267. https://doi.org/10.1016/j.artmed.2004.11.004
  2. Alexe, G., S. Alexe, D.E. Axelrod, T.O. Bonates, I.I. Lozina, M. Reiss, and P.L. Hammer, "Breast cancer prognosis by combinatorial analysis of gene expression data," Breast Cancer Research, Vol.8, No.4(2006), p.R41. https://doi.org/10.1186/bcr1512
  3. Alexe, G., S. Alexe, L.A. Liotta, E. Petricoin, M. Reiss, and P.L. Hammer, "Ovarian cancer detection by logical analysis of proteomic data," Proteomics, Vol.4(2004), pp.766-783. https://doi.org/10.1002/pmic.200300574
  4. Alexe, G., S. Alexe, P.L. Hammer, and B. Vizvari, "Pattern-based feature selections in genomics and proteomics," Annals of Operations Research, Vol.148(2006), pp.189-201. https://doi.org/10.1007/s10479-006-0084-x
  5. Apiletti, D., E. Baralis, G. Bruno, and A. Fiori, "MaskedPainter: Feature selection for microarray data analysis," Intelligent Data Analysis, (2012), pp.717-737.
  6. Ayers, K.L. and H.J. Cordell, "SNP selection in genome-wide and candidate gene studies via penalized logistic regression," Genetic epidemiology, Vol.34, No.8(2010), pp.879-891. https://doi.org/10.1002/gepi.20543
  7. Baralis, E., G. Bruno, and A. Fiori, "Maximum number of genes for microarray feature selection," 30th Annual International IEEE EMBS Conference, 2008.
  8. Bertolazzi, P., G. Felici, P. Festa, and G. Lancia, "Logic classification and feature selection for biomedical data," Computers and Mathematics with Applications, (2008), pp.889-899.
  9. Boros, E., P.L. Hammer, T. Ibaraki, A. Kogan, E. Mayoraz, and I. Muchnik, "An implementation of logical analysis of data," Knowledge and Data Engineering, IEEE Transactions on, Vol.12, No.2(2000), pp.292-306. https://doi.org/10.1109/69.842268
  10. Chvatal, V., "A greedy heuristic for the setcovering problem," Mathematics of operations research, Vol.4, No.3(1979), pp.233-235. https://doi.org/10.1287/moor.4.3.233
  11. Dʹiaz-Uriarte, R. and S.A. De Andres, "Gene selection and classification of microarray data using random forest," BMC bioinformatics, Vol.7, No.1(2006), p.3. https://doi.org/10.1186/1471-2105-7-3
  12. Ding, C. and H. Peng, "Minimum redundancy feature selection from microarray gene expression data," Journal of bioinformatics and computational biology, Vol.3, No.2(2005), pp.185-205. https://doi.org/10.1142/S0219720005001004
  13. Guyon, I., J. Weston, S. Barnhill, and V. Vapnik, "Gene selection for cancer classification using support vector machines," Machine learning, Vol.46, No.1-3(2002), pp.389-422. https://doi.org/10.1023/A:1012487302797
  14. Hall, M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten, "The weka data mining software : an update," ACM SIGKDD explorations newsletter, Vol.11, No.1(2009), pp.10-18. https://doi.org/10.1145/1656274.1656278
  15. Li, L., C.R. Weinberg, T.A. Darden, and L.G. Pedersen, "Gene selection for sample classification based on gene expression data : study of sensitivity to choice of parameters of the ga/knn method," Bioinformatics, Vol.17, No.12(2001), pp.1131-1142. https://doi.org/10.1093/bioinformatics/17.12.1131
  16. Liu, H., J. Li, and L. Wong, "A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns," Genome Informatics Series, (2002), pp.51-60.
  17. Long, N., D. Gianola, G.J.M. Rosa, K.A. Weigel, and S. Avendano, "Machine learning classification procedure for selecting SNPs in genomic selection : application to early mortality in broilers," Journal of animal breeding and genetics, Vol.124, No.6(2007), pp.377-389. https://doi.org/10.1111/j.1439-0388.2007.00694.x
  18. Model, F., P. Adorjan, A. Olek, and C. Piepenbrock, "Feature selection for DNA methylation based cancer classification," Bioinformatics, Vol.17, No.1(2001), pp.S157-S164. https://doi.org/10.1093/bioinformatics/17.suppl_1.S157
  19. Ren, X., Y. Wang, L. Chen, X. Zhang, and Q. Jin, "ellipsoidFN : a tool for identifying a heterogeneous set of cancer biomarkers based on gene expressions," Nucleic acids research, Vol.41, No.4(2013), pp.e53-e53. https://doi.org/10.1093/nar/gks1288
  20. Rubin, J., "A technique for the solution of massive set covering problems, with application to airline crew scheduling," Transportation Science, Vol.7, No.1(1973), pp.34-48. https://doi.org/10.1287/trsc.7.1.34
  21. Saeys, Y., I. Inza, and P. Larranaga, "A review of feature selection techniques in bioinformatics," bioinformatics, Vol.23, No.19(2007), pp.2507-2517. https://doi.org/10.1093/bioinformatics/btm344
  22. Thomas, J.G., J.M. Olson, S.J. Tapscott, and L.P. Zhao, "An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles," Genome Research, Vol.11, No.7(2001), pp.1227-1236. https://doi.org/10.1101/gr.165101
  23. Toregas, C., R. Swain, C. ReVelle, and L. Bergman, "The location of emergency service facilities," Operations Research, Vol.19, No.6(1971), pp.1363-1373. https://doi.org/10.1287/opre.19.6.1363
  24. Wang, Z., I.C. Yuan-chin, Z. Ying, L. Zhu, and Y. Yang, "A parsimonious threshold- independent protein feature selection method through the area under receiver operating characteristic curve," Bioinformatics, Vol.23, No.20(2007), pp.2788-2794. https://doi.org/10.1093/bioinformatics/btm442
  25. Zhang, H.H., J. Ahn, X. Lin, and C. Park, "Gene selection using support vector machines with non-convex penalty," Bioinformatics, Vol.22, No.1(2006), pp.88-95. https://doi.org/10.1093/bioinformatics/bti736
  26. Zhang, X., X. Lu, Q. Shi, X. Xu, E.L. Honchiu, L.N. Harris, J.D. Iglehart, A. Miron, J.S. Liu, and W.H. Wong, "Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data," BMC bioinformatics, Vol.7, No.1(2006), p.197.
  27. Zhuang, J., M. Widschwendter, and A.E. Teschendorff, "A comparison of feature selection and classification methods in DNA methylation studies using the illumina infinium platform," BMC bioinformatics, Vol.13, No.1(2012), p.59. https://doi.org/10.1186/1471-2105-13-59