DOI QR코드

DOI QR Code

Combining Support Vector Machine Recursive Feature Elimination and Intensity-dependent Normalization for Gene Selection in RNAseq

RNAseq 빅데이터에서 유전자 선택을 위한 밀집도-의존 정규화 기반의 서포트-벡터 머신 병합법

  • Received : 2017.06.02
  • Accepted : 2017.08.03
  • Published : 2017.10.31

Abstract

In past few years, high-throughput sequencing, big-data generation, cloud computing, and computational biology are revolutionary. RNA sequencing is emerging as an attractive alternative to DNA microarrays. And the methods for constructing Gene Regulatory Network (GRN) from RNA-Seq are extremely lacking and urgently required. Because GRN has obtained substantial observation from genomics and bioinformatics, an elementary requirement of the GRN has been to maximize distinguishable genes. Despite of RNA sequencing techniques to generate a big amount of data, there are few computational methods to exploit the huge amount of the big data. Therefore, we have suggested a novel gene selection algorithm combining Support Vector Machines and Intensity-dependent normalization, which uses log differential expression ratio in RNAseq. It is an extended variation of support vector machine recursive feature elimination (SVM-RFE) algorithm. This algorithm accomplishes minimum relevancy with subsets of Big-Data, such as NCBI-GEO. The proposed algorithm was compared to the existing one which uses gene expression profiling DNA microarrays. It finds that the proposed algorithm have provided as convenient and quick method than previous because it uses all functions in R package and have more improvement with regard to the classification accuracy based on gene ontology and time consuming in terms of Big-Data. The comparison was performed based on the number of genes selected in RNAseq Big-Data.

고처리 시퀀싱과 빅데이터 및 크라우드 컴퓨팅에 혁신이 일어나면서, RNA 시퀀싱도 획기적인 변화가 일어, RNAseq가 기존의 DNA 마이크로어레이를 대체하여, 빅-데이터를 형성하고 있다. 현재, RANseq 이용한 유전자 조절망(GRN) 까지 연구가 활성화 되고 있는데, 그 중 한 분야가 GRN의 기본 요소인 특징 유전자를 빅-데이터에서도 구별하고 기존에 알려진 것 외에 새로운 역할을 찾는 것이다. 그러나, 이러한 연구 방향에 부합하는 빅-데이터를 처리할 수 있는 컴퓨테이션 방법이 아직까지 매우 부족하다. 따라서 본 논문에서는 RNAseq 빅-데이터를 처리할 수 있도록 기존의 SVM-RFE알고리즘을 밀집도-의존 정규화에 병합하여, NCBI-GEO와 같은 빅-데이터에서 공개된 일부의 데이터에 개선된 알고리즘을 적용하고 해당 알고리즘에 의해 나온 결과의 성능을 평가한다.

Keywords

References

  1. H. Bolouri, "Modeling genomic regulatory networks with big data", Trends in Genetics, Vol. 30, No. 5, pp.182, 2014. https://doi: 10.1016/j.tig.2014.02.005.
  2. Y. H. Yang, S. Dudoit, P. Luu, D. M. Lin, V. Peng, J. Ngai, and T. P. Speed, "Normalization for cDNA microarray data", Nucleic Acids Res, Vol.30, No.4, pp.e15, 2002. https://doi.org/10.1093/bioinformatics/btg146
  3. M. Zhu, J. Dahmen, G. Stacey, and J. Cheng, "Predicting gene regulatory networks of soybean nodulation from RNA-Seq transcriptome data", BMC Bioinformatics, Vol.14, p.278, 2013. https://doi: 10.1186/1471-2105-14-278
  4. I. Guyon, J. Weston, S. Barnhill, V. Vapnik, "Gene selection for cancer classification using support vector machine", Mach. Learn. Vol. 46, pp.389-422, 2002. https://doi.org/10.1023/A:1012487302797
  5. X. Li, S. Peng, J. Chen, B. Li, H. Zhang, and M. Lai, "SVM-T-RFE: a novel gene selection algorithm for identifying metastasisrelated genes in colorectal cancer using gene expression profiles", Biochem. Biophys. Res. Commun. Vol.419, pp.148-153, 2012. https://doi.org/10.1016/j.bbrc.2012.01.087.
  6. S. Mishra., D. Mishra, "SVM-BT-RFE: An improved gene selection framework using Bayesian T-test embedded in support vector machine (recursive feature elimination) algorithm" Karbala International Journal of Modern Science, Vol.1, pp.86-96, 2015. https://doi.org/10.1016/j.kijoms.2015.10.002
  7. T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, "Molecular classification of cancer: class discovery and class prediction by gene expression monitoring", Science, Vol. 286, No. 5439, pp. 531-537, 1999. https://doi.org/10.1126/science.286.5439.531
  8. S-K Kim, S-Y Kim, J-H Kim, S-A Roh, D-H Cho, Y-S Kim, J-C Kim, "A nineteen gene-based risk score classifier predicts prognosis of colorectal cancer patients", Molecular Oncology, Vol.8, Iss. 8, pp.1653-1666, 2014. https://doi.org/10.1016/j.molonc.2014.06.016
  9. C. Kim, "Combining Support Vector Machine Recursive Feature Elimination and MA-plot-based methods for Gene Selection in cDNA(RNA-seq) data" ICONI 2016. http://www.iconi.org
  10. M. Ezzeldin, A. Bashir1, H. S. Shon, D. G. Lee, H. Kim and K. H. Ryu, "Real-Time Automated Cardiac Health Monitoring by Combination of Active Learning and Adaptive Feature Selection" KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 7, NO. 1, Jan 2013. https://doi.org/10.3837/tiis.2013.01.007
  11. B. Ahn, E. Abbas, J.-A. Park and H.-J. Choi, "Increasing Splicing Site Prediction by Training Gene Set Based on Species," KSII Transactions on Internet and Information Systems, vol. 6, no. 11, pp. 2784-2799, 2012. https://doi.org/10.3837/tiis.2012.10.002
  12. QuickGO http://www.ebi.ac.uk/QuickGO-Beta/
  13. L. Wanga, Y. Wangb, Q. Chang, "Feature selection methods for big data bioinformatics", Methods, Vol.111, No.1, pp21-31, 2016. https://doi.org/10.1016/j.ymeth.2016.08.014.
  14. S. A. Zadeh, M. Ghadiri, V. S. Mirrokni, M. Zadimoghaddam, "Scalable Feature Selection via Distributed Diversity Maximization" AAAI 2017, pp.2876-2883. http://www.aaai.org/Conferences/AAAI/aaai17.php