DOI QR코드

DOI QR Code

Feature Selection with Ensemble Learning for Prostate Cancer Prediction from Gene Expression

  • Abass, Yusuf Aleshinloye (Department of Computer Science Nile University of Nigeria) ;
  • Adeshina, Steve A. (Department of Computer Science Nile University of Nigeria)
  • 투고 : 2021.12.05
  • 발행 : 2021.12.30

초록

Machine and deep learning-based models are emerging techniques that are being used to address prediction problems in biomedical data analysis. DNA sequence prediction is a critical problem that has attracted a great deal of attention in the biomedical domain. Machine and deep learning-based models have been shown to provide more accurate results when compared to conventional regression-based models. The prediction of the gene sequence that leads to cancerous diseases, such as prostate cancer, is crucial. Identifying the most important features in a gene sequence is a challenging task. Extracting the components of the gene sequence that can provide an insight into the types of mutation in the gene is of great importance as it will lead to effective drug design and the promotion of the new concept of personalised medicine. In this work, we extracted the exons in the prostate gene sequences that were used in the experiment. We built a Deep Neural Network (DNN) and Bi-directional Long-Short Term Memory (Bi-LSTM) model using a k-mer encoding for the DNA sequence and one-hot encoding for the class label. The models were evaluated using different classification metrics. Our experimental results show that DNN model prediction offers a training accuracy of 99 percent and validation accuracy of 96 percent. The bi-LSTM model also has a training accuracy of 95 percent and validation accuracy of 91 percent.

키워드

과제정보

The authors thank the National Information Technology Development Agency (NITDA) and Nile University of Nigeria (NUN) for supporting Y.A. Abass's studies in Nigeria. The authors thank the anonymous reviewers for their objective remarks and their suggestions on the paper.

참고문헌

  1. N. S. Madhukar and O. Elemento, "Bioinformatics approaches to predict drug responses from genomic sequencing," Cancer Systems Biology, p. 277-296, 2018.
  2. S. Li, P. P. Labaj, P. Zumbo, P. Sykacek, W. Shi, L. Shi, J. Phan, P.-Y. Wu, M. Wang, C. Wang and others, "Detecting and correcting systematic variation in large-scale RNA sequencing data," Nature biotechnology, vol. 32, p. 888-895, 2014. https://doi.org/10.1038/nbt.3000
  3. Y. A. Abass and S. A. Adeshina, "Deep Learning Methodologies for Genomic Data Prediction," Journal of Artificial Intelligence for Medical Sciences, 2021.
  4. P. Mamoshina, A. Vieira, E. Putin and A. Zhavoronkov, "Applications of deep learning in biomedicine," Molecular pharmaceutics, vol. 13, p. 1445-1454, 2016. https://doi.org/10.1021/acs.molpharmaceut.5b00982
  5. A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander and others, "Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles," Proceedings of the National Academy of Sciences, vol. 102, p. 15545-15550, 2005. https://doi.org/10.1073/pnas.0506580102
  6. A. Arbaaeen and A. Shah, "Ontology-Based Approach to Semantically Enhanced Question Answering for Closed Domain: A Review," Information, vol. 12, p. 200, 2021. https://doi.org/10.3390/info12050200
  7. K. Zarringhalam, D. Degras, C. Brockel and D. Ziemek, "Robust phenotype prediction from gene expression data using differential shrinkage of co-regulated genes," Scientific reports, vol. 8, p. 1-10, 2018.
  8. R. Lopez, J. Regier, M. B. Cole, M. I. Jordan and N. Yosef, "Deep generative modeling for single-cell transcriptomics," Nature methods, vol. 15, p. 1053-1058, 2018. https://doi.org/10.1038/s41592-018-0229-2
  9. Y.-J. Shen and S.-G. Huang, "Improve survival prediction using principal components of gene expression data," Genomics, proteomics & bioinformatics, vol. 4, p. 110-119, 2006. https://doi.org/10.1016/S1672-0229(06)60022-3
  10. Y. Bengio, A. Courville and P. Vincent, "Representation learning: A review and new perspectives," IEEE transactions on pattern analysis and machine intelligence, vol. 35, p. 1798-1828, 2013. https://doi.org/10.1109/TPAMI.2013.50
  11. R. C. Edgar, "Search and clustering orders of magnitude faster than BLAST," Bioinformatics, vol. 26, p. 2460-2461, 2010. https://doi.org/10.1093/bioinformatics/btq461
  12. L. Pinello, G. Lo Bosco and G.-C. Yuan, "Applications of alignment-free methods in epigenomics," Briefings in Bioinformatics, vol. 15, p. 419-430, 2014. https://doi.org/10.1093/bib/bbt078
  13. G. L. Bosco, "Alignment free dissimilarities for nucleosome classification," in International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics, 2015.
  14. T. Yue and H. Wang, "Deep learning for genomics: A concise overview," arXiv preprint arXiv:1802.00810, 2018.
  15. J. Schmidhuber, "Deep learning in neural networks: An overview," Neural networks, vol. 61, p. 85-117, 2015. https://doi.org/10.1016/j.neunet.2014.09.003
  16. A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet classification with deep convolutional neural networks," Communications of the ACM, vol. 60, p. 84-90, 2017. https://doi.org/10.1145/3065386
  17. I. Goodfellow, Y. Bengio and A. Courville, Deep learning, MIT press, 2016.
  18. C. Olah, "Understanding lstm networks-colah's blog," Colah. github. io, 2015.
  19. H. P. Desai, A. P. Parameshwaran, R. Sunderraman and M. Weeks, "Comparative study using neural networks for 16S ribosomal gene classification," Journal of Computational Biology, vol. 27, p. 248-258, 2020. https://doi.org/10.1089/cmb.2019.0436
  20. J. S. Bridle, "Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition," in Neurocomputing, Springer, 1990, p. 227-236.
  21. M. Axelson-Fisk, "Comparative Gene Finding," in Comparative Gene Finding, Springer, 2010, p. 157-180.
  22. L. Fu, Q. Peng and L. Chai, "Predicting dna methylation states with hybrid information based deep-learning model," IEEE/ACM transactions on computational biology and bioinformatics, vol. 17, p. 1721-1728, 2019. https://doi.org/10.1109/tcbb.2019.2909237
  23. B. Lee, J. Baek, S. Park and S. Yoon, "deepTarget: end-to-end learning framework for microRNA target prediction using deep recurrent neural networks," in Proceedings of the 7th ACM international conference on bioinformatics, computational biology, and health informatics, 2016.
  24. S. Park, S. Min, H. Choi and S. Yoon, "deepMiRGene: Deep neural network based precursor microrna prediction," arXiv preprint arXiv:1605.00017, 2016.
  25. B. P. Lewis, I.-h. Shih, M. W. Jones-Rhoades, D. P. Bartel and C. B. Burge, "Prediction of mammalian microRNA targets," Cell, vol. 115, p. 787-798, 2003. https://doi.org/10.1016/S0092-8674(03)01018-3
  26. Y. Chen, Y. Li, R. Narayan, A. Subramanian and X. Xie, "Gene expression inference with deep learning," Bioinformatics, vol. 32, p. 1832-1839, 2016. https://doi.org/10.1093/bioinformatics/btw074
  27. J. Lanchantin, R. Singh, B. Wang and Y. Qi, "Deep motif dashboard: Visualizing and understanding genomic sequences using deep neural networks," in Pacific Symposium on Biocomputing 2017, 2017.
  28. R. Singh, J. Lanchantin, G. Robins and Y. Qi, "DeepChrome: deep-learning for predicting gene expression from histone modifications," Bioinformatics, vol. 32, p. i639-i648, 2016. https://doi.org/10.1093/bioinformatics/btw427
  29. D. Urda, J. Montes-Torres, F. Moreno, L. Franco and J. M. Jerez, "Deep learning to analyze RNA-seq gene expression data," in International work-conference on artificial neural networks, 2017.
  30. B. M. Kuenzi, J. Park, S. H. Fong, K. S. Sanchez, J. Lee, J. F. Kreisberg, J. Ma and T. Ideker, "Predicting drug response and synergy using a deep learning model of human cancer cells," Cancer cell, vol. 38, p. 672-684, 2020. https://doi.org/10.1016/j.ccell.2020.09.014
  31. E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh and others, "Initial sequencing and analysis of the human genome," 2001.
  32. S. Sunyaev, J. Hanke, A. Aydin, U. Wirkner, I. Zastrow, J. Reich and P. Bork, "Prediction of nonsynonymous single nucleotide polymorphisms in human diseaseassociated genes," Journal of molecular medicine, vol. 77, p. 754-760, 1999. https://doi.org/10.1007/s001099900059
  33. Y. Miura, Y. Sakurai and T. Endo, "O-GlcNAc modification affects the ATM-mediated DNA damage response," Biochimica et Biophysica Acta (BBA)-General Subjects, vol. 1820, p. 1678-1685, 2012. https://doi.org/10.1016/j.bbagen.2012.06.013
  34. C. L. M. Marcelis and A. P. M. de Brouwer, "Feingold syndrome 1," 2019.
  35. E. Castro and R. Eeles, "The role of BRCA1 and BRCA2 in prostate cancer," Asian journal of andrology, vol. 14, p. 409, 2012. https://doi.org/10.1038/aja.2011.150
  36. K. Tutlewska, J. Lubinski and G. Kurzawski, "Germline deletions in the EPCAM gene as a cause of Lynch syndrome-literature review," Hereditary cancer in clinical practice, vol. 11, p. 1-9, 2013. https://doi.org/10.1186/1897-4287-11-1
  37. J. Ni, P. Cozzi, J. Beretov, W. Duan, J. Bucci, P. Graham and Y. Li, "Epithelial cell adhesion molecule (EpCAM) is involved in prostate cancer chemotherapy/radiotherapy response in vivo," BMC cancer, vol. 18, p. 1-12, 2018. https://doi.org/10.1186/s12885-017-3892-2
  38. D. E. Beaudoin, N. Longo, R. A. Logan, J. P. Jones and J. A. Mitchell, "Using information prescriptions to refer patients with metabolic conditions to the Genetics Home Reference website," Journal of the Medical Library Association: JMLA, vol. 99, p. 70, 2011. https://doi.org/10.3163/1536-5050.99.1.012
  39. D. Anastassiou, "Genomic signal processing," IEEE signal processing magazine, vol. 18, p. 8-20, 2001. https://doi.org/10.1109/79.939833
  40. S. Tiwari, S. Ramachandran, A. Bhattacharya, S. Bhattacharya and R. Ramaswamy, "Prediction of probable genes by Fourier analysis of genomic sequences," Bioinformatics, vol. 13, p. 263-270, 1997. https://doi.org/10.1093/bioinformatics/13.3.263
  41. H. a. S. M. a. S. M. a. G. F. Saberkari, "Prediction of protein coding regions in DNA sequences using signal processing methods," in 2012 IEEE Symposium on Industrial Electronics and Applications, 2012.
  42. H. Saberkari, M. Shamsi and M. H. Sedaaghi, "Identification of genomic islands in DNA sequences using a non-DSP technique based on the Z-Curve," in 11th Iranian Conference on Intelligent Systems (ICIS 2013) February 27th & 28th, 2013.
  43. S. S. Sahu, "Analysis of Genomic and Proteomic Signals Using Signal Processing and Soft Computing Techniques," 2011.
  44. G. De Clercq, "DEEP LEARNING FOR CLASSIFICATION OF DNA FUNCTIONAL SEQUENCES," 2019.
  45. N. Mughees, S. A. Mohsin, A. Mughees and A. Mughees, "Deep sequence to sequence Bi-LSTM neural networks for day-ahead peak load forecasting," Expert Systems with Applications, vol. 175, p. 114844, 2021. https://doi.org/10.1016/j.eswa.2021.114844
  46. S. Siami-Namini, N. Tavakoli and A. S. Namin, "A comparative analysis of forecasting financial time series using arima, lstm, and bilstm," arXiv preprint arXiv:1911.09512, 2019.
  47. D. P. Snustad and M. J. Simmons, Principles of genetics, John Wiley & Sons, 2015.