DOI QR코드

DOI QR Code

Review of statistical methods for survival analysis using genomic data

  • Lee, Seungyeoun (Department of Mathematics and Statistics, Sejong University) ;
  • Lim, Heeju (Department of Statistics, University of Connecticut)
  • Received : 2019.11.19
  • Accepted : 2019.11.24
  • Published : 2019.12.31

Abstract

Survival analysis mainly deals with the time to event, including death, onset of disease, and bankruptcy. The common characteristic of survival analysis is that it contains "censored" data, in which the time to event cannot be completely observed, but instead represents the lower bound of the time to event. Only the occurrence of either time to event or censoring time is observed. Many traditional statistical methods have been effectively used for analyzing survival data with censored observations. However, with the development of high-throughput technologies for producing "omics" data, more advanced statistical methods, such as regularization, should be required to construct the predictive survival model with high-dimensional genomic data. Furthermore, machine learning approaches have been adapted for survival analysis, to fit nonlinear and complex interaction effects between predictors, and achieve more accurate prediction of individual survival probability. Presently, since most clinicians and medical researchers can easily assess statistical programs for analyzing survival data, a review article is helpful for understanding statistical methods used in survival analysis. We review traditional survival methods and regularization methods, with various penalty functions, for the analysis of high-dimensional genomics, and describe machine learning techniques that have been adapted to survival analysis.

Keywords

References

  1. Klein JP, Moeschberger ML. Survival Analysis: Techniques for Censored and Truncated Data. 2nd ed. New York: Springer, 2010.
  2. Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2nd ed. New York: John Wiley and Sons, 2011.
  3. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc 1958;53:457-481. https://doi.org/10.1080/01621459.1958.10501452
  4. Peto R, Peto J. Asymptotically efficient rank invariant test procedures. J R Stat Soc Series A 1972;135:185-206. https://doi.org/10.2307/2344317
  5. Cox DR. Regression models and life-tables. J R Stat Soc Series B Methodol 1972;34:187-220.
  6. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 1998;95:14863-14868. https://doi.org/10.1073/pnas.95.25.14863
  7. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000;403:503-511. https://doi.org/10.1038/35000501
  8. Dunn OJ. Multiple comparisons among means. J Am Stat Assoc 1961;56:52-64. https://doi.org/10.1080/01621459.1961.10482090
  9. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Methodol 1995;57:289-300.
  10. Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med 1997;16:385-395. https://doi.org/10.1002/(SICI)1097-0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3
  11. Hoerl A, Kennard R. Ridge regression. Encycl Stat Sci 2006;8:129-136.
  12. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol 2005;67:301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x
  13. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 2004;431:931-945. https://doi.org/10.1038/nature03001
  14. Breslow N, Crowley J. A large sample study of the life table and product limit estimates under random censorship. Ann Stat 1974;2:437-453. https://doi.org/10.1214/aos/1176342705
  15. Efron B. The two sample problem with censored data. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 4 (Le Cam LM, Neyman J, eds.); 1954 Jun 21-Jul 18; Berkeley, CA, USA. New York: Prentice-Hall, 1967. pp. 831-853.
  16. Breslow NE. Analysis of survival data under the proportional hazards model. Int Stat Rev 1975;43:45-57. https://doi.org/10.2307/1402659
  17. Mantel N, Bohidar NR, Ciminera JL. Mantel-Haenszel analyses of litter-matched time-to-response data, with modifications for recovery of interlitter information. Cancer Res 1977;37:3863-3868.
  18. Schumacher M. Two-sample tests of Cramer-Von Mises- and Kolmogorov-Smirnov-type for randomly censored data. Int Stat Rev 1984;52:263-281. https://doi.org/10.2307/1403046
  19. Brookmeyer R, Crowley J. A k-sample median test for censored data. J Am Stat Assoc 1982;77:433-440. https://doi.org/10.2307/2287264
  20. Chen Z, Zhang G. Comparing survival curves based on medians. BMC Med Res Methodol 2016;16:33. https://doi.org/10.1186/s12874-016-0133-3
  21. Cox DR. Partial likelihood. Biometrika 1975;62:269-276. https://doi.org/10.1093/biomet/62.2.269
  22. Aalen OO. A linear regression model for the analysis of life times. Stat Med 1989;8:907-925. https://doi.org/10.1002/sim.4780080803
  23. Lin DY, Ying Z. Semiparametric analysis of the additive risk model. Biometrika 1994;81:61-71. https://doi.org/10.1093/biomet/81.1.61
  24. Hastie T, Tibshirani R, Wainwright M. Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Raton: CRC Press, 2016.
  25. Zhang HH, Lu W. Adaptive Lasso for Cox's proportional hazards model. Biometrika 2007;94:691-703. https://doi.org/10.1093/biomet/asm037
  26. Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. J R Stat Soc Series B Stat Methodol 2005;67:91-108. https://doi.org/10.1111/j.1467-9868.2005.00490.x
  27. Tabassum MN, Ollila E. Pathwise least angle regression and a significance test for the elastic net. In: Proceedings of the 25th European Signal Processing Conference (EUSIPCO); 2017 Aug 28-Sep 2; Kos, Greece. Piscataway: Institute of Electrical and Electronics Engineers, 2017.
  28. Bishop CM. Pattern Recognition and Machine Learning. New York: Springer, 2006.
  29. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference and Prediction. New York: Springer, 2001.
  30. James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning with Application in R. New York: Springer, 2017.
  31. Geron A. Hands-on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. Sebastopol: O'Reilly, 2017.
  32. Brieman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees. Belmont: Taylor & Francis, 1984.
  33. Ciampi A, Thiffault J, Nakache JP, Asselain B. Stratification by stepwise regression, correspondence analysis and recursive partition: a comparison of three methods of analysis for survival data with covariates. Comput Stat Data Anal 1986;4:185-204. https://doi.org/10.1016/0167-9473(86)90033-2
  34. Leblanc M, Crowley J. Survival trees by goodness of split. J Am Stat Assoc 1993;88:457-467. https://doi.org/10.1080/01621459.1993.10476296
  35. Calhoun P, Su X, Nunn M, Fan J. Constructing multivariate survival trees: the MST package for R. J Stat Softw 2018 Feb [Epub]. https://doi.org/10.18637/jss.v083.i12.
  36. Su X, Fan J. Multivariate survival trees: a maximum likelihood approach based on frailty models. Biometrics 2004;60:93-99. https://doi.org/10.1111/j.0006-341X.2004.00139.x
  37. Fan J, Nunn ME, Su X. Multivariate exponential survival trees and their application to tooth prognosis. Comput Stat Data Anal 2009;53:1110-1121. https://doi.org/10.1016/j.csda.2008.10.019
  38. Jansche M, Shivaswamy PK, Chu W. A support vector approach to censored targets. In: 7th IEEE International Conference on Data Mining (ICDM 2007), Vol. 1 (Ramakrishnan N, ZaTane OR, Shi Y, Clifton CW, Wu X, eds.); 2007 Oct 28-31; Omaha, NE, USA. Piscataway: Institute of Electrical and Electronics Engineers, 2007. pp. 655-660
  39. Khan FM, Zubek VB. Support vector regression for censored data (SVRc): a novel tool for survival analysis. In: 2008 Eighth IEEE International Conference on Data Mining; 2008 Dec 15-19, Pisa, Italy. Piscataway: Institute of Electrical and Electronics Engineers, 2008. pp. 863-868.
  40. Hothorn T, Lausen B, Benner A, Radespiel-Troger M. Bagging survival trees. Stat Med 2004;23:77-91. https://doi.org/10.1002/sim.1593
  41. Ishwaran H, Kogalur UB, Chen X, Minn AJ. Random survival forests for high-dimensional data. Stat Anal Data Min 2011;4:115-132. https://doi.org/10.1002/sam.10103
  42. Freund Y. Boosting a weak learning algorithm by majority. Inf Comput 1995;121:256-285. https://doi.org/10.1006/inco.1995.1136
  43. Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat 2000;28:337-407. https://doi.org/10.1214/aos/1016120463
  44. De Bin R. Boosting in Cox Regression: A Comparison between the Likelihood-Based and the Model-Based Approaches with Focus on the R-Packages CoxBoost and mboost. Technical Report No. 180. Munich: Department of Statistics, University of Munich, 2015.
  45. Binder H, Schumacher M. Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinformatics 2008;9:14. https://doi.org/10.1186/1471-2105-9-14
  46. Buhlmann P, Yu B. Boosting with the L2 loss: regression and classification. J Am Stat Assoc 2003;98:324-339. https://doi.org/10.1198/016214503000125
  47. Schmid M, Hothorn T. Flexible boosting of accelerated failure time models. BMC Bioinformatics 2008;9:269. https://doi.org/10.1186/1471-2105-9-269
  48. Lee DKK, Chen N, Ishwaran H. Boosted nonparametric hazards with time-dependent covariates. Ithaca: arXiv, Corrnell University, 2017. Accessed 2019 Sep 10. Available from: https://arxiv.org/abs/1701.07926.
  49. Floyd CE Jr, Lo JY, Yun AJ, Sullivan DC, Kornguth PJ. Prediction of breast cancer malignancy using an artificial neural network. Cancer 1994;74:2944-2948. https://doi.org/10.1002/1097-0142(19941201)74:11<2944::AID-CNCR2820741109>3.0.CO;2-F
  50. Burnside ES. Breast cancer risk estimation with artificial neural networks revisited: discrimination and calibration. Cancer 2010;116:3310-3321. https://doi.org/10.1002/cncr.25081
  51. Faraggi D, Simon R. A neural network model for survival data. Stat Med 1995;14:73-82. https://doi.org/10.1002/sim.4780140108
  52. Petalidis LP, Oulas A, Backlund M, Wayland MT, Liu L, Plant K, et al. Improved grading and survival prediction of human astrocytic brain tumors by artificial neural network analysis of gene expression microarray data. Mol Cancer Ther 2008;7:1013-1024. https://doi.org/10.1158/1535-7163.MCT-07-0177
  53. Chi CL, Street WN, Wolberg WH. Application of artificial neural network-based survival analysis on two breast cancer datasets. AMIA Annu Symp Proc 2007;2007:130-134.
  54. Ching T, Zhu X, Garmire LX. Cox-nnet: an artificial neural network method for prognosis prediction of high-throughput omics data. PLoS Comput Biol 2018;14:e1006076. https://doi.org/10.1371/journal.pcbi.1006076
  55. Yu CN, Greiner R, Lin HC, Baracos V. Learning patient-specific cancer survival distributions as a sequence of dependent regressors. In: Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS 2011); 2011 Dec 12-15; Granada, Spain. Red Hook: Curran Associates Inc, 2011. pp. 1845-1853.
  56. Fotso S. Deep neural networks for survival analysis based on a multi-task framework. Ithaca: arXiv, Corrnell University, 2017. Accessed 2019 Sep 10. Available from: https://arxiv.org/abs/1801.05512.
  57. Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet 2013;45:1113-1120. https://doi.org/10.1038/ng.2764
  58. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014;15:550. https://doi.org/10.1186/s13059-014-0550-8
  59. Grimes T, Walker AR, Datta S, Datta S. Predicting survival times for neuroblastoma patients using RNA-seq expression profiles. Biol Direct 2018;13:11. https://doi.org/10.1186/s13062-018-0213-x
  60. Harrell FE Jr, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. JAMA 1982;247:2543-2546. https://doi.org/10.1001/jama.1982.03320430047030