Variable Selection in Frailty Models using FrailtyHL R Package: Breast Cancer Survival Data

frailtyHL 통계패키지를 이용한 프레일티 모형의 변수선택: 유방암 생존자료

  • Kim, Bohyeon (Department of Statistics, Pukyong National University) ;
  • Ha, Il Do (Department of Statistics, Pukyong National University) ;
  • Noh, Maengseok (Department of Statistics, Pukyong National University) ;
  • Na, Myung Hwan (Department of Statistics, Chonnam National University) ;
  • Song, Ho-Chun (Department of Nuclear Medicine, Chonnam National University Hospital) ;
  • Kim, Jahae (Department of Nuclear Medicine, Chonnam National University Hospital)
  • Received : 2015.07.27
  • Accepted : 2015.08.06
  • Published : 2015.10.31


Determining relevant variables for a regression model is important in regression analysis. Recently, a variable selection methods using a penalized likelihood with various penalty functions (e.g. LASSO and SCAD) have been widely studied in simple statistical models such as linear models and generalized linear models. The advantage of these methods is that they select important variables and estimate regression coefficients, simultaneously; therefore, they delete insignificant variables by estimating their coefficients as zero. We study how to select proper variables based on penalized hierarchical likelihood (HL) in semi-parametric frailty models that allow three penalty functions, LASSO, SCAD and HL. For the variable selection we develop a new function in the "frailtyHL" R package. Our methods are illustrated with breast cancer survival data from the Medical Center at Chonnam National University in Korea. We compare the results from three variable-selection methods and discuss advantages and disadvantages.

통계적 모형에서 적절한 변수를 선택하는 것은 회귀분석에서 매우 중요하다. 최근 벌점 함수(예: LASSO 및 SCAD)와 함께 벌점화 가능도를 사용하는 변수 선택 방법들이 선형모형 및 일반화 선형모형과 같은 단순한 통계 모형에서 널리 연구되고 있다. 이러한 방법들의 주요 장점은 중요한 변수를 선택하고 동시에 회귀계수를 추정하는 것이다. 그러므로 이 방법들은 0으로 회귀계수를 추정함으로써 중요하지 않은 변수를 삭제한다. 이 논문에서는 콕스 비례 위험 모형의 한 확장인 준 모수적 프레일티 모형에서 벌점화된 다단계 가능도(h-likelihood; HL)를 기반으로 적절한 변수를 선택하는 방법을 연구한다. 이를 위해 세 가지 벌점 함수 LASSO, SCAD 및 HL을 사용한다. 본 논문에서는 변수선택을 효율적으로 하기 위해 "frailtyHL" R 패키지 (Ha 등, 2012)를 기반으로 하여 새로운 함수를 개발하였다. 개발된 방법의 예증을 위해 전남대 의과대학 병원에서 수집된 유방암 생존자료를 이용하여 세 가지 변수 선택 방법의 결과를 비교하고, 이 변수선택방법들의 상대적 장 단점에 대해 토론한다.



Supported by : 한국연구재단


  1. Androulakis, E., Koukouvinos, C. and Vonta, F. (2012). Estimation and variable selection via frailty models with penalized likelihood, Statistics in Medicine, 31, 2223-2239.
  2. Breiman, L. (1996). Heuristics of instability and stabilization in model selection, The Annals of Statistics, 24, 2350-2383.
  3. Breslow, N. E. (1972). Discussion of Professor Cox's paper, Journal of the Royal Statistical Society B, 34, 216-217.
  4. Clayton, D. G. (1991). A Monte Carlo method for Bayesian inference in frailty models, Biometrics, 47, 467-480.
  5. Cox, D. R. (1972). Regression models and life tables (with Discussion), Journal of the Royal Statistical Society B, 74, 187-220.
  6. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96, 1348-1360.
  7. Fan, J. and Li, R. (2002). Variable selection for Cox's proportional hazards model and frailty model, The Annals of Statistics, 30, 74-99.
  8. Ha, I. D. and Lee, Y. (2003). Estimating frailty models via Poisson hierarchical generalized linear models, Journal of Computational and Graphical Statistics, 12, 663-681.
  9. Ha, I. D. and Lee, Y. (2005). Comparison of hierarchical likelihood versus orthodox best linear unbiased predictor approaches for frailty models, Biometrika, 92, 717-723.
  10. Ha, I. D., Lee, Y. and Song, J.-K. (2001). Hierarchical likelihood approach for frailty models, Biometrika, 88, 233-243.
  11. Ha, I. D., Noh, M. and Lee, Y. (2012). frailtyHL: A package for fitting frailty models with h-likelihood, The R Journal, 4, 307-320.
  12. Ha, I. D., Pan, J., Oh, S. and Lee, Y. (2014). Variable selection in general frailty models using penalized h-likelihood, Journal of Computational and Graphical Statistics, 23, 1044-1060.
  13. Ha, I. D., Sylvester, R., Legrand, C. and MacKenzie, G. (2011). Frailty modelling for survival data from multi-centre clinical trials, Statistics in Medicine, 30, 28-37.
  14. Hougaard, P. (2000). Analysis of Multivariate Survival Data, Springer, New York.
  15. Lee, Y. and Nelder, J. A. (1996). Hierarchical generalized linear models (with discussion), Journal of the Royal Statistical Society B, 58, 619-678.
  16. Lee, Y. and Oh, H. S. (2014). A new sparse variable selection via random-effect model, Journal of Multivariate Analysis, 125, 89-99.
  17. Lee, Y., Nelder, J. A. and Pawitan, Y. (2006). Generalized Linear Models with Random Effects: Unified Analysis via H-Likelihood, Chapman and Hall, London.
  18. Legrand, C, Ducrocq, V., Janssen, P., Sylvester, R. and Duchateau, L. (2005). A Bayesian approach to jointly estimate centre and treatment by centre heterogeneity in a proportional hazards model, Statistics in Medicine, 24, 3789-3804.
  19. Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models, Journal of the Royal Statistical Society, Series A, 135, 370-384.
  20. Ripatti, S. and Palmgren. J. (2000). Estimation of multivariate frailty models using penalized partial likelihood, Biometrics, 56, 1016-1022.
  21. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso, Journal of the Royal Statistical Society Series B, 58, 267-288.
  22. Tibshirani, R. (1997). The LASSO method for variable selection in the Cox Model, Statistics in Medicine, 16, 385-395.<385::AID-SIM380>3.0.CO;2-3
  23. Vaida, F. and Xu, R. (2000). Proportional hazards models with random effects, Statistics in Medicine, 19, 3309-3324.<3309::AID-SIM825>3.0.CO;2-9