A modified partial least squares regression for the analysis of gene expression data with survival information

  • Lee, So-Yoon (Credit Bureau Business Department, NICE Information Service) ;
  • Huh, Myung-Hoe (Department of Statistics, Korea University) ;
  • Park, Mira (Department of Preventive Medicine, Eulji University)
  • Received : 2014.06.30
  • Accepted : 2014.08.22
  • Published : 2014.09.30


In DNA microarray studies, the number of genes far exceeds the number of samples and the gene expression measures are highly correlated. Partial least squares regression (PLSR) is one of the popular methods for dimensional reduction and known to be useful for the classifications of microarray data by several studies. In this study, we suggest a modified version of the partial least squares regression to analyze gene expression data with survival information. The method is designed as a new gene selection method using PLSR with an iterative procedure of imputing censored survival time. Mean square error of prediction criterion is used to determine the dimension of the model. To visualize the data, plot for variables superimposed with samples are used. The method is applied to two microarray data sets, both containing survival time. The results show that the proposed method works well for interpreting gene expression microarray data.



Supported by : National Research Foundation of Korea (NRF)


  1. Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Broldrick, J. C., Sabet, H., Tran, T., Yu, X., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503-511.
  2. Bovelstad, H. M.,Nygard, S., Storvold, H. L., Aldrin, M., Borgan, O, Frigessi, A. and Lingjaerde, O. C. (2007). Predicting survival from microarray data - A comparative study. Bioinformatics, 23, 2080-2087.
  3. Dai, J. J., Lieu, L. and Rocke, D. (2006). Dimension reduction for classification with gene expression microarray data. Statistical Applications in Genetics and Molecular Biology, 5, article 6.
  4. Fort, G. and Lambert-Lacroix, S. (2005). Classification using partial least squares with penalized logistic regression. Bioinformatics, 21, 1104-1111.
  5. Helland, I. (1988). On the structure of partial least squares regression. Communications in Statistics-Simulation and Computation, 17, 581-607.
  6. Kim, J. D. (2003). Unified non-iterative algorithm for principal component regression, partial least squares and ordinary least squares. Journal of the Korean Data & Information Science Society, 14, 355-366.
  7. Mehmood, T., Liland, K., Snipen, L. and SaeboA, S. (2012). A review of variable selection methods in partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 118, 62-69.
  8. Nguyen, D. V. and Rocke, D. M. (2002a). Tumor classification by partial least squares using gene expression data. Bioinformatics, 18, 39-50.
  9. Nguyen, D. V. and Rocke, D. M. (2002b). Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics, 18, 1625-1632.
  10. Nguyen, T. S. and Rojo, J. (2009). Dimension reduction of microarray gene expression data: The accelerated failure time model. Journal of Bioinformatics and Computational Biology, 7, 939-954.
  11. Park, P. J., Tian, L. and Kohane, I. S. (2002). Linking gene expression data with patient survival times using partial least squares. Bioinformatics, 18, 120-127.
  12. Saeys, Y., Inza, I. and Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23, 2507.
  13. Sorlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences, 98, 10869-10874.