DOI QR코드

DOI QR Code

고차원 데이터에서 공분산행렬의 추정에 대한 비교연구

A Comparative Study of Covariance Matrix Estimators in High-Dimensional Data

  • 투고 : 2013.07.29
  • 심사 : 2013.09.30
  • 발행 : 2013.10.31

초록

공분산 행렬은 다변량 통계분석에서 중요한 역할을 하고 있으며 전통적인 다변량 분석의 경우 표본 공분산 행렬이 참공분산 행렬의 추정량으로 주로 사용되었다. 하지만 변수의 수가 표본의 크기보다 훨씬 큰 고차원 데이터와 같은 경우에는 표본 공분산 행렬은 비정칙행렬이 되어 기존의 다변량 기법을 사용하는 데 적절하지 않을 수가 있다. 최근 이러한 문제점을 해결하기 위해 축소추정, 경계추정, 수정 콜레스키 분해 추정 등의 새로운 공분산 행렬의 추정량들이 제안되었다. 본 논문에서는 추정량들의 성능에 영향을 미칠 수 있는 여러 현실적인 상황들을 가정하여 모의실험을 통해 참공분산 행렬의 추정량들의 성능을 비교하였다.

The covariance matrix is important in multivariate statistical analysis and a sample covariance matrix is used as an estimator of the covariance matrix. High dimensional data has a larger dimension than the sample size; therefore, the sample covariance matrix may not be suitable since it is known to perform poorly and event not invertible. A number of covariance matrix estimators have been recently proposed with three different approaches of shrinkage, thresholding, and modified Cholesky decomposition. We compare the performance of these newly proposed estimators in various situations.

키워드

참고문헌

  1. Bickel, P. J. and Levina, E. (2008a). Covariance regularization by thresholding, The Annals of Statistics, 36, 2577-2604. https://doi.org/10.1214/08-AOS600
  2. Bickel, P. J. and Levina, E. (2008b). Regularized estimation of large covariance matrices, The Annals of Statistics, 36, 199-227. https://doi.org/10.1214/009053607000000758
  3. Bouveyron, C., Girard, S. and Schmid, C. (2007). High-dimensional data clustering, Computational Statistics & Data Analysis, 52, 502-519. https://doi.org/10.1016/j.csda.2007.02.009
  4. Cai, T. and Liu, W. (2011). Adaptive thresholding for sparse covariance matrix estimation, Journal of the American Statistical Association, 106, 672-6684. https://doi.org/10.1198/jasa.2011.tm10560
  5. Cai, T., Zhang, C. H. and Zhou, H. H. (2010). Optimal rates of convergence for covariance matrix estimation, The Annals of Statistics, 38, 2118-2144. https://doi.org/10.1214/09-AOS752
  6. Choe, S., Kim, S., Lee, C., Yang, W., Park, Y., Choi, H., Chung, H., Lee, D. and Hwang, B. Y. (2011). Species identification of Papaver by metabolite profiling, Forensic Science International.
  7. Chun, H. and Keles, S. (2010a). Sparse partial least squares regression for simultaneous dimension reduction and variable selection, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72, 3-25. https://doi.org/10.1111/j.1467-9868.2009.00723.x
  8. Chung, D. and Keles, S. (2010b). Sparse partial least squares classification for high dimensional data, Statistical Applications in Genetics and Molecular Biology, 9, 17.
  9. Clemmensen, L., Hastie, T., Witten, D. and Ersboll, B. (2011). Sparse discriminant analysis, Technometrics, 53, 406-413. https://doi.org/10.1198/TECH.2011.08118
  10. Fisher, T. J. and Sun, X. (2011). Improved Stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix, Computational Statistics & Data Analysis, 55, 1909-1918. https://doi.org/10.1016/j.csda.2010.12.006
  11. Ghosh, D. and Chinnaiyan, A. M. (2002). Mixture modelling of gene expression data from microarray experiments, Bioinformatics, 18, 275-286. https://doi.org/10.1093/bioinformatics/18.2.275
  12. Huang, J. Z., Liu, N., Pourahmadi, M. and Liu, L. (2006). Covariance matrix selection and estimation via penalised normal likelihood, Biometrika, 93, 85-98. https://doi.org/10.1093/biomet/93.1.85
  13. Kim, N., Kim, K., Choi, B. Y., Lee, D. H., Shin, Y. S., Bang, K. H., Cha, S. W., Lee, J. W., Choi, H. K., Jang, D. S. and Lee, D. (2011). Metabolomic approach for age discrimination of Panax ginseng using UPLC-Q-Tof MS, Journal of Agricultural and Food Chemistry, 59, 10435-10441. https://doi.org/10.1021/jf201718r
  14. Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices, Journal of Multivariate Analysis, 88, 365-411. https://doi.org/10.1016/S0047-259X(03)00096-4
  15. Levina, E., Rothman, A. and Zhu, J. (2008). Sparse estimation of large covariance matrices via a nested Lasso penalty, The Annals of Applied Statistics, 245-263.
  16. Mai, Q., Zou, H. and Yuan, M. (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions, Biometrika, 99, 29-42. https://doi.org/10.1093/biomet/asr066
  17. McLachlan, G. J., Bean, R. and Peel, D. (2002). A mixture model-based approach to the clustering of microarray expression data, Bioinformatics, 18, 413-422. https://doi.org/10.1093/bioinformatics/18.3.413
  18. McNicholas, P. D. and Murphy, T. B. (2010). Model-based clustering of microarray expression data via latent Gaussian mixture models, Bioinformatics, 26, 2705-2712. https://doi.org/10.1093/bioinformatics/btq498
  19. Palmitesta, P. and Provasi, C. (n.d.). Computer Generation of Random Vectors from Continuous Multivariate Distributions. Available from: http://www.econpol.unisi.it/dmq/pdf/DMQ WP34.pdf.
  20. Rothman, A. J., Levina, E. and Zhu, J. (2009). Generalized thresholding of large covariance matrices, Journal of the American Statistical Association, 104, 177-186. https://doi.org/10.1198/jasa.2009.0101
  21. Rothman, A. J., Levina, E. and Zhu, J. (2010). A new approach to Cholesky-based covariance regularization in high dimensions, Biometrika, 97, 539-550. https://doi.org/10.1093/biomet/asq022
  22. Schafer, J. and Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics, Statistical Applications in Genetics and Molecular Biology, 4, 32.
  23. Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation, Journal of Multivariate Analysis, 99, 1015-1034. https://doi.org/10.1016/j.jmva.2007.06.007
  24. Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression, Proceedings of the National Academy of Sciences, 99, 6567-6572. https://doi.org/10.1073/pnas.082099299
  25. Tusher, V. G., Tibshirani, R. and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response, Proceedings of the National Academy of Sciences, 98, 5116-5121. https://doi.org/10.1073/pnas.091062498
  26. Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse principal component analysis, Journal of Computational and Graphical Statistics, 15, 265-286. https://doi.org/10.1198/106186006X113430