DOI QR코드

DOI QR Code

Network-based regularization for analysis of high-dimensional genomic data with group structure

그룹 구조를 갖는 고차원 유전체 자료 분석을 위한 네트워크 기반의 규제화 방법

  • Kim, Kipoong (Department of Statistics, Pusan National University) ;
  • Choi, Jiyun (Department of Statistics, Pusan National University) ;
  • Sun, Hokeun (Department of Statistics, Pusan National University)
  • Received : 2016.08.09
  • Accepted : 2016.10.05
  • Published : 2016.10.31

Abstract

In genetic association studies with high-dimensional genomic data, regularization procedures based on penalized likelihood are often applied to identify genes or genetic regions associated with diseases or traits. A network-based regularization procedure can utilize biological network information (such as genetic pathways and signaling pathways in genetic association studies) with an outstanding selection performance over other regularization procedures such as lasso and elastic-net. However, network-based regularization has a limitation because cannot be applied to high-dimension genomic data with a group structure. In this article, we propose to combine data dimension reduction techniques such as principal component analysis and a partial least square into network-based regularization for the analysis of high-dimensional genomic data with a group structure. The selection performance of the proposed method was evaluated by extensive simulation studies. The proposed method was also applied to real DNA methylation data generated from Illumina Innium HumanMethylation27K BeadChip, where methylation beta values of around 20,000 CpG sites over 12,770 genes were compared between 123 ovarian cancer patients and 152 healthy controls. This analysis was also able to indicate a few cancer-related genes.

고차원 유전체 자료를 사용하는 유전체 연관 분석에서는 벌점 우도함수 기반의 회귀계수 규제화 방법이 질병 및 표현형질에 영향을 주는 유전자를 발견하는데 많이 이용된다. 특히, 네트워크 기반의 규제화 방법은 유전체 연관성 연구에서의 유전체 경로나 신호 전달 경로와 같은 생물학적 네트워크 정보를 사용할 수 있으므로, Lasso나 Elastic-net과 같은 다른 규제화 방법들과 비교했을 경우 네트워크 기반의 규제화 방법이 보다 더 정확하게 관련 유전자들을 찾아낼 수 있다는 장점을 가지고 있다. 그러나 네트워크 기반의 규제화 방법은 그룹 구조를 갖고 있는 고차원 유전체 자료에는 적용시킬 수 없다는 문제점을 가지고 있다. 실제 SNP 데이터와 DNA 메틸화 데이터처럼 대다수의 고차원 유전체 자료는 그룹 구조를 가지고 있으므로 본 논문에서는 이러한 그룹 구조를 가지고 있는 고차원 유전체 자료를 분석하고자 네트워크 기반의 규제화 방법에 주성분 분석(principal component analysis; PCA)과 부분 최소 자승법(partial least square; PLS)과 같은 차원 축소 방법을 결합시키는 새로운 분석 방법을 제안하고자 한다. 새롭게 제안한 분석 방법은 몇 가지의 모의실험을 통해 변수 선택의 우수성을 입증하였으며, 또한 152명의 정상인들과 123명의 난소암 환자들로 구성된 고차원 DNA 메틸화 자료 분석에도 사용하였다. DNA 메틸화 자료는 대략 20,000여개의 CpG sites가 12,770개의 유전자에 포함되어 있는 그룹 구조를 가지고 있으며 Illumina Innium uman Methylation27 BeadChip으로부터 생성되었다. 분석 결과 우리는 실제로 암에 연관된 몇 가지의 유전자를 발견할 수 있었다.

Keywords

References

  1. Alexander, D. and Lange, K. (2011). Stability selection for genome-wide association. Genetic Epidemiology, 35, 722-728. https://doi.org/10.1002/gepi.20623
  2. Chen, M., Cho, J., and Zhao, H. (2011). Incorporating biological pathways via a Markov random field model in genome-wide association studies. PLoS Genetics, 7, e1001353. https://doi.org/10.1371/journal.pgen.1001353
  3. Du, P., Zhang, X., Huang, C., Jafari, N., Kibbe, W., Hou, L., and Lin, S. (2010). Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics, 11, 587. https://doi.org/10.1186/1471-2105-11-587
  4. Faraway, J. (2014). Linear Models with R (2nd ed.), Chapman and Hall/CRC.
  5. Friedman J., Hastie T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1-22.
  6. Li, C. and Li, H. (2008). Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics, 24, 1175-1182. https://doi.org/10.1093/bioinformatics/btn081
  7. Li, C. and Li, H. (2010). Variable selection and regression analysis for graph-structured covariates with an application to genomics. Annals of Applied Statistics, 4, 1498-1516. https://doi.org/10.1214/10-AOAS332
  8. Marsit, C., Christensen, B., Houseman, E., Karagas, M., Wrensch, M., Yeh, R., Nelson, H., Wiemels, J., Zheng, S., Posner, M., McClean, M., Wiencke, J., and Kelsey, K. (2009). Epigenetic profiling reveals etiologically distinct patterns of DNA methylation in head and neck squamous cell carcinoma. Carcinogenesis, 30, 416-422. https://doi.org/10.1093/carcin/bgp006
  9. Meinshausen, N. and Buhlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society, Series B, 72, 417-473. https://doi.org/10.1111/j.1467-9868.2010.00740.x
  10. Peng, J., Wang, P., Zhou, N., and Zhu, J. (2009). Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association, 104, 735-746. https://doi.org/10.1198/jasa.2009.0126
  11. Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2011). Regularization paths for Cox's proportional hazards model via coordinate descent. Journal of Statistical Software, 39, 1-13.
  12. Sun, H. and Wang, S. (2012). Penalized logistic regression for high-dimensional DNA methylation data with case-control studies. Bioinformatics, 28, 1368-1375. https://doi.org/10.1093/bioinformatics/bts145
  13. Sun, H. and Wang, S. (2013). Network-based regularization for matched case-control analysis of high-dimensional DNA methylation data. Statistics in Medicine, 32, 2127-2139. https://doi.org/10.1002/sim.5694
  14. Sun, H., Lin, W., Feng, R., and Li, H. (2014). Network-regularized high-dimensional Cox regression for analysis of genomic data. Statistca Sinica, 24, 1433-1459.
  15. Teschendorff, A., Menon, U., Gentry-Maharaj, A., Ramus, S., Weisenberger, D., Shen, H., Campan, M., Noushmehr, H., Bell, C., Maxwell, A., Savage, D., Mueller-Holzner, E., Marth, C., Kocjan, G., Gayther, S., Jones, A., Beck, S., Wagner, W., Laird, P., Jacobs, I., and Widschwendter, M. (2010). Age-dependent DNA methylation of genes that are suppressed in stem cells is hallmark of cancer. Genome Research, 20, 440-446. https://doi.org/10.1101/gr.103606.109
  16. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267-288.
  17. Whittaker, J. (1990). Graphical Models in Applied Mathematical Multivariate Statistics, Wiley, New York.
  18. Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67, 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x