Statistical Issues in Genomic Cohort Studies

유전체 코호트 연구의 주요 통계학적 과제

  • Park, So-Hee (Cancer Biostatistics Branch, Division of Cancer Registration and Epidemiology, National Cancer Center)
  • 박소희 (국립암센터 암등록역학연구부 암통계연구과)
  • Published : 2007.03.31


When conducting large-scale cohort studies, numerous statistical issues arise from the range of study design, data collection, data analysis and interpretation. In genomic cohort studies, these statistical problems become more complicated, which need to be carefully dealt with. Rapid technical advances in genomic studies produce enormous amount of data to be analyzed and traditional statistical methods are no longer sufficient to handle these data. In this paper, we reviewed several important statistical issues that occur frequently in large-scale genomic cohort studies, including measurement error and its relevant correction methods, cost-efficient design strategy for main cohort and validation studies, inflated Type I error, gene-gene and gene-environment interaction and time-varying hazard ratios. It is very important to employ appropriate statistical methods in order to make the best use of valuable cohort data and produce valid and reliable study results.


  1. Hunter DJ, Spiegelman D, Adami HO, Beeson L, van den Brandt PA, Folsom AR, Fraser GE, Goldbohm RA, Graham S, Howe GR, Kushi LH, Marshall JR, McDermott A, Miller AB, Speizer FE, Wolk A, Yaun SS, Willett W. Cohort studies of fat intake and the risk of breast cancer-a pooled analysis. N Engl J Med 1996; 334(6): 356-361
  2. Prentice RL. Measurement error and results from analytic epidemiology: Dietary fat and breast cancer. J Natl Cancer Inst 1996; 88(23): 1738-1747
  3. Fuchs CS, Giovannucci EL, Colditz GA, Hunter DJ, Stampfer MJ, Rosner B, Speizer FE, Willett WC. Dietary fiber and the risk of colorectal cancer and adenoma in women. N Engl J Med 1999; 340(3): 169-176
  4. Park Y, Hunter DJ, Spiegelman D, Bergkvist L, Berrino F, van den Brandt PA, Buring JE, Colditz GA, Freudenheim JL, Fuchs CS, Giovannucci E, Goldbohm RA, Graham S, Harnack L, Hartman AM, Jacobs DR Jr, Kato I, Krogh V, Leitzmann MF, McCullough ML, Miller AB, Pietinen P, Rohan TE, Schatzkin A, Willett WC, Wolk A, Zeleniuch-Jacquotte A, Zhang SM, Smith-Warner SA. Dietary fiber intake and risk of colorectal cancer: A pooled analysis of prospective cohort studies. JAMA 2005; 294(22): 2849-2857
  5. Fraser GE, Stram DO. Regression calibration in studies with correlated variables measured with error. Am J Epidemiol 2001; 154(9): 836-844
  6. Thomas D. New techniques for the analysis of cohort studies. Epidemiol Rev 1998; 20(1): 122-134
  7. Rosner B, Spiegelman D, Willett WC. Correction of logistic regression relative risk estimates and confidence intervals for measurement error: The case of multiple covariates measured with error. Am J Epidemiol 1990; 132(4): 734-745
  8. Rosner B, Willett WC, Spiegelman D. Correction of logistic regression relative risk estimates and confidence intervals for systematic within-person measurement error. Stat Med 1989; 8(9): 1051-1069
  9. Spiegelman D, Gray R. Cost-efficient study designs for binary response data with Gaussian covariate measurement error. Biometrics 1991; 47(3): 851-869
  10. Park S, Stram DO. Cost-efficient design of main cohort and calibration studies where one or more exposure variables are measured with errors. Proc Joint Stat Meet Aug 11-15; New York, NY: 2002. p. 2611-2616
  11. Hauser R, Meeker JD, Park S, Silva MJ, Calafat AM. Temporal variability of urinary phthalate metabolite levels in men of reproductive age. Environ Health Perspect 2004; 112(17): 1734-1740
  12. Park S, Ryan LM, Meeker JD, Hauser R. A latent model for measurement error correction using replicate data. Proc Int Biom Soc meet Mar 20-23; Austin, TX: 2005. p. 273
  13. Carroll RJ, Ruppert D, Stefanski LA, Measurement Error in Nonlinear Models. New York: Chapman and Hall; 1995. p. 141-164
  14. Rothman K. No adjustments are needed for multiple comparisons. Epidemiology 1990; 1(1): 43-46
  15. Weinberg CR. It's time to rehabilitate the p-value. Epidemiology 2001; 12(3): 288-290
  16. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc (Ser B) 1995; 57(1): 289-300
  17. Wacholder S, Chanock S, Garcia-Closas M, El Ghormli L, Rothman N. Assessing the probability that a positive report is false: An approach for molecular epidemiology studies. J Natl Cancer Inst 2004; 96(6): 434-442
  18. Breslow NE, Day NE. Statistical methods in cancer research. Volume II--The design and analysis of cohort studies. IARC Sci Publ 1987; (82): 1-406
  19. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet 2001; 69(1): 138-147
  20. Motsinger AA, Ritchie MD. Multifactor dimensionality reduction: an analysis strategy for modelling and detecting gene-gene interactions in human genetics and pharmacogenomics studies. Hum Genomics 2006; 2(5): 318-328
  21. Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol 2003; 24(2): 150-157
  22. Chen J, Yu K, Hsing A, Therneau TM. A partially linear tree-based regression model for assessing complex joint gene-gene and gene-environment effects. Genet Epidemiol 2007. Epub 2007 Jan 31
  23. Prentice RL, Pettinger M, Anderson GL. Statistical issues arising in the Women's Health Initiative. Biometrics 2005; 61(4): 899-911
  24. Carlin B, Louis TA. Bayes and Empirical-Bayes Methods for Data Analysis. 2nd ed. New York: Chapman and Hall; 2000. p. 57-85