Identifying Statistically Significant Gene-Sets by Gene Set Enrichment Analysis Using Fisher Criterion

Fisher Criterion을 이용한 Gene Set Enrichment Analysis 기반 유의 유전자 집합의 검출 방법 연구

  • Kim, Jae-Young (Graduate School of Information and Communication Engineering, Kyungpook National University) ;
  • Shin, Mi-Young (School of Electrical and Engineering and Computer Science, Kyungpook National University)
  • 김재영 (경북대학교 정보통신학과) ;
  • 신미영 (경북대학교 전자전기컴퓨터학부)
  • Published : 2008.07.25

Abstract

Gene set enrichment analysis (GSEA) is a computational method to identify statistically significant gene sets showing significant differences between two groups of microarray expression profiles and simultaneously uncover their biological meanings in an elegant way by employing gene annotation databases, such as Cytogenetic Band, KEGG pathways, gene ontology, and etc. For the gone set enrichment analysis, all the genes in a given dataset are first ordered by the signal-to-noise ratio between the groups and then further analyses are proceeded. Despite of its impressive results in several previous studies, however, gene ranking by the signal-to-noise ratio makes it difficult to consider highly up-regulated genes and highly down-regulated genes at the same time as the candidates of significant genes, which possibly reflect certain situations incurred in metabolic and signaling pathways. To deal with this problem, in this article, we investigate the gene set enrichment analysis method with Fisher criterion for gene ranking and also evaluate its effects in Leukemia related pathway analyses.

Gene set enrichment analysis (GSEA)는 두 개의 클래스를 가지는 마이크로어레이 실험 데이터 분석을 위해 생물학적 특징을 기반으로 구성된 다양한 유전자-집합 중에서 두 클래스의 발현값들이 통계적으로 중요한 차이를 나타내는 유의한 유전자-집합을 추출하기 위한 분석 방법이다. 특히, 유전자에 대한 다양한 생물학적인 정보를 지닌 유전자 주석 데이터베이스(Cytogenetic Band, KEGG pathway, Gene Ontology 등)를 이용하여 마이크로어레이 실험에 사용된 전체 유전자 중 특정 기능을 가지는 유전자들을 그룹화하여 다양한 유전자-집합을 발굴하고, 각 유전자-집합 내에서 두 클래스간에 발현값의 차이를 참조하여 유의한 유전자들을 결정하여, 이를 기반으로 통계적으로 유의한 유전자-집합들을 최종 검출하는 방법이다. 본 논문에서는 GSEA 분석 과정에서 현재 주로 사용되고 있는 signal-to-noise ratio 기반 유전자 서열화(gene ranking) 방법 대신에, Fisher criterion을 이용한 유전자 서열화 방법을 적용함으로써 기존의 GSEA 방법에서 추출하지 못한 생물학적으로 의미 있는 새로운 유의 유전자-집합을 추출하는 방법을 제안하고자 한다. 또한, 제안한 방법의 성능을 고찰하기 위하여 공개된 Leukemia 관련 마이크로어레이 실험 데이터 분석에 적용하였으며, 기존의 알려진 결과와 비교 분석함으로써 제안한 방법의 유용성을 검증하고자 하였다.

Keywords

References

  1. A. Subramanian et al., "Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.", Proc. Natl Acad Sci USA 102: 15545-50, Sep 2005
  2. E. Taskesen, "Sub-typing of model organisms based on gene expression data." Bioinformatics technical University of Delft Research Assignment, 2006
  3. S. Monti et al., "Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response.", Blood. 2005 Mar 1;105(5):1851-61, Nov 2004
  4. C. Bishop, "Neural Networks for Pattern Recognition", Oxford University Press, Oxford, 1995
  5. A. Blum et al., "Selection of relevant features and example in machine learning", Artificial intelligence, 97:245-271, 1997 https://doi.org/10.1016/S0004-3702(97)00063-5
  6. P. Bradley et al., "Feature selection via mathematical programming", Technical report to appear in INFORMS Journal on computing, 1998
  7. A. Zhang, "Advanced analysis of gene expression microarray data", World Scientific, 2006
  8. S. Dudoit et al., "Multiple Testing Procedures and Applications to Genomics", Springer, 2007
  9. G. J. McLachlan et al., "ANALYZING MICROARRAY GENE EXPRESSION DATA", WILEY-INTERSCIENCE John Wiley & Sons, 2004
  10. S. Dudoit et al., "Multiple Hypothesis Testing in Microarray Experiments", Statistical Science, 18: 71-103, 2003 https://doi.org/10.1214/ss/1056397487
  11. Y. Ge et al., "Resampling-based multiple testing for microarray data analysis", Technical Report 633, Department of Statistics, University of California, Berkeley, 2003
  12. V. G. Tusher et al., "Significance analysis of microarrays applied to the ionizing radiation response", Proc Natl Acad Sci. 24;98(9):5116-21, Apr 2001
  13. R. Gentleman et al., "Bioinformatics and Computational Biology Solutions Using R and Bioconductor", Springer, 2005
  14. J. Verzani, "Using R for Introductory Statistics" Chapman & Hall/CRC, Boca Raton, FL, 2005
  15. T. R. Golub et al., "Molecular classification of cancer: class discovery and class prediction by gene expression monitoring", Science (Wash. DC), 286: 531.537, 1999 https://doi.org/10.1126/science.286.5439.531
  16. KEGG: Kyoto Encyclopedia of Genes and Genomes , http://www.genome.ad.jp/kegg/
  17. M. Kanehisa et al., "The KEGG databases at GenomeNet, Nucleic Acids Res.", 30:42-46, 2002 https://doi.org/10.1093/nar/30.1.42
  18. S. Kawashima et al., "KEGG API: A Web Service Using SOAP/WSDL to Access the KEGG System", Genome Informatics 14: 673-674, 2003
  19. I. Dinu et al., "Improving GSEA for analysis of biologicpathways for differential gene expression across a binary phenotype.", Collection of Biostatistics, 2007
  20. T. Manoli et al., "Group testing for Pathway analysis improves comparability of different microarray datasets", Bioinformatics, 22(20):2500-2506, 2006 https://doi.org/10.1093/bioinformatics/btl424
  21. S. Kudsen, "Cancer Diagnostics with DNA Microarrays", John Wiley & Sons, Inc., 2006
  22. C. Potten et al., "Apoptosis", Cambridge University Press, 2005