DOI QR코드

DOI QR Code

Standard-based Integration of Heterogeneous Large-scale DNA Microarray Data for Improving Reusability

  • Jung, Yong (Seoul National University Biomedical Informatics) ;
  • Seo, Hwa-Jeong (Medical Informatics, Graduate School of Public Health, Gachon University of Medicine and Science) ;
  • Park, Yu-Rang (Seoul National University Biomedical Informatics) ;
  • Kim, Ji-Hun (Seoul National University Biomedical Informatics) ;
  • Bien, Sang Jay (Seoul National University Biomedical Informatics) ;
  • Kim, Ju-Han (Seoul National University Biomedical Informatics)
  • Accepted : 2011.03.02
  • Published : 2011.03.31

Abstract

Gene Expression Omnibus (GEO) has kept the largest amount of gene-expression microarray data that have grown exponentially. Microarray data in GEO have been generated in many different formats and often lack standardized annotation and documentation. It is hard to know if preprocessing has been applied to a dataset or not and in what way. Standard-based integration of heterogeneous data formats and metadata is necessary for comprehensive data query, analysis and mining. We attempted to integrate the heterogeneous microarray data in GEO based on Minimum Information About a Microarray Experiment (MIAME) standard. We unified the data fields of GEO Data table and mapped the attributes of GEO metadata into MIAME elements. We also discriminated non-preprocessed raw datasets from others and processed ones by using a two-step classification method. Most of the procedures were developed as semi-automated algorithms with some degree of text mining techniques. We localized 2,967 Platforms, 4,867 Series and 103,590 Samples with covering 279 organisms, integrated them into a standard-based relational schema and developed a comprehensive query interface to extract. Our tool, GEOQuest is available at http://www.snubi.org/software/GEOQuest/.

Keywords

References

  1. Allison, D.B., Cui, X., Page, G.P., and Sabripour, M. (2006). Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet. 7, 55-65. https://doi.org/10.1038/nrg1749
  2. Argraves, G.L., Jani, S., Barth, J.L., and Argraves, W.S. (2005). ArrayQuest: a web resource for the analysis of DNA microarray data. BMC Bioinformatics 6, 287. https://doi.org/10.1186/1471-2105-6-287
  3. Ball, C.A., and Brazma, A. (2006). MGED standards: work in progress. OMICS 10, 138-144. https://doi.org/10.1089/omi.2006.10.138
  4. Barrett, T., and Edgar, R. (2006). Gene expression omnibus: microarray data storage, submission, retrieval, and analysis. Methods Enzymol. 411, 352-369. https://doi.org/10.1016/S0076-6879(06)11019-8
  5. Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I.F., Soboleva, A., Tomashevsky, M., and Edgar, R. (2007). NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucl. Acids Res. 35, D760-765. https://doi.org/10.1093/nar/gkl887
  6. Boyle, J. (2005). Gene-Expression Omnibus integration and clustering tools in SeqExpress. Bioinformatics 21, 2550-2551. https://doi.org/10.1093/bioinformatics/bti355
  7. Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F. C., Kim, I. F., Markowitz, V., Matese, J. C., Parkinson, H., Robinson, A., Sarkans, U., Schulze- Kremer, S., Stewart, J., Taylor, R., Vilo, J. and Vingron, M. (2001). Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat. Genet. 29, 365-371. https://doi.org/10.1038/ng1201-365
  8. Burgarella, S., Cattaneo, D., Pinciroli, F., and Masseroli, M. (2005). MicroGen: a MIAME compliant web system for microarray experiment information and workflow management. BMC Bioinformatics 6 Suppl 4, S6. https://doi.org/10.1186/1471-2105-6-S4-S6
  9. Butte, A.J., and Chen, R. (2006). Finding disease-related genomic experiments within an international repository: first steps in translational bioinformatics. AMIA. Annu. Symp. Proc. 106-110.
  10. Butte, A.J., and Kohane, I.S. (2006). Creation and implications of a phenome-genome network. Nat. Biotechnol. 24, 55-62. https://doi.org/10.1038/nbt1150
  11. Chaussabel, D., and Sher, A. (2002). Mining microarray expression data by literature profiling. Genome Biol. 3, RESEARCH0055.
  12. Chen, D., Muller, H.M., and Sternberg, P.W. (2006). Automatic document classification of biological literature. BMC Bioinformatics 7, 370. https://doi.org/10.1186/1471-2105-7-370
  13. Edgar, R., and Barrett, T. (2006). NCBI GEO standards and services for microarray data. Nat. Biotechnol. 24, 1471-1472. https://doi.org/10.1038/nbt1206-1471
  14. Gollub, J., Ball, C.A., Binkley, G., Demeter, J., Finkelstein, D.B., Hebert, J.M., Hernandez-Boussard, T., Jin, H., Kaloper, M., Matese, J.C., Schroeder, M., Brown, P. O., Botstein, D. and Sherlock, G. (2003). The Stanford Microarray Database: data access and quality assessment tools. Nucl. Acids Res. 31, 94-96. https://doi.org/10.1093/nar/gkg078
  15. Humphreys, B.L., Lindberg, D.A., Schoolman, H.M., and Barnett, G.O. (1998). The Unified Medical Language System: an informatics research collaboration. J. Am. Med. Inform. Assoc. 5, 1-11. https://doi.org/10.1136/jamia.1998.0050001
  16. Johnson, S.B., Paul, T., and Khenina, A. (1997). Generic database design for patient management information. Proc. AMIA. Annu. Fall. Symp. 22-26.
  17. Louie, B., Mork, P., Martin-Sanchez, F., Halevy, A., and Tarczy-Hornoch, P. (2007). Data integration and genomic medicine. J. Biomed. Inform. 40, 5-16. https://doi.org/10.1016/j.jbi.2006.02.007
  18. Martin-Sanchez, F., Iakovidis, I., Norager, S., Maojo, V., de Groen, P., Van der Lei, J., Jones, T., Abraham-Fuchs, K., Apweiler, R., Babic, A., Baud, R., Breton, V., Cinquin, P., Doupi, P., Dugas, M., Eils, R., Engelbrecht,R., Ghazal, P., Jehenson, P., Kulikowski, C., Lampe, K., De Moor, G., Orphanoudakis, S., Rossing, N., Sarachan, B., Sousa, A., Spekowius, G., Thireos, G., Zahlmann, G., Zvarova, J., Hermosilla, I. and Vicente, F. J. . (2004). Synergy between medical informatics and bioinformatics: facilitating genomic medicine for future health care. J. Biomed. Inform. 37, 30-42. https://doi.org/10.1016/j.jbi.2003.09.003
  19. Miotto, O., Tan, T.W., and Brusic, V. (2005). Supporting the curation of biological databases with reusable text mining. Genome Inform. 16, 32-44.
  20. Parkinson, H., Kapushesky, M., Shojatalab, M., Abeygunawardena, N., Coulson, R., Farne, A., Holloway, E., Kolesnykov, N., Lilja, P., Lukk, M., Mani, R., Rayner, T., Sharma, A., William, E., Sarkans, U. and Brazma, A. (2007). ArrayExpress--a public database of microarray experiments and gene expression profiles. Nucl. Acids Res. 35, D747-750. https://doi.org/10.1093/nar/gkl995
  21. Perou, C.M. (2001). Show me the data! Nat. Genet. 29, 373. https://doi.org/10.1038/ng1201-373
  22. Quackenbush, J. (2002). Microarray data normalization and transformation. Nat. Genet. 32 Suppl, 496-501. https://doi.org/10.1038/ng1032
  23. Rayner, T.F., Rocca-Serra, P., Spellman, P.T., Causton, H.C., Farne, A., Holloway, E., Irizarry, R.A., Liu, J., Maier, D.S., Miller, M., Petersen, K., Quackenbush, J., Sherlock, G., Stoeckert, C. J., Jr., White, J., Whetzel, P. L., Wymore, F., Parkinson, H., Sarkans, U., Ball, C. A. and Brazma, A. (2006). A simple spreadsheet-based, MIAMEsupportive format for microarray data: MAGE-TAB. BMC Bioinformatics 7, 489. https://doi.org/10.1186/1471-2105-7-489
  24. Sean, D., and Meltzer, P.S. (2007). GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics 23, 1846-1847. https://doi.org/10.1093/bioinformatics/btm254
  25. Spellman, P.T., Miller, M., Stewart, J., Troup, C., Sarkans, U., Chervitz, S., Bernhart, D., Sherlock, G., Ball, C., Lepage, M., Swiatek, M., Marks, W. L., Goncalves, J., Markel, S., Iordan, D., Shojatalab, M., Pizarro, A., White, J., Hubley, R., Deutsch, E., Senger, M., Aronow, B. J., Robinson, A., Bassett, D., Stoeckert, C. J., Jr. and Brazma, A. (2002). Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol. 3, RESEARCH0046.
  26. The Microarray Gene Expression Data (MGED) society. The MIAME checklist [http://www.mged.org/Workgroups/MIAME/miame_checklist.html]
  27. Vita, R., Vaughan, K., Zarebski, L., Salimi, N., Fleri, W., Grey, H., Sathiamurthy, M., Mokili, J., Bui, H.H., Bourne, P.E., Ponomarenko, J., de Castro, R., Jr., Chan, R. K., Sidney, J., Wilson, S. S., Stewart, S., Way, S., Peters, B. and Sette, A. (2006). Curation of complex, context- dependent immunological data. BMC Bioinformatics 7, 341. https://doi.org/10.1186/1471-2105-7-341
  28. Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., DiCuccio, M., Edgar, R., Federhen, S., Geer, L. Y., Kapustin, Y., Khovayko, O., Landsman, D., Lipman, D. J., Madden, T. L., Maglott, D. R., Ostell, J., Miller, V., Pruitt, K. D., Schuler, G. D., Sequeira, E., Sherry, S. T., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusov, R. L., Tatusova, T. A., Wagner, L. and Yaschenko, E. (2007). Database resources of the National Center for Biotechnology Information. Nucl. Acids Res. 35, D5-12. https://doi.org/10.1093/nar/gkl1031
  29. Yoon, S., Yang, Y., Choi, J., and Seong, J. (2006). Large scale data mining approach for gene-specific standardization of microarray gene expression data. Bioinformatics 22, 2898-2904. https://doi.org/10.1093/bioinformatics/btl500

Cited by

  1. Identification of prognostic biomarkers for glioblastomas using protein expression profiling vol.40, pp.4, 2012, https://doi.org/10.3892/ijo.2011.1302