A Method for Microarray Data Analysis based on Bayesian Networks using an Efficient Structural learning Algorithm and Data Dimensionality Reduction

효율적 구조 학습 알고리즘과 데이타 차원축소를 통한 베이지안망 기반의 마이크로어레이 데이타 분석법

  • 황규백 (서울대학교 컴퓨터공학부) ;
  • 장정호 (서울대학교 컴퓨터공학부) ;
  • 장병탁 (서울대학교 컴퓨터공학부)
  • Published : 2002.12.01

Abstract

Microarray data, obtained from DNA chip technologies, is the measurement of the expression level of thousands of genes in cells or tissues. It is used for gene function prediction or cancer diagnosis based on gene expression patterns. Among diverse methods for data analysis, the Bayesian network represents the relationships among data attributes in the form of a graph structure. This property enables us to discover various relations among genes and the characteristics of the tissue (e.g., the cancer type) through microarray data analysis. However, most of the present microarray data sets are so sparse that it is difficult to apply general analysis methods, including Bayesian networks, directly. In this paper, we harness an efficient structural learning algorithm and data dimensionality reduction in order to analyze microarray data using Bayesian networks. The proposed method was applied to the analysis of real microarray data, i.e., the NC160 data set. And its usefulness was evaluated based on the accuracy of the teamed Bayesian networks on representing the known biological facts.

DNA chip 기술에 의해 얻어지는 마이크로어레이(microarray) 데이타는 세포나 조직 내의 수천 개 유전자의 발현도(expression level)를 한번에 측정한 것으로, 유전자 발현 양상에 기반한 암의 진단, 유전자의 기능 예측 등에 이용되고 있다. 다양한 데이타 분석 기법들 중 베이지안망(Bayesian network)은 데이타의 각 속성들간의 관계를 그래프 형태로 표현할 수 있는 특징을 가지고 있다. 이는 마이크로어레이 데이타의 분석을 통해 여러 유전자와 조직의 특성(암의 종류 등) 사이의 관계를 밝히는데 유용하다 하지만 대부분의 마이크로어레이 데이타는 sparse data로 베이지안망을 비롯한 각종 분석 기법의 적용을 어렵게 하고 있다. 본 논문에서는 베이지안망에 기반한 마이크로어레이 데이타 분석을 위해 효율적 구조 학습 알고리즘과 데이타 차원 축소를 이용한다. 제시되는 분석법은 실제 마이크로어레이 데이타인 NC160 data set에 적용되었으며, 그 유용성은 데이타로부터 학습된 베이지안망이 실제 생물학적으로 알려진 사실들을 어느 정도 정확하게 표현하는지에 의해 평가되었다.

Keywords

References

  1. Schena, M. (ed.), Microarray Biochip Technology, Eaton Publishing, MA, 2000
  2. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D., Cluster analysis and display of genome-wide expression patterns, Proceedings of the National Academy of Sciences of the United States of America, vol. 95, no. 25, pp. 14863-14868, 1998 https://doi.org/10.1073/pnas.95.25.14863
  3. Raychaudhuri, S., Stuart, J.M., and Altman, R.B., Principal components analysis to summarize microarray experiments: application to sporulation time series, Pacific Symposium on Biocomputing 5 (Proceedings of PSB'00), pp. 452-463, 1999
  4. Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C, and Meltzer, P.S., Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nature Medicine, vol. 7, no. 6, pp. 673-679, 2001 https://doi.org/10.1038/89044
  5. Friedman, N., Linial, M., Nachman, I., and Pe'er, D., Using Bayesian networks to analyze expression data, In Proceedings of the 4th Annual International Conference on Computational Molecular Biology (RECOMB'00), pp. 127-135, 2000 https://doi.org/10.1145/332306.332355
  6. Hartemink, A.J., Gifford, D.K., Jaakkola, T.S., and Young, R.A., Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks, Pacific Symposium on Biocomputing 6 (Proceedings of PSB'01), pp. 422-433, 2000
  7. Hartemink, A.J., Gifford, D.K., Jaakkola, T.S., and Young, R.A., Combining location and expression data for principled discovery of genetic regulatory network models, Pacific Symposium on Biocomputing 7 (Proceedings of PSB'02), pp. 437-449, 2001
  8. Hwang, K.-B., Cho, D.-Y., Park, S.-W., Kim, S.-D., and Zhang, B.-T., Applying machine learning techniques to analysis of gene expression data: cancer diagnosis, Lin, S.M. and Johnson, K.F. (eds.), Methods of Microarray Data Analysis (Proceedings of CAMDA'00), Kluwer Academic Publishers, MA, pp. 167-182, 2002
  9. Leping, L., Pedersen, L.G., Darden, T.A., and Weinberg, C.R., Computational analysis of leukemia microarray expression data using the GA/KNN method, Lin, S.M. and Johnson, K.F. (eds.), Methods of Microarray Data Analysis (Proceedings of CAMDA'00), Kluwer Academic Publishers, MA, pp. 81-95, 2002
  10. Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., and Futcher, B., Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization, Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273-3297, 1998 https://doi.org/10.1091/mbc.9.12.3273
  11. Scherf, U., Ross, D.T., Waltham, M., Smith, L.H., Lee, J.K., Tanabe, L., Kohn, K.W., Reinhold, W.C., Myers, T.G., Andrews, D.T., Scudiero, D.A., Eisen, M.B., Sausville, E.A., Pommier, Y., Botstein, D., Brown, P.O., and Weinstein, J.N., A gene expression database for the molecular pharmacology of cancer, Nature Genetics, vol. 24, no. 3, pp. 236-244, 2000 https://doi.org/10.1038/73439
  12. Jensen, F.V., An Introduction to Bayesian Networks, Springer-Verlag, NY, 1996
  13. Heckerman, D., A tutorial on learning with Bayesian networks, Jordan, M.I. (ed.), Learning in Graphical Models, MIT Press, MA, pp. 301-354, 1999
  14. Friedman, N. and Goldszmidt, M., Learning Bayesian networks with local structure, Jordan, M.I. (ed.), Learning in Graphical Models, MIT Press, MA, pp. 421-459, 1999
  15. Heckerman, D., Geiger, D., and Chickering, D.M., Learning Bayesian networks: the combination of knowledge and statistical data, Machine Learning, vol. 20, no. 3, pp. 197-243, 1995
  16. Chickering, D.M., Learning Bayesian networks is NP-complete, Fisher, D. and Lenz, H.-J. (eds.), Learning from Data: Artificial Intelligence and Statistics V, Springer-Verlag, NY, pp. 121-130, 1996
  17. Cooper, G.F., Computational complexity of probabilistic inference using Bayesian belief networks, Artificial Intelligence, vol. 42, no. 2-3, pp. 393-405, 1990 https://doi.org/10.1016/0004-3702(90)90060-D
  18. Dagum, P. and Luby, M., Approximating probabilistic inference in Bayesian belief networks is NP-hard, Artificial Intelligence, vol. 60, no. 1, pp. 141-153, 1993 https://doi.org/10.1016/0004-3702(93)90036-B
  19. Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann Publishers, CA, 1988
  20. Spirtes, P., Glymour, C, and Scheines, R., Causation, Prediction, and Search, 2nd edition, MIT Press, MA, 2000
  21. Friedman, N., Nachman, I., and Pe'er, D., Learning Bayesian network structure from massive datasets: the 'sparse candidate' algorithm, In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence(UAI'99), pp. 206-215, 1999
  22. Friedman, N., Goldszmidt, M., and Wyner, A., Data analysis with Bayesian networks: a bootstrap approach, In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence(UAI'99), pp. 196-205, 1999
  23. Hwang, K.-B., Lee, J.W, Chung, S.-W, and Zhang, B.-T., Construction of large-scale Bayesian networks by local to global search, Lecture Notes in Artificial Intelligence (Proceedings of PRICAT02), vol. 2417, pp. 375- 384, 2002
  24. Graepel, T., Burger, M., and Obermayer, K., Self-organizing maps: generalizations and new optimization techniques, Neurocomputing, vol. 21, pp. 173-190, 1998 https://doi.org/10.1016/S0925-2312(98)00035-6
  25. Dempster, A.P., Laird, N.M., and Rubin, D.B., Maximum likelihood from incomplete data via the EM algorithm(with discussion), Journal of Royal Statistical Society B, vol. 39, no. 1, pp. 1-38, 1977
  26. Zhang. B.-T. and Cho, D-Y, System identification using evolutionary Markov chain Monte Carlo, Journal of Systems Architecture, vol. 47, no. 7, pp. 587-599, 2001 https://doi.org/10.1016/S1383-7621(01)00017-0