Data Mining for High Dimensional Data in Drug Discovery and Development

  • Lee, Kwan R. (GlaxoSmithKline, Research & Development, Data Exploration Sciences 1250 South Collegeville Road Collegeville) ;
  • Park, Daniel C. (GlaxoSmithKline, Research & Development, Data Exploration Sciences 1250 South Collegeville Road Collegeville) ;
  • Lin, Xiwu (GlaxoSmithKline, Research & Development, Data Exploration Sciences 1250 South Collegeville Road Collegeville) ;
  • Eslava, Sergio (GlaxoSmithKline, Research & Development, Data Exploration Sciences 1250 South Collegeville Road Collegeville)
  • Published : 2003.12.01

Abstract

Data mining differs primarily from traditional data analysis on an important dimension, namely the scale of the data. That is the reason why not only statistical but also computer science principles are needed to extract information from large data sets. In this paper we briefly review data mining, its characteristics, typical data mining algorithms, and potential and ongoing applications of data mining at biopharmaceutical industries. The distinguishing characteristics of data mining lie in its understandability, scalability, its problem driven nature, and its analysis of retrospective or observational data in contrast to experimentally designed data. At a high level one can identify three types of problems for which data mining is useful: description, prediction and search. Brief review of data mining algorithms include decision trees and rules, nonlinear classification methods, memory-based methods, model-based clustering, and graphical dependency models. Application areas covered are discovery compound libraries, clinical trial and disease management data, genomics and proteomics, structural databases for candidate drug compounds, and other applications of pharmaceutical relevance.

Keywords

References

  1. Agrawal, R., et al. (1995). Fast discovery of association rules. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining. (AAAI Press), 3-8
  2. Ai, C.S., Blower, P.E., and Ledwith, R.H. (1991). Extracting reaction information from chemical databases. In PiatetskyShapiro, G. and W. J. Frawley ,eds. Knowledge Discovery in Databases, (Cambridge, MA:AAAI/MIT Press), 367-381
  3. Bahler, D. and Bristol, D.W. (1993). The induction of rules for predicting chemical carcinogenesis in rodents. Proceedings of the First International Conference on Intelligent Systems for Molecular Biology (Menlo Park, CA:AAAI Press), 29-37
  4. Banfield, J. and Raftery, A. (1993). Model-based Gaussian and non-Gaussian Clustering. Biometrics. 49, 803-821 https://doi.org/10.2307/2532201
  5. Beer, D. et al. (2002). Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine. 8, 816-824
  6. Breiman, L., Friedman, J., Olshen, R., and Stone, C.J. CART:Classification and Regression Trees.(Belmont, CA:Wadsworth Press)
  7. Breiman, L., Friedman, J., Olshen, R., and Stone, C.J. CART:Classification and Regression Trees.(Belmont, CA:Wadsworth Press)
  8. Burr, T., Gattiker, J.R., and LaBerge, G.S. (2001). Genetic Subtyping using Cluster Analysis. SIGKDD Explorations. 3, 33-42 https://doi.org/10.1145/507533.507539
  9. Chatfield, C. (1995). Model uncertainty, data mining, and statistical inference. J. R. Statist. Soc. (A).158, 419-466 https://doi.org/10.2307/2983440
  10. Cook, D.J. and Holder, L. (1994). Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research. 1, 231-255
  11. Decker, K.M. and Foccardi, S. (1995). Technology overview: a report on data mining. Technical Report CSCS TR-95-02. (Swiss Scientific Computing Center, Manno, Switwerland)
  12. Elder, J. and Pregibon, D. (1996). A statistical perspective on KDD, Advances in Knowledge Discovery and DataMining. U. Fayyad, et al eds. (Cambridge, MA:AAAI/MIT Press), 83-114
  13. Engels, M.F.M., Knapen, K., and Tollenaere, J.P. (2001). Approaches for Mining High-throughput Screening Data Sets. Paper presented on the 13th European Symposium on Quantitative Structure-Activity Relationships, Dusseldorf, Germany
  14. Fayyad, U.M., Piatetsky-Shapiro,G., Smyth, P., and Uthurasamy, R. (1996). Advances in KnOWledge Discovery and Data Mining. (Cambridge, MA: AAAI/MIT Press)
  15. Fayyad, U.M., Haussler, D., and Stolorz, P. (1996). KDD for science dataanalysis: issues andexamples. InProceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. E. Simoudis and J. Han eds. (Menlo Park,CA:AAAI Press), 50-56
  16. Friedman, H.P. and Goldberg, J.D. (2000). Knowledge Discovery from Databases and Data Mining: New Paradigms for Statistics and Data Analysis? pharmaceutical Report.8(2), Biopharmaceutical Section, American Statistical Association
  17. Glymour, C., Madigan, D., Pregibon, D., and Smyth, P. (1996). Data mining and statistics Communications of the ACM. 39, 35-41
  18. Hastie, T., Tibshirani, R., Eisen, M., Alizadeh, A., Levy, R., Staut, L., Botstein, D., and Brown, P. (2000). Identifying distinct set of genes with similar expression patterns via gene Genome Biology.shaving. Genome Biology. 1, 1-21
  19. Heckerman, D. (1996). Bayesian networks for knowledge discovery. In Advanced in Knowledge Discovery and Data Mining, U. Fayyad et al. eds. (AAAI/MIT Press), 273-305
  20. Hennessy, D. et al. (1995). Induction of rules for biological macromolecule cystanization. Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology. (Menlo Park, CA:AAAI Press), 179-187
  21. Jain, A. N., et al. (1994). Compass: a shape-based machine learning too for drug design. Journal of Computer-Aided Molecular Design. 8, 635-652 https://doi.org/10.1007/BF00124012
  22. Lee, K.R., Lin, X., Park, D.C., Eslava S. (2003). Megavariate data analysis of mass spectrometric proteomics data using latent variable projection method. Proteomics. 3, 1680-1686 https://doi.org/10.1002/pmic.200300515
  23. Lee, K.R., Lydick, E., Park, D.C., Lin, X. (2001). Exploratory Data Analysis of Irregular Patterns of Longitudinal Laboratory Data from Clinical Trials - A case study of liver function test. Proceedings of 10th World Congress on Medical Informatics, London, UK.873
  24. Lin,X., Park, D.C., Eslava, S., Lee,K.R., Lam, L.H., and Zhu LA (2003). Making Sense of Human Lung Carcinomas Gene Expression Data: Integration and Analysis of Two Affymetrix Platform Experiments. Proceedings of Critical Assessment of Microarray Data Analysis (CAMDA03), Durham, NC, USA, 2327
  25. Mannila, H. (1996). Data mining: machine learning, statistics, and databases. Proceedings of the 19961ntemational Conference on Machine Learning, (San Mateo, CA: Morgan Kaufmann Publishers), also available on the Web at http://www.cs.helsinki.fi/-mannila
  26. Mannila, H. and Toivonen, H. (1996). Discovering generalized episodes using minimal occurences. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining(AAAI Press), 146-151
  27. Moore, J.S., Parker J.S., Olsen, N.S., and Aune, T.M. (2002). Symbolic discriminant analysis of microarray data in automimmune disease. Genetic Epidemiology. 23,57-69 https://doi.org/10.1002/gepi.1117
  28. Muggleton, S., King, R., and Sternberg, M. (1992). Protein secondary structure prediction using logic. Protein Engineering. 5,647-657 https://doi.org/10.1093/protein/5.7.647
  29. Michie, D., Spiegelhalter, D.J., and Taylor, C.C. (1994). Machine Leaming, Neural and Statistical Classification. (New York: Ellis Horwood)
  30. Olaleye, D. and Tardiff, B.E. (2001). Practical Issues in and Applications of Clinical Data Mining. DrugInformation Journal. 35,791-808
  31. Piatetsky-Shapiro, G. and Frawley, W.J. (1991). Knowledge Discovery in Databases. (Cambridge, MA:AAAIIMIT Press)
  32. Quinlan, J.R. (1993). C4.5: Programs for Machine Learning, San Mateo. (CA: Morgan Kaufmann)
  33. Smyth, P. and Goodman, R.M. (1992). An information theoretic approach to rule induction from databases. IEEE Transactions on Knowledge and Data Engineering. 4, 301-316 https://doi.org/10.1109/69.149926
  34. Smyth, P. (1996). Clustering using Monte Carlo cross-validation. Proceedings of the 2nd International Conference on Knowledge Discovery andData Mining. (AAAI Press) 126-133
  35. Smyth, P., Heckerman, D., andJordan, M.I. (1997). Probabilistic independence networks for hidden Markov probability models. Neural Computation. 9, 227-269 https://doi.org/10.1162/neco.1997.9.2.227
  36. Tibshirani, R., Hastie, T., Botstein, D., and Brown, P. (2001). Supervised harvesting of expression trees. Genome Biology 2, 1-12
  37. Vohradsky, J. and Thompson, C.J. (1997). Identification of procaryotic developmental stages by statistical analyzes of two-dimensional gelpatterns. Electrophoresis 18,1418-1428 https://doi.org/10.1002/elps.1150180817