DOI QR코드

DOI QR Code

The extension of the largest generalized-eigenvalue based distance metric Dij1) in arbitrary feature spaces to classify composite data points

  • Received : 2019.07.31
  • Accepted : 2019.10.14
  • Published : 2019.12.31

Abstract

Analyzing patterns in data points embedded in linear and non-linear feature spaces is considered as one of the common research problems among different research areas, for example: data mining, machine learning, pattern recognition, and multivariate analysis. In this paper, data points are heterogeneous sets of biosequences (composite data points). A composite data point is a set of ordinary data points (e.g., set of feature vectors). We theoretically extend the derivation of the largest generalized eigenvalue-based distance metric Dij1) in any linear and non-linear feature spaces. We prove that Dij1) is a metric under any linear and non-linear feature transformation function. We show the sufficiency and efficiency of using the decision rule $\bar{{\delta}}_{{\Xi}i}$(i.e., mean of Dij1)) in classification of heterogeneous sets of biosequences compared with the decision rules min𝚵iand median𝚵i. We analyze the impact of linear and non-linear transformation functions on classifying/clustering collections of heterogeneous sets of biosequences. The impact of the length of a sequence in a heterogeneous sequence-set generated by simulation on the classification and clustering results in linear and non-linear feature spaces is empirically shown in this paper. We propose a new concept: the limiting dispersion map of the existing clusters in heterogeneous sets of biosequences embedded in linear and nonlinear feature spaces, which is based on the limiting distribution of nucleotide compositions estimated from real data sets. Finally, the empirical conclusions and the scientific evidences are deduced from the experiments to support the theoretical side stated in this paper.

Keywords

References

  1. Anderson TW. An Introduction to Multivariate Statistical Analysis. 3rd ed. Hoboken: John Wiley & Sons, 2003.
  2. Schalkoff R. Pattern Recognition: Statistical, Syntactic and Neural approaches. New York: John Wiley and Sons, 1992.
  3. Haykin SS. Neural Networks and Learning Machines. 3rd ed. Upper Saddle River, NJ: Pearson Education, 2009.
  4. Taha HA. Operations Research: An Introduction. 3rd ed. New York: MacMillan, 1982.
  5. Daoud M. A new variance-covariance structure-based statistical pattern recognition system for solving the sequence-set proximity problem under the homology-free assumption Ph.D. Dissertation. Guelph: University of Guelph, 2010.
  6. Daoud M. Insights of window-based mechanism approach to visualize composite biodata point in feature spaces. Genomics Inform 2019;17:e4. https://doi.org/10.5808/GI.2019.17.1.e4
  7. Daoud M. Quantum sequence analysis: a new alignment-free technique for analyzing sequences in feature space. In: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics (Gheng Q, Jeun J, Li Y, Prieto-Centurion V, Krishnan JA, Schatz BR, eds.); 2013 Sep 22-25; Washington, DC, USA. New York: ACM Press, 2013. p. 702.
  8. Daoud M, Kremer SC. Neural and statistical classification to families of bio-sequences. In: 2006 IEEE International Joint Conference on Neural Networks Proceedings; 2006 Jul 16-21; Vancouver, BC, Canada. Orlando: Institute of Electrical and Electronics Engineers, 2006. pp. 699-704.
  9. Notredame C. Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 2002;3:131-144. https://doi.org/10.1517/14622416.3.1.131
  10. Vinga S, Almeida J. Alignment-free sequence comparison-a review. Bioinformatics 2003;19:513-523. https://doi.org/10.1093/bioinformatics/btg005
  11. Daoud M, Kremer SC. Detecting similarities between families of bio-sequences using the steady-state of a PCA-neural network. In: 2006 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB06); 2006 Sep 28-29; Toronto, ON, Canada. Orlando: Institute of Electrical and Electronics Engineers, 2006. pp. 179-185.
  12. Forstner W, Moonen B. A metric for covariance matrices. In: Geodesy: The Challenge of the 3rd Millennium (Grafarend EW, Krumm FW, Schwarze VS, eds.). Berlin: Springer, 2003. pp. 299-309.
  13. Andersson SA, Brons HK, Jensen ST. Distribution of eigenvalues in multivariate statistical analysis. Ann Stat 1983;11:392-415. https://doi.org/10.1214/aos/1176346149
  14. Flury B. Some relations between the comparison of covariance matrices and principal component analysis. Comput Stat Data Anal 1983;1:97-109. https://doi.org/10.1016/0167-9473(83)90077-4
  15. Flury BN. Common principal components in k groups. J Am Stat Assoc 1984;79:892-898. https://doi.org/10.2307/2288721
  16. Flury B. Common Principal Components and Related Multivariate Models. New York: John Wiley & Sons, 1988.
  17. Daoud M. On Generalized and Common Principal Components Analysis. M.S. Thesis. Irbid: Yarmouk University, 1989.
  18. Daoud M, Kremer SC. A new distance distribution paradigm to detect the variability of the influenza-A virus in high dimensional spaces. In: 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshop; 2009 Nov 13; Washington, DC, USA. Orlando: Institute of Electrical and Electronics Engineers, 2009. pp. 32-37.
  19. Kurstak E, Marusyk RG, Murphy FA, Van Regenmortel MH. Virus Variability, Epidemiology and Control. Vol. 2. New York: Springer, 1990. pp. 1-7.
  20. Lamb RA, Krug RM. Orthomyxoviridae: the viruses and their replication. In: Fields of Virology (Knipe DM, Howley PM, Griffin DE, eds.). Vol. 2. 4th ed. Philadelphia: Lippincott Williams and Wilkins, 2001. pp. 1487-1579.
  21. Cohen JD. Recursive hashing functions for n-grams. ACM Trans Inf Syst 1997;15:291-320. https://doi.org/10.1145/256163.256168
  22. Hogg RV, Craig AT. Introduction to Mathematical Statistics. 4th ed. New York: MacMillan Publishing, 1978.
  23. Schweiger B, Zadow I, Heckler R. Antigenic drift and variability of influenza viruses. Med Microbiol Immunol 2002;191:133-138. https://doi.org/10.1007/s00430-002-0132-3
  24. Hilleman MR. Realities and enigmas of human viral influenza: pathogenesis, epidemiology and control. Vaccine 2002;20:3068-3087. https://doi.org/10.1016/S0264-410X(02)00254-2
  25. Cann AJ. Principles of Molecular Virology. 4th ed. London: Academic Press, 2005.
  26. NCBI. Influenza Virus Resource. Bethesda: National Center for Biotechnology Information, 2008. Accessed 2019 Sep 2. Available from: http://www.ncbi.nlm.nih.gov/genomes/FLU/