DOI QR코드

DOI QR Code

Machine Learning Using Template-Based-Predicted Structure of Haemagglutinin Predicts Pathogenicity of Avian Influenza

  • Jong Hyun Shin (Department of Microbiology, Sungkyunkwan University School of Medicine) ;
  • Sun Ju Kim (Department of Microbiology, Sungkyunkwan University School of Medicine) ;
  • Gwanghun Kim (Department of Biomedical Sciences; BK21 FOUR Biomedical Science Project; Medical Research Institute, Seoul National University College of Medicine) ;
  • Hang-Rae Kim (Department of Biomedical Sciences; Department of Anatomy & Cell Biology; BK21 FOUR Biomedical Science Project; Medical Research Institute, Seoul National University College of Medicine) ;
  • Kwan Soo Ko (Department of Microbiology, Sungkyunkwan University School of Medicine)
  • 투고 : 2024.05.20
  • 심사 : 2024.07.30
  • 발행 : 2024.10.28

초록

Deep learning presents a promising approach to complex biological classifications, contingent upon the availability of well-curated datasets. This study addresses the challenge of analyzing three-dimensional protein structures by introducing a novel pipeline that utilizes open-source tools to convert protein structures into a format amenable to computational analysis. Applying a two-dimensional convolutional neural network (CNN) to a dataset of 12,143 avian influenza virus genomes from 64 countries, encompassing 119 hemagglutinin (HA) and neuraminidase (NA) types, we achieved significant classification accuracy. The pathogenicity was determined based on the presence of H5 or H7 subtypes, and our models, ranging from zero to six mid-layers, indicated that a four-layer model most effectively identified highly pathogenic strains, with accuracies over 0.9. To enhance our approach, we incorporated Principal Component Analysis (PCA) for dimensionality reduction and one-class SVM for abnormality detection, improving model robustness through bootstrapping. Furthermore, the K-nearest neighbor (K-NN) algorithm was fine-tuned via hyperparameter optimization to corroborate the findings. The PCA identified distinct clustering for pathogenic HA, yielding an AUC of up to 0.85. The optimized K-NN model demonstrated an impressive accuracy between 0.96 and 0.97. These combined methodologies underscore our deep learning framework's capacity for rapid and precise identification of pathogenic avian influenza strains, thus providing a critical tool for managing global avian influenza threats.

키워드

과제정보

All sequence data were retrieved from avian-flu database (http://avian-flu.org). The source code is available on Github at https://github.com/jhshin0714/ML_avian-flu/. This research was supported in part by the Bio & Medical Technology Development Program of the National Research Foundation (NRF), funded by the Korean government (MSIT) (NRF-2018M3A9H4055197).

참고문헌

  1. Bouvier NM, Palese P. 2008. The biology of influenza viruses. Vaccine 26 Suppl 4: D49-53.
  2. Krammer F, Smith GJD, Fouchier RAM, Peiris M, Kedzierska K, Doherty PC, et al. 2018. Influenza. Nat. Rev. Dis. Primers 4: 3.
  3. Long JS, Mistry B, Haslam SM, Barclay WS. 2019. Host and viral determinants of influenza A virus species specificity. Nat. Rev. Microbiol. 17: 67-81.
  4. Blagodatski A, Trutneva K, Glazova O, Mityaeva O, Shevkova L, Kegeles E, et al. 2021. Avian influenza in wild birds and poultry: dissemination pathways, monitoring methods, and virus ecology. Pathogens. 10: 630.
  5. Taubenberger JK, Morens DM. 2006. 1918 Influenza: the mother of all pandemics. Emerg. Infect. Dis. 12: 15-22.
  6. Seltzer ML, Zhang L. 2009. The data deluge: challenges and opportunities of unlimited data in statistical signal processing. Proc. IEEE Int. Conf. Acoust. Speech Signal Process 2009: 3701-3704.
  7. LeCun Y, Bengio Y, Hinton G. 2015. Deep learning. Nature 521: 436-444.
  8. AlQuraishi M. 2019. ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinformatics 20: 311
  9. Wang S, Sundaram JP, Spiro D. 2010. VIGOR, an annotation program for small viral genomes. BMC Bioinformatics 11: 451
  10. Fiser A, Sali A. 2003. Modeller: generation and refinement of homology-based protein structure models. Methods Enzymol. 374: 461-491.
  11. Lin T, Wang G, Li A, Zhang Q, Wu C, Zhang R, et al. 2009. The hemagglutinin structure of an avian H1N1 influenza A virus. Virology 392: 73-81.
  12. Kazutaka Katoh, Kazuharu Misawa, Kei-ichi Kuma, Takashi Miyata. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30: 3059-3066.
  13. Lawrence S, Giles CL. 2000. Overfitting and neural networks: conjugate gradient and backpropagation. In Proc IEEE-INNS-ENNS Int Joint Conf Neural Netw, IJCNN 2000, Neural Computing: New Challenges and Perspectives for the New Millennium. pp. 114-119.
  14. Visa S, Ramsay B, Ralescu A, Knaap E. 2011. Confusion matrix-based feature selection. In CEUR Workshop Proc, Vol. 710, pp. 120-127.
  15. Trappenberg TP. 2019. Machine Learning with Sklearn. In: Fundamentals of Machine Learning. Oxford Univ Press: Oxford, UK, pp. 38-65.
  16. Hunter JD. 2007. Matplotlib: A 2D Graphics Environment. Comput Sci Eng. 9: 90-95.
  17. Claas EC, Osterhaus AD, van Beek R, De Jong JC, Rimmelzwaan GF, Senne DA, et al. 1998. Human influenza A H5N1 virus related to a highly pathogenic avian influenza virus. Lancet 351: 472-477.
  18. Fouchier RA, Schneeberger PM, Rozendaal FW, Jan M Broekman, Stiena A G Kemink, Vincent Munster, et al. 2004. Avian influenza A virus (H7N7) associated with human conjunctivitis and a fatal case of acute respiratory distress syndrome. Proc. Natl. Acad. Sci. USA 101: 1356-1361.
  19. Ramazi P, Kunegel-Lion M, Greiner R, Lewis MA. 2021. Predicting insect outbreaks using machine learning: a mountain pine beetle case study. Ecol Evol. 11: 13014-13028.
  20. Chapelle O, Scholkopf B, Zien A. 2006. Risks of Semi-Supervised Learning: How Unlabeled Data Can Degrade Performance of Generative Classifiers. In: Semi-Supervised Learning. MIT Press: USA, pp. 57-72.