DOI QR코드

DOI QR Code

분할 순차 패턴과 SVM을 이용한 HPV 타입 예측 시스템

HPV-type Prediction System using SVM and Partial Sequential Pattern

  • Kim, Jinsu (College of Liberal Arts, Anyang University)
  • 투고 : 2014.09.05
  • 심사 : 2014.12.20
  • 발행 : 2014.12.28

초록

기존의 시스템에서는 서열 전체 혹은 정렬되지 않은 서열로부터 패턴들을 생성하기 때문에 패턴의 수가 기하급수적으로 증가하여 많은 시간과 비용이 소모된다. 본 논문에서는 단백질의 전체 서열로부터 패턴을 찾아내는 것이 아니라, 다중 서열 정렬 기법을 이용하여 단백질의 분할 서열 구간을 생성하고 분할 서열 구간의 순차 패턴을 생성하며 생성된 패턴들을 통합하여 전체 모티프 후보 집합을 만들어 SVM의 훈련 집합으로 선택 및 학습하며, 최종적으로 미지의 혹은 알려진 단백질 서열의 HPV 타입을 SVM을 통해 학습된 정보를 적용하여 예측하는 시스템을 제안한다. 제안된 시스템은 기존의 시스템에 비해 최소 지지도 30%에서 정확도와 재현율 측면에서 보다 향상된 성능을 보였다.

The existing system consumes a considerable amount time and cost for extracting the patterns from whole sequences or misaligned sequences. In this paper, We propose the classification system, which creates the partition sequence sections using multiple sequence alignment method and extracts the sequential patterns from these section. These extracted patterns are accumulated motif candidate sets and then used the training sets of SVM classifier. This proposed system predicts a HPV-type(high/low) using the learned knowledges from known/unknown protein sequences and shows more improved precision, recall than previous system in 30% minimum support.

키워드

참고문헌

  1. Bailey, Timothy L., et al., Article: Data Mining Techniques for Informative Motif Discovery. International Journal of Computer Applications, Vol. 88, No. 12, pp. 21-24, 2014. https://doi.org/10.5120/15405-3901
  2. Rashida Hasan and Jainal Uddin, Motif discovery in biological sequences without alignment or enumeration. In Proceedings of the Second Annual International Conference on Computational Molecular Biology, RECOMB 98 pp. 221-227, 1998.
  3. Vance Chiang-Chi Liao, and Ming-Syan Chen, Efficient mining gapped sequential patterns for motifs in biological sequences. BMC Syst Biol. 7(Suppl 4), pp. S7, 2013.
  4. Bernhard E. Boser, Isabelle Guyon, and Vladimir Vapnik, A training algorithm for optimal margin classifiers. In Computational Learing Theory, pp. 144-152, 1992.
  5. V. N. Vapnik. Statistical Learning Theory. Springer, 1998.
  6. Su-Hyeon Namn, Hong-Kee Kim, Knowledge Extraction from Academic Journals Using Data Mining Techniques. The Journal of Digital Policy & Management, Vol. 3, No. 1, pp. 75-88, 2005.
  7. R. Agrawal and R. Srikant, "Mining Sequential Pattern, Proc. of the Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, 1995.
  8. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, and Higgins DG, Clustal W and Clustal X version 2.0. Bioinformatics, Vol.23, pp. 2947-2948, 2007. https://doi.org/10.1093/bioinformatics/btm404
  9. J.D. Thompson, D.G. Higgins and T.j. Gibson, ClustalW: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting. Nucleic Acids Research, Vol. 22, No. 22, pp. 4673-4680, 1994. https://doi.org/10.1093/nar/22.22.4673
  10. Yang, Y., and Liu, X., A Re-examination of Text Categorization Methods. In Proceedings of ACM SIGIR'99 conference, pp. 42-49, 1999.