DOI QR코드

DOI QR Code

A Study on Clustering and Identifying Gene Sequences using Suffix Tree Clustering Method and BLAST

서픽스트리 클러스터링 방법과 블라스트를 통합한 유전자 서열의 클러스터링과 기능검색에 관한 연구

  • 한상일 (부산대학교 화학공학과) ;
  • 이성근 (부산대학교 화학공학과) ;
  • 김경훈 (부산대학교 화학공학과) ;
  • 이주영 (부산대학교 화학공학과) ;
  • 김영한 (동아대학교 화학공학과) ;
  • 황규석 (부산대학교 화학공학과)
  • Published : 2005.10.01

Abstract

The DNA and protein data of diverse species have been daily discovered and deposited in the public archives according to each established format. Database systems in the public archives provide not only an easy-to-use, flexible interface to the public, but also in silico analysis tools of unidentified sequence data. Of such in silico analysis tools, multiple sequence alignment [1] methods relying on pairwise alignment and Smith-Waterman algorithm [2] enable us to identify unknown DNA, protein sequences or phylogenetic relation among several species. However, in the existing multiple alignment method as the number of sequences increases, the runtime increases exponentially. In order to remedy this problem, we adopted a parallel processing suffix tree algorithm that is able to search for common subsequences at one time without pairwise alignment. Also, the cross-matching subsequences triggering inexact-matching among the searched common subsequences might be produced. So, the cross-matching masking process was suggested in this paper. To identify the function of the clusters generated by suffix tree clustering, BLAST was combined with a clustering tool. Our clustering and annotating tool is summarized as the following steps: (1) construction of suffix tree; (2) masking of cross-matching pairs; (3) clustering of gene sequences and (4) annotating gene clusters by BLAST search. The system was successfully evaluated with 22 gene sequences in the pyrubate pathway of bacteria, clustering 7 clusters and finding out representative common subsequences of each cluster

Keywords

References

  1. C. Notredame and D. G. Higgins, 'SAGA: sequence alignment by genetic algorithm,' Nucleic Acids Res., vol. 24, pp. 1515-1524, 1996 https://doi.org/10.1093/nar/24.8.1515
  2. T. F. Smith and M. S. Waterman, 'Identification of common molecular sequences,' J. Mol. Biol., vol. 147, pp. 195-197, 1981 https://doi.org/10.1016/0022-2836(81)90087-5
  3. J. Y. Chen and J. V. Carlis, 'Genomic data modeling,' Information Systems, vol. 28, pp. 287, 2003 https://doi.org/10.1016/S0306-4379(02)00071-6
  4. D. W. Mount, 'Bioinformatics: sequence and genome analysism,' Cold Spring Harbor Laboratory Press, New York, pp. 3-5, 2001
  5. J. M. Ostell, S, J. Wheelan and J. A. Kans, 'The NCBI data model,' Methods Biochem. Anal., vol. 43, pp. 19, 2001 https://doi.org/10.1002/0471223921.ch2
  6. A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White and S. L. Salzberg, 'Alignment of whole genomes,' Nucleic Acids Res., vol. 27(11), pp. 2369-2376, 1999 https://doi.org/10.1093/nar/27.11.2369
  7. A. L, Delcher, A. Phillippy, J. Carlton and S. L. Salzberg, 'Fast algorithms for large-scale genome alignment and comparisonm,' Nucleic Acids Res., vol. 30(11), pp. 2478-2483, 2002 https://doi.org/10.1093/nar/30.11.2478
  8. N. Volfovsky, B. J. Haas and S. L. Salzberg, 'A clustering method for repeat analysis in DNA sequences,' Genome Biol., vol. 2, pp. 1-11, 2001 https://doi.org/10.1186/gb-2001-2-8-research0027
  9. A. Kalyanaraman, S. Aluru and S. Kothari, 'Parallel EST clustering,' HICOMB, 185, 2002
  10. O. Zamir, O. Etzioni, O, Madani and R. M. Karp, 'Fast and intuitive clustering of Web documents,' In Proc. of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 287-290, 1997
  11. S. I. Han, S. G. Lee, B. K. Hou, S. H. Park, Y. H. Kim and K. S. Hwang, 'A gene clustering method with masking cross-matching frahments using modified suffic tree clustering method,' Korean J. Chem. Eng., vol. 22(3), pp. 345, 2005 https://doi.org/10.1007/BF02719409
  12. S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, 'Basic local alignment search tool,' Journal of Molecular Biology, Vol. 215, No. 3, pp. 403-410, 1990 https://doi.org/10.1016/S0022-2836(05)80360-2
  13. E. M. McCreight, 'A space-economical suffix tree construction algorithms,' J. ACM 23, pp. 262-272, 1976 https://doi.org/10.1145/321941.321946
  14. E. Ukkonen, 'On-line construction of suffix trees,' Algorithmica 14, pp. 353-364, 1993 https://doi.org/10.1007/BF01206331
  15. D. Gusfield, 'Algorithms on strings, trees, and sequences: computer science and computational biology,' Cambridge University Press, London, pp. 116, 1997