NOGSEC: A NOnparametric method for Genome SEquence Clustering

녹섹(NOGSEC): A NOnparametric method for Genome SEquence Clustering

  • 이영복 (부산대학교 전자계산학과 그래픽스응용연구실, 포항공과대학교 생물학전문연구정보센터) ;
  • 김판규 (부산대학교 전자계산학과 그래픽스응용연구실, 컴퓨터 및 정보통신연구소) ;
  • 조환규 (부산대학교 전자계산학과 그래픽스응용연구실, 컴퓨터 및 정보통신연구소)
  • Published : 2003.06.01

Abstract

One large topic in comparative genomics is to predict functional annotation by classifying protein sequences. Computational approaches for function prediction include protein structure prediction, sequence alignment and domain prediction or binding site prediction. This paper is on another computational approach searching for sets of homologous sequences from sequence similarity graph. Methods based on similarity graph do not need previous knowledges about sequences, but largely depend on the researcher's subjective threshold settings. In this paper, we propose a genome sequence clustering method of iterative testing and graph decomposition, and a simple method to calculate a strict threshold having biochemical meaning. Proposed method was applied to known bacterial genome sequences and the result was shown with the BAG algorithm's. Result clusters are lacking some completeness, but the confidence level is very high and the method does not need user-defined thresholds.

비교유전체학의 주요 주제 중 유전자서열을 분류하고 단백질기능을 예측하는 연구가 있으며, 이를 위해 단백질 구조, 공통서열 및 바인딩 위치 예측등의 방법과 함께, 전유전체 서열에서 구해지는 유사도 그래프를 분석해 상동유전자를 검색하는 계산학적인 접근방법이 있다. 유사도그래프를 사용한 방법은 서열에 대한 기존 지식에 의존하지 않는 장점이 있지만 유사도 하한값과 같은 주관적인 임계값이 필요한 단점이 있다. 본 논문에서는 반복적으로 그래프를 분해하는 이전의 방법을 일반화시켜, 유사도 그래프에 기반한 유전자 서열군집분석 방법론과 객관적이고 안정적인 파라미터 임계값 계산 방법을 제안한다. 제시된 방법으로 알려진 미생물 유전체 서 열을 분석하여 이전의 방법인 BAG 알고리즘 결과와 비교했다.

Keywords

References

  1. National Center for Biotechnology Information
  2. J. Mol. Biol. v.215 Basic local alignment search tool Altschul,S.F. https://doi.org/10.1016/S0022-2836(05)80360-2
  3. Nucl. Acids Res. v.25 Novel developments with the PRINTS protein fingerprint database Attwood,T.K.;M.E.Beck;A.J.Bleasby;K.Degtyarenko;A.D.Michie;D.J.Parry Smith
  4. Proc. Int. Conf. Inteli. Syst. Mol. Biol. A generalized profile syntax for biomolecular sequences motifs and its function in automatic sequence interpretation Bucher,P.;A.Bairoch
  5. Atlas of protein sequence and structure Survey of new data and computer methods of analysis Dayhoff,M.O.
  6. PROTEINS, Structure, Function, and Genetics v.41 Practical limits of function prediction Davos,D.;A.Valencia https://doi.org/10.1002/1097-0134(20001001)41:1<98::AID-PROT120>3.0.CO;2-S
  7. Bioinformatics v.16 GeneRAGE: A robust algorithm for sequence clustering and domain detection Enright,A.J.;C.A.Ouzounis https://doi.org/10.1093/bioinformatics/16.5.451
  8. Science v.259 Ancient conserved regions in new gene sequences and the protein databases Green,P.;D.Lipman;L.Hillier;R.Waterston;D.State;J.M.Claverie https://doi.org/10.1126/science.8456298
  9. Computational Biology and Genome Informatics Graph theoretic sequence clustering algorithms and their applications to genome comparison Kim,S.;J.T.L.Wang(ed.);C.H.Wu(ed.);P.P.Wang(ed.)
  10. Soviet Physics Doklady Binary codes capable of correcting deletions, insertions and reversals Levenshtein, V.I.
  11. Bioinformatics: Sequence and Genomo Analysis Mount,D.W.
  12. Theor. Comp. Sci. v.210 Classifying molecular sequences using a linkage graph with their pairwise similarities Matsuda,H;l.T.Ishihara;A.Hashimoto https://doi.org/10.1016/S0304-3975(98)00091-7
  13. IEICE Trans. Fundamentals v.E83-A Detection of conserved domains in protein sequences using a maximum-density subgraph algorithm Matsuda,H.
  14. J. Mol. Biol. v.48 A general method applicable to the search for similarities in the amino acid sequences of two proteins Needleman,S.B.;C.D.Wunsch https://doi.org/10.1016/0022-2836(70)90057-4
  15. Computational Molecular Biology, An Algorithmic Approach Pavel,A.Pevzner
  16. Proc. Batl. Acad. Sci. v.85 Improved tools for biological sequence comparison Pearson,W.R.;D.J.Lipman
  17. Nucl. Acids Res. v.24 The block database-a system for protein classification Shmuel,P.;J.G.Henikoff.;S.Henikoff https://doi.org/10.1093/nar/24.1.197
  18. Proc. Natl. Acad. Sci. v.95 Smart, a simple modular architecture research tool: identification of signaling domains Schultz,J.;F.Milpetz;P.Bork;C.P.Ponting https://doi.org/10.1073/pnas.95.11.5857
  19. J. Mol. Biol. v.147 Identification of common molecular subsequences Smith,T.F.;M.S.Waterman https://doi.org/10.1016/0022-2836(81)90087-5
  20. Nucl. Acids Res. v.28 The cog database: a tool for genome-scale analysis of protein functions and evolution Tatusov,R.L.;M.Y.Galperin;D.A.Natale;E.V.Koonin https://doi.org/10.1093/nar/28.7.e22
  21. Nucl. Acids Res. v.29 The cog database: new developments in phylogenetic classification of proteins from complete genomes Tatusov,R.L.;D.A.Natale;I.V.Garkavtsev;T.A.Tatusova;U.T.Shankavaram;B.S.Rao;B.Kiryutin;M.Y.Galperin;N.D.Fedorova;E.V.Koonin https://doi.org/10.1093/nar/29.1.22