An Efficient Local Alignment Algorithm for DNA Sequences including N and X

N과 X를 포함하는 DNA 서열을 위한 효율적인 지역정렬 알고리즘

  • 김진욱 (인하대학교 컴퓨터정보공학부)
  • Received : 2009.11.30
  • Accepted : 2010.01.11
  • Published : 2010.03.15

Abstract

A local alignment algorithm finds a substring pair of given two strings where two substrings of the pair are similar to each other. A DNA sequence can consist of not only A, C, G, and T but also N and X where N and X are used when the original bases lose their information for various reasons. In this paper, we present an efficient local alignment algorithm for two DNA sequences including N and X using the affine gap penalty metric. Our algorithm is an extended version of the Kim-Park algorithm and can be extended in case of including other characters which have similar properties to N and X.

지역정렬(local alignment) 알고리즘은 주어진 두 서열에서 서로 유사한 부분 문자열을 찾아내는 알고리즘이다. DNA 서열은 A, C, G, T 외에 N과 X도 가질 수 있는데, N과 X는 DNA로부터 염기배열 정보를 뽑아낼 때 실험적인 이유로 혹은 다른 이유로 일부 배열 정보를 잃어버린 경우에 사용된다. 본 논문에서는 A, C, G, T 이외에 N과 X를 모두 갖는 DNA 서열의 affine gap penalty metric에 대한 지역정렬을 찾는 효율적인 알고리즘을 제시한다. 이는 N만 처리할 수 있는 Kim-Park 알고리즘을 N과 X를 모두 처리할 수 있도록 성공적으로 확장한 결과이며, 더불어 새로운 문자가 추가되더라도 바로 적용이 가능한 일반화된 결과이다.

Keywords

References

  1. D. Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge University Press, New York, 1997.
  2. P. Green, PHRAP, http://www.phrap.org.
  3. E.W. Myers, G.G. Sutton, A.L. Delcher, I.M. Dew, D.P. Fasulo, et al., A Whole-Genome Assembly of Drosophila, Science, 287, pp.2196-2204, 2000. https://doi.org/10.1126/science.287.5461.2196
  4. A. Batzoglou, D.B. Jaffe, K. Stanley, J. Butler, et al., ARACHNE: A Whole-Genome Shotgun Assembler, Genome Research, 12, pp.177-189, 2002. https://doi.org/10.1101/gr.208902
  5. J. Wang, G.K. Wong, P. Ni, et al., RePS: A Sequence Assembler that Masks Exact Repeats Identified from the Shotgun Data, Genome Research, 12, pp.824-831, 2002. https://doi.org/10.1101/gr.165102
  6. J.W. Kim, K. Roh, K. Park, H. Park, J. Seo, MLP: Mate-Based Sequence Layout with PHRAP, Bioinformatics and Biosystems, 1(1), pp.61-66, 2006.
  7. T.F. Smith, M.S. Waterman, Identification of Common Molecular Subsequences, Journal of Molecular Biology, 147, pp.195-197, 1981. https://doi.org/10.1016/0022-2836(81)90087-5
  8. O. Gotoh, An Improved Algorithm for Matching Biological Sequences, Journal of Molecular Biology, 162, pp.705-708, 1982. https://doi.org/10.1016/0022-2836(82)90398-9
  9. J.W. Kim, K. Park, An Efficient Alignment Algorithm for Masked Sequences, Theoretical Computer Science, 370, pp.19-33, 2007. https://doi.org/10.1016/j.tcs.2006.10.003
  10. NC-UIB, Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences. Recommendations 1984, The European Journal of Biochemistry, 150, pp.1-5, 1985. https://doi.org/10.1111/j.1432-1033.1985.tb08977.x
  11. J.W. Kim, A. Amir, G.M. Landau, K. Park, Computing Similarity of Run-Length Encoded Strings with Affine Gap Penalty, Theoretical Computer Science, 395, pp.268-282, 2008. https://doi.org/10.1016/j.tcs.2008.01.008