• Title/Summary/Keyword: DNA 시퀀스

Search Result 36, Processing Time 0.047 seconds

Alternative Splicing Pattern Analysis from RNA-Seq data (RNA-Seq 데이터를 이용한 선택 스플라이싱 유형 분석)

  • Kong, Jin-Hwa;Lee, Jong-Keun;Lee, Un-Joo;Yoon, Jee-Hee
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2011.06a
    • /
    • pp.37-40
    • /
    • 2011
  • 선택 스플라이싱 (alternative splicing)은 mRNA (messenger RNA)의 전구체인 pre-mRNA가 mRNA로 전사될 때 pre-mRNA의 엑손 영역들 (exons)이 여러 가지 유형 (pattern)으로 다시 연결되는 과정을 말한다. 선택 스플라이싱에 의해 하나의 유전자로부터 서로 다른 mRNA가 만들어 지고 서로 다른 이소형의 단백질 (protein isoforms)이 생성된다. 현재까지 알려진 선택 스플라이싱의 유형은 약 7가지 종류가 있으며, 유전자의 돌연변이 및 질병과 밀접한 연관성을 가지고 있는 것으로 알려져 있다. 본 연구에서는 차세대 시퀀싱 (Next Generation Sequencing : NGS) 기술로 생성된 RNA-Seq 데이터로부터 각 유전자 영역에 대한 선택 스플라이싱 유형을 분류/추출하는 새로운 알고리즘을 제안한다. 제안된 알고리즘에서는 RNA-Seq 데이터를 DNA 시퀀스와 mRNA 트랜스크립트 시퀀스에 동시 매핑하고, 각 엑손 영역에 정렬된 RNA-Seq 데이터의 커버리지 정보 및 엑손의 접합 (junction) 정보를 이용하여 발현된 트랜스크립트 (transcript)의 종류와 양을 측정한다. 알고리즘의 유효성을 보이기 위하여 시뮬레이션 데이터를 이용한 인간 유전자 영역에서의 선택 스플라이싱 유형 추출 실험을 수행하였으며, 검증된 선택 스플라이싱 DB와 비교, 검증하였다.

SNP Analysis Method for Next-generation Sequencing Data (차세대 시퀀싱 데이터를 위한 SNP 분석 방법)

  • Hong, Sang-kyoon;Lee, Deok-hae;Kong, Jin-hwa;Kim, Deok-Keun;Hong, Dong-wan;Yoon, Jee-hee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2010.11a
    • /
    • pp.95-98
    • /
    • 2010
  • 최근 차세대 시퀀싱 기술의 급속한 발전에 따라 서열 정보의 해독이 비교적 쉬워지면서 개인별 맞춤의학의 실현에 대한 기대와 관심이 높아지고 있다. 각 개인의 서열 정보 사이에는 SNP (single nucleotide polymorphism), Indel, CNV (copy number variation) 등의 다양한 유전적 구조 변이가 존재하며, 이러한 서열 정보의 부분적 차이는 각 개인의 유전적 특성 및 질병 감수성 등과 밀접한 관련을 갖는다. 본 연구에서는 차세대 시퀀싱 결과로 산출되는 수많은 짧은 DNA 서열 조각인 리드 데이터를 이용한 SNP 추출 알고리즘을 제안한다. 제안된 알고리즘에서는 레퍼런스 시퀀스의 각 위치에 대한 리드 시퀀스의 매핑 정보를 기반으로 SNP 후보 영역을 추출하며, 품질 정보 등을 활용하여 에러 발생률을 최소화한다. 또한 대규모 시퀀싱 데이터와 SNP 구조 변이 데이터의 효율적인 저장/검색을 지원하는 시각적 분석 도구를 구현하여 제안된 방식의 유용성을 검증한다.

Detecting Software Similarity Using API Sequences on Static Major Paths (정적 주요 경로 API 시퀀스를 이용한 소프트웨어 유사성 검사)

  • Park, Seongsoo;Han, Hwansoo
    • Journal of KIISE
    • /
    • v.41 no.12
    • /
    • pp.1007-1012
    • /
    • 2014
  • Software birthmarks are used to detect software plagiarism. For binaries, however, only a few birthmarks have been developed. In this paper, we propose a static approach to generate API sequences along major paths, which are analyzed from control flow graphs of the binaries. Since our API sequences are extracted along the most plausible paths of the binary codes, they can represent actual API sequences produced from binary executions, but in a more concise form. Our similarity measures use the Smith-Waterman algorithm that is one of the popular sequence alignment algorithms for DNA sequence analysis. We evaluate our static path-based API sequence with multiple versions of five applications. Our experiment indicates that our proposed method provides a quite reliable similarity birthmark for binaries.

A CNV detection algorithm based on statistical analysis of the aligned reads (정렬된 리드의 통계적 분석을 기반으로 하는 CNV 검색 알고리즘)

  • Hong, Sang-Kyoon;Hong, Dong-Wan;Yoon, Jee-Hee;Kim, Baek-Sop;Park, Sang-Hyun
    • The KIPS Transactions:PartD
    • /
    • v.16D no.5
    • /
    • pp.661-672
    • /
    • 2009
  • Recently it was found that various genetic structural variations such as CNV(copy number variation) exist in the human genome, and these variations are closely related with disease susceptibility, reaction to treatment, and genetic characteristics. In this paper we propose a new CNV detection algorithm using millions of short DNA sequences generated by giga-sequencing technology. Our method maps the DNA sequences onto the reference sequence, and obtains the occurrence frequency of each read in the reference sequence. And then it detects the statistically significant regions which are longer than 1Kbp as the candidate CNV regions by analyzing the distribution of the occurrence frequency. To select a proper read alignment method, several methods are employed in our algorithm, and the performances are compared. To verify the superiority of our approach, we performed extensive experiments. The result of simulation experiments (using a reference sequence, build 35 of NCBI) revealed that our approach successfully finds all the CNV regions that have various shapes and arbitrary length (small, intermediate, or large size).

DNA Watermarking Method based on Random Codon Circular Code (랜덤 코돈 원형 부호 기반의 DNA 워터마킹)

  • Lee, Suk-Hwan;Kwon, Seong-Geun;Kwon, Ki-Ryong
    • Journal of Korea Multimedia Society
    • /
    • v.16 no.3
    • /
    • pp.318-329
    • /
    • 2013
  • This paper proposes a DNA watermarking method for the privacy protection and the prevention of illegal copy. The proposed method allocates codons to random circular angles by using random mapping table and selects triplet codons for embedding target with the help of the Lipschitz regularity value of local modulus maxima of codon circular angles. Then the watermark is embedded into circular angles of triplet codons without changing the codes of amino acids in a DNA. The length and location of target triplet codons depend on the random mapping table for 64 codons that includes start and stop codons. This table is used as the watermark key and can be applied on any codon sequence regardless of the length of sequence. If this table is unknown, it is very difficult to detect the length and location of them for extracting the watermark. We evaluated our method and DNA-crypt watermarking of Heider method on the condition of similar capacity. From evaluation results, we verified that our method has lower base changing rate than DNA-crypt and has lower bit error rate on point mutation and insertions/deletions than DNA-crypt. Furthermore, we verified that the entropy of random mapping table and the locaton of triplet codons is high, meaning that the watermark security has high level.

Parallel Approximate String Matching with k-Mismatches for Multiple Fixed-Length Patterns in DNA Sequences on Graphics Processing Units (GPU을 이용한 다중 고정 길이 패턴을 갖는 DNA 시퀀스에 대한 k-Mismatches에 의한 근사적 병열 스트링 매칭)

  • Ho, ThienLuan;Kim, HyunJin;Oh, SeungRohk
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.66 no.6
    • /
    • pp.955-961
    • /
    • 2017
  • In this paper, we propose a parallel approximate string matching algorithm with k-mismatches for multiple fixed-length patterns (PMASM) in DNA sequences. PMASM is developed from parallel single pattern approximate string matching algorithms to effectively calculate the Hamming distances for multiple patterns with a fixed-length. In the preprocessing phase of PMASM, all target patterns are binary encoded and stored into a look-up memory. With each input character from the input string, the Hamming distances between a substring and all patterns can be updated at the same time based on the binary encoding information in the look-up memory. Moreover, PMASM adopts graphics processing units (GPUs) to process the data computations in parallel. This paper presents three kinds of PMASM implementation methods in GPUs: thread PMASM, block-thread PMASM, and shared-mem PMASM methods. The shared-mem PMASM method gives an example to effectively make use of the GPU parallel capacity. Moreover, it also exploits special features of the CUDA (Compute Unified Device Architecture) memory structure to optimize the performance. In the experiments with DNA sequences, the proposed PMASM on GPU is 385, 77, and 64 times faster than the traditional naive algorithm, the shift-add algorithm and the single thread PMASM implementation on CPU. With the same NVIDIA GPU model, the performance of the proposed approach is enhanced up to 44% and 21%, compared with the naive, and the shift-add algorithms.

A DNA Index Structure using Frequency and Position Information of Genetic Alphabet (염기문자의 빈도와 위치정보를 이용한 DNA 인덱스구조)

  • Kim Woo-Cheol;Park Sang-Hyun;Won Jung-Im;Kim Sang-Wook;Yoon Jee-Hee
    • Journal of KIISE:Databases
    • /
    • v.32 no.3
    • /
    • pp.263-275
    • /
    • 2005
  • In a large DNA database, indexing techniques are widely used for rapid approximate sequence searching. However, most indexing techniques require a space larger than original databases, and also suffer from difficulties in seamless integration with DBMS. In this paper, we suggest a space-efficient and disk-based indexing and query processing algorithm for approximate DNA sequence searching, specially exact match queries, wildcard match queries, and k-mismatch queries. Our indexing method places a sliding window at every possible location of a DNA sequence and extracts its signature by considering the occurrence frequency of each nucleotide. It then stores a set of signatures using a multi-dimensional index, such as R*-tree. Especially, by assigning a weight to each position of a window, it prevents signatures from being concentrated around a few spots in index space. Our query processing algorithm converts a query sequence into a multi-dimensional rectangle and searches the index for the signatures overlapped with the rectangle. The experiments with real biological data sets revealed that the proposed method is at least three times, twice, and several orders of magnitude faster than the suffix-tree-based method in exact match, wildcard match, and k- mismatch, respectively.

Cloning and Characterization of a 5-Enolpyruvyl Shikimate 3-Phosphate Synthase (EPSPS) Gene from Korean Lawn Grass (Zoysia japonica) (들잔디 5-Enolpyruvyl Shikimate 3-Phosphate Synthase(EPSPS) 유전자 클로닝 및 특성)

  • Lee, Hye-Jung;Lee, Geung-Joo;Kim, Dong-Sub;Kim, Jin-Beak;Ku, Ja-Hyeong;Kang, Si-Yong
    • Horticultural Science & Technology
    • /
    • v.28 no.4
    • /
    • pp.648-655
    • /
    • 2010
  • This study is the first comprehensive report on the molecular cloning, structural characterization, sequence comparison between wild and mutant types, copy number in the genome, expression features and activities of a gene encoding 5-enolpyruvylshikimate-3-phosphate synthase (EPSPS) in Korean lawn grass ($Zoysia$ $japonica$). The full length cDNA of the EPSPS from Korean lawn grass ($zj$EPSPS) obtained from a 3' and 5' RACE method was 1540 bp, containing a 1176 bp ORF, a 144 bp leader sequence (5' UTR) and a 220 bp 3' UTR, which was eventually decoded 391 amino acid residues with a molecular mass of 41.74 kDa. The Southern blot detection of the $zj$EPSPS showed that the gene exists as a single copy in the Korean lawn grass genome. Sequence comparison of the $zj$EPSPS gene demonstrated that the glyphosate-tolerant mutant (GT) having a Pro-53 to Ser substitution in the gene seems to have a preferred binding activity of the enzyme to phosphoenol pyruvate(PEP) over glyphosate, which allows the continuous synthesis of aromatic amino acids in the shikimate pathway. From the Northern blotting analysis, the $zj$EPSPS was found to be highly expressed, with continuous increase until 36 hours after 0.5% glyphosate treatment in both wild and mutant samples, but 1.5-fold higher EPSP synthase activity was observed in the tolerant mutant when exposed to the glyphosate treatment. The molecular information of the $zj$EPSPS gene obtained from this study needs to be further dissected to be more effectively applied to the development of gene-specific DNA markers and zoysiagrass cultivars; nevertheless, the glyphosate-tolerant mutant having the featured $zj$EPSPS gene can be provided to turfgrass managers for weed problems with timely adoptable management options.

An Efficient Algorithm for Determining Probe Specificity in DNA Chips (유전자 칩에서 Probe Specificity를 판별하기 위한 효율적인 알고리즘)

  • Kwon Young-Dae;Park Kyoung-Wook;Lim Hyeong-Seok
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2005.07a
    • /
    • pp.913-915
    • /
    • 2005
  • 유전자 칩의 정확성은 각 유전자들의 식별자로 활용되는 probe들에 의해 결정된다. 칩에 삽입되는 probe들은 반응오류를 피하기 위해 이중구조, 녹는점, 그리고 CG구조와 같은 요소들을 고려한다. 또한 다른 유전자들과의 교차반응을 최소화하기 위해 specificity를 고려한다. probe의 specificity 검증은 전체 유전자들에 대해 탐색해야 하므로 대규모 염색체에 대해서는 많은 계산이 요구된다. 본 논문에서는 specificity 검증을 위한 효율적인 알고리즘을 제시한다. 제시한 알고리즘은 해시테이블을 활용하여 probe가 specificity를 만족하지 못하게 하는 유전자 시퀀스들만을 탐색하여 비교한다. 제시한 알고리즘이 기존 알고리즘보다 효율적임을 실험결과를 통해 보인다.

  • PDF

DNA Sequence Alignment Using a Graph-based Distributed System (그래프 기반 분산 시스템을 이용한 염기 서열 정렬)

  • Lee, Jun-Su;Ahn, Jae-Gyoon;Yeu, Yun-Ku;Roh, Hong-Chan;Park, Sang-Hyun
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2013.05a
    • /
    • pp.894-897
    • /
    • 2013
  • 서열 정렬(sequence alignment)은 유전학(genomic)에서 널리 사용되는 도구 중 하나이다. 최근에는 차세대 시퀀싱 기술(NGS)이 발달함에 따라 데이터의 생산량이 크게 증가했고, 이에 따라 높은 처리량(throughput)을 가진 서열 정렬 알고리즘의 필요성이 증가하였다. 본 논문에서 제안하는 염기 서열 정렬 알고리즘은 시퀀스(sequence)데이터를 그래프 형태로 변형시킨 다음, 마이크로소프트사의 그래프 기반인 메모리(in-memory) 분산시스템(distributed system) 트리니티(Trinity)를 이용해 서열 정렬을 수행한다. 본 논문의 알고리즘은 트리니티 시스템에서 시뮬레이션 염기 데이터를 성공적으로 정렬하였으며, 슬레이브의 개수가 늘어날수록 빠른 속도를 나타내어 확장성(scalability)을 입증했다.