Search | Korea Science

Yoon Joo-Young;Park Kun-Soo
- Proceedings of the Korean Information Science Society Conference
- /
- 2006.06a
- /
- pp.406-408
- /
- 2006
스트링 B-트리는 외부 메모리에 저장된 문자열 데이터의 검색을 효율적으로 수행할 수 있는 자료 구조이다. 본 논문에서는 앞서서 연구된 새로운 분기 알고리즘을 이용하여 전체적인 새로운 스트링 B-트리의 구현 방법을 제시한다. 그리고 실험을 통하여 효율을 최대화하는 구현 방법을 논의한다.
PDF

최정현;진희정;조환규
- Proceedings of the Korean Information Science Society Conference
- /
- 2001.04a
- /
- pp.748-750
- /
- 2001
최근 Human Genome Project(HGP)에서 사람의 염기 서열의 초안이 발표되었다. 생물체의 염기 서열을 분석하는 방법은 매우 많은데, 그 중 하나가 k-mer 분석이다. k-mer는 유전자의 염기 서열내의 길이가 k인 연속된 염기 서열이다. k-mer 분석은 염기서열이 가진 k-mer들의 빈도의 분포나 대칭성 등을 탐색하는 것이다. 그런데 유전자의 염기 서열은 대용량 텍스트이고 k가 줄 때 기존의 온메모리 알고리즘으로는 처리가 불가능하므로 효율적인 자료구조와 알고리즘이 필요하다. 본 논문에서는 패턴 일치(pattern matching)에 적합하고 외부 메모리를 지원하는 스트링 B-트리(string B-tree)를 이용한 k-mer 분석 방법을 제시하고, 그것을 구현하였으며 몇 가지 실험 결과에 대하여 기술한다.
PDF

Choe, Jeong-Hyeon;Jo, Hwan-Gyu
- The KIPS Transactions:PartA
- /
- v.8A no.4
- /
- pp.509-516
- /
- 2001
As results of many genome projects, genomic sequences of many organisms are revealed. Various methods such as global alignment, local alignment are used to analyze the sequences of the organisms, and k -mer analysis is one of the methods for analyzing the genomic sequences. The k -mer analysis explores the frequencies of all k-mers or the symmetry of them where the k -mer is the sequenced base with the length of k. However, existing on-memory algorithms are not applicable to the k -mer analysis because a whole genomic sequence is usually a large text. Therefore, efficient data structures and algorithms are needed. String B-tree is a good data structure that supports external memory and fits into pattern matching. In this paper, we improve the string B-tree in order to efficiently apply the data structure to k -mer analysis, and the results of k -mer analysis for C. elegans and other 30 genomic sequences are shown. We present a visualization system which enables users to investigate the distribution and symmetry of the frequencies of all k -mers using CGR (Chaotic Game Representation). We also describe the method to find the signature which is the part of the sequence that is similar to the whole genomic sequence.
PDF

Choe, Jeong-Hyeon;Jin, Hui-Jeong;Kim, Cheol-Min;Jang, Cheol-Hun;Jo, Hwan-Gyu
- The KIPS Transactions:PartA
- /
- v.9A no.3
- /
- pp.387-398
- /
- 2002
As whole genome sequences of many organisms have been revealed by small-scale genome projects, the intensive research on individual genes and their functions has been performed. However on-memory algorithms are inefficient to analysis of whole genome sequences, since the size of individual whole genome is from several million base pairs to hundreds billion base pairs. In order to effectively manipulate the huge sequence data, it is necessary to use the indexed data structure for external memory. In this paper, we introduce a workbench system for analysis and visualization of whole genome sequence using string B-tree that is suitable for analysis of huge data. This system consists of two parts : analysis query part and visualization part. Query system supports various transactions such as sequence search, k-occurrence, and k-mer analysis. Visualization system helps biological scientist to easily understand whole structure and specificity by many kinds of visualization such as whole genome sequence, annotation, CGR (Chaos Game Representation), k-mer, and RWP (Random Walk Plot). One can find the relations among organisms, predict the genes in a genome, and research on the function of junk DNA using our workbench.
https://doi.org/10.3745/KIPSTA.2002.9A.3.387 인용 PDF KSCI