• Title/Summary/Keyword: 시퀀스 데이타베이스

Search Result 37, Processing Time 0.022 seconds

A DNA Index Structure using Frequency and Position Information of Genetic Alphabet (염기문자의 빈도와 위치정보를 이용한 DNA 인덱스구조)

  • Kim Woo-Cheol;Park Sang-Hyun;Won Jung-Im;Kim Sang-Wook;Yoon Jee-Hee
    • Journal of KIISE:Databases
    • /
    • v.32 no.3
    • /
    • pp.263-275
    • /
    • 2005
  • In a large DNA database, indexing techniques are widely used for rapid approximate sequence searching. However, most indexing techniques require a space larger than original databases, and also suffer from difficulties in seamless integration with DBMS. In this paper, we suggest a space-efficient and disk-based indexing and query processing algorithm for approximate DNA sequence searching, specially exact match queries, wildcard match queries, and k-mismatch queries. Our indexing method places a sliding window at every possible location of a DNA sequence and extracts its signature by considering the occurrence frequency of each nucleotide. It then stores a set of signatures using a multi-dimensional index, such as R*-tree. Especially, by assigning a weight to each position of a window, it prevents signatures from being concentrated around a few spots in index space. Our query processing algorithm converts a query sequence into a multi-dimensional rectangle and searches the index for the signatures overlapped with the rectangle. The experiments with real biological data sets revealed that the proposed method is at least three times, twice, and several orders of magnitude faster than the suffix-tree-based method in exact match, wildcard match, and k- mismatch, respectively.

XML Document Filtering based on Segments (세그먼트 기반의 XML 문서 필터링)

  • Kwon, Joon-Ho;Rao, Praveen;Moon, Bong-Ki;Lee, Suk-Ho
    • Journal of KIISE:Databases
    • /
    • v.35 no.4
    • /
    • pp.368-378
    • /
    • 2008
  • In recent years, publish-subscribe (pub-sub) systems based on XML document filtering have received much attention. In a typical pub-sub system, subscribed users specify their interest in profiles expressed in the XPath language, and each new content is matched against the user profiles so that the content is delivered to only the interested subscribers. As the number of subscribed users and their profiles can grow very large, the scalability of the system is critical to the success of pub-sub services. In this paper, we propose a fast and scalable XML filtering system called SFiST which is an extension of the FiST system. Sharable segments are extracted from twig patterns and stored into the hash-based Segment Table in SFiST system. Segments are used to represent user profiles as Terse Sequences and stored in the Compact Segment Index during filtering. Our experimental study shows that SFiST system has better performance than FiST system in terms of filtering time and memory usage.

A Space Efficient Indexing Technique for DNA Sequences (공간 효율적인 DNA 시퀀스 인덱싱 방안)

  • Song, Hye-Ju;Park, Young-Ho;Loh, Woong-Kee
    • Journal of KIISE:Databases
    • /
    • v.36 no.6
    • /
    • pp.455-465
    • /
    • 2009
  • Suffix trees are widely used in similar sequence matching for DNA. They have several problems such as time consuming, large space usages of disks and memories and data skew, since DNA sequences are very large and do not fit in the main memory. Thus, in the paper, we present a space efficient indexing method called SENoM, allowing us to build trees without merging phases for the partitioned sub trees. The proposed method is constructed in two phases. In the first phase, we partition the suffixes of the input string based on a common variable-length prefix till the number of suffixes is smaller than a threshold. In the second phase, we construct a sub tree based on the disk using the suffix sets, and then write it to the disk. The proposed method, SENoM eliminates complex merging phases. We show experimentally that proposed method is effective as bellows. SENoM reduces the disk usage less than 35% and reduces the memory usage less than 20% compared with TRELLIS algorithm. SENoM is available to query efficiently using the prefix tree even when the length of query sequence is large.

Generalization of Window Construction for Subsequence Matching in Time-Series Databases (시계열 데이터베이스에서의 서브시퀀스 매칭을 위한 윈도우 구성의 일반화)

  • Moon, Yang-Sae;Han, Wook-Shin;Whang, Kyu-Young
    • Journal of KIISE:Databases
    • /
    • v.28 no.3
    • /
    • pp.357-372
    • /
    • 2001
  • In this paper, we present the concept of generalization in constructing windows for subsequence matching and propose a new subsequence matching method. GeneralMatch, based on the generalization. The earlier work of Faloutsos et al.(FRM in short) causes a lot of false alarms due to lack of the point-filtering effect. DualMatch, which has been proposed by the authors, improves performance significantly over FRM by exploiting the point filtering effect, but it has the problem of having a smaller maximum window size (half that FRM) given the minimum query length. GeneralMatch, an improvement of DualMatch, offers advantages of both methods: it can use large windows like FRM and, at the same time, can exploit the point-filtering effect like DualMatch. GeneralMatch divides data sequences into J-sliding windows (generalized sliding windows) and the query sequence into J-disjoint windows (generalized disjoint windows). We formally prove that our GeneralMatch is correct, i.e., it incurs no false dismissal. We also prove that, given the minimum query length, there is a maximum bound of the window size to guarantee correctness of GeneralMatch. We then propose a method of determining the value of J that minimizes the number of page accesses, Experimental results for real stock data show that, for low selectivities ($10^{-6}~10^{-4}$), GeneralMatch improves performance by 114% over DualMatch and by 998% iver FRM on the average; for high selectivities ($10^{-6}~10^{-4}$), by 46% over DualMatch and by 65% over FRM on the average.

  • PDF

Sequence Group Validation based on Boundary Locking for Valid XML Documents (유효한 XML 문서에 대한 경계 로킹에 기반한 시퀀스 그룹 검증 기법)

  • Choi, Yoon-Sang;Park, Seog
    • Journal of KIISE:Databases
    • /
    • v.32 no.6
    • /
    • pp.628-640
    • /
    • 2005
  • The XML is well accepted in several different Web application areas. As soon as many users and applications work concurrently on the same collection of XML documents, isolating accesses and modifications of different transactions becomes an important issue. When an XML document correctly corresponds to the rules laid out in a DTD or XML schema, it is also said to be valid. The valid XML document's validity should be guaranteed after the document is updated. The validation method mentioned above, however, results in lower degree of concurrency. For getting higher degree of concurrency and minimizing the range of the XML document validity, a new validation method based on a specific locking method is required. In this paper we propose the sequence group validation method for minimizing the range of the XML document validity. We also propose the boundary locking method for isolating accesses and modifications of different transactions while supporting the valid XML document's validity. Finally, the results of some experiments show the validation and locking methods increase the degree of transaction concurrency.

CNVDAT: A Copy Number Variation Detection and Analysis Tool for Next-generation Sequencing Data (CNVDAT : 차세대 시퀀싱 데이터를 위한 유전체 단위 반복 변이 검출 및 분석 도구)

  • Kang, Inho;Kong, Jinhwa;Shin, JaeMoon;Lee, UnJoo;Yoon, Jeehee
    • Journal of KIISE:Databases
    • /
    • v.41 no.4
    • /
    • pp.249-255
    • /
    • 2014
  • Copy number variations(CNVs) are a recently recognized class of human structural variations and are associated with a variety of human diseases, including cancer. To find important cancer genes, researchers identify novel CNVs in patients with a particular cancer and analyze large amounts of genomic and clinical data. We present a tool called CNVDAT which is able to detect CNVs from NGS data and systematically analyze the genomic and clinical data associated with variations. CNVDAT consists of two modules, CNV Detection Engine and Sequence Analyser. CNV Detection Engine extracts CNVs by using the multi-resolution system of scale-space filtering, enabling the detection of the types and the exact locations of CNVs of all sizes even when the coverage level of read data is low. Sequence Analyser is a user-friendly program to view and compare variation regions between tumor and matched normal samples. It also provides a complete analysis function of refGene and OMIM data and makes it possible to discover CNV-gene-phenotype relationships. CNVDAT source code is freely available from http://dblab.hallym.ac.kr/CNVDAT/.

An Efficient Algorithm for Streaming Time-Series Matching that Supports Normalization Transform (정규화 변환을 지원하는 스트리밍 시계열 매칭 알고리즘)

  • Loh, Woong-Kee;Moon, Yang-Sae;Kim, Young-Kuk
    • Journal of KIISE:Databases
    • /
    • v.33 no.6
    • /
    • pp.600-619
    • /
    • 2006
  • According to recent technical advances on sensors and mobile devices, processing of data streams generated by the devices is becoming an important research issue. The data stream of real values obtained at continuous time points is called streaming time-series. Due to the unique features of streaming time-series that are different from those of traditional time-series, similarity matching problem on the streaming time-series should be solved in a new way. In this paper, we propose an efficient algorithm for streaming time- series matching problem that supports normalization transform. While the existing algorithms compare streaming time-series without any transform, the algorithm proposed in the paper compares them after they are normalization-transformed. The normalization transform is useful for finding time-series that have similar fluctuation trends even though they consist of distant element values. The major contributions of this paper are as follows. (1) By using a theorem presented in the context of subsequence matching that supports normalization transform[4], we propose a simple algorithm for solving the problem. (2) For improving search performance, we extend the simple algorithm to use $k\;({\geq}\;1)$ indexes. (3) For a given k, for achieving optimal search performance of the extended algorithm, we present an approximation method for choosing k window sizes to construct k indexes. (4) Based on the notion of continuity[8] on streaming time-series, we further extend our algorithm so that it can simultaneously obtain the search results for $m\;({\geq}\;1)$ time points from present $t_0$ to a time point $(t_0+m-1)$ in the near future by retrieving the index only once. (5) Through a series of experiments, we compare search performances of the algorithms proposed in this paper, and show their performance trends according to k and m values. To the best of our knowledge, since there has been no algorithm that solves the same problem presented in this paper, we compare search performances of our algorithms with the sequential scan algorithm. The experiment result showed that our algorithms outperformed the sequential scan algorithm by up to 13.2 times. The performances of our algorithms should be more improved, as k is increased.