• 제목/요약/키워드: sequence database

검색결과 566건 처리시간 0.024초

A Novel Approach for Mining High-Utility Sequential Patterns in Sequence Databases

  • Ahmed, Chowdhury Farhan;Tanbeer, Syed Khairuzzaman;Jeong, Byeong-Soo
    • ETRI Journal
    • /
    • 제32권5호
    • /
    • pp.676-686
    • /
    • 2010
  • Mining sequential patterns is an important research issue in data mining and knowledge discovery with broad applications. However, the existing sequential pattern mining approaches consider only binary frequency values of items in sequences and equal importance/significance values of distinct items. Therefore, they are not applicable to actually represent many real-world scenarios. In this paper, we propose a novel framework for mining high-utility sequential patterns for more real-life applicable information extraction from sequence databases with non-binary frequency values of items in sequences and different importance/significance values for distinct items. Moreover, for mining high-utility sequential patterns, we propose two new algorithms: UtilityLevel is a high-utility sequential pattern mining with a level-wise candidate generation approach, and UtilitySpan is a high-utility sequential pattern mining with a pattern growth approach. Extensive performance analyses show that our algorithms are very efficient and scalable for mining high-utility sequential patterns.

시퀀스 유사도에 기반한 유전체 데이터베이스 압축 및 영향 분석 (The Analysis of Genome Database Compaction based on Sequence Similarity)

  • 권선영;이병한;박승현;조정희;윤성로
    • 정보과학회 컴퓨팅의 실제 논문지
    • /
    • 제23권4호
    • /
    • pp.250-255
    • /
    • 2017
  • 유전체 데이터의 급증 및 정밀의료 등 응용 분야 확대에 따라 유전체 데이터베이스의 효율적 관리에 대한 중요성이 커지고 있다. 전통적인 압축 기법을 통해 유전체 데이터를 압축할 경우, 압축효과는 크지만, 압축된 상태에서 데이터베이스를 비교하거나 검색하는 등의 작업이 용이하지 않게 된다. 유전체 데이터 분석에 소요되는 시간은 데이터베이스에 존재하는 시퀀스 수에 비례하며, 중복되거나 유사한 시퀀스가 다수 존재한다는 점에 착안하여, 본 논문에서는 유전체 데이터베이스 상에 존재하는 유사 시퀀스를 제거함으로써 전체 데이터베이스 크기를 줄이는 기법을 제안한다. 실험을 통해 시퀀스 유사도 1% 기준으로도 전체의 약 84% 시퀀스가 제거되며, 약 10배 빠른 분류분석이 가능함을 보인다. 또한 큰 폭의 압축효과에도 불구하고, 범주 다양성 및 분류 분석 등에 미치는 변화가 미미함을 확인함으로써, 시퀀스 유사도 기반의 제안 압축 기법이 유전체 데이터베이스 압축에 효과적인 방법임을 제시한다.

Theoretical Peptide Mass Distribution in the Non-Redundant Protein Database of the NCBI

  • Lim Da-Jeong;Oh Hee-Seok;Kim Hee-Bal
    • Genomics & Informatics
    • /
    • 제4권2호
    • /
    • pp.65-70
    • /
    • 2006
  • Peptide mass mapping is the matching of experimentally generated peptides masses with the predicted masses of digested proteins contained in a database. To identify proteins by matching their constituent fragment masses to the theoretical peptide masses generated from a protein database, the peptide mass fingerprinting technique is used for the protein identification. Thus, it is important to know the theoretical mass distribution of the database. However, few researches have reported the peptide mass distribution of a database. We analyzed the peptide mass distribution of non-redundant protein sequence database in the NCBI after digestion with 15 different types of enzymes. In order to characterize the peptide mass distribution with different digestion enzymes, a power law distribution (Zipfs law) was applied to the distribution. After constructing simulated digestion of a protein database, rank-frequency plot of peptide fragments was applied to generalize a Zipfs law curve for all enzymes. As a result, our data appear to fit Zipfs law with statistically significant parameter values.

Nucleotide sequence analysis of the 5S ribosomal RNA gene of the mushroom tricholoma matsutake

  • Hwang, Seon-Kap;Kim, Jong-Guk
    • Journal of Microbiology
    • /
    • 제33권2호
    • /
    • pp.136-141
    • /
    • 1995
  • From a cluster of structural rRNA genes which has previsouly been cloned (Hwang and Kim, in submission; J. Microbiol. Biotechnol.), a 1.0-kb Eco RI fragment of DNA which shows significant homology to the 25S and rRNA s of Tricholoma matsutake was used for sequence analysis. Nucleotide sequence was bidirectionally determined using delection series of the DNA fragment. Comparing the resultant 1016-base sequence with sequences in the database, both the 3'end of 25S-rRNA gene and 5S rRNA gene were searched. The 5S rRNA gene is 118-bp in length and is located 158-bp downstream of 3'end of the 25S rRNA gene. IGSI and IGS2 (partial) sequences are also contained in the fragment. Multiple alignment of the 5S rRNA sequences was carried out with 5S rRNA sequences from some members of the subdivision Basidiomycotina obtained from the database. Polygenetic analysis with distance matrix established by Kimura's 2-parameter method and phylogenetic tree by UPGMA method proposed that T. matsutake is closely related to efibulobasidium allbescens. Secondary structure of 5S rRNA was also hypothesized to show similar topology with its generally accepted eukaryotic counterpart.

  • PDF

Protein Sequence Search based on N-gram Indexing

  • Hwang, Mi-Nyeong;Kim, Jin-Suk
    • Bioinformatics and Biosystems
    • /
    • 제1권1호
    • /
    • pp.46-50
    • /
    • 2006
  • According to the advancement of experimental techniques in molecular biology, genomic and protein sequence databases are increasing in size exponentially, and mean sequence lengths are also increasing. Because the sizes of these databases become larger, it is difficult to search similar sequences in biological databases with significant homologies to a query sequence. In this paper, we present the N-gram indexing method to retrieve similar sequences fast, precisely and comparably. This method regards a protein sequence as a text written in language of 20 amino acid codes, adapts N-gram tokens of fixed-length as its indexing scheme for sequence strings. After such tokens are indexed for all the sequences in the database, sequences can be searched with information retrieval algorithms. Using this new method, we have developed a protein sequence search system named as ProSeS (PROtein Sequence Search). ProSeS is a protein sequence analysis system which provides overall analysis results such as similar sequences with significant homologies, predicted subcellular locations of the query sequence, and major keywords extracted from annotations of similar sequences. We show experimentally that the N-gram indexing approach saves the retrieval time significantly, and that it is as accurate as current popular search tool BLAST.

  • PDF

PC-Based Hybrid Grid Computing for Huge Biological Data Processing

  • Cho, Wan-Sup;Kim, Tae-Kyung;Na, Jong-Hwa
    • Journal of the Korean Data and Information Science Society
    • /
    • 제17권2호
    • /
    • pp.569-579
    • /
    • 2006
  • Recently, the amount of genome sequence is increasing rapidly due to advanced computational techniques and experimental tools in the biological area. Sequence comparisons are very useful operations to predict the functions of the genes or proteins. However, it takes too much time to compare long sequence data and there are many research results for fast sequence comparisons. In this paper, we propose a hybrid grid system to improve the performance of the sequence comparisons based on the LanLinux system. Compared with conventional approaches, hybrid grid is easy to construct, maintain, and manage because there is no need to install SWs for every node. As a real experiment, we constructed an orthologous database for 89 prokaryotes just in a week under hybrid grid; note that it requires 33 weeks on a single computer.

  • PDF

시계열 데이터베이스에서 DFT-기반 다차원 인덱스를 위한 물리적 데이터베이스 설계 (Physical Database Design for DFT-Based Multidimensional Indexes in Time-Series Databases)

  • 김상욱;김진호;한병일
    • 한국멀티미디어학회논문지
    • /
    • 제7권11호
    • /
    • pp.1505-1514
    • /
    • 2004
  • 시퀀스 매칭은 시계열 데이터베이스로부터 질의 시퀀스와 변화의 추세가 유사한 데이터 시퀀스들을 검색하는 연산이다. 기존의 대부분의 연구에서는 효과적인 시퀀스 매칭을 위하여 다차원 인덱스를 사용하며, 데이터 시퀀스를 이산 푸리에 변환(Discrete Fourier Transform: DFT)한 후, 단순히 앞의 두 개 내지 세 개의 DFT 계수만을 구성 속성 (organizing attributes)으로 사용함으로써 고차원의 경우 발생하는 차원 저주(dimensionality curse) 문제를 해결한다. 본 논문에서는 기존의 단순한 기법이 가지는 성능 상의 문제점들을 지적하고, 이러한 문제점들을 해결하는 최적의 다차원 인덱스 구성 기법을 제안한다. 제안된 기법은 대상이 되는 시계열 데이터베이스의 특성을 사전에 분석함으로써 변별력이 뛰어난 요소들을 다차원 인덱스의 구성 속성으로 선정하며, 비용 모델(cost model)을 기반으로 한 시퀀스 매칭 비용의 추정을 통하여 다차원 인덱스에 참여하는 최적의 구성 속성의 수를 결정한다. 제안된 기법의 우수성을 규명하기 위하여 실험을 통한기존 기법과의 성능 비교를 수행하였다 실험 결과에 의하면, 제안된 기법은 기존의 기법에 비교하여 매우 큰 성능 개선 효과를 가지는 것으로 나타났다.

  • PDF

유전자 및 유전체 연구 기술과 동향 (Trend and Technology of Gene and Genome Research)

  • 이진성;김기환;서동상;강석우;황재삼
    • 한국잠사곤충학회지
    • /
    • 제42권2호
    • /
    • pp.126-141
    • /
    • 2000
  • A major step towards understanding of the genetic basis of an organism is the complete sequence determination of all genes in target genome. The nucleotide sequence encoded in the genome contains the information that specifies the amino acid sequence of every protein and functional RNA molecule. In principle, it will be possible to identify every protein resposible for the structure and function of the body of the target organism. The pattern of expression in different cell types will specify where and when each protein is used. The amino acid sequence of the proteins encoded by each gene will be derived from the conceptional translation of the nucleotide sequence. Comparison of these sequences with those of known proteins, whose sequences are sorted in database, will suggest an approximate function for many proteins. This mini review describes the development of new sequencing methods and the optimization of sequencing strategies for whole genome, various cDNA and genomic analysis.

  • PDF

Effective Biological Sequence Alignment Method using Divide Approach

  • 최해원;김상진;피수영
    • 한국산업정보학회논문지
    • /
    • 제17권6호
    • /
    • pp.41-50
    • /
    • 2012
  • This paper presents a new sequence alignment method using the divide approach, which solves the problem by decomposing sequence alignment into several sub-alignments with respect to exact matching subsequences. Exact matching subsequences in the proposed method are bounded on the generalized suffix tree of two sequences, such as protein domain length more than 7 and less than 7. Experiment results show that protein sequence pairs chosen in PFAM database can be aligned using this method. In addition, this method reduces the time about 15% and space of the conventional dynamic programming approach. And the sequences were classified with 94% of accuracy.