• Title/Summary/Keyword: sequence databases

Search Result 226, Processing Time 0.029 seconds

Making Cache-Conscious CCMR-trees for Main Memory Indexing (주기억 데이타베이스 인덱싱을 위한 CCMR-트리)

  • 윤석우;김경창
    • Journal of KIISE:Databases
    • /
    • v.30 no.6
    • /
    • pp.651-665
    • /
    • 2003
  • To reduce cache misses emerges as the most important issue in today's situation of main memory databases, in which CPU speeds have been increasing at 60% per year, and memory speeds at 10% per year. Recent researches have demonstrated that cache-conscious index structure such as the CR-tree outperforms the R-tree variants. Its search performance can be poor than the original R-tree, however, since it uses a lossy compression scheme. In this paper, we propose alternatively a cache-conscious version of the R-tree, which we call MR-tree. The MR-tree propagates node splits upward only if one of the internal nodes on the insertion path has empty room. Thus, the internal nodes of the MR-tree are almost 100% full. In case there is no empty room on the insertion path, a newly-created leaf simply becomes a child of the split leaf. The height of the MR-tree increases according to the sequence of inserting objects. Thus, the HeightBalance algorithm is executed when unbalanced heights of child nodes are detected. Additionally, we also propose the CCMR-tree in order to build a more cache-conscious MR-tree. Our experimental and analytical study shows that the two-dimensional MR-tree performs search up to 2.4times faster than the ordinary R-tree while maintaining slightly better update performance and using similar memory space.

Design of Heterogeneous Content Linkage Method by Analyzing Genbank (Genbank 분석을 통한 이종의 콘텐츠 연계 방안 설계)

  • Ahn, Bu-Young;Lee, Myung-Sun;Kim, Ji-Young;Oh, Chung-Shick
    • The Journal of the Korea Contents Association
    • /
    • v.10 no.6
    • /
    • pp.49-54
    • /
    • 2010
  • As information on gene sequences is not only diverse but also extremely huge in volume, high-performance computer and information technology techniques are required to build and analyze gene sequence databases. This has given rise to the discipline of bioinformatics, a field of research where computers are utilized to collect, to manage, to save, to evaluate, and to analyze biological data. In line with such continued development in bioinformatics, the Korea Institute of Science and Technology Information (KISTI) has built an infrastructure for the biological information, based on the information technology, and provided the information for researchers of bioscience. This paper analyzes the reference fields of Genbank, the most frequently used gene database by the global researchers among the life information databases, and proposes the interface method to NDSL which is the science and technology information integrated service provided by KISTI. For these, after collecting Genbank data from NCBI FTP site, we rebuilt the database by separating Genbank text files into the basic gene data and the reference data. So new tables are generated through extracting the paper and patent information from Genbank reference fields. Then we suggest the method of connection with the paper DB and the patent DB operated by KISTI.

Subsequence Matching Under Time Warping in Time-Series Databases : Observation, Optimization, and Performance Results (시계열 데이터베이스에서 타임 워핑 하의 서브시퀀스 매칭 : 관찰, 최적화, 성능 결과)

  • Kim Man-Soon;Kim Sang-Wook
    • The KIPS Transactions:PartD
    • /
    • v.11D no.7 s.96
    • /
    • pp.1385-1398
    • /
    • 2004
  • This paper discusses an effective processing of subsequence matching under time warping in time-series databases. Time warping is a trans-formation that enables finding of sequences with similar patterns even when they are of different lengths. Through a preliminary experiment, we first point out that the performance bottleneck of Naive-Scan, a basic method for processing of subsequence matching under time warping, is on the CPU processing step. Then, we propose a novel method that optimizes the CPU processing step of Naive-Scan. The proposed method maximizes the CPU performance by eliminating all the redundant calculations occurring in computing the time warping distance between the query sequence and data subsequences. We formally prove the proposed method does not incur false dismissals and also is the optimal one for processing Naive-Scan. Also, we discuss the we discuss to apply the proposed method to the post-processing step of LB-Scan and ST-Filter, the previous methods for processing of subsequence matching under time warping. Then, we quantitatively verify the performance improvement ef-fects obtained by the proposed method via extensive experiments. The result shows that the performance of all the three previous methods im-proves by employing the proposed method. Especially, Naive-Scan, which is known to show the worst performance, performs much better than LB-Scan as well as ST-Filter in all cases when it employs the proposed method for CPU processing. This result is so meaningful in that the performance inversion among Nive- Scan, LB-Scan, and ST-Filter has occurred by optimizing the CPU processing step, which is their perform-ance bottleneck.

Generalization of Window Construction for Subsequence Matching in Time-Series Databases (시계열 데이터베이스에서의 서브시퀀스 매칭을 위한 윈도우 구성의 일반화)

  • Moon, Yang-Sae;Han, Wook-Shin;Whang, Kyu-Young
    • Journal of KIISE:Databases
    • /
    • v.28 no.3
    • /
    • pp.357-372
    • /
    • 2001
  • In this paper, we present the concept of generalization in constructing windows for subsequence matching and propose a new subsequence matching method. GeneralMatch, based on the generalization. The earlier work of Faloutsos et al.(FRM in short) causes a lot of false alarms due to lack of the point-filtering effect. DualMatch, which has been proposed by the authors, improves performance significantly over FRM by exploiting the point filtering effect, but it has the problem of having a smaller maximum window size (half that FRM) given the minimum query length. GeneralMatch, an improvement of DualMatch, offers advantages of both methods: it can use large windows like FRM and, at the same time, can exploit the point-filtering effect like DualMatch. GeneralMatch divides data sequences into J-sliding windows (generalized sliding windows) and the query sequence into J-disjoint windows (generalized disjoint windows). We formally prove that our GeneralMatch is correct, i.e., it incurs no false dismissal. We also prove that, given the minimum query length, there is a maximum bound of the window size to guarantee correctness of GeneralMatch. We then propose a method of determining the value of J that minimizes the number of page accesses, Experimental results for real stock data show that, for low selectivities ($10^{-6}~10^{-4}$), GeneralMatch improves performance by 114% over DualMatch and by 998% iver FRM on the average; for high selectivities ($10^{-6}~10^{-4}$), by 46% over DualMatch and by 65% over FRM on the average.

  • PDF

Selectivity Estimation for Spatio-Temporal a Overlap Join (시공간 겹침 조인 연산을 위한 선택도 추정 기법)

  • Lee, Myoung-Sul;Lee, Jong-Yun
    • Journal of KIISE:Databases
    • /
    • v.35 no.1
    • /
    • pp.54-66
    • /
    • 2008
  • A spatio-temporal join is an expensive operation that is commonly used in spatio-temporal database systems. In order to generate an efficient query plan for the queries involving spatio-temporal join operations, it is crucial to estimate accurate selectivity for the join operations. Given two dataset $S_1,\;S_2$ of discrete data and a timestamp $t_q$, a spatio-temporal join retrieves all pairs of objects that are intersected each other at $t_q$. The selectivity of the join operation equals the number of retrieved pairs divided by the cardinality of the Cartesian product $S_1{\times}S_2$. In this paper, we propose aspatio-temporal histogram to estimate selectivity of spatio-temporal join by extending existing geometric histogram. By using a wide spectrum of both uniform dataset and skewed dataset, it is shown that our proposed method, called Spatio-Temporal Histogram, can accurately estimate the selectivity of spatio-temporal join. Our contributions can be summarized as follows: First, the selectivity estimation of spatio-temporal join for discrete data has been first attempted. Second, we propose an efficient maintenance method that reconstructs histograms using compression of spatial statistical information during the lifespan of discrete data.

Optimal Construction of Multiple Indexes for Time-Series Subsequence Matching (시계열 서브시퀀스 매칭을 위한 최적의 다중 인덱스 구성 방안)

  • Lim, Seung-Hwan;Kim, Sang-Wook;Park, Hee-Jin
    • Journal of KIISE:Databases
    • /
    • v.33 no.2
    • /
    • pp.201-213
    • /
    • 2006
  • A time-series database is a set of time-series data sequences, each of which is a list of changing values of the object in a given period of time. Subsequence matching is an operation that searches for such data subsequences whose changing patterns are similar to a query sequence from a time-series database. This paper addresses a performance issue of time-series subsequence matching. First, we quantitatively examine the performance degradation caused by the window size effect, and then show that the performance of subsequence matching with a single index is not satisfactory in real applications. We argue that index interpolation is fairly useful to resolve this problem. The index interpolation performs subsequence matching by selecting the most appropriate one from multiple indexes built on windows of their inherent sizes. For index interpolation, we first decide the sites of windows for multiple indexes to be built. In this paper, we solve the problem of selecting optimal window sizes in the perspective of physical database design. For this, given a set of query sequences to be peformed in a target time-series database and a set of window sizes for building multiple indexes, we devise a formula that estimates the cost of all the subsequence matchings. Based on this formula, we propose an algorithm that determines the optimal window sizes for maximizing the performance of entire subsequence matchings. We formally Prove the optimality as well as the effectiveness of the algorithm. Finally, we perform a series of extensive experiments with a real-life stock data set and a large volume of a synthetic data set. The results reveal that the proposed approach improves the previous one by 1.5 to 7.8 times.

Draft Genome Assembly and Annotation for Cutaneotrichosporon dermatis NICC30027, an Oleaginous Yeast Capable of Simultaneous Glucose and Xylose Assimilation

  • Wang, Laiyou;Guo, Shuxian;Zeng, Bo;Wang, Shanshan;Chen, Yan;Cheng, Shuang;Liu, Bingbing;Wang, Chunyan;Wang, Yu;Meng, Qingshan
    • Mycobiology
    • /
    • v.50 no.1
    • /
    • pp.66-78
    • /
    • 2022
  • The identification of oleaginous yeast species capable of simultaneously utilizing xylose and glucose as substrates to generate value-added biological products is an area of key economic interest. We have previously demonstrated that the Cutaneotrichosporon dermatis NICC30027 yeast strain is capable of simultaneously assimilating both xylose and glucose, resulting in considerable lipid accumulation. However, as no high-quality genome sequencing data or associated annotations for this strain are available at present, it remains challenging to study the metabolic mechanisms underlying this phenotype. Herein, we report a 39,305,439 bp draft genome assembly for C. dermatis NICC30027 comprised of 37 scaffolds, with 60.15% GC content. Within this genome, we identified 524 tRNAs, 142 sRNAs, 53 miRNAs, 28 snRNAs, and eight rRNA clusters. Moreover, repeat sequences totaling 1,032,129 bp in length were identified (2.63% of the genome), as were 14,238 unigenes that were 1,789.35 bp in length on average (64.82% of the genome). The NCBI non-redundant protein sequences (NR) database was employed to successfully annotate 11,795 of these unigenes, while 3,621 and 11,902 were annotated with the Swiss-Prot and TrEMBL databases, respectively. Unigenes were additionally subjected to pathway enrichment analyses using the Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), Cluster of Orthologous Groups of proteins (COG), Clusters of orthologous groups for eukaryotic complete genomes (KOG), and Non-supervised Orthologous Groups (eggNOG) databases. Together, these results provide a foundation for future studies aimed at clarifying the mechanistic basis for the ability of C. dermatis NICC30027 to simultaneously utilize glucose and xylose to synthesize lipids.

Characterization, Cloning and Expression of the Ferritin Gene from the Korean Polychaete, Periserrula leucophryna

  • Jeong Byeong Ryong;Chung Su-Mi;Baek Nam Joo;Koo Kwang Bon;Baik Hyung Suk;Joo Han-Seung;Chang Chung-Soon;Choi Jang Won
    • Journal of Microbiology
    • /
    • v.44 no.1
    • /
    • pp.54-63
    • /
    • 2006
  • Ferritin is a major eukaryotic protein and in humans is the protein of iron storage. A partial gene fragment of ferritin (255 bp) taken from the total RNA of Periserrula leucophryna, was amplified by RT-PCR using oligonucleotide primers designed from the conserved metal binding domain of eukaryotic ferritin and confirmed by DNA sequencing. Using the $^{32}P-labeled$ partial ferritin cDNA fragment, 28 different clones were obtained by the screening of the P. leucophryna cDNA library prepared in the Uni-ZAP XR vector, sequenced and characterized. The longest clone was named the PLF (Periserrula leucophryna ferritin) gene and the nucleotide and amino acid sequences of this novel gene were deposited in the GenBank databases with accession numbers DQ207752 and ABA55730, respectively. The entire cDNA of PLF clone was 1109 bp (CDS: 129-653), including a coding nucleotide sequence of 525 bp, a 5' -untranslated region of 128 bp, and a 3'-noncoding region of 456 bp. The 5'-UTR contains a putative iron responsive element (IRE) sequence. Ferritin has an open reading frame encoding a polypeptide of 174 amino acids including a hydrophobic signal peptide of 17 amino acids. The predicted molecular weights of the immature and mature ferritin were calculated to be 20.3 kDa and 18.2 kDa, respectively. The region encoding the mature ferritin was subcloned into the pT7-7 expression vector after PCR amplification using the designed primers and included the initiation and termination codons; the recombinant clones were expressed in E. coli BL21(DE3) or E. coli BL21(DE3)pLysE. SDS-PAGE and western blot analysis showed that a ferritin of approximately 18 kDa (mature form) was produced and that by iron staining in native PAGE, it is likely that the recombinant ferritin is correctly folded and assembled into a homopolymer composed of a single subunit.

Molecular Cloning and Bioinformatic Analysis of SPATA4 Gene

  • Liu, Shang-Feng;Ai, Chao;Ge, Zhong-Qi;Liu, Hai-Luo;Liu, Bo-Wen;He, Shan;Wang, Zhao
    • BMB Reports
    • /
    • v.38 no.6
    • /
    • pp.739-747
    • /
    • 2005
  • Full-length cDNA sequences of four novel SPATA4 genes in chimpanzee, cow, chicken and ascidian were identified by bioinformatic analysis using mouse or human SPATA4 cDNA fragment as electronic probe. All these genes have 6 exons and have similar protein molecular weight and do not localize in sex chromosome. The mouse SPATA4 sequence is identified as significantly changed in cryptorchidism, which shares no significant homology with any known protein in swissprot databases except for the homologous genes in various vertebrates. Our searching results showed that all SPATA4 proteins have a putative conserved domain DUF1042. The percentages of putative SPATA4 protein sequence identity ranging from 30% to 99%. The high similarity was also found in 1 kb promoter regions of human, mouse and rat SPATA4 gene. The similarities of the sequences upstream of SPATA4 promoter also have a high proportion. The results of searching SymAtlas (http://symatlas.gnf.org/SymAtlas/) showed that human SPATA4 has a high expression in testis, especially in testis interstitial, leydig cell, seminiferous tubule and germ cell. Mouse SPATA4 was observed exclusively in adult mouse testis and almost no signal was detected in other tissues. The pI values of the protein are negative, ranging from 9.44 to 10.15. The subcellular location of the protein is usually in the nucleus. And the signal peptide possibilities for SPATA4 are always zero. Using the SNPs data in NCBI, we found 33 SNPs in human SPATA4 gene genomic DNA region, with the distribution of 29 SNPs in the introns. CpG island searching gives the data about CpG island, which shows that the regions of the CpG island have a high similarity with each other, though the length of the CpG island is different from each other.This research is a fundamental work in the fields of the bioinformational analysis, and also put forward a new way for the bioinformatic analysis of other genes.

Human Infections with Spirometra decipiens Plerocercoids Identified by Morphologic and Genetic Analyses in Korea

  • Jeon, Hyeong-Kyu;Park, Hansol;Lee, Dongmin;Choe, Seongjun;Kim, Kyu-Heon;Huh, Sun;Sohn, Woon-Mok;Chai, Jong-Yil;Eom, Keeseon S.
    • Parasites, Hosts and Diseases
    • /
    • v.53 no.3
    • /
    • pp.299-305
    • /
    • 2015
  • Tapeworms of the genus Spirometra are pseudophyllidean cestodes endemic in Korea. At present, it is unclear which Spirometra species are responsible for causing human infections, and little information is available on the epidemiological profiles of Spirometra species infecting humans in Korea. Between 1979 and 2009, a total of 50 spargana from human patients and 2 adult specimens obtained from experimentally infected carnivorous animals were analyzed according to genetic and taxonomic criteria and classified as Spirometra erinaceieuropaei or Spirometra decipiens depending on the morphology. Morphologically, S. erinaceieuropaei and S. decipiens are different in that the spirally coiled uterus in S. erinaceieuropaei has 5-7 complete coils, while in S. decipiens it has only 4.5 coils. In addition, there is a 9.3% (146/1,566) sequence different between S. erinaceieuropaei and S. decipiens in the cox1 gene. Partial cox1 sequences (390 bp) from 35 Korean isolates showed 99.4% (388/390) similarity with the reference sequence of S. erinaceieuropaei from Korea (G1724; GenBank KJ599680) and an additional 15 Korean isolates revealed 99.2% (387/390) similarity with the reference sequences of S. decipiens from Korea (G1657; GenBank KJ599679). Based on morphologic and molecular databases, the estimated population ratio of S. erinaceieuropaei to S. decipiens was 35: 15. Our results indicate that both S. erinaceieuropaei and S. decipiens found in Korea infect humans, with S. erinaceieuropaei being 2 times more prevalent than S. decipiens. This study is the first to report human sparganosis caused by S. decipiens in humans in Korea.