• Title/Summary/Keyword: sequence databases

Search Result 226, Processing Time 0.03 seconds

GWB: An integrated software system for Managing and Analyzing Genomic Sequences (GWB: 유전자 서열 데이터의 관리와 분석을 위한 통합 소프트웨어 시스템)

  • Kim In-Cheol;Jin Hoon
    • Journal of Internet Computing and Services
    • /
    • v.5 no.5
    • /
    • pp.1-15
    • /
    • 2004
  • In this paper, we explain the design and implementation of GWB(Gene WorkBench), which is a web-based, integrated system for efficiently managing and analyzing genomic sequences, Most existing software systems handling genomic sequences rarely provide both managing facilities and analyzing facilities. The analysis programs also tend to be unit programs that include just single or some part of the required functions. Moreover, these programs are widely distributed over Internet and require different execution environments. As lots of manual and conversion works are required for using these programs together, many life science researchers suffer great inconveniences. in order to overcome the problems of existing systems and provide a more convenient one for helping genomic researches in effective ways, this paper integrates both managing facilities and analyzing facilities into a single system called GWB. Most important issues regarding the design of GWB are how to integrate many different analysis programs into a single software system, and how to provide data or databases of different formats required to run these programs. In order to address these issues, GWB integrates different analysis programs byusing common input/output interfaces called wrappers, suggests a common format of genomic sequence data, organizes local databases consisting of a relational database and an indexed sequential file, and provides facilities for converting data among several well-known different formats and exporting local databases into XML files.

  • PDF

The Performance Bottleneck of Subsequence Matching in Time-Series Databases: Observation, Solution, and Performance Evaluation (시계열 데이타베이스에서 서브시퀀스 매칭의 성능 병목 : 관찰, 해결 방안, 성능 평가)

  • 김상욱
    • Journal of KIISE:Databases
    • /
    • v.30 no.4
    • /
    • pp.381-396
    • /
    • 2003
  • Subsequence matching is an operation that finds subsequences whose changing patterns are similar to a given query sequence from time-series databases. This paper points out the performance bottleneck in subsequence matching, and then proposes an effective method that improves the performance of entire subsequence matching significantly by resolving the performance bottleneck. First, we analyze the disk access and CPU processing times required during the index searching and post processing steps through preliminary experiments. Based on their results, we show that the post processing step is the main performance bottleneck in subsequence matching, and them claim that its optimization is a crucial issue overlooked in previous approaches. In order to resolve the performance bottleneck, we propose a simple but quite effective method that processes the post processing step in the optimal way. By rearranging the order of candidate subsequences to be compared with a query sequence, our method completely eliminates the redundancy of disk accesses and CPU processing occurred in the post processing step. We formally prove that our method is optimal and also does not incur any false dismissal. We show the effectiveness of our method by extensive experiments. The results show that our method achieves significant speed-up in the post processing step 3.91 to 9.42 times when using a data set of real-world stock sequences and 4.97 to 5.61 times when using data sets of a large volume of synthetic sequences. Also, the results show that our method reduces the weight of the post processing step in entire subsequence matching from about 90% to less than 70%. This implies that our method successfully resolves th performance bottleneck in subsequence matching. As a result, our method provides excellent performance in entire subsequence matching. The experimental results reveal that it is 3.05 to 5.60 times faster when using a data set of real-world stock sequences and 3.68 to 4.21 times faster when using data sets of a large volume of synthetic sequences compared with the previous one.

Noise Control Boundary Image Matching Using Time-Series Moving Average Transform (시계열 이동평균 변환을 이용한 노이즈 제어 윤곽선 이미지 매칭)

  • Kim, Bum-Soo;Moon, Yang-Sae;Kim, Jin-Ho
    • Journal of KIISE:Databases
    • /
    • v.36 no.4
    • /
    • pp.327-340
    • /
    • 2009
  • To achieve the noise reduction effect in boundary image matching, we use the moving average transform of time-series matching. Our motivation is based on an intuition that using the moving average transform we may exploit the noise reduction effect in boundary image matching as in time-series matching. To confirm this simple intuition, we first propose $\kappa$-order image matching, which applies the moving average transform to boundary image matching. A boundary image can be represented as a sequence in the time-series domain, and our $\kappa$-order image matching identifies similar images in this time-series domain by comparing the $\kappa$-moving average transformed sequences. Next, we propose an index-based matching method that efficiently performs $\kappa$-order image matching on a large volume of image databases, and formally prove the correctness of the index-based method. Moreover, we formally analyze the relationship between an order $\kappa$ and its matching result, and present a systematic way of controlling the noise reduction effect by changing the order $\kappa$. Experimental results show that our $\kappa$-order image matching exploits the noise reduction effect, and our index-based matching method outperforms the sequential scan by one or two orders of magnitude.

Genomic and Proteomic Analysis of Microbial Function in the Gastrointestinal Tract of Ruminants - Review -

  • White, Bryan A.;Morrison, Mark
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.14 no.6
    • /
    • pp.880-884
    • /
    • 2001
  • Rumen microbiology research has undergone several evolutionary steps: the isolation and nutritional characterization of readily cultivated microbes; followed by the cloning and sequence analysis of individual genes relevant to key digestive processes; through to the use of small subunit ribosomal RNA (SSU rRNA) sequences for a cultivation-independent examination of microbial diversity. Our knowledge of rumen microbiology has expanded as a result, but the translation of this information into productive alterations of ruminal function has been rather limited. For instance, the cloning and characterization of cellulase genes in Escherichia coli has yielded some valuable information about this complex enzyme system in ruminal bacteria. SSU rRNA analyses have also confirmed that a considerable amount of the microbial diversity in the rumen is not represented in existing culture collections. However, we still have little idea of whether the key, and potentially rate-limiting, gene products and (or) microbial interactions have been identified. Technologies allowing high throughput nucleotide and protein sequence analysis have led to the emergence of two new fields of investigation, genomics and proteomics. Both disciplines can be further subdivided into functional and comparative lines of investigation. The massive accumulation of microbial DNA and protein sequence data, including complete genome sequences, is revolutionizing the way we examine microbial physiology and diversity. We describe here some examples of our use of genomics- and proteomics-based methods, to analyze the cellulase system of Ruminococcus flavefaciens FD-1 and explore the genome of Ruminococcus albus 8. At Illinois, we are using bacterial artificial chromosome (BAC) vectors to create libraries containing large (>75 kbases), contiguous segments of DNA from R. flavefaciens FD-1. Considering that every bacterium is not a candidate for whole genome sequencing, BAC libraries offer an attractive, alternative method to perform physical and functional analyses of a bacterium's genome. Our first plan is to use these BAC clones to determine whether or not cellulases and accessory genes in R. flavefaciens exist in clusters of orthologous genes (COGs). Proteomics is also being used to complement the BAC library/DNA sequencing approach. Proteins differentially expressed in response to carbon source are being identified by 2-D SDS-PAGE, followed by in-gel-digests and peptide mass mapping by MALDI-TOF Mass Spectrometry, as well as peptide sequencing by Edman degradation. At Ohio State, we have used a combination of functional proteomics, mutational analysis and differential display RT-PCR to obtain evidence suggesting that in addition to a cellulosome-like mechanism, R. albus 8 possesses other mechanisms for adhesion to plant surfaces. Genome walking on either side of these differentially expressed transcripts has also resulted in two interesting observations: i) a relatively large number of genes with no matches in the current databases and; ii) the identification of genes with a high level of sequence identity to those identified, until now, in the archaebacteria. Genomics and proteomics will also accelerate our understanding of microbial interactions, and allow a greater degree of in situ analyses in the future. The challenge is to utilize genomics and proteomics to improve our fundamental understanding of microbial physiology, diversity and ecology, and overcome constraints to ruminal function.

Reinterpretation of the protein identification process for proteomics data

  • Kwon, Kyung-Hoon;Lee, Sang-Kwang;Cho, Kun;Park, Gun-Wook;Kang, Byeong-Soo;Park, Young-Mok
    • Interdisciplinary Bio Central
    • /
    • v.1 no.3
    • /
    • pp.9.1-9.6
    • /
    • 2009
  • Introduction: In the mass spectrometry-based proteomics, biological samples are analyzed to identify proteins by mass spectrometer and database search. Database search is the process to select the best matches to the experimental mass spectra among the amino acid sequence database and we identify the protein as the matched sequence. The match score is defined to find the matches from the database and declare the highest scored hit as the most probable protein. According to the score definition, search result varies. In this study, the difference among search results of different search engines or different databases was investigated, in order to suggest a better way to identify more proteins with higher reliability. Materials and Methods: The protein extract of human mesenchymal stem cell was separated by several bands by one-dimensional electrophorysis. One-dimensional gel was excised one by one, digested by trypsin and analyzed by a mass spectrometer, FT LTQ. The tandem mass (MS/MS) spectra of peptide ions were applied to the database search of X!Tandem, Mascot and Sequest search engines with IPI human database and SwissProt database. The search result was filtered by several threshold probability values of the Trans-Proteomic Pipeline (TPP) of the Institute for Systems Biology. The analysis of the output which was generated from TPP was performed. Results and Discussion: For each MS/MS spectrum, the peptide sequences which were identified from different conditions such as search engines, threshold probability, and sequence database were compared. The main difference of peptide identification at high threshold probability was caused by not the difference of sequence database but the difference of the score. As the threshold probability decreases, the missed peptides appeared. Conversely, in the extremely high threshold level, we missed many true assignments. Conclusion and Prospects: The different identification result of the search engines was mainly caused by the different scoring algorithms. Usually in proteomics high-scored peptides are selected and low-scored peptides are discarded. Many of them are true negatives. By integrating the search results from different parameter and different search engines, the protein identification process can be improved.

Construction of a Full-length cDNA Library from Korean Stewartia (Stewartia koreana Nakai) and Characterization of EST Dataset (노각나무(Stewartia koreana Nakai)의 cDNA library 제작 및 EST 분석)

  • Im, Su-Bin;Kim, Joon-Ki;Choi, Young-In;Choi, Sun-Hee;Kwon, Hye-Jin;Song, Ho-Kyung;Lim, Yong-Pyo
    • Horticultural Science & Technology
    • /
    • v.29 no.2
    • /
    • pp.116-122
    • /
    • 2011
  • In this study, we report the generation and analysis of 1,392 expressed sequence tags (ESTs) from Korean Stewartia (Stewartia koreana Nakai). A cDNA library was generated from the young leaf tissue and a total of 1,392 cDNA were partially sequenced. EST and unigene sequence quality were determined by computational filtering, manual review, and BLAST analyses. Finally, 1,301 ESTs were acquired after the removal of the vector sequence and filtering over a minimum length 100 nucleotides. A total of 893 unigene, consisting of 150 contigs and 743 singletons, was identified after assembling. Also, we identified 95 new microsatellite-containing sequences from the unigenes and classified the structure according to their repeat unit. According to homology search with BLASTX against the NCBI database, 65% of ESTs were homologous with known function and 11.6% of ESTs were matched with putative or unknown function. The remaining 23.2% of ESTs showed no significant similarity to any protein sequences found in the public database. Annotation based searches against multiple databases including wine grape and populus sequences helped to identify putative functions of ESTs and unigenes. Gene ontology (GO) classification showed that the most abundant GO terms were transport, nucleotide binding, plastid, in terms biological process, molecular function and cellular component, respectively. The sequence data will be used to characterize potential roles of new genes in Stewartia and provided for the useful tools as a genetic resource.

An Integrated Genomic Resource Based on Korean Cattle (Hanwoo) Transcripts

  • Lim, Da-Jeong;Cho, Yong-Min;Lee, Seung-Hwan;Sung, Sam-Sun;Nam, Jung-Rye;Yoon, Du-Hak;Shin, Youn-Hee;Park, Hye-Sun;Kim, Hee-Bal
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.23 no.11
    • /
    • pp.1399-1404
    • /
    • 2010
  • We have created a Bovine Genome Database, an integrated genomic resource for Bos taurus, by merging bovine data from various databases and our own data. We produced 55,213 Korean cattle (Hanwoo) ESTs from cDNA libraries from three tissues. We concentrated on genomic information based on Hanwoo transcripts and provided user-friendly search interfaces within the Bovine Genome Database. The genome browser supported alignment results for the various types of data: Hanwoo EST, consensus sequence, human gene, and predicted bovine genes. The database also provides transcript data information, gene annotation, genomic location, sequence and tissue distribution. Users can also explore bovine disease genes based on comparative mapping of homologous genes and can conduct searches centered on genes within user-selected quantitative trait loci (QTL) regions. The Bovine Genome Database can be accessed at http://bgd.nabc.go.kr.

The Analysis of Genome Database Compaction based on Sequence Similarity (시퀀스 유사도에 기반한 유전체 데이터베이스 압축 및 영향 분석)

  • Kwon, Sunyoung;Lee, Byunghan;Park, Seunghyun;Jo, Jeonghee;Yoon, Sungroh
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.4
    • /
    • pp.250-255
    • /
    • 2017
  • Given the explosion of genomic data and expansion of applications such as precision medicine, the importance of efficient genome-database management continues to grow. Traditional compression techniques may be effective in reducing the size of a database, but a new challenge follows in terms of performing operations such as comparison and searches on the compressed database. Based on that many genome databases typically have numerous duplicated or similar sequences, and that the runtime of genome analyses is normally proportional to the number of sequences in a database, we propose a technique that can compress a genome database by eliminating similar entries from the database. Through our experiments, we show that we can remove approximately 84% of sequences with 1% similarity threshold, accelerating the downstream classification tasks by approximately 10 times. We also confirm that our compression method does not significantly affect the accuracy of taxonomy diversity assessments or classification.

Sequence Group Validation based on Boundary Locking for Valid XML Documents (유효한 XML 문서에 대한 경계 로킹에 기반한 시퀀스 그룹 검증 기법)

  • Choi, Yoon-Sang;Park, Seog
    • Journal of KIISE:Databases
    • /
    • v.32 no.6
    • /
    • pp.628-640
    • /
    • 2005
  • The XML is well accepted in several different Web application areas. As soon as many users and applications work concurrently on the same collection of XML documents, isolating accesses and modifications of different transactions becomes an important issue. When an XML document correctly corresponds to the rules laid out in a DTD or XML schema, it is also said to be valid. The valid XML document's validity should be guaranteed after the document is updated. The validation method mentioned above, however, results in lower degree of concurrency. For getting higher degree of concurrency and minimizing the range of the XML document validity, a new validation method based on a specific locking method is required. In this paper we propose the sequence group validation method for minimizing the range of the XML document validity. We also propose the boundary locking method for isolating accesses and modifications of different transactions while supporting the valid XML document's validity. Finally, the results of some experiments show the validation and locking methods increase the degree of transaction concurrency.

Assessment of the Potential Allergenicity of Genetically Modified Soybeans and Soy-based Products

  • Kim, Jae-Hwan;Lieu, Hae-Youn;Kim, Tae-Woon;Kim, Dae-Ok;Shon, Dong-Hwa;Ahn, Kang-Mo;Lee, Sang-Il;Kim, Hae-Yeong
    • Food Science and Biotechnology
    • /
    • v.15 no.6
    • /
    • pp.954-958
    • /
    • 2006
  • A comprehensive safety evaluation was conducted to assess the potential allergenicity of newly introduced proteins in genetically modified (GM) crops. We assessed the allergenicity of CP4 5-enolpyruvylshikimate-3-phosphate synthase (EPSPS) in GM soybeans. This assessment was performed by IgE immunoblotting with soy-allergic children's sera, amino acid sequence homology with known allergens, and the digestibility of CP4 EPSPS. No differences in IgE-antigen binding by immunoblotting were found between GM soy samples and the corresponding non-GM samples. Based on the comparison of EPSPS amino acid sequence homology with current allergen databases, no known allergen was found. In addition, CP4 EPSPS protein was rapidly digested by simulated gastric fluid (SGF). Taken together, these results indicate that GM soybeans have no allergenicity in children and are as safe as conventional soybeans.