• Title/Summary/Keyword: Biological Sequence Database

Search Result 94, Processing Time 0.024 seconds

Functional Annotation and Analysis of Korean Patented Biological Sequences Using Bioinformatics

  • Lee, Byung Wook;Kim, Tae Hyung;Kim, Seon Kyu;Kim, Sang Soo;Ryu, Gee Chan;Bhak, Jong
    • Molecules and Cells
    • /
    • v.21 no.2
    • /
    • pp.269-275
    • /
    • 2006
  • A recent report of the Korean Intellectual Property Office(KIPO) showed that the number of biological sequence-based patents is rapidly increasing in Korea. We present biological features of Korean patented sequences though bioinformatic analysis. The analysis is divided into two steps. The first is an annotation step in which the patented sequences were annotated with the Reference Sequence (RefSeq) database. The second is an association step in which the patented sequences were linked to genes, diseases, pathway, and biological functions. We used Entrez Gene, Online Mendelian Inheritance in Man (OMIM), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Gene Ontology (GO) databases. Through the association analysis, we found that nearly 2.6% of human genes were associated with Korean patenting, compared to 20% of human genes in the U.S. patent. The association between the biological functions and the patented sequences indicated that genes whose products act as hormones on defense responses in the extra-cellular environments were the most highly targeted for patenting. The analysis data are available at http://www.patome.net

Estimation of Substring Selectivity in Biological Sequence Database (생물학 서열 데이타베이스에서 부분 문자열의 선적도 추정)

  • 배진욱;이석호
    • Journal of KIISE:Databases
    • /
    • v.30 no.2
    • /
    • pp.168-175
    • /
    • 2003
  • Until now, substring selectivities have been estimated by two steps. First step is to build up a count-suffix tree, which has statistical information about substrings, and second step is to estimate substring selectivity using it. However, it's actually impossible to build up a count-suffix tree from biological sequences because their lengths are too long. So, this paper proposes a novel data structure, count q-gram tree, consisting of fixed length substrings. The Count q-gram tree retains the exact counts of all substrings whose lengths are equal to or less than q and this tree is generated in 0(N) time and in site not subject to total length of all sequences, N. This paper also presents an estimation technique, k-MO. k-MO can choose overlapping length of splitted substrings from a query string, and this choice will affect accuracy of selectivity and query processing time. Experiments show k-MO can estimate very accurately.

Development of Integrated Retrieval System of the Biology Sequence Database Using Web Service (웹 서비스를 이용한 바이오 서열 정보 데이터베이스 및 통합 검색 시스템 개발)

  • Lee, Su-Jung;Yong, Hwan-Seung
    • The KIPS Transactions:PartD
    • /
    • v.11D no.4
    • /
    • pp.755-764
    • /
    • 2004
  • Recently, the rapid development of biotechnology brings the explosion of biological data and biological data host. Moreover, these data are highly distributed and heterogeneous, reflecting the distribution and heterogeneity of the Molecular Biology research community. As a consequence, the integration and interoperability of molecular biology databases are issue of considerable importance. But, up to now, most of the integrated systems such as link based system, data warehouse based system have many problems which are keeping the data up to date when the schema and data of the data source are changed. For this reason, the integrated system using web service technology that allow biological data to be fully exploited have been proposed. In this paper, we built the integrated system if the bio sequence information bated on the web service technology. The developed system allows users to get data with many format such as BSML, GenBank, Fasta to traverse disparate data resources. Also, it has better retrieval performance because the retrieval modules of the external database proceed in parallel.

Effective Biological Sequence Alignment Method using Divide Approach

  • Choi, Hae-Won;Kim, Sang-Jin;Pi, Su-Young
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.17 no.6
    • /
    • pp.41-50
    • /
    • 2012
  • This paper presents a new sequence alignment method using the divide approach, which solves the problem by decomposing sequence alignment into several sub-alignments with respect to exact matching subsequences. Exact matching subsequences in the proposed method are bounded on the generalized suffix tree of two sequences, such as protein domain length more than 7 and less than 7. Experiment results show that protein sequence pairs chosen in PFAM database can be aligned using this method. In addition, this method reduces the time about 15% and space of the conventional dynamic programming approach. And the sequences were classified with 94% of accuracy.

Protein Sequence Search based on N-gram Indexing

  • Hwang, Mi-Nyeong;Kim, Jin-Suk
    • Bioinformatics and Biosystems
    • /
    • v.1 no.1
    • /
    • pp.46-50
    • /
    • 2006
  • According to the advancement of experimental techniques in molecular biology, genomic and protein sequence databases are increasing in size exponentially, and mean sequence lengths are also increasing. Because the sizes of these databases become larger, it is difficult to search similar sequences in biological databases with significant homologies to a query sequence. In this paper, we present the N-gram indexing method to retrieve similar sequences fast, precisely and comparably. This method regards a protein sequence as a text written in language of 20 amino acid codes, adapts N-gram tokens of fixed-length as its indexing scheme for sequence strings. After such tokens are indexed for all the sequences in the database, sequences can be searched with information retrieval algorithms. Using this new method, we have developed a protein sequence search system named as ProSeS (PROtein Sequence Search). ProSeS is a protein sequence analysis system which provides overall analysis results such as similar sequences with significant homologies, predicted subcellular locations of the query sequence, and major keywords extracted from annotations of similar sequences. We show experimentally that the N-gram indexing approach saves the retrieval time significantly, and that it is as accurate as current popular search tool BLAST.

  • PDF

Reinterpretation of the protein identification process for proteomics data

  • Kwon, Kyung-Hoon;Lee, Sang-Kwang;Cho, Kun;Park, Gun-Wook;Kang, Byeong-Soo;Park, Young-Mok
    • Interdisciplinary Bio Central
    • /
    • v.1 no.3
    • /
    • pp.9.1-9.6
    • /
    • 2009
  • Introduction: In the mass spectrometry-based proteomics, biological samples are analyzed to identify proteins by mass spectrometer and database search. Database search is the process to select the best matches to the experimental mass spectra among the amino acid sequence database and we identify the protein as the matched sequence. The match score is defined to find the matches from the database and declare the highest scored hit as the most probable protein. According to the score definition, search result varies. In this study, the difference among search results of different search engines or different databases was investigated, in order to suggest a better way to identify more proteins with higher reliability. Materials and Methods: The protein extract of human mesenchymal stem cell was separated by several bands by one-dimensional electrophorysis. One-dimensional gel was excised one by one, digested by trypsin and analyzed by a mass spectrometer, FT LTQ. The tandem mass (MS/MS) spectra of peptide ions were applied to the database search of X!Tandem, Mascot and Sequest search engines with IPI human database and SwissProt database. The search result was filtered by several threshold probability values of the Trans-Proteomic Pipeline (TPP) of the Institute for Systems Biology. The analysis of the output which was generated from TPP was performed. Results and Discussion: For each MS/MS spectrum, the peptide sequences which were identified from different conditions such as search engines, threshold probability, and sequence database were compared. The main difference of peptide identification at high threshold probability was caused by not the difference of sequence database but the difference of the score. As the threshold probability decreases, the missed peptides appeared. Conversely, in the extremely high threshold level, we missed many true assignments. Conclusion and Prospects: The different identification result of the search engines was mainly caused by the different scoring algorithms. Usually in proteomics high-scored peptides are selected and low-scored peptides are discarded. Many of them are true negatives. By integrating the search results from different parameter and different search engines, the protein identification process can be improved.

Mollusks Sequence Database: Version II (연체동물 전용 BLAST 서버 업데이트 (Version II))

  • Kang, Se Won;Hwang, Hee Ju;Park, So Young;Wang, Tae Hun;Park, Eun Bi;Lee, Tae Hee;Hwang, Ui Wook;Lee, Jun-Sang;Park, Hong Seog;Han, Yeon Soo;Lim, Chae Eun;Kim, Soonok;Lee, Yong Seok
    • The Korean Journal of Malacology
    • /
    • v.30 no.4
    • /
    • pp.429-431
    • /
    • 2014
  • Since we reported a BLAST server for the mollusk in 2004, no work has reported the usability or modification of the server. To improve its usability, the BLAST server for the mollusk has been updated as version II (http://www.malacol.or.kr/blast) in the present study. The database was constructed by using the Intel server Platform ZSS130 dual Xeon 3.20 GHz CPU and Linux CentOS system and with NCBI WebBLAST package. We downloaded the mollusk nucleotide, amino acid, EST, GSS and mitochondrial genome sequences which can be opened through NCBI web BLAST and used them to build up the database. The updated database consists of 520,977 nucleotide sequences, 229,857 amino acid sequences, 586,498 EST sequences, 23,112 GSS and 565 mitochondrial genome sequences. Total database size is 1.2 GB. Furthermore, we have added repeat sequences, Escherichia coli sequences and vector sequences to facilitate data validation. The newly updated BLAST server for the mollusk will be useful for many malacological researchers as it will save time to identify and study various molluscan genes.

Selection of Putative Iron-responsive Elements by Iron Regulatory Protein-2

  • Kim, Hae-Yeong
    • Journal of Applied Biological Chemistry
    • /
    • v.42 no.2
    • /
    • pp.62-65
    • /
    • 1999
  • Iron regulatory proteins (IRPs) 1 and 2 bind with equally high affinity to specific RNA stem-loop sequences known as iron-responsive elements (IRE) which mediate the post-transcriptional regulation of many genes of iron metabolism. To study putative IRE-like sequences in RNA transcripts using the IRP-IRE interaction, Eight known genes from database were selected and the RNA binding activity of IRE-like sequences were compared to IRP-2. Among them, the IRE-like sequence in 3'-untranslational region (UTR) of divalent ration transporter-1 (DCT-1) shows a significant RNA binding affinity. This finding predicts that IRE consensus sequence present within 3'-UTR of DCT-1 might confer the regulation by IRP-2.

  • PDF

Gene Reangement through 151 bp Repeated Sequence in Rice Chloroplast DNA (벼 엽록체 DNA내의 151 bp 반복염기서열에 의한 유전자 재배열)

  • Nahm, Baek-Hie;Kim, Han-Jip
    • Applied Biological Chemistry
    • /
    • v.36 no.3
    • /
    • pp.208-214
    • /
    • 1993
  • To investigate the gene rearrangement via short repeated sequences in chloroplast DNA, the pattern of heterologous gene clusters containing the 151 bp repeated sequence with the development of plastid was compared in rice and the homologous gene clusters from various plant sources were searched for comparative analysis. Southern blot analysis of rice DNA using rp12 gene containing 151 bp repeated sequence as a probe showed the presence of heterologous gene clusters. Such heterologous gene clusters varied with the development of plastid. Also it was observed that the heterologous gene clusters were observed in all of the rice cultivars used in this work. Finally the comparative analysis of DNA sequence of the homologous gene clusters from various plants showed the evolutionary gene rearragngement via short repeated sequence among plants. These results suggest the possible relationship between the plastid development and gene rearrangement through short repeated sequences.

  • PDF

Bioinformatics for the Korean Functional Genomics Project

  • Kim, Sang-Soo
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2000.11a
    • /
    • pp.45-52
    • /
    • 2000
  • Genomic approach produces massive amount of data within a short time period, New high-throughput automatic sequencers can generate over a million nucleotide sequence information overnight. A typical DNA chip experiment produces tens of thousands expression information, not to mention the tens of megabyte image files, These data must be handled automatically by computer and stored in electronic database, Thus there is a need for systematic approach of data collection, processing, and analysis. DNA sequence information is translated into amino acid sequence and is analyzed for key motif related to its biological and/or biochemical function. Functional genomics will play a significant role in identifying novel drug targets and diagnostic markers for serious diseases. As an enabling technology for functional genomics, bioinformatics is in great need worldwide, In Korea, a new functional genomics project has been recently launched and it focuses on identi☞ing genes associated with cancers prevalent in Korea, namely gastric and hepatic cancers, This involves gene discovery by high throughput sequencing of cancer cDNA libraries, gene expression profiling by DNA microarray and proteomics, and SNP profiling in Korea patient population, Our bioinformatics team will support all these activities by collecting, processing and analyzing these data.

  • PDF