• Title/Summary/Keyword: sequence database

Search Result 566, Processing Time 0.028 seconds

An Approach for a Substitution Matrix Based on Protein Blocks and Physicochemical Properties of Amino Acids through PCA

  • You, Youngki;Jang, Inhwan;Lee, Kyungro;Kim, Heonjoo;Lee, Kwanhee
    • Interdisciplinary Bio Central
    • /
    • v.6 no.4
    • /
    • pp.3.1-3.10
    • /
    • 2014
  • Amino acid substitution matrices are essential tools for protein sequence analysis, homology sequence search in protein databases and multiple sequence alignment. The PAM matrix was the first widely used amino acid substitution matrix. The BLOSUM series then succeeded the PAM matrix. Most substitution matrixes were developed by using the statistical frequency of substitution between each amino acid at blocks representing groups of protein families or related proteins. However, substitution of amino acids is based on the similarity of physiochemical properties of each amino acid. In this study, a new approach was used to obtain major physiochemical properties in multiple sequence alignment. Frequency of amino acid substitution in multiple sequence alignment database and selected attributes of amino acids in physiochemical properties database were merged. This merged data showed the major physiochemical properties through principle components analysis. Using factor analysis, these four principle components were interpreted as flexibility of electronic movement, polarity, negative charge and structural flexibility. Applying these four components, BAPS was constructed and validated for accuracy. When comparing receiver operated characteristic ($ROC_{50}$) values, BAPS scored slightly lower than BLOSUM and PAM. However, when evaluating for accuracy by comparing results from multiple sequence alignment with the structural alignment results of two test data sets with known three-dimensional structure in the homologous structure alignment database, the result of the test for BAPS was comparatively equivalent or better than results for prior matrices including PAM, Gonnet, Identity and Genetic code matrix.

Reinterpretation of the protein identification process for proteomics data

  • Kwon, Kyung-Hoon;Lee, Sang-Kwang;Cho, Kun;Park, Gun-Wook;Kang, Byeong-Soo;Park, Young-Mok
    • Interdisciplinary Bio Central
    • /
    • v.1 no.3
    • /
    • pp.9.1-9.6
    • /
    • 2009
  • Introduction: In the mass spectrometry-based proteomics, biological samples are analyzed to identify proteins by mass spectrometer and database search. Database search is the process to select the best matches to the experimental mass spectra among the amino acid sequence database and we identify the protein as the matched sequence. The match score is defined to find the matches from the database and declare the highest scored hit as the most probable protein. According to the score definition, search result varies. In this study, the difference among search results of different search engines or different databases was investigated, in order to suggest a better way to identify more proteins with higher reliability. Materials and Methods: The protein extract of human mesenchymal stem cell was separated by several bands by one-dimensional electrophorysis. One-dimensional gel was excised one by one, digested by trypsin and analyzed by a mass spectrometer, FT LTQ. The tandem mass (MS/MS) spectra of peptide ions were applied to the database search of X!Tandem, Mascot and Sequest search engines with IPI human database and SwissProt database. The search result was filtered by several threshold probability values of the Trans-Proteomic Pipeline (TPP) of the Institute for Systems Biology. The analysis of the output which was generated from TPP was performed. Results and Discussion: For each MS/MS spectrum, the peptide sequences which were identified from different conditions such as search engines, threshold probability, and sequence database were compared. The main difference of peptide identification at high threshold probability was caused by not the difference of sequence database but the difference of the score. As the threshold probability decreases, the missed peptides appeared. Conversely, in the extremely high threshold level, we missed many true assignments. Conclusion and Prospects: The different identification result of the search engines was mainly caused by the different scoring algorithms. Usually in proteomics high-scored peptides are selected and low-scored peptides are discarded. Many of them are true negatives. By integrating the search results from different parameter and different search engines, the protein identification process can be improved.

Construction of EST Database for Comparative Gene Studies of Acanthamoeba

  • Moon, Eun-Kyung;Kim, Joung-Ok;Xuan, Ying-Hua;Yun, Young-Sun;Kang, Se-Won;Lee, Yong-Seok;Ahn, Tae-In;Hong, Yeon-Chul;Chung, Dong-Il;Kong, Hyun-Hee
    • Parasites, Hosts and Diseases
    • /
    • v.47 no.2
    • /
    • pp.103-107
    • /
    • 2009
  • The genus Acanthamoeba can cause severe infections such as granulomatous amebic encephalitis and amebic keratitis in humans. However, little genomic information of Acanthamoeba has been reported. Here, we constructed Acanthamoeba expressed sequence tags (EST) database (Acanthamoeba EST DB) derived from our 4 kinds of Acanthamoeba cDNA library. The Acanthamoeba EST DB contains 3,897 EST generated from amebae under various conditions of long term in vitro culture, mouse brain passage, or encystation, and downloaded data of Acanthamoeba from National Center for Biotechnology Information (NCBI) and Taxonomically Broad EST Database (TBestDB). The almost reported eDNA/genomic sequences of Acanthamoeba provide stand alone BLAST system with nucleotide (BLAST NT) and amino acid (BLAST AA) sequence database. In BLAST results, each gene links for the significant information including sequence data, gene orthology annotations, relevant references, and a BlastX result. This is the first attempt for construction of Acanthamoeba database with genes expressed in diverse conditions. These data were integrated into a database (http://www. amoeba.or.kr).

Predict Protein Secondary Structure based on Emerging Sequence Mining (출현 시퀀스 마이닝 기반의 단백질 2 차 구조 예측)

  • Li, Meijing;Lee, Heon Gyu;Saeed, Khalid E.K.;Shon, Ho Sun;Ryu, Keun Ho
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2009.04a
    • /
    • pp.379-382
    • /
    • 2009
  • 최근 단백질 기능 예측을 위한 서열비교와 구조비교 기법들은 정확한 분류가 가능한 반면, 새로운 단백질 기능 분류를 함에 있어서 많은 복잡도가 따른다. 따라서 이 논문에서는 보다 빠른 단백질의 구조 분류 및 예측을 위하여 출현 시퀀스(emerging sequence)를 기반으로 하는 분류기법을 제안하였다. 이 기법에서는 먼저, 출현 시퀀스 마이닝 알고리즘을 이용하여 단백질 서열 데이터로부터 4 가지의 단백질 2 차 구조 출현 시퀀스를 발견하고, SVM을 이용하여 단백질의 출현 시퀀스 속성으로부터 단백질의 2 차 구조를 예측하였다.

A Design for Efficient Similar Subsequence Search with a Priority Queue and Suffix Tree in Image Sequence Databases (이미지 시퀀스 데이터베이스에서 우선순위 큐와 접미어 트리를 이용한 효율적인 유사 서브시퀀스 검색의 설계)

  • 김인범
    • Journal of the Korea Computer Industry Society
    • /
    • v.4 no.4
    • /
    • pp.613-624
    • /
    • 2003
  • This paper proposes a design for efficient and accurate retrieval of similar image subsequences using the multi-dimensional time warping distance as similarity evaluation tool in image sequence database after building of two indexing structures implemented with priority queue and suffix tree respectively. Receiving query image sequence, at first step, the proposed method searches the candidate set of similar image subsequences in priory queue index structure. If it can not get satisfied results, it retrieves another candidate set in suffix tree index structure at second step. The using of the low-bound distance function can remove the dissimilar subsequence without false dismissals during similarity evaluating process between query image sequence and stored sequences in two index structures.

  • PDF

Applied Computational Tools for Crop Genome Research

  • Love Christopher G;Batley Jacqueline;Edwards David
    • Journal of Plant Biotechnology
    • /
    • v.5 no.4
    • /
    • pp.193-195
    • /
    • 2003
  • A major goal of agricultural biotechnology is the discovery of genes or genetic loci which are associated with characteristics beneficial to crop production. This knowledge of genetic loci may then be applied to improve crop breeding. Agriculturally important genes may also benefit crop production through transgenic technologies. Recent years have seen an application of high throughput technologies to agricultural biotechnology leading to the production of large amounts of genomic data. The challenge today is the effective structuring of this data to permit researchers to search, filter and importantly, make robust associations within a wide variety of datasets. At the Plant Biotechnology Centre, Primary Industries Research Victoria in Melbourne, Australia, we have developed a series of tools and computational pipelines to assist in the processing and structuring of genomic data to aid its application to agricultural biotechnology resear-ch. These tools include a sequence database, ASTRA, for the processing and annotation of expressed sequence tag data. Tools have also been developed for the discovery of simple sequence repeat (SSR) and single nucleotide polymorphism (SNP) molecular markers from large sequence datasets. Application of these tools to Brassica research has assisted in the production of genetic and comparative physical maps as well as candidate gene discovery for a range of agronomically important traits.

Identification of Viral Taxon-Specific Genes (VTSG): Application to Caliciviridae

  • Kang, Shinduck;Kim, Young-Chang
    • Genomics & Informatics
    • /
    • v.16 no.4
    • /
    • pp.23.1-23.5
    • /
    • 2018
  • Virus taxonomy was initially determined by clinical experiments based on phenotype. However, with the development of sequence analysis methods, genotype-based classification was also applied. With the development of genome sequence analysis technology, there is an increasing demand for virus taxonomy to be extended from in vivo and in vitro to in silico. In this study, we verified the consistency of the current International Committee on Taxonomy of Viruses taxonomy using an in silico approach, aiming to identify the specific sequence for each virus. We applied this approach to norovirus in Caliciviridae, which causes 90% of gastroenteritis cases worldwide. First, based on the dogma "protein structure determines its function," we hypothesized that the specific sequence can be identified by the specific structure. Firstly, we extracted the coding region (CDS). Secondly, the CDS protein sequences of each genus were annotated by the conserved domain database (CDD) search. Finally, the conserved domains of each genus in Caliciviridae are classified by RPS-BLAST with CDD. The analysis result is that Caliciviridae has sequences including RNA helicase in common. In case of Norovirus, Calicivirus coat protein C terminal and viral polyprotein N-terminal appears as a specific domain in Caliciviridae. It does not include in the other genera in Caliciviridae. If this method is utilized to detect specific conserved domains, it can be used as classification keywords based on protein functional structure. After determining the specific protein domains, the specific protein domain sequences would be converted to gene sequences. This sequences would be re-used one of viral bio-marks.

Analyses of Expressed Sequence Tags from Chironomus riparius Using Pyrosequencing : Molecular Ecotoxicology Perspective

  • Nair, Prakash M. Gopalakrishnan;Park, Sun-Young;Choi, Jin-Hee
    • Environmental Analysis Health and Toxicology
    • /
    • v.26
    • /
    • pp.10.1-10.7
    • /
    • 2011
  • Objects: Chironomus riparius, a non-biting midge (Chironomidae, Diptera), is extensively used as a model organism in aquatic ecotoxicological studies, and considering the potential of C. riparius larvae as a bio-monitoring species, little is known about its genome sequences. This study reports the results of an Expressed Sequence Tags (ESTs) sequencing project conducted on C. riparius larvae using 454 pyrosequencing. Method: To gain a better understanding of C. riparius transcriptome, we generated ESTs database of C.ripairus using pyrosequencing method. Results: Sequencing runs, using normalized cDNA collections from fourth instar larvae, yielded 20,020 expressed sequence tags, which were assembled into 8,565 contigs and 11,455 singletons. Sequence analysis was performed by BlastX search against the National Center for Biotechnology Information (NCBI) nucleotide (nr) and uniprot protein database. Based on the gene ontology classifications, 24% (E-value${\leq}1^{-5}$) of the sequences had known gene functions, 24% had unknown functions and 52% of sequences did not match any known sequences in the existing database. Sequence comparison revealed 81% of the genes have homologous genes among other insects belonging to the order Diptera providing tools for comparative genome analyses. Targeted searches using these annotations identified genes associated with essential metabolic pathways, signaling pathways, detoxification of toxic metabolites and stress response genes of ecotoxicological interest. Conclusions: The results obtained from this study would eventually make ecotoxicogenomics possible in a truly environmentally relevant species, such as, C. riparius.

NBLAST: a graphical user interface-based two-way BLAST software with a dot plot viewer

  • Choi, Beom-Soon;Choi, Seon Kang;Kim, Nam-Soo;Choi, Ik-Young
    • Genomics & Informatics
    • /
    • v.20 no.3
    • /
    • pp.36.1-36.6
    • /
    • 2022
  • BLAST, a basic bioinformatics tool for searching local sequence similarity, has been one of the most widely used bioinformatics programs since its introduction in 1990. Users generally use the web-based NCBI-BLAST program for BLAST analysis. However, users with large sequence data are often faced with a problem of upload size limitation while using the web-based BLAST program. This proves inconvenient as scientists often want to run BLAST on their own data, such as transcriptome or whole genome sequences. To overcome this issue, we developed NBLAST, a graphical user interface-based BLAST program that employs a two-way system, allowing the use of input sequences either as "query" or "target" in the BLAST analysis. NBLAST is also equipped with a dot plot viewer, thus allowing researchers to create custom database for BLAST and run a dot plot similarity analysis within a single program. It is available to access to the NBLAST with http://nbitglobal.com/nblast.