• Title/Summary/Keyword: Protein sequence search

Search Result 114, Processing Time 0.031 seconds

Protein Sequence Search based on N-gram Indexing

  • Hwang, Mi-Nyeong;Kim, Jin-Suk
    • Bioinformatics and Biosystems
    • /
    • v.1 no.1
    • /
    • pp.46-50
    • /
    • 2006
  • According to the advancement of experimental techniques in molecular biology, genomic and protein sequence databases are increasing in size exponentially, and mean sequence lengths are also increasing. Because the sizes of these databases become larger, it is difficult to search similar sequences in biological databases with significant homologies to a query sequence. In this paper, we present the N-gram indexing method to retrieve similar sequences fast, precisely and comparably. This method regards a protein sequence as a text written in language of 20 amino acid codes, adapts N-gram tokens of fixed-length as its indexing scheme for sequence strings. After such tokens are indexed for all the sequences in the database, sequences can be searched with information retrieval algorithms. Using this new method, we have developed a protein sequence search system named as ProSeS (PROtein Sequence Search). ProSeS is a protein sequence analysis system which provides overall analysis results such as similar sequences with significant homologies, predicted subcellular locations of the query sequence, and major keywords extracted from annotations of similar sequences. We show experimentally that the N-gram indexing approach saves the retrieval time significantly, and that it is as accurate as current popular search tool BLAST.

  • PDF

Proteomics Data Analysis using Representative Database

  • Kwon, Kyung-Hoon;Park, Gun-Wook;Kim, Jin-Young;Park, Young-Mok;Yoo, Jong-Shin
    • Bioinformatics and Biosystems
    • /
    • v.2 no.2
    • /
    • pp.46-51
    • /
    • 2007
  • In the proteomics research using mass spectrometry, the protein database search gives the protein information from the peptide sequences that show the best match with the tandem mass spectra. The protein sequence database has been a powerful knowledgebase for this protein identification. However, as we accumulate the protein sequence information in the database, the database size gets to be huge. Now it becomes hard to consider all the protein sequences in the database search because it consumes much computing time. For the high-throughput analysis of the proteome, usually we have used the non-redundant refined database such as IPI human database of European Bioinformatics Institute. While the non-redundant database can supply the search result in high speed, it misses the variation of the protein sequences. In this study, we have concerned the proteomics data in the point of protein similarities and used the network analysis tool to build a new analysis method. This method will be able to save the computing time for the database search and keep the sequence variation to catch the modified peptides.

  • PDF

Reinterpretation of the protein identification process for proteomics data

  • Kwon, Kyung-Hoon;Lee, Sang-Kwang;Cho, Kun;Park, Gun-Wook;Kang, Byeong-Soo;Park, Young-Mok
    • Interdisciplinary Bio Central
    • /
    • v.1 no.3
    • /
    • pp.9.1-9.6
    • /
    • 2009
  • Introduction: In the mass spectrometry-based proteomics, biological samples are analyzed to identify proteins by mass spectrometer and database search. Database search is the process to select the best matches to the experimental mass spectra among the amino acid sequence database and we identify the protein as the matched sequence. The match score is defined to find the matches from the database and declare the highest scored hit as the most probable protein. According to the score definition, search result varies. In this study, the difference among search results of different search engines or different databases was investigated, in order to suggest a better way to identify more proteins with higher reliability. Materials and Methods: The protein extract of human mesenchymal stem cell was separated by several bands by one-dimensional electrophorysis. One-dimensional gel was excised one by one, digested by trypsin and analyzed by a mass spectrometer, FT LTQ. The tandem mass (MS/MS) spectra of peptide ions were applied to the database search of X!Tandem, Mascot and Sequest search engines with IPI human database and SwissProt database. The search result was filtered by several threshold probability values of the Trans-Proteomic Pipeline (TPP) of the Institute for Systems Biology. The analysis of the output which was generated from TPP was performed. Results and Discussion: For each MS/MS spectrum, the peptide sequences which were identified from different conditions such as search engines, threshold probability, and sequence database were compared. The main difference of peptide identification at high threshold probability was caused by not the difference of sequence database but the difference of the score. As the threshold probability decreases, the missed peptides appeared. Conversely, in the extremely high threshold level, we missed many true assignments. Conclusion and Prospects: The different identification result of the search engines was mainly caused by the different scoring algorithms. Usually in proteomics high-scored peptides are selected and low-scored peptides are discarded. Many of them are true negatives. By integrating the search results from different parameter and different search engines, the protein identification process can be improved.

A Performance Comparison of Protein Profiles for the Prediction of Protein Secondary Structures (단백질 이차 구조 예측을 위한 단백질 프로파일의 성능 비교)

  • Chi, Sang-Mun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.22 no.1
    • /
    • pp.26-32
    • /
    • 2018
  • The protein secondary structures are important information for studying the evolution, structure and function of proteins. Recently, deep learning methods have been actively applied to predict the secondary structure of proteins using only protein sequence information. In these methods, widely used input features are protein profiles transformed from protein sequences. In this paper, to obtain an effective protein profiles, protein profiles were constructed using protein sequence search methods such as PSI-BLAST and HHblits. We adjust the similarity threshold for determining the homologous protein sequence used in constructing the protein profile and the number of iterations of the profile construction using the homologous sequence information. We used the protein profiles as inputs to convolutional neural networks and recurrent neural networks to predict the secondary structures. The protein profile that was created by adding evolutionary information only once was effective.

Differentially Expressed Genes of Potentially Allelopathic Rice in Response against Barnyardgrass

  • Junaedi, Ahmad;Jung, Woo-Suk;Chung, Ill-Min;Kim, Kwang-Ho
    • Journal of Crop Science and Biotechnology
    • /
    • v.10 no.4
    • /
    • pp.231-236
    • /
    • 2007
  • Differentially expressed genes(DEG) were identified in a rice variety, Sathi, an indica type showing high allelopathic potential against barnyardgrass(Echinochloa crus-galli(L.) Beauv. var. frumentaceae). Rice plants were grown with and without barnyardgrass and total RNA was extracted from rice leaves at 45 days after seeding. DEG full-screening was performed by $GeneFishing^{TM}$ method. The differentially expressed bands were re-amplified and sequenced, then analyzed by Basic Local Alignment Search Tool(BLAST) searching for homology sequence identification. Gel electrophoresis showed nine possible genes associated with allelopathic potential in Sathi, six genes(namely DEG-1, 4, 5, 7, 8, and 9) showed higher expression, and three genes(DEG-2, 3 and 6) showed lower expression as compared to the control. cDNA sequence analysis showed that DEG-7 and DEG-9 had the same sequence. From RT PCR results, DEG-6 and DEG-7 were considered as true DEG, whereas DEG-1, 2, 3, 4, 5, and 8 were considered as putative DEG. Results from blast-n and blast-x search suggested that DEG-1 is homologous to a gene for S-adenosylmethionine synthetase, DEG-2 is homologous to a chloroplast gene for ribulose 1,5-bisphosphate carboxylase large subunit, DEG-8 is homologous to oxysterol-binding protein with an 85.7% sequence similarity, DEG-5 is homologous to histone 2B protein with a 47.9% sequence similarity, DEG-6 is homologous to nicotineamine aminotransferase with a 33.1% sequence similarity, DEG-3 has 98.8% similarity with nucleotides sequence that has 33.1% similarity with oxygen evolving complex protein in photosystem II, DEG-7 is homologous to nucleotides sequence that may relate with putative serin/threonine protein kinase and putative transposable element, and DEG-4 has 98.8% similarity with nucleotides sequence for an unknown protein.

  • PDF

A Pattern Summary System Using BLAST for Sequence Analysis

  • Choi, Han-Suk;Kim, Dong-Wook;Ryu, Tae-W.
    • Genomics & Informatics
    • /
    • v.4 no.4
    • /
    • pp.173-181
    • /
    • 2006
  • Pattern finding is one of the important tasks in a protein or DNA sequence analysis. Alignment is the widely used technique for finding patterns in sequence analysis. BLAST (Basic Local Alignment Search Tool) is one of the most popularly used tools in bio-informatics to explore available DNA or protein sequence databases. BLAST may generate a huge output for a large sequence data that contains various sequence patterns. However, BLAST does not provide a tool to summarize and analyze the patterns or matched alignments in the BLAST output file. BLAST lacks of general and robust parsing tools to extract the essential information out from its output. This paper presents a pattern summary system which is a powerful and comprehensive tool for discovering pattern structures in huge amount of sequence data in the BLAST. The pattern summary system can identify clusters of patterns, extract the cluster pattern sequences from the subject database of BLAST, and display the clusters graphically to show the distribution of clusters in the subject database.

ORF Miner: a Web-based ORF Search Tool

  • Park, Sin-Gi;Kim, Ki-Bong
    • Genomics & Informatics
    • /
    • v.7 no.4
    • /
    • pp.217-219
    • /
    • 2009
  • The primary clue for locating protein-coding regions is the open reading frame and the determination of ORFs (Open Reading Frames) is the first step toward the gene prediction, especially for prokaryotes. In this respect, we have developed a web-based ORF search tool called ORF Miner. The ORF Miner is a graphical analysis utility which determines all possible open reading frames of a selectable minimum size in an input sequence. This tool identifies all open reading frames using alternative genetic codes as well as the standard one and reports a list of ORFs with corresponding deduced amino acid sequences. The ORF Miner can be employed for sequence annotation and give a crucial clue to determination of actual protein-coding regions.

Identification of Viral Taxon-Specific Genes (VTSG): Application to Caliciviridae

  • Kang, Shinduck;Kim, Young-Chang
    • Genomics & Informatics
    • /
    • v.16 no.4
    • /
    • pp.23.1-23.5
    • /
    • 2018
  • Virus taxonomy was initially determined by clinical experiments based on phenotype. However, with the development of sequence analysis methods, genotype-based classification was also applied. With the development of genome sequence analysis technology, there is an increasing demand for virus taxonomy to be extended from in vivo and in vitro to in silico. In this study, we verified the consistency of the current International Committee on Taxonomy of Viruses taxonomy using an in silico approach, aiming to identify the specific sequence for each virus. We applied this approach to norovirus in Caliciviridae, which causes 90% of gastroenteritis cases worldwide. First, based on the dogma "protein structure determines its function," we hypothesized that the specific sequence can be identified by the specific structure. Firstly, we extracted the coding region (CDS). Secondly, the CDS protein sequences of each genus were annotated by the conserved domain database (CDD) search. Finally, the conserved domains of each genus in Caliciviridae are classified by RPS-BLAST with CDD. The analysis result is that Caliciviridae has sequences including RNA helicase in common. In case of Norovirus, Calicivirus coat protein C terminal and viral polyprotein N-terminal appears as a specific domain in Caliciviridae. It does not include in the other genera in Caliciviridae. If this method is utilized to detect specific conserved domains, it can be used as classification keywords based on protein functional structure. After determining the specific protein domains, the specific protein domain sequences would be converted to gene sequences. This sequences would be re-used one of viral bio-marks.

A Protein Sequence Prediction Method by Mining Sequence Data (서열 데이타마이닝을 통한 단백질 서열 예측기법)

  • Cho, Sun-I;Lee, Do-Heon;Cho, Kwang-Hwi;Won, Yong-Gwan;Kim, Byoung-Ki
    • The KIPS Transactions:PartD
    • /
    • v.10D no.2
    • /
    • pp.261-266
    • /
    • 2003
  • A protein, which is a linear polymer of amino acids, is one of the most important bio-molecules composing biological structures and regulating bio-chemical reactions. Since the characteristics and functions of proteins are determined by their amino acid sequences in principle, protein sequence determination is the starting point of protein function study. This paper proposes a protein sequence prediction method based on data mining techniques, which can overcome the limitation of previous bio-chemical sequencing methods. After applying multiple proteases to acquire overlapped protein fragments, we can identify candidate fragment sequences by comparing fragment mass values with peptide databases. We propose a method to construct multi-partite graph and search maximal paths to determine the protein sequence by assembling proper candidate sequences. In addition, experimental results based on the SWISS-PROT database showing the validity of the proposed method is presented.

Leucine Rich Repeat Sequence of the ${\delta}$ Endotoxin Family of Bacillus thuringiensis

  • Vudayagiri, Suvarchala;Jamil, Kaiser
    • BMB Reports
    • /
    • v.33 no.1
    • /
    • pp.89-91
    • /
    • 2000
  • In this investigation we report our search for the presence of Leucine Rich Repeats (LRRs) in various Bacillus thuringiensis (Bt) sub species. Leucine rich repeats are short sequence motifs present in some proteins. The consensus sequence corresponding to the LRR was present in Crystal proteins of Bacillus thuringiensis sub species. This LRR sequence has been predicted to be involved in proteinprotein interactions or receptor binding functions, hence the importance of this study.

  • PDF