• 제목/요약/키워드: protein sequences

검색결과 1,076건 처리시간 0.028초

로컬 서열 정렬과 트리거 기반의 단백질 버전 정보 관리 기법 (A management Technique for Protein Version Information based on Local Sequence Alignment and Trigger)

  • 정광수;박성희;류근호
    • 정보처리학회논문지D
    • /
    • 제12D권1호
    • /
    • pp.51-62
    • /
    • 2005
  • 하나의 아미노산 서열의 기능이 밝혀지면, 그와 유사한 서열 구조를 가지고 있는 서열의 기능도 유추해 낼 수 있다. 또한 기능이 밝혀진 단백질의 아미노산 서열을 변화시키거나 유용한 단백질을 만드는 것도 가능하다. 이 과정에서 하나의 원본 단백질 서열에 대하여 다른 서열 구성을 가지고 있는 여러 가지 단백질 서열이 생겨 날 수 있다. 여기서, 원본 단백질을 변화시켜 만든 단백질 버전 서열과 단백질의 주석정보를 저장 및 관리하는 체계적인 기법이 요구된다. 따라서 이 논문에서는 로컬 서열 정렬 기법을 적용한 단백질 아미노산 서열의 버전관리 기법과 트리거를 적용한 단백질 주석데이터의 이력 관리 기법을 제시하였다. 제안된 기법을 통하여 원본 서열과 버전서열의 유사도 측정 및 버전 관리의 자동화와 저장 공간을 감소시킬 수 있다. 또한 단백질 정보의 이력을 저장하고 서열 변화 정보를 분석하여 돌연변이 연구에 의한 유용한 단백질 개발 및 신약 개발이 가능하다.

Protein Sequence Search based on N-gram Indexing

  • Hwang, Mi-Nyeong;Kim, Jin-Suk
    • Bioinformatics and Biosystems
    • /
    • 제1권1호
    • /
    • pp.46-50
    • /
    • 2006
  • According to the advancement of experimental techniques in molecular biology, genomic and protein sequence databases are increasing in size exponentially, and mean sequence lengths are also increasing. Because the sizes of these databases become larger, it is difficult to search similar sequences in biological databases with significant homologies to a query sequence. In this paper, we present the N-gram indexing method to retrieve similar sequences fast, precisely and comparably. This method regards a protein sequence as a text written in language of 20 amino acid codes, adapts N-gram tokens of fixed-length as its indexing scheme for sequence strings. After such tokens are indexed for all the sequences in the database, sequences can be searched with information retrieval algorithms. Using this new method, we have developed a protein sequence search system named as ProSeS (PROtein Sequence Search). ProSeS is a protein sequence analysis system which provides overall analysis results such as similar sequences with significant homologies, predicted subcellular locations of the query sequence, and major keywords extracted from annotations of similar sequences. We show experimentally that the N-gram indexing approach saves the retrieval time significantly, and that it is as accurate as current popular search tool BLAST.

  • PDF

단백질 서열의 상동 관계를 가중 조합한 단백질 이차 구조 예측 (Prediction of Protein Secondary Structure Using the Weighted Combination of Homology Information of Protein Sequences)

  • 지상문
    • 한국정보통신학회논문지
    • /
    • 제20권9호
    • /
    • pp.1816-1821
    • /
    • 2016
  • 단백질은 대부분의 생물학적 과정에서 중대한 역할을 수행하고 있으므로, 단백질 진화, 구조와 기능을 알아내기 위하여 많은 연구가 수행되고 있는데, 단백질의 이차 구조는 이러한 연구의 중요한 기본적 정보이다. 본 연구는 대규모 단백질 구조 자료로부터 단백질 이차 구조 정보를 효과적으로 추출하여 미지의 단백질 서열이 가지는 이차 구조를 예측하려 한다. 질의 서열과 상동관계에 있는 단백질 구조자료내의 서열들을 광범위하게 찾아내기 위하여, 탐색에 사용하는 프로파일의 구성에 질의 서열과 유사한 서열들을 사용하고 갭을 허용하여 반복적인 탐색이 가능한 PSI-BLAST를 사용하였다. 상동 단백질들의 이차구조는 질의 서열과의 상동 관계의 강도에 따라 가중되어 이차 구조 예측에 기여되었다. 이차 구조를 각각 세 개와 여덟 개로 분류하는 예측 실험에서 상동 서열들과 신경망을 동시에 사용하여 93.28%와 88.79%의 정확도를 얻어서 기존 방법보다 성능이 향상되었다.

Retrieving Protein Domain Encoding DNA Sequences Automatically Through Database Cross-referencing

  • Choi, Yoon-Sup;Yang, Jae-Seong;Ryu, Sung-Ho;Kim, Sang-Uk
    • Bioinformatics and Biosystems
    • /
    • 제1권2호
    • /
    • pp.95-98
    • /
    • 2006
  • Recent proteomic studies of protein domains require high-throughput and systematic approaches. Since most experiments using protein domains, the modules of protein-protein interactions, require gene cloning, the first experimental step should be retrieving DNA sequences of domain encoding regions from databases. For a large scale proteomic research, however, it is a laborious task to extract a large number of domain sequences manually from several inter-linked databases. We present a new methodology to retrieve DNA sequences of domain encoding regions through automatic database cross-referencing. To extract protein domain encoding regions, it traverses several inter-connected database with validation process. And we applied this method to retrieve all the EGF domain encoding DNA sequences of homo sapiens. This new algorithm was implemented using Python library PAMIE, which enables to cross-reference across distinct databases automatically.

  • PDF

Identification of Viral Taxon-Specific Genes (VTSG): Application to Caliciviridae

  • Kang, Shinduck;Kim, Young-Chang
    • Genomics & Informatics
    • /
    • 제16권4호
    • /
    • pp.23.1-23.5
    • /
    • 2018
  • Virus taxonomy was initially determined by clinical experiments based on phenotype. However, with the development of sequence analysis methods, genotype-based classification was also applied. With the development of genome sequence analysis technology, there is an increasing demand for virus taxonomy to be extended from in vivo and in vitro to in silico. In this study, we verified the consistency of the current International Committee on Taxonomy of Viruses taxonomy using an in silico approach, aiming to identify the specific sequence for each virus. We applied this approach to norovirus in Caliciviridae, which causes 90% of gastroenteritis cases worldwide. First, based on the dogma "protein structure determines its function," we hypothesized that the specific sequence can be identified by the specific structure. Firstly, we extracted the coding region (CDS). Secondly, the CDS protein sequences of each genus were annotated by the conserved domain database (CDD) search. Finally, the conserved domains of each genus in Caliciviridae are classified by RPS-BLAST with CDD. The analysis result is that Caliciviridae has sequences including RNA helicase in common. In case of Norovirus, Calicivirus coat protein C terminal and viral polyprotein N-terminal appears as a specific domain in Caliciviridae. It does not include in the other genera in Caliciviridae. If this method is utilized to detect specific conserved domains, it can be used as classification keywords based on protein functional structure. After determining the specific protein domains, the specific protein domain sequences would be converted to gene sequences. This sequences would be re-used one of viral bio-marks.

Nucleotide and protein researches on anaerobic fungi during four decades

  • Chang, Jongsoo;Park, Hyunjin
    • Journal of Animal Science and Technology
    • /
    • 제62권2호
    • /
    • pp.121-140
    • /
    • 2020
  • Anaerobic fungi habitat in the gastrointestinal tract of foregut fermenters or hindgut fermenters and degrade fibrous plant biomass through the hydrolysis reactions with a wide variety of cellulolytic enzymes and physical penetration through fiber matrix with their rhizoids. To date, seventeen genera have been described in family Neocallimasticaceae, class Neocallimastigomycetes, phylum Neocallimastigomycota and one genus has been described in phylum Neocallimastigomycota. In National Center for Biotechnology Information (NCBI) database (DB), 23,830 nucleotide sequences and 59,512 protein sequences have been deposited and most of them were originated from Piromyces, Neocallimastix and Anaeromyces. Most of protein sequences (44,025) were acquired with PacBio next generation sequencing system. The whole genome sequences of Anaeromyces robustus, Neocallimastix californiae, Pecoramyces ruminantium, Piromyces finnis and Piromyces sp. E2 are available in Joint Genome Institute (JGI) database. According to the results of protein prediction, average Isoelectric points (pIs) were ranged from 5.88 (Anaeromyces) to 6.57 (Piromyces) and average molecular weights were ranged from 38.7 kDa (Orpinomyces) to 56.6 kDa (Piromyces). In Carbohydrate-Active enZYmes (CAZY) database, glycoside hydrolases (36), carbohydrate binding module (11), carbohydrate esterases (8), glycosyltransferase (5) and polysaccharide lyases (3) from anaerobic fungi were registered. During four decades, 1,031 research articles about anaerobic fungi were published and 444 and 719 articles were available in PubMed (PM) and PubMed Central (PMC) DB.

Identification of Salmonella pullorum Genomic Sequences Using Suppression Subtractive Hybridization

  • Li, Qiuchun;Xu, Yaohui;Jiao, Xinan
    • Journal of Microbiology and Biotechnology
    • /
    • 제19권9호
    • /
    • pp.898-903
    • /
    • 2009
  • Pullorum disease affecting poultry is caused by Salmonella enterica serovar Pullorum and results in severe economic loss every year, especially in countries with a developing poultry industry. The pathogenesis of S. Pullorum is not yet well defined, as the specific virulence factors still need to be identified. Thus, to isolate specific DNA fragments belonging to S. Pullorum, this study used suppression subtractive hybridization. As such, the genome of the S. Pullorum C79-13 strain was subtracted from the genome of Salmonella enterica serovar Gallinarum 9 and Salmonella enterica serovar Enteritidis CMCC(B) 50041, respectively, resulting in the identification of 20 subtracted fragments. A sequence homology analysis then revealed three types of fragment: phage sequences, plasmid sequences, and sequences with an unknown function. As a result, several important virulence-related genes encoding the IpaJ protein, colicin Y, tailspike protein, excisionase, and Rhs protein were identified that may play a role in the pathogenesis of S. Pullorum.

Bioinformatics Analysis of Hsp20 Sequences in Proteobacteria

  • Heine, Michelle;Chandra, Sathees B.C.
    • Genomics & Informatics
    • /
    • 제7권1호
    • /
    • pp.26-31
    • /
    • 2009
  • Heat shock proteins are a class of molecular chaperones that can be found in nearly all organisms from Bacteria, Archaea and Eukarya domains. Heat shock proteins experience increased transcription during periods of heat induced osmotic stress and are involved in protein disaggregation and refolding as part of a cell's danger signaling cascade. Heat shock protein, Hsp20 is a small molecular chaperone that is approximately 20kDa in weight and is hypothesized to prevent aggregation and denaturation. Hsp20 can be found in several strains of Proteobacteria, which comprises the largest phyla of the Bacteria domain and also contains several medically significant bacterial strains. Genomic analyses were performed to determine a common evolutionary pattern among Hsp20 sequences in Proteobacteria. It was found that Hsp20 shared a common ancestor within and among the five subclasses of Proteobacteria. This is readily apparent from the amount of sequence similarities within and between Hsp20 protein sequences as well as phylogenetic analysis of sequences from proteobacterial and non-proteobacterial species.

2-D graphical representation of protein sequences and its application to coronavirus phylogeny

  • Li, Chun;Xing, Lili;Wang, Xin
    • BMB Reports
    • /
    • 제41권3호
    • /
    • pp.217-222
    • /
    • 2008
  • Based on a five-letter model of the 20 amino acids, we propose a new 2-D graphical representation of protein sequence. Then we transform the 2-D graphical representation into a numerical characterization that will facilitate quantitative comparisons of protein sequences. As an application, we construct the phylogenetic tree of 56 coronavirus spike proteins. The resulting tree agrees well with the established taxonomic groups.

Proteomics Data Analysis using Representative Database

  • Kwon, Kyung-Hoon;Park, Gun-Wook;Kim, Jin-Young;Park, Young-Mok;Yoo, Jong-Shin
    • Bioinformatics and Biosystems
    • /
    • 제2권2호
    • /
    • pp.46-51
    • /
    • 2007
  • In the proteomics research using mass spectrometry, the protein database search gives the protein information from the peptide sequences that show the best match with the tandem mass spectra. The protein sequence database has been a powerful knowledgebase for this protein identification. However, as we accumulate the protein sequence information in the database, the database size gets to be huge. Now it becomes hard to consider all the protein sequences in the database search because it consumes much computing time. For the high-throughput analysis of the proteome, usually we have used the non-redundant refined database such as IPI human database of European Bioinformatics Institute. While the non-redundant database can supply the search result in high speed, it misses the variation of the protein sequences. In this study, we have concerned the proteomics data in the point of protein similarities and used the network analysis tool to build a new analysis method. This method will be able to save the computing time for the database search and keep the sequence variation to catch the modified peptides.

  • PDF