• 제목/요약/키워드: gene annotation

검색결과 184건 처리시간 0.025초

OryzaGP: rice gene and protein dataset for named-entity recognition

  • Larmande, Pierre;Do, Huy;Wang, Yue
    • Genomics & Informatics
    • /
    • 제17권2호
    • /
    • pp.17.1-17.3
    • /
    • 2019
  • Text mining has become an important research method in biology, with its original purpose to extract biological entities, such as genes, proteins and phenotypic traits, to extend knowledge from scientific papers. However, few thorough studies on text mining and application development, for plant molecular biology data, have been performed, especially for rice, resulting in a lack of datasets available to solve named-entity recognition tasks for this species. Since there are rare benchmarks available for rice, we faced various difficulties in exploiting advanced machine learning methods for accurate analysis of the rice literature. To evaluate several approaches to automatically extract information from gene/protein entities, we built a new dataset for rice as a benchmark. This dataset is composed of a set of titles and abstracts, extracted from scientific papers focusing on the rice species, and is downloaded from PubMed. During the 5th Biomedical Linked Annotation Hackathon, a portion of the dataset was uploaded to PubAnnotation for sharing. Our ultimate goal is to offer a shared task of rice gene/protein name recognition through the BioNLP Open Shared Tasks framework using the dataset, to facilitate an open comparison and evaluation of different approaches to the task.

Effective Exon-Intron Structure Verification of a 1-Pyrroline-5-Carboxylate-Synthetase Gene from Halophytic Leymus chinensis (Trin.) Based on PCR, DNA Sequencing, and Alignment

  • Sun, Yan-Lin;Hong, Soon-Kwan
    • 한국자원식물학회지
    • /
    • 제23권6호
    • /
    • pp.526-534
    • /
    • 2010
  • Genomes of clusters of related eukaryotes are now being sequenced at an increasing rate. In this paper, we developed an accurate, low-cost method for annotation of gene prediction and exon-intron structure. The gene prediction was adapted for delta 1-pyrroline-5-carboxylate-synthetase (p5cs) gene from China wild-type of the halophytic Leymus chinensis (Trin.), naturally adapted to highly-alkali soils. Due to complex adaptive mechanisms in halophytes, more attentions are being paid on the regulatory elements of stress adaptation in halophytes. P5CS encodes delta 1-pyrroline-5-carboxylate-synthetase, a key regulatory enzyme involved in the biosynthesis of proline, that has direct correlation with proline accumulation in vivo and positive relationship with stress tolerance. Using analysis of reverse transcription-polymerase chain reaction (RT-PCR) and PCR, and direct sequencing, 1076 base pairs (bp) of cDNA in length and 2396 bp of genomic DNA in length were obtained from direct sequencing results. Through gene prediction and exon-intron structure verification, the full-length of cDNA sequence was divided into eight parts, with seven parts of intron insertion. The average lengths of determinated coding regions and non-coding regions were 154.17 bp and 188.57 bp, respectively. Nearly all splice sites displayed GT as the donor sites at the 5' end of intron region, and 71.43% displayed AG as the acceptor sites at the 3' end of intron region. We conclude that this method is a cost-effective way for obtaining an experimentally verified genome annotation.

Introduction to Gene Prediction Using HMM Algorithm

  • Kim, Keon-Kyun;Park, Eun-Sik
    • Journal of the Korean Data and Information Science Society
    • /
    • 제18권2호
    • /
    • pp.489-506
    • /
    • 2007
  • Gene structure prediction, which is to predict protein coding regions in a given nucleotide sequence, is the most important process in annotating genes and greatly affects gene analysis and genome annotation. As eukaryotic genes have more complicated structures in DNA sequences than those of prokaryotic genes, analysis programs for eukaryotic gene structure prediction have more diverse and more complicated computational models. There are Ab Initio method, Similarity-based method, and Ensemble method for gene prediction method for eukaryotic genes. Each Method use various algorithms. This paper introduce how to predict genes using HMM(Hidden Markov Model) algorithm and present the process of gene prediction with well-known gene prediction programs.

  • PDF

BINGO: Biological Interpretation Through Statistically and Graph-theoretically Navigating Gene $Ontology^{TM}$

  • Lee, Sung-Geun;Yang, Jae-Seong;Chung, Il-Kyung;Kim, Yang-Seok
    • Molecular & Cellular Toxicology
    • /
    • 제1권4호
    • /
    • pp.281-283
    • /
    • 2005
  • Extraction of biologically meaningful data and their validation are very important for toxicogenomics study because it deals with huge amount of heterogeneous data. BINGO is an annotation mining tool for biological interpretation of gene groups. Several statistical modeling approaches using Gene Ontology (GO) have been employed in many programs for that purpose. The statistical methodologies are useful in investigating the most significant GO attributes in a gene group, but the coherence of the resultant GO attributes over the entire group is rarely assessed. BINGO complements the statistical methods with graph-theoretic measures using the GO directed acyclic graph (DAG) structure. In addition, BINGO visualizes the consistency of a gene group more intuitively with a group-based GO subgraph. The input group can be any interesting list of genes or gene products regardless of its generation process if the group is built under a functional congruency hypothesis such as gene clusters from DNA microarray analysis.

GSnet: An Integrated Tool for Gene Set Analysis and Visualization

  • Choi, Yoon-Jeong;Woo, Hyun-Goo;Yu, Ung-Sik
    • Genomics & Informatics
    • /
    • 제5권3호
    • /
    • pp.133-136
    • /
    • 2007
  • The Gene Set network viewer (GSnet) visualizes the functional enrichment of a given gene set with a protein interaction network and is implemented as a plug-in for the Cytoscape platform. The functional enrichment of a given gene set is calculated using a hypergeometric test based on the Gene Ontology annotation. The protein interaction network is estimated using public data. Set operations allow a complex protein interaction network to be decomposed into a functionally-enriched module of interest. GSnet provides a new framework for gene set analysis by integrating a priori knowledge of a biological network with functional enrichment analysis.

한국인 비증후군성 구순구개열 환자의 OFC1 유전자의 서열 분석 (Sequencing analysis of the OFC1 gene on the nonsyndromic cleft lip and palate patient in Korean)

  • 김성식;손우성
    • 대한치과교정학회지
    • /
    • 제33권3호
    • /
    • pp.185-197
    • /
    • 2003
  • 비증후군성 구순구개열을 발생시키는 주요유전자로 추측이 되는 OFC1 유전자(위치 염색체 6p24.3)의 한국인에서 나타나는 특성을 연구하였다. 3대에 걸쳐서 처음으로 비증후군성 구순구개열이 나타난 40 명의 환자(남자 20명, 여자 20명, 평균 나이 : 14.2세)와 3대에 걸쳐서 비증후군성 구순구개열을 포함한 어떤 선천성 기형도 나타나지 않았던 정상 성인 40명 (남자 20명, 여자 20명, 평균 나이 : 25.6세)을 연구 대상으로 하였다. 중합효소 연쇄 반응법을 이용하여 OFC1 유전자를 분리 증폭한 후, 염기 서열 분석을 통해서 대립유전자형을 밝히고, BLAST 와 Pedant-Pro 데이터베이스를 이용하여 단백질의 상동성 검색을 수행하였으며, 그 결과는 다음과 같다. 1. OFC1 유전자는 'CA' 연쇄반복서열을 가진 극소위성 표지자로 밝혀졌다. 2. 환자군과 대조군의 OFC1 유전자의 특별한 차이는 발견되지 않았다. 3. 한국인에서 나타난 'CA' 연쇄반복서열의 형태는 'ABI linkage map 2'의 TA(CA)11TA(CA)10과는 달리, TA(CA)n의 형태를 띄었으며, 연쇄반복의 수는 17회에서 26회로 다양하게 나타났다. 4. 'CA' 연쇄반복서열의 횟수에 따라서, 9가지의 대립유전자형이 발견되었으며, 나타나는 빈도는 환자군과 대조군에서 유사하였다. 5. 'ABI linkage map 2'의 'CA' 연쇄반복서열 사이의 염기서열 T가 한국인에서는 C로 치환되어 있었지만, ORF예측을 하였을 때 예상되는 아미노산의 배열 차이는 관찰되지 않았다. 6. 한국인 OFC1 유전자의 염기서열로 예측되는 단백질을 알아보기 위하여 BLAST 검색을 한 결과, Telomerase reverse transcriptase(TERT, locus 5p15.33, NCBI Genome Annotation ; NT023089)와 Nucleotide binding protein 2(NBP2, locus 17q22, NCBI Genome Annotation; NT010783)가 유사한 구조를 가지는 단백질로 밝혀졌다. 7. Pedant-Pro 데이터베이스로 단백질 구조의 상동성 검색을 한 결과, OFC1 유전자는 적어도 하나의 transmembrane region과 non-gloular region을 가지는 구조로 밝혀졌다.

Functional annotation of uncharacterized proteins from Fusobacterium nucleatum: identification of virulence factors

  • Kanchan Rauthan;Saranya Joshi;Lokesh Kumar;Divya Goel;Sudhir Kumar
    • Genomics & Informatics
    • /
    • 제21권2호
    • /
    • pp.21.1-21.14
    • /
    • 2023
  • Fusobacterium nucleatum is a gram-negative bacteria associated with diverse infections like appendicitis and colorectal cancer. It mainly attacks the epithelial cells in the oral cavity and throat of the infected individual. It has a single circular genome of 2.7 Mb. Many proteins in F. nucleatum genome are listed as "Uncharacterized." Annotation of these proteins is crucial for obtaining new facts about the pathogen and deciphering the gene regulation, functions, and pathways along with discovery of novel target proteins. In the light of new genomic information, an armoury of bioinformatic tools were used for predicting the physicochemical parameters, domain and motif search, pattern search, and localization of the uncharacterized proteins. The programs such as receiver operating characteristics determine the efficacy of the databases that have been employed for prediction of different parameters at 83.6%. Functions were successfully assigned to 46 uncharacterized proteins which included enzymes, transporter proteins, membrane proteins, binding proteins, etc. Apart from the function prediction, the proteins were also subjected to string analysis to reveal the interacting partners. The annotated proteins were also put through homology-based structure prediction and modeling using Swiss PDB and Phyre2 servers. Two probable virulent factors were also identified which could be investigated further for potential drug-related studies. The assigning of functions to uncharacterized proteins has shown that some of these proteins are important for cell survival inside the host and can act as effective drug targets.

Complete Chloroplast Genome assembly and Annotation of Milk Thistle (Silybum marianum) and Phylogenetic Analysis

  • Hwajin Jung;Yedomon Ange Bovys Zoclanclounon;Jeongwoo Lee;Taeho Lee;Jeonggu Kim;Guhwang Park;Keunpyo Lee;Kwanghoon An;Jeehyoung Shim;Joonghyoun Chin;Suyoung Hong
    • 한국작물학회:학술대회논문집
    • /
    • 한국작물학회 2022년도 추계학술대회
    • /
    • pp.210-210
    • /
    • 2022
  • Silybum marianum is an annual or biennial plant from the Asteraceae family. It can grow in low-nutrient soil and drought conditions, making it easy to cultivate. From the seed, a specialized plant metabolite called silymarin (flavonolignan complex) is produced and is known to alleviate the liver from hepatitis and toxins damages. To infer the phylogenetic placement of a Korean milk thistle, we conducted a chloroplast assembly and annotation following by a comparison with existing Chinese reference genome (NC_028027). The chloroplast genome structure was highly similar with an assembly size of 152,642 bp, an 153,202 bp for Korean and Chinese milk thistle respectively. Moreover, there were similarities at the gene level, coding sequence (n = 82), transfer RNA (n = 31) and ribosomal RNA (n = 4). From all coding sequences gene set, the phylogenetic tree inference placed the Korean cultivar into the milk thistle clade; corroborating the expected tree. Moreover, an investigation the tree based only on the ycf1 gene confirmed the same tree; suggesting that ycf1 gene is a potential marker for DNA barcoding and population diversity study in milk thistle genus. Overall, the provided data represents a valuable resource for population genomics and species-centered determination since several species have been reported in the Silybum genus.

  • PDF

NA-Seq를 이용한 제주산 메밀의 발아초기 전사체 프로파일 분석 (Transcriptomic Profile Analysis of Jeju Buckwheat using RNA-Seq Data)

  • 한송이;정성진;오대주;정용환;김찬식;김재훈
    • 한국산학기술학회논문지
    • /
    • 제19권1호
    • /
    • pp.537-545
    • /
    • 2018
  • 본 연구에서는 메밀의 발아초기에 발현되는 전사체의 다양한 정보 수집을 위해 양절메밀과 대관 3-3호의 RNA를 추출하여 전사체 분석을 수행하였다. 제주산 양절메밀과 대관3-3호의 종자 및 발아 후 12, 24, 36시간별로 total RNA를 추출하고, llumina Hiseq 2000 플랫폼을 사용하여 시퀀싱 하였다. SolexaQA package의 DynamicTrim과 LengthsORT 프로그램으로 이용하여 raw 데이터 분석을 실시한 후, 어셈블리(assembly)와 annotation을 수행하였다. RNA-seq raw 데이터로부터 약 84.2%, 81.5%에 해당하는 16.5Gb, 16.2Gb의 transcriptome 데이터를 확보하였다. 47Mb에 해당하는 43,494개의 대표적인 전사체(representative transcripts)를 확보하였고, 그 중에서 annotation DB와 서열 유사도를 갖는 서열은 23,165개로 확인되었다. 메밀의 representative transcripts 유전자의 유전자 온톨로지(gene ontology) 분석결과, biological process는 metabolic process (49.49%)에서, cellular components는 cell (46.12%)에서, molecular function은 catalyltic activity (80.43%)에서 유전자가 많이 분포되어 있는 것을 확인하였다. 종자의 발아에 관련된 gibberellin receptor GID1C의 경우에는 양절메밀, 대관 3-3호의 발현양이 모두 시간이 지남에 따라 증가되는 것을 확인할 수 있었으며, gibberellin 20-oxidase1의 경우에는 양절메밀에서는 발아 후 12 시간이내에 증가되었으나, 대관 3-3호에서는 36시간까지 유전자 발현양 증가하는 것을 확인할 수 있었다. 이러한 제주산 메밀의 발아초기 단계별 전사체 분석 데이터는 종간의 기능적, 형태학적 차이를 일으키는 메커니즘 규명에 도움을 줄 것으로 사료된다.

Duration HMM을 이용한 진핵생물 유전자 예측 프로그램 개발 (A Eukaryotic Gene Structure Prediction Program Using Duration HMM)

  • 태홍석;박기정
    • 미생물학회지
    • /
    • 제39권4호
    • /
    • pp.207-215
    • /
    • 2003
  • 주어진 염기서열에서 단백질로 코딩되는 영역을 예측하는 유전자 구조 예측은 유전자 annotation의 가장 핵심적인 부분으로 유전자 분석 및 유전체 프로젝트 전체에 큰 영향을 준다. 진핵생물의 유전자가 원핵생물의 유전자에 비해 더 복잡한 구조를 가지기 때문에 진핵생물의 유전자 구조 예측 모델 역시 원핵생물에 비해 다양하고 복잡한 모델로 구성되어 있다. 본 연구팀은 duration hidden markov model을 기본형태로 하여 진핵생물의 유전자 구조 예측 프로그램인 EGSP를 개발하였다. 이 프로그램은 각 생명체의 유전자 구조 예측에 필요한 파라메터를 생성하는 학습기능과, 이를 기반으로 핵산 서열을 입력으로 해서 단백질을 코딩하는 부위를 예측하여 출력하는 기능으로 구성되며, 최근의 프로그램들의 추세대로 복수 개 유전자 예측의 기능을 갖추고 있다. EGSP의 학습과 예측에 사용되는 각 파라메터의 전체 성능에 대한 효과 분석 등을 위해 여러 개 signal에 대한 개별 모델이 주는 효과 등을 분석하였다. 진핵생물의 유전자 구조 예측에 가장 많이 연구되는 human dataset을 이용하여 현재 개발된 유전자 구조 예측 프로그램인 GenScan과 GeneID, Morgan 등 보편적으로 사용되는 프로그램들과의 성능을 여러 가지 기준에서 비교한 결과, 본 프로그램이 실용성 있는 수준을 보여주는 것을 확인하였다. 그리고 진핵 미생물인 Saccharomyces cerevisiae로 성능을 테스트한 결과 만족할 만한 수준의 성능을 나타내는 것을 알 수 있었다.