Proceedings of the Korean Society for Bioinformatics Conference (한국생물정보학회:학술대회논문집)
Korean Society for Bioinformatics (ksbsb)
- 기타
2000.11a
-
Microorganisms have been widely employed for the production of useful bioproducts including primary metabolites such as ethanol, succinic acid, acetone and butanol, secondary metabolites represented by antibiotics, proteins, polysaccharides, lipids and many others. Since these products can be obtained in small quantities under natural condition, mutation and selection processes have been employed for the improvement of strains. Recently, metabolic engineering strategies have been employed for more efficient production of these bioproducts. Metabolic engineering can be defined as purposeful modification of cellular metabolic pathways by introducing new pathways, deleting or modifying the existing pathways for the enhanced production of a desired product or modified/new product, degradation of xenobiotics, and utilization of inexpensive raw materials. Metabolic flux analysis and metabolic control analysis along with recombinant DNA techniques are three important components in designing optimized metabolic pathways, This powerful technology is being further improved by the genomics, proteomics, metabolomics and bioinformatics. Complete genome sequences are providing us with the possibility of addressing complex biological questions including metabolic control, regulation and flux. In silico analysis of microbial metabolic pathways is possible from the completed genome sequences. Transcriptome analysis by employing ONA chip allows us to examine the global pattern of gene expression at mRNA level. Two dimensional gel electrophoresis of cellular proteins can be used to examine the global proteome content, which provides us with the information on gene expression at protein level. Bioinformatics can help us to understand the results obtained with these new techniques, and further provides us with a wide range of information contained in the genome sequences. The strategies taken in our lab for the production of pharmaceutical proteins, polyhydroxyalkanoate (a family of completely biodegradable polymer), succinic acid and me chemicals by employing metabolic engineering powered by genomics, proteomics, metabolomics and bioinformatics will be presented.
-
Structural genomics aims to provide a good experimental structure or computational model of every tractable protein in a complete genome. Underlying this goal is the immense value of protein structure, especially in permitting recognition of distant evolutionary relationships for proteins whose sequence analysis has failed to find any significant homolog. A considerable fraction of the genes in all sequenced genomes have no known function, and structure determination provides a direct means of revealing homology that may be used to infer their putative molecular function. The solved structures will be similarly useful for elucidating the biochemical or biophysical role of proteins that have been previously ascribed only phenotypic functions. More generally, knowledge of an increasingly complete repertoire of protein structures will aid structure prediction methods, improve understanding of protein structure, and ultimately lend insight into molecular interactions and pathways. We use computational methods to select families whose structures cannot be predicted and which are likely to be amenable to experimental characterization. Methods to be employed included modern sequence analysis and clustering algorithms. A critical component is consultation of the presage database for structural genomics, which records the community's experimental work underway and computational predictions. The protein families are ranked according to several criteria including taxonomic diversity and known functional information. Individual proteins, often homologs from hyperthermophiles, are selected from these families as targets for structure determination. The solved structures are examined for structural similarity to other proteins of known structure. Homologous proteins in sequence databases are computationally modeled, to provide a resource of protein structure models complementing the experimentally solved protein structures.
-
Expressed sequence tags (EFTs) are the partial segments of cDNA produced from 5 or 3 single-pass sequencing of cDNA clones, error-prone and generated in highly redundant sets. Advancement and expansion of Genomics made biologists to generate huge amount of ESTs from variety of organisms-human, microorganisms as well as plants, and the cumulated number of ESTs is over 5.3 million, As the EST data being accumulate more rapidly, it becomes bigger that the needs of the EST analysis tools for extraction of biological meaning from EST data. Among the several needs of EST analyses, the extraction of protein sequence or functional motifs from ESTs are important for the identification of their function in vivo. To accomplish that purpose the precise and accurate identification of the region where the coding sequences (CDSs) is a crucial problem to solve primarily, and it will be helpful to extract and detect of genuine CD5s and protein motifs from EST collections. Although several public tools are available for EST analysis, there is not any one to accomplish the object. Furthermore, they are not targeted to the plant ESTs but human or microorganism. Thus, to correspond the urgent needs of collaborators deals with plant ESTs and to establish the analysis system to be used as general-purpose public software we constructed the pipelined-EST analysis system by integration of public software components. The software we used are as follows - Phred/Cross-match for the quality control and vector screening, NCBI Blast for the similarity searching, ICATools for the EST clustering, Phrap for EST contig assembly, and BLOCKS/Prosite for protein motif searching. The sample data set used for the construction and verification of this system was 1,386 ESTs from human intrathymic T-cells that verified using UniGene and Nr database of NCBI. The approach for the extraction of CDSs from sample data set was carried out by comparison between sample data and protein sequences/motif database, determining matched protein sequences/motifs that agree with our defined parameters, and extracting the regions that shows similarities. In recent future, in addition to these components, it is supposed to be also integrated into our system and served that the software for the peptide mass spectrometry fingerprint analysis, one of the proteomics fields. This pipelined-EST analysis system will extend our knowledge on the plant ESTs and proteins by identification of unknown-genes.
-
Protein interaction is an important research topic in Bioinformatics. A novel computational method of protein interaction was developed. It shows the diverse pattern of protein protein interaction,
-
We describe a hidden Markov model, HMMTIR, for general protein sequence based on the I-sites library of sequence-structure motifs. Unlike the linear HMMs used to model individual protein families, HMMSTR has a highly branched topology and captures recurrent local features of protein sequences and structures that transcend protein family boundaries. The model extends the I-sites library by describing the adjacencies of different sequence-structure motifs as observed in the database, and achieves a great reduction in parameters by representing overlapping motifs in a much more compact form. The HMM attributes a considerably higher probability to coding sequence than does an equivalent dipeptide model, predicts secondary structure with an accuracy of 74.6% and backbone torsion angles better than any previously reported method, and predicts the structural context of beta strands and turns with an accuracy that should be useful for tertiary structure prediction. HMMSTR has been incorporated into a public, fully-automated protein structure prediction server.
-
Computational chemistry is a discipline using computational methods for the calculation of molecular structure, properties, and reaction or for the simulation of molecular behavior. Relating and turning the complexity of data from genomics, high-throughput screening, combinatorial chemical synthesis, gene-expression investigations, pharmacogenomics, and proteomics into useful information and knowledge is the primary goal of bioinformatics. In particular, the structure-based molecular design is one of essential fields in bioinformatics and it can be called as structural bioinformatics. Therefore, the conformational analysis for proteins and peptides using the techniques of computational chemistry is expected to play a role in structural bioinformatics. There are two major computational methods for conformational analysis of proteins and peptides; one is the molecular orbital (MO) method and the other is the force field (or empirical potential function) method. The MO method can be classified into ab initio and semiempirical methods, which have been applied to relatively small and large molecules, respectively. However, the improvement in computer hardwares and softwares enables us to use the ab initio MO method for relatively larger biomolecules with up to v100 atoms or ∼800 basis functions. In order to show how computational chemistry can be used in structural bioinformatics, 1 will present on (1) cis-trans isomerization of proline dipeptide and its derivatives, (2) positional preference of proline in
${\alpha}$ -helices, and (3) conformations and activities of Arg-Gly-Asp-containing tetrapeptides. -
An early step toward evaluating a predicted RNA secondary structure is to visualize the predicted structure in graphical form. This talk will present an algorithm for efficiently drawing RNA secondary structures. The algorithm represents the direction and space for a structural element using vector and vector space, and generates nearly overlap-free polygonal displays. The algorithm and a graphical user interface have been implemented in a working program called VizQFolder on IBM PC compatibles.
-
Knowledge discovery has attracted increased attention in the biomedical industry in recent years is due to the increased availability of huge amount of biomedical data and the imminent need to turn such data into useful information and knowledge. In this talk, we discuss knowledge discovery techniques for gene expression analysis and MHC-peptide binding prediction in the context of discovering protein antigens and hot spots in these antigens.
-
With the advent of DNA microarray and "chip" technologies, gene expression in an organism can be monitored on a genomic scale, allowing the transcription levels of many genes to be measured simultaneously. Functional interpretation of massive expression data and linking such data to DNA sequences have become the new challenges to bioinformatics. I will us yeast cell cycle expression data analysis as an example to demonstrate how special database and computational methods may be used for extracting functional information, I will also briefly describe a novel clustering algorithm which has been applied to the cell cycle data.
-
Genomic approach produces massive amount of data within a short time period, New high-throughput automatic sequencers can generate over a million nucleotide sequence information overnight. A typical DNA chip experiment produces tens of thousands expression information, not to mention the tens of megabyte image files, These data must be handled automatically by computer and stored in electronic database, Thus there is a need for systematic approach of data collection, processing, and analysis. DNA sequence information is translated into amino acid sequence and is analyzed for key motif related to its biological and/or biochemical function. Functional genomics will play a significant role in identifying novel drug targets and diagnostic markers for serious diseases. As an enabling technology for functional genomics, bioinformatics is in great need worldwide, In Korea, a new functional genomics project has been recently launched and it focuses on identi☞ing genes associated with cancers prevalent in Korea, namely gastric and hepatic cancers, This involves gene discovery by high throughput sequencing of cancer cDNA libraries, gene expression profiling by DNA microarray and proteomics, and SNP profiling in Korea patient population, Our bioinformatics team will support all these activities by collecting, processing and analyzing these data.
-
Xenie is the JAVA application software that integrates and represents 'gene to function'information of human gene. Xenie extracts data from several heterogeneous molecular biology databases and provides integrated information in human readable and machine usable way. We defined 7 semantic frame classes (Gene, Transcript, Polypeptide, Protein_complex, Isotype, Functional_object, and Cell) as a common schema for storing and integrating gene to function information and relationship. Each of 7 semantic frame classes has data fields that are supposed to store biological data like gene symbol, disease information, cofactors, and inhibitors, etc. By using these semantic classes, Xenie can show how many transcripts and polypeptide has been known and what the function of gene products is in General. In detail, Xenie provides functional information of given human gene in the fields of semantic objects that are storing integrated data from several databases (Brenda, GDB, Genecards, HGMD, HUGO, LocusLink, OMIM, PIR, and SWISS-PROT). Although Xenie provide fully readable form of XML document for human researchers, the main goal of Xenie system is providing integrated data for other bioinformatic application softwares. Technically, Xenie provides two kinds of output format. One is JAVA persistent object, the other is XML document, both of them have been known as the most favorite solution for data exchange. Additionally, UML designs of Xenie and DTD for 7 semantic frame classes are available for easy data binding to other bioinformatic application systems. Hopefully, Xenie's output can provide more detailed and integrated information in several bioinformatic systems like Gene chip, 2D gel, biopathway related systems. Furthermore, through data integration, Xenie can also make a way for other bioiformatic systems to ask 'function based query'that was originally impossible to be answered because of separatly stored data in heterogeneous databases.
-
Post-genomics may be defined in different ways depending on how one views the challenges after the genome. A popular view is to follow the concept of the central dogma in molecular biology, namely from genome to transcriptome to proteome. Projects are going on to analyze gene expression profiles both at the mRNA and protein levels and to catalog protein 3D structure families, which will no doubt help the understanding of information in the genome. However complete, such catalogs of genes, RNAs, and proteins only tell us about the building blocks of life. They do not tell us much about the wiring (interaction) of building blocks, which is essential for uncovering systemic functional behaviors of the cell or the organism. Thus, an alternative view of post-genomics is to go up from the molecular level to the cellular level, and to understand, what I call, the "interactome"or a complete picture of molecular interactions in the cell. KEGG (http://www.genome.ad.jp/kegg/) is our attempt to computerize current knowledge on various cellular processes as a collection of "generalized"protein-protein interaction networks, to develop new graph-based algorithms for predicting such networks from the genome information, and to actually reconstruct the interactomes for all the completely sequenced genomes and some partial genomes. During the reconstruction process, it becomes readily apparent that certain pathways and molecular complexes are present or absent in each organism, indicating modular structures of the interactome. In addition, the reconstruction uncovers missing components in an otherwise complete pathway or complex, which may result from misannotation of the genome or misrepresentation of the KEGG pathway. When combined with additional experimental data on protein-protein interactions, such as by yeast two-hybrid systems, the reconstruction possibly uncovers unknown partners for a particular pathway or complex. Thus, the reconstruction is tightly coupled with the annotation of individual genes, which is maintained in the GENES database in KEGG. We are also trying to expand our literature surrey to include in the GENES database most up-to-date information about gene functions.
-
The past few years have seen a dramatic increase in gene expression data on the basis of DNA microarrays or DNA chips. Going beyond a generic view on the genome, microarray data are able to distinguish between gene populations in different tissues of the same organism and in different states of cells belonging to the same tissue. This affords a cell-wide view of the metabolic and regulatory processes under different conditions, building an effective basis for new diagnoses and therapies of diseases. In this talk we present machine learning techniques for effective mining of DNA microarray data. A brief introduction to the research field of machine learning from the computer science and artificial intelligence point of view is followed by a review of recently-developed learning algorithms applied to the analysis of DNA chip gene expression data. Emphasis is put on graphical models, such as Bayesian networks, latent variable models, and generative topographic mapping. Finally, we report on our own results of applying these learning methods to two important problems: the identification of cell cycle-regulated genes and the discovery of cancer classes by gene expression monitoring. The data sets are provided by the competition CAMDA-2000, the Critical Assessment of Techniques for Microarray Data Mining.
-
Now that the complete genomes of numerous organisms have been ascertained, key problems in molecular biology include determining the functions of the genes in each organism, the relationships that exist among these genes, and the regulatory mechanisms that control their operation. These problems can be partially addressed by using machine learning methods to induce predictive models from available data. My group is applying and developing machine learning methods for several tasks that involve characterizing gene regulation. In one project, for example, we are using machine learning methods to identify transcriptional control elements such as promoters, terminators and operons. In another project, we are using learning methods to identify and characterize sets of genes that are affected by tumor promoters in mammals. Our approach to these tasks involves learning multiple models for inter-related tasks, and applying learning algorithms to rich and diverse data sources including sequence data, microarray data, and text from the scientific literature.
-
The completion of sequencing human genome would motivate us to map millions of human cDNAs onto the unique ruler "genome sequence", in order to identify the exact address of each cDNA together with its exons, its promoter region, and its alternative splicing patterns. The expression patterns of some cDNAs could therefore be associated with these precise gene addresses, which further accelerate studies on mining correlations between motifs of promoters and expressions of genes in tissues. Towards the realization of this goal, we have developed a time-and-space efficient software named SQUALL that is able to map one cDNA sequence of length a few thousand onto a long genome sequence of length thirty million in a couple of minutes on average. Using SQUALL, we have mapped twenty thousand of our Bodymap (http://bodymap.ims.u-tokyo.ac.jp) cDNAs onto the genome sequences of Chr.21st and 22nd. In this talk, I will report the status of this ongoing project.
-
Considering that there is the lack of standards for storing genome-related on-line documents, the techniques in Natural Language Processing (NLP) is likely to become more and more important. It is necessary to extract useful information from the raw text and to store it in a computer-readable database format. Recent advances in NLP technologies raise new challenges and opportunities for tackling genome-related on-line text for information extraction task, For example, we can obtain many useful information related to genetic network or metabolic pathways simply by analyzing verbs such as 'activate'or 'inhibit'in Medline abstracts in a fully automatic way, Thus, combining NLP techniques with genome informatics extends beyond the traditional realms of either technology to a variety of emerging applications.
-
Prostate cancer initially responds and regresses in response to androgen depletion therapy, but most human prostate cancers will eventually recur, and re-grow as an androgen independent tumor. Once these tumors become hormone refractory, they usually are incurable leading to death for the patient. Little is known about the molecular details of how prostate cancer cells regress following androgen ablation and which genes are involved in the androgen independent growth following the development of resistance to therapy. Such knowledge would reveal putative drug targets useful in the rational therapeutic design to prevent therapy resistance and control androgen independent growth. The application of genome scale technologies have permitted new insights into the molecular mechanisms associated with these processes. Specifically, we have applied functional genomics using high density cDNA microarray analysis for parallel gene expression analysis of prostate cancer in an experimental xenograft system during androgen withdrawal therapy, and following therapy resistance, The large amount of expression data generated posed a formidable bioinformatics challenge. A novel template based gene clustering algorithm was developed and applied to the data to discover the genes that respond to androgen ablation. The data show restoration of expression of androgen dependent genes in the recurrent tumors and other signaling genes. Together, the discovered genes appear to be involved in prostate cancer cell growth and therapy resistance in this system. We have also developed and applied tissue microarray (TMA) technology for high throughput molecular analysis of hundreds to thousands of clinical specimens simultaneously. TMA analysis was used for rapid clinical translation of candidate genes discovered by cDNA microarray analysis to determine their clinical utility as diagnostic, prognostic, and therapeutic targets. Finally, we have developed a bioinformatic approach to combine pharmacogenomic data on the efficacy and specificity of various drugs to target the discovered prostate cancer growth associated candidate genes in an attempt to improve current therapeutics.
-
A lot of microbial genome sequencing projects is being done in many genome centers around the world, since the first genome, Haemophilus influenzae, was sequenced in 1995. The deluge of microbial genome sequence data demands new and highly automatic data flow system in order for genome researchers to manage and analyze their own bulky sequence data from low-level to high-level. In such an aspect, we developed the automatic data management system for microbial genome projects, which consists mainly of local database, analysis programs, and user-friendly interface. We designed and implemented the local database for large-scale sequencing projects, which makes systematic and consistent data management and retrieval possible and is tightly coupled with analysis programs and web-based user interface, That is, parsing and storage of the results of analysis programs in local database is possible and user can retrieve the data in any level of data process by means of web-based graphical user interface. Contig assembly, homology search, and ORF prediction, which are essential in genome projects, make analysis programs in our system. All but Contig assembly program are open as public domain. These programs are connected with each other by means of a lot of utility programs. As a result, this system will maximize the efficiency in cost and time in genome research.
-
Protein folding is a fundamental problem in structural bioinformatics and so numerous studies have been devoted to the subject. As the most common regular secondary conformation in proteins, helix has been an important ingredient of the protein folding problem. In particular, alanine based polypeptides are widely studied to identify the helix folding process in that the aianine amino acid is known to have one of the highest helix propensities. In principle, intrinsic helix propensities can be obtained from gas-phase measurements where solvent effect is absent. Hudgins et al. studied alanine-based peptides in vacuo using high-resolution ion mobility measurement technique. It was reported that introduction of a single Iysine at the C terminus resulted in the formation of very stable, monomeric polyalanine helices. We also have investigated helix formation in vacuo with different terminal charge conditions; we have found a new type of helix motif, To the best of our knowledge, this type of helix conformation has not been characterized before and we name it as I-helix.