Proceedings of the Korean Society for Bioinformatics Conference (한국생물정보학회:학술대회논문집)
Korean Society for Bioinformatics (ksbsb)
- 기타
2005.09a
-
-
In this paper, we highlight the development of a web application system. Its main objective is to provide pathway visualization functionalities for inter-cytokine relationships, as well as for other types of relationships, with a specific cytokine(s) of interest. A natural language processor is first used to extract information from a certain web page that concerns the cytokine(s) of interest. The results obtained are then further processed and then displayed graphically to the user. The system displays how the cytokine(s) of interest interacts with other cytokines and cells. Useful information such as the type of reaction and catalyst involved, if any, are also displayed. In addition, the system also offers functionalities for graphical manipulations of the visualized pathways. The system has been shown to provide better overview, and hence, improved learning to readers who are new to this field by virtue of accurate inputs obtained from the natural language processing module.
-
We developed a database system to enable efficient and high-throughput transposon analyses in rice. We grow large-scale mutant series of rice by taking advantage of an active MITE transposon mPing, and apply the transposon display method to them to study correlation between genotypes and phenotypes. But the analytical phase, in which we find mutation spots from waveform data called fragment profiles, involves several problems from a viewpoint of labor amount, data management, and reliability of the result. As a solution, our database system manages all the analytical data throughout the experiments, and provides several functions and well designed web interfaces to perform overall analyses reliably and efficiently.
-
The high-dimensionality and insufficiency of gene expression profiles and proteomic profiles makes feature selection become a critical step in efficiently building accurate models for cancer problems based on such data sets. In this paper, we use a method, called Discrete Function Learning algorithm, to find discriminatory feature vectors based on information theory. The target feature vectors contain all or most information (in terms of entropy) of the class attribute. Two data sets are selected to validate our approach, one leukemia subtype gene expression data set and one ovarian cancer proteomic data set. The experimental results show that the our method generalizes well when applied to these insufficient and high-dimensional data sets. Furthermore, the obtained classifiers are highly understandable and accurate.
-
DNA microarray becomes a major tool for the investigation of global gene expression in all aspects of cancer and biomedical research. DNA microarray experiment generates enormous amounts of data and they are meaningful only in the context of a detailed description of microarrays, biomaterials, and conditions under which they were generated. MicroArray Gene Expression Data (MGED) society has established microarray standard for structured management of these diverse and large amount data. MGED MAGE-OM (MicroArray Gene Expression Object Model) is an object oriented data model, which attempts to define standard objects for gene expression. To assess the relevance of DNA microarray analysis of cancer research it is required to combine clinical and genomics data. MAGE-OM, however, does not have an appropriate structure to describe clinical information of cancer. For systematic integration of gene expression and clinical data, we create a new model, Cancer Genomics Object Model.
-
Interaction of twelve erythromycin A analogues with 50S ribosomal subunit were studied employing AutoDock 3.0.5. Results showed that all active macrolides bound at the same binding site with erythromycin A in contrast to the inactive analogues which bound at location slightly different than erythromycin A. The binding site showed consistency with the X-ray data from the perspectives of hydrogen bonding and hydrophobic interactions formed by erythromycins, roxithromycin, azithromycin, cethromycin and telithromycin with the ribosome. The inactive derivatives of erythromycin A anhydride showed higher binding free energy, while 5-desosaminyl erythronolides A and B even though having quiet similar values of binding free energy with the active analogues, docked at binding sites which are quiet different than the active analogues. These results suggest the molecular docking technique can be used in predicting the binding of erythromycin A analogues to their ribosomal target.
-
Cytokines play a crucial role in the immune and inflammatory responses. But because of the high evolutionary rate of these proteins, the similarity between different members of their family is very low, which makes the identification of novel members of cytokines very difficult. According to this point, a new bioinformatic strategy to identify novel cytokine of the short-chain and long-chain 4
${\alpha}$ helix cytokine using hidden markov model (HMM) is proposed in the paper. As a result, two motifs were created on the two train data sets, which were used to search three different databases. In order to improve the result, a strict criterion is established to filter the novel cytokines in the subject proteins. Finally, according to their E-value, scores and the criterion, four subject proteins are predicted to be possible novel cytokines for each family respectively. -
SGS (Splicing Graph Server) is as web application based on the MVC architecture with a Java platform. The specifications of the implemented design pattern are closely associated with the specific requirements of splicing graphs for analyzing alternative splice variants from a single gene. The paper presents the use of MVC architecture using JavaBeans as a model, with a JSP viewer and the servlet as the controller for this bioinformatics web application, with the open source apache/tomcat application server and a MySql database management system.
-
There are many sources of systematic variations in cDNA microarray experiments which affect the measured gene expression levels like differences in labeling efficiency between the two fluorescent dyes. Print-tip lowess normalization is used in situations where dye biases can depend on spot overall intensity and/or spatial location within the array. However, print-tip lowess normalization performs poorly in situation where error variability for each gene is heterogeneous over intensity ranges. We proposed the new print-tip normalization methods based on support vector machine regression(SVMR) and support vector machine quantile regression(SVMQR). SVMQR was derived by employing the basic principle of support vector machine (SVM) for the estimation of the linear and nonlinear quantile regressions. We applied our proposed methods to previous cDNA micro array data of apolipoprotein-AI-knockout (apoAI-KO) mice, diet-induced obese mice, and genistein-fed obese mice. From our statistical analysis, we found that the proposed methods perform better than the existing print-tip lowess normalization method.
-
Microarray gene expression profiling technology is one of the most important research topics in clinical diagnosis of disease. Given thousands of genes, only a small number of them show strong correlation with a certain phenotype. To identify such an optimal subset from thousands of genes is intractable, which plays a crucial role when classify multiple-class genes express models from tumor samples. This paper proposes an efficient classifier design method to simultaneously select the most relevant genes using an intelligent genetic algorithm (IGA) and design an accurate classifier using Support Vector Machine (SVM). IGA with an intelligent crossover operation based on orthogonal experimental design can efficiently solve large-scale parameter optimization problems. Therefore, the parameters of SVM as well as the binary parameters for gene selection are all encoded in a chromosome to achieve simultaneous optimization of gene selection and the associated SVM for accurate tumor classification. The effectiveness of the proposed method IGA/SVM is evaluated using four benchmark datasets. It is shown by computer simulation that IGA/SVM performs better than the existing method in terms of classification accuracy.
-
Transcription factors regulate gene expression by binding to gene upstream region. Each transcription factor has the specific binding site in promoter region. So the analysis of gene upstream sequence is necessary for understanding regulatory mechanism of genes, under a plausible idea that assumption that DNA sequence motif profiles are closely related to gene expression behaviors of the corresponding genes. Here, we present an effective approach to the analysis of the relation between gene expression profiles and gene upstream sequences on the basis of kernel canonical correlation analysis (kernel CCA). Kernel CCA is a useful method for finding relationships underlying between two different data sets. In the application to a yeast cell cycle data set, it is shown that gene upstream sequence profile is closely related to gene expression patterns in terms of canonical correlation scores. By the further analysis of the contributing values or weights of sequence motifs in the construction of a pair of sequence motif profiles and expression profiles, we show that the proposed method can identify significant DNA sequence motifs involved with some specific gene expression patterns, including some well known motifs and those putative, in the process of the yeast cell cycle.
-
Orthologs are genes having the same function across different species that specialize from a single gene in the last common ancestor of these species. Orthologous groups are useful in the genome annotation, studies on gene evolution, and comparative genomics. However, the construction of an orthologous group is difficult to automate and it takes so much time. It is also hard to guarantee the accuracy of the constructed orthologous groups. We propose a system to construct orthologous groups on many genomes automatically and rapidly. We utilize the grid computing to reduce the sequence alignment time, and we use clustering algorithm in the application of database to automate whole processes. We have generated orthologous groups for 20 complete prokaryotes genomes just in a day because of the grid computing. Furthermore, new genomes can be accommodated easily by the clustering algorithm and grid computing. We compared the generated orthologous groups with COGs (Clusters of orthologous Group of proteins) and KO (KEGG Ortholog). The comparison shows about 85 percent similarity compared with previous well-known orthologous databases.
-
Three of the genus Pseudomonas (P. aeruginosa, P. putida, P. syringae) show highly different phenotypic characteristics among them. Two of the three members are pathogenic and the other is non-pathogenic. Comparative analyses of the complete genomes can elucidate the genomic similarities and differences among them. We analyzed the three genomes and the genes of them to reveal the degree of conservation of chromosomes and similarity of the genes. The 2-dimensional dot plot between the pathogenic P. aeruginosa and non-pathogenic P. putida shared higher portion of the nucleotide sequences than other two combinations. Comparison of the nucleotide compositions by calculating the genome-scale plot of G+C contents and GC skew showed the variation of location. Comparison of the metabolic capabilities using the functional classification of KEGG orthology revealed that the differences in the number of genes for the specific functional categories resulted in the phenotypic differences. Finally combination of the analyses using the protein homologs supported the evolutionary distance of the P. putida obtained from other genome-scale comparisons.
-
In this paper, firstly we report experimental results on applying information extraction (IE) methodology to the task of summarizing clinical trial design information in focus on ‘Compared Treatment’, ‘Endpoint’ and ‘Patient Population’ from clinical trial MEDLINE abstracts. From these results, we have come to see this problem as one that can be decomposed into a sentence classification subtask and an IE subtask. By classifying sentences from clinical trial abstracts and only performing IE on sentences that are most likely to contain relevant information, we hypothesize that the accuracy of information extracted from the abstracts can be increased. As preparation for testing this theory in the next stage, we conducted an experiment applying state-of-the-art sentence classification techniques to the clinical trial abstracts and evaluated its potential in the original task of the summarization of clinical trial design information.
-
In the multidimensional protein identification technology of high-throughput proteomics, we use one-dimensional gel electrophoresis and after the separation by two-dimensional liquid chromatography, the sample is analyzed by tandem mass spectrometry. In this study, we have analyzed the Pseudomonas Putida KT2440 protein. From the protein identification, the protein database was combined with its reversed sequence database. From the peptide selection whose error rate is less than 1%, the SEQUEST database search for the tandem mass spectral data identified 2,045 proteins. For each protein, we compared the molecular weight calibrated from 1D-gel band position with the theoretical molecular weight computed from the amino acid sequence, by defining a variable MW
$_{corr}$ Since the bacterial proteome is simpler than human proteome considering the complexity and modifications, the proteome analysis result for the Pseudomonas Putida KT2440 could suggest a guideline to build the protocol to analyze human proteome data. -
This paper studies the problem of inferring a chemical compound from a feature vector consisting of the numbers of occurrences of vertex-labeled paths, which has potential future applications for designing new chemical compounds based on the kernel methods. This paper shows that the problem for outerplanar graphs of bounded degree can be solved in polynomial time if an alphabet is fixed and the maximum length of paths and the number of edges of each face are bounded by a constant. It is also shown that the problem is strongly NP-hard even for trees of unbounded degree.
-
Recently, as the size of genetic knowledge grows faster, automated analysis and systemization into high-throughput database has become hot issue. One essential task is to recognize and identify genomic entities and discover their relations. However, ambiguity of name entities is a serious problem because of their multiplicity of meanings and types. So far, many effective techniques have been proposed to analyze documents. Yet, accuracy is high when the data fits the model well. The purpose of this paper is to design and implement a document classification system for identifying entity problems using text/data mining combination, supplemented by rich data mining algorithms to enhance its performance. we propose RTP ost system of different style from any traditional method, which takes fault tolerant system approach and data mining strategy. This feedback cycle can enhance the performance of the text mining in terms of accuracy. We experimented our system for classifying RB-related documents on PubMed abstracts to verify the feasibility.
-
Identifying orthologous paralogenes is a fundamental problem in comparative genomics and can facilitate the study of evolutionary history of the species. Existing approaches for locating paralogs make use of local alignment based algorithms such as BLAST. However, there are cases that genes with high alignment scores are not paralogenes. On the other hand, whole genome alignment tools are designed to locate orthologs. Most of these tools are based on some unique substrings (called anchors) in the corresponding orthologous pair to identify them. Intuitively, these tools may not be useful in identifying orthologous paralogenes as paralogenes are very similar and there may not be enough unique anchors. However, our study shows that this is not true. Paralogenes although are similar, they have undergone different mutations. So, there are enough unique anchors for identifying them. Our contributions include the followings. Based on this counter-intuitive finding, we propose to employ the whole genome alignment tools to help verifying paralogenes. Our experiments on five pairs of human-mouse chromosomes show that our approach is effective and can identify most of the mis-classified paralog groups (more than 80%). We verify our finding that whole genome alignment tools are able to locate orthologous paralogenes through a simulation study. The result from the study confirms our finding.
-
In comparative genome analyses, synteny blocks play important roles for finding ortholog genes, reconstructing phylogenetic tree and predicting genome rearrangement events. In this paper, we propose a novel method to search biologically plausible synteny blocks not only from the viewpoint of finding highly preserved regions but also from the viewpoint of analyzing genome rearrangements. We have applied the method to our experiments on four fungal organisms, and succeeded to obtain some biologically interesting results.
-
With the increasing availability of mammalian genome sequences it became possible to use large scale phylogenetic analysis in order to locate potentially functional regions. In this paper we describe a new probabilistic method for the characterization of phylogenetic conservation in mammalian DNA sequences. We have used this method for the analysis of Hox gene clusters, based on the alignment of 6 species, and we constructed a map of for indicating short and long conserved fragments and their positions with respect to the known locations of Hox genes and other elements, sometimes showing surprising layouts.
-
The Growing Self-Organizing Map (GSOM), an extended type of the Self-Organizing Map, is a widely accepted tool for clustering high dimensional data. It is also suitable for the clustering of short DNA sequences of phylogenetic genomes by their oligonucleotide frequency. The GSOM presents the result of the clustering process visually on a coloured map, where the clusters can be identified by the user. This paper describes a proposal for automatic cluster detection on this map without any participation by the user. It has been applied with good success on 20 different data sets for the purpose of species separation.
-
Ahn, Ji-Young;Nam, Ky-Youb;Chang, Byung-Ha;Yoon, Jeong-Hyeok;Cho, Seung-Joo;Koh, Hun-Yeong;No, Kyoung-Tai 134
The design of ion channel targeted library is a valuable methodology that can aid in the selection and prioritization of potential ion channel-likeness for ion-channel-targeted bio-screening from large commercial available chemical pool. The differences of property profiling between the 93 ion-channel active compounds from MDDR and CMC database and the ACDSC compounds were classified by suitable descriptors calculated with preADME software. Through the PCA, clustering, and similarity analysis, the compounds capable of ion channel activity were defined in ACDSC compounds pool. The designed library showed a tendency to follow the property profile of ion-channel active compounds and can be implemented with great time and economical efficiencies of ligand-based drug design or virtual high throughput screening from an enormous small molecule space. -
Lee, Joo-Youn;Nam, Ky-Youb;Min, Yong-Ki;Park, Chan-Koo;Lee, Hyun-Gul;Kim, Bum-Tae;No, Kyoung-Tai 139
Cytochrome P450 14${\alpha}$ -sterol demethylase enzyme (CYP51) is the target a of azole type antifungals. The azole blocks the ergosterol synthesis and thereby inhibits fungal growth. A three-dimensional (3D) homology model of CYP51 from Candida albicans was constructed based on the X-ray crystal structure of CYP51 from Mycobacterium tuberculosis. Using this model, the binding modes for the substrate (24-methylene-24, 25-dihydrolanosterol) and the known inhibitors (fluconazole, voriconazole, oxiconazole, miconazole) were predicted from docking. Virtual screening was performed employing Structure Based Focusing (SBF). In this procedure, the pharmacophore models for database search were generated from the protein-ligands interactions each other. The initial structure-based virtual screening selected 15 compounds from a commercial available 3D database of approximately 50,000 molecule library, Being evaluated by a cell-based assay, 5 compounds were further identified as the potent inhibitors of Candida albicans CYP51 (CACYP51) with low minimal inhibitory concentration (MIC) range. BMD-09-01${\sim}$ BMD-09-04 MIC range was 0.5${\mu}$ g/ml and BMD-09-05 was 1${\mu}$ g/ml. These new inhibitors provide a basis for some non-azole antifungal rational design of new, and more efficacious antifungal agents. -
Proper PCR primer design determines the success or failure of Polymerase Chain Reaction (PCR) reactions. In this project, we develop GENE-PRIMER, a genomes specific PCR primer design program that is amenable to a genome-wide scale. To achieve this, we incorporated various parameters with biological significance into our program, namely, primer length, melting temperature of primers Tm, guanine/cytosine (GC) content of primer, homopolymeric runs in primer and self-hybridization tendency of primer. In addition, BLAST algorithm is utilized for the purpose of primer specificity check. In summary, selected primers adhered to both physico-chemical criteria and also display specificity to intended binding site in the genome.
-
Saithong, Treenut;Saraboon, Piyaporn;Meechai, Asawin;Cheevadhanarak, Supapon;Bhumiratana, Sakarindr 151
Recently, systems biology has been increasingly applied to gain insights into the complexity of living organisms. Many inaccessible biological information and hidden evidences fur example flux distribution of the metabolites are simply revealed by investigation of artificial cell behaviors. Most bio-models are models of single cell organisms that cannot handle the multi-cellular organisms like plants. Herein, a structured and multi-cellular model of potato was developed to comprehend the root starch biosynthesis. On the basis of simplest plant cell biology, a potato structured model on the platform of Berkley Madonna was divided into three parts: photosynthetic (leaf), non-photosynthetic (tuber) and transportation (phloem) cells. The model of starch biosynthesis begins with the fixation of CO$_2$ from atmosphere to the Calvin cycle. Passing through a series of reactions, triose phosphate from Calvin cycle is converted to sucrose which is transported to sink cells and is eventually formed the amylose and amylopectin (starch constituents). After validating the model with data from a number of literatures, the results show that the structured model is a good representative of the studied system. The result of triose phosphate (DHAP and GAP) elevation due to lessening the aldolase activity is an illustration of the validation. Furthermore, the representative model was used to gain more understanding of starch production process such as the effect of CO$_2$ uptake on qualitative and quantitative aspects of starch biosynthesis. -
We introduce a fully automatic clustering method to classier candidate paralog clusters from a set of protein sequences within one genome. A set of protein sequences is represented as a set of nodes, each represented by the amino acid sequence for a protein with the sequence similarities among them constituting a set of edges in a graph of protein relationships. We use graph-based clustering methods to identify structurally consistent sets of nodes which are strongly connected with each other. Our results are consistent with those from current leading systems such as COG/KOG and KEGG based on manual curation. All the results are viewable at http://www.cs.rutgers.edu/
${\sim}$ seabee. -
Data mining techniques can be applied to identify patterns of interest in the gene expression data. One goal in mining gene expression data is to determine how the expression of any particular gene might affect the expression of other genes. To find relationships between different genes, association rules have been applied to gene expression data set [1]. A notable limitation of association rule mining method is that only the association in a single profile experiment can be detected. It cannot be used to find rules across different condition profiles or different time point profile experiments. However, with the appearance of time-series microarray data, it became possible to analyze the temporal relationship between genes. In this paper, we analyze the time-series microarray gene expression data to extract the sequential patterns which are similar to the association rules between genes among different time points in the yeast cell cycle. The sequential patterns found in our work can catch the associations between different genes which express or repress at diverse time points. We have applied sequential pattern mining method to time-series microarray gene expression data and discovered a number of sequential patterns from two groups of genes (test, control) and more sequential patterns have been discovered from test group (same CO term group) than from the control group (different GO term group). This result can be a support for the potential of sequential patterns which is capable of catching the biologically meaningful association between genes.
-
This paper introduces a method which improves the performance of the identification of splice sites in the genomic DNA sequence of eukaryotes. This method combines a low order Markov model in series with a neural network for the predictions of splice sites. The lower order Markov model incorporates the biological knowledge surrounding the splice sites as probabilistic parameters. The Neural network takes the Markov encoded parameters as the inputs and produces the prediction. Two types of neural networks are used for the comparison. This method reduces the computational complexity and shows encouraging accuracy in the predictions of splice sites when applied to several standard splice site dataset.
-
A three dimensional (3D) model for the catalytic region of Type II Pseudomonas sp. USM 4-55 PHA synthase 1 (PhaC1
$_{P.sp\;USM\;4-55}$ ) from residue 267 to residue 484 was developed. Sequence analysis demonstrated that PhaC1$_{P.sp\;USM\;4-55}$ lacked homology with all known structural databases. PSI-BLAST and HMM Superfamily analyses demonstrated that this enzyme belongs to the${\alpha}/{\beta}$ hydrolase fold family. Threading approach revealed that the most suitable template to use was the Human gastric lipase (1HLG). The superimposition of the predicted PhaC1$_{P.sp\;USM\;4-55}$ model with the 1HLG template structure covering 86.2% of the backbone atoms showed an RMSD of 1.15${\AA}$ The catalytic residues comprising of Cys296, Asp451, His452 and His479 were found to be conserved and were located adjacent to each other. We proposed that the catalytic mechanism involved the formation of two tetrahedral intermediates. -
In this paper, we propose a heuristic method to select features using a Two-Phase Markov Blanket-based (TPMB) algorithm. The first phase, filtering phase, of TPMB algorithm works by filtering the obviously redundant features. A non-linear correlation method based on Information theory is used as a metric to measure the redundancy of a feature [1]. In second phase, approximating phase, the Markov Blanket (MB) of a system is estimated by employing the concept of cross entropy to identify the MB. We perform experiments on microarray data and report two popular dataset, AML-ALL [3] and colon tumor [4], in this paper. The experimental results show that the TPMB algorithm can significantly reduce the number of features while maintaining the accuracy of the classifiers.
-
Genetic networks are a key to unraveling dynamic properties of biological processes and regulation of genes plays an essential role in dynamic behavior of the genetic networks. A popular characterization of regulation of the gene is a kinetic model. However, many kinetic parameters in the genetic regulation have not been available. To overcome this difficulty, in this report, state-space approach to modeling gene regulation is presented. Second-order systems are used to characterize gene regulation. Interpretation of coefficients in the second order systems as resistance, capacitance and inductance is studied. The mathematical methods for transient response analysis of gene regulation to external perturbation are investigated. Criterion for classifying gene into three categories: underdamped, overdamped and critical damped is discussed. The proposed models are applied to yeast cell cycle gene expression data.
-
When we want to find out the regulatory relationships between genes from gene expression data, dimensionality is one of the big problem. In general, the size of search space in modeling the regulatory relationships grows in O(n
$^2$ ) while the number of genes is increasing. However, hopefully it can be reduced to O(kn) with selected k by applying divide and conquer heuristics which depend on some assumptions about genetic network. In this paper, we approach the modeling problem in divide-and-conquer manner. We applied clustering to make the problem into small sub-problems, then hierarchical model process is applied to those small sub-problems. -
Cell phenotypes are determined by the concerted activity of thousands of genes and their products. This activity is coordinated by a complex network that regulates the expression of genes. Understanding this organization is crucial to elucidate cellular activities, and many researches have tried to construct gene regulatory networks from mRNA expression data which are nowadays the most available and have a lot of information for cellular processes. Several computational tools, such as Boolean network, Qualitative network, Bayesian network, and so on, have been applied to infer these networks. Among them, Bayesian networks that we chose as the inference tool have been often used in this field recently due to their well-established theoretical foundation and statistical robustness. However, the relative insufficiency of experiments with respect to the number of genes leads to many false positive inferences. To alleviate this problem, we had developed the algorithm of MONET(MOdularized NETwork learning), which is a new method for inferring modularized gene networks by utilizing two complementary sources of information: biological annotations and gene expression. Afterward, we have packaged and improved MONET by combining dispersed functional blocks, extending species which can be inputted in this system, reducing the time complexities by improving algorithms, and simplifying input/output formats and parameters so that it can be utilized in actual fields. In this paper, we present the architecture of MONET system that we have improved.
-
Immunoglobulins (IG) , T cell receptors (TR) and major histocompatibility complex (MHC) are major components of the immune system. Their experimentally determined three-dimensional (3D) structures are numerous and their retrieval and comparison is problematic. IMGT, the international ImMunoGeneTics information system
$^{\circledR}$ (http://imgt.cines.fr), has devised controlled vocabulary and annotation rules for the sequences and 3D structures of the IG TR and MHC. Annotated data from IMGT/3D sructure-DB, the IMGT 3D structure database, are used in this paper to compare 3D structure of the domains and receptor, and to characterize IG/antigen, peptide/MHC and TR/peptide/MHC interfaces. The analysis includes angle measures to assess receptor flexibility, structural superimposition and contact analysis. Up-to-date data and analysis results are available at the IMGT Web site, http://imgt.cines.fr. -
Vorapreeda, Tayvich;Kittichotirat, Weerayuth;Meechai, Asawin;Bhumiratana, Sakarindr;Cheevadhanarak, Supapon 215
Generally, enzymes in the starch biosynthesis pathway exist in many isoforms, contributing to the difficulties in the dissection of their specific roles in controlling starch properties. In this study, we present an algorithm as an alternative method to classify isoforms of starch biosynthesis enzymes based on their conserved secondary structures. Analysis of the predicted secondary structure of plant soluble starch synthase I (SSI) and soluble starch synthase II (SSII) demonstrates that these two classes of isoform can be reclassified into three subsets, SS-A, SS-B and SS-C, according to the differences in the secondary structure of the protein at C-terminus. SS-A reveals unique structural features that are conserved only in cereal plants, while those of SS-B are found in all plants and SS-C is restricted to barley. These findings enable us to increase the accuracy in the estimation of evolutionary distance between isoforms of starch synthases. Moreover, it facilitates the elucidation of correlations between the functions of each enzyme isoforms and the properties of starches. Our secondary structure analysis tool can be applicable to study the functions of other plant enzyme isoforms of economical importance. -
THEMATICS is a simple computational method for predicting functional sites in proteins. The method computes the theoretical titration curves of the ionizable residues of a protein using its 3D structure, determines the residues with perturbed, non-Henderson-Hasselbalch titration behavior, and identifies clusters of these perturbed residues in physical proximity. We have shown previously that this method is highly successful in predicting catalytic sites in enzymes. In the present study, we apply the method to non-catalytic ligand-binding proteins. It is shown that THEMATICS can predict non-catalytic binding sites. The success rate is better than 80 % for a set of 30 non-catalytic, ligand-binding proteins. The application of the method to Glutamine-binding protein from E. coli is discussed in detail.
-
The protein side-chain packing problem (SCPP) is known to be NP-complete. Various graph theoretic based side-chain packing algorithms have been proposed. However as the size of the protein becomes larger, the sampling space increases exponentially. Hence, one approach to cope with the time complexity is to decompose the graph of the protein into smaller subgraphs. Some existing approaches decompose the graph into biconnected components at an articulation point (resulting in an at-most 21-residue subgraph) or solve the SCPP by tree decomposition (4-, 5-residue subgraph). In this regard, we had also presented a deterministic based approach called as SPWCQ using the notion of maximum edge weight clique in which we reduce SCPP to a graph and then obtain the maximum edge-weight clique of the obtained graph. This algorithm performs well for a protein of less than 500 residues. However, it fails to produce a feasible solution for larger proteins because of the size of the search space. In this paper, we present a new heuristic approach for the side-chain packing problem based on the maximum edge-weight clique finding algorithm that enables us to compute the side-chain packing of much larger proteins. Our new approach can compute side-chain packing of a protein of 874 residues with an RMSD of 1.423
${\AA}$ . -
Predicting the cellular location of an unknown protein gives a valuable information for inferring the possible function of the protein. For more accurate prediction system, we need a good feature extraction method that transforms the raw sequence data into the numerical feature vector, minimizing information loss. In this paper, we propose new methods of extracting underlying features only from the sequence data by computing pairwise sequence alignment scores. In addition, we use composition based features to improve prediction accuracy. To construct an SVM ensemble from separately trained SVM classifiers, we propose specificity based weighted majority voting. The overall prediction accuracy evaluated by the 5-fold cross-validation reached 88.53% for the eukaryotic animal data set. By comparing the prediction accuracy of various feature extraction methods, we could get the biological insight on the location of targeting information. Our numerical experiments confirm that our new feature extraction methods are very useful for predicting subcellular localization of proteins.
-
도메인 조합에 기반한 단백질 상호작용 예측 기법은 효모와 같은 특정 종에 대하여 우수한예측 정확도를 보이는 것으로 알려졌으나, 인간과 같은 고등 생명체의 단백질에 대한 상호작용 예측을 수행하기 위하여는 여러종에 대한 기법의 적절성검증과 최적의 학습집단 구성 방안에 대한 연구가 선행되어야 한다. 본 논문에서는, 초파리 단백질을 이용한 예측 정확도 검증으로 도메인 조합 기법의 일반화 가능성을 타진 하고 이종간의 상호작용 예측실험 및 정확도 검증을 통하여 비교적 연구가 덜 되어진 종의 단백질 상호작용 예측을 위한 학습집단 구성 방법에 대하여 기술한다. 초파리 실험에서는 10351개의 상호작용이 있는 단백질 쌍 가운데, 80%와 20%를 각각 학습집단 및 실험집단으로 사용하였으며, 상호작용이 없는단백질 쌍의 학습집단은 1배에서 5배까지 변화시키면서 예측 정확도를 관찰하였다. 이 결과77.58%의 민감도와 92.61%의 특이도를 확인하였다. 이종간의 상호작용 예측 실험은 효모, 초파리, 효모, 초파리에 해당하는 학습집단 각각을 바탕으로 Human, Mouse, E. coli, C. elegans 등의 단백질 상호작용 예측을 수행하였다. 실험 곁과 학습집단의 도메인이 실험집단의 도메인과 많이 겹칠수록 높은 정확도를 보여주었으며, 도메인 집단간의 유사도를 나타내기 위해 고안한 Domain Overlapping Rate(DOR) 는 상호작용 예측 정확도의 중요한 요소임을 찾아내었다.
-
Angiotensin-converting enzyme (ACE) is primarily responsible for human hypertension. Current ACE drugs show serious cough and angiodema health problems due to the un-specific activity of the drug to ACE protein. The availability of ACE crystal structure (1UZF) provided the plausible biological orientation of inhibitors to ACE active site (C-domain). Three-dimensional quantitative structure-activity relationship (3D-QSAR) models have been constructed using the comparative molecula. field analysis (CoMFA) and comparative molecular similarity indices analysis (CoMSIA) for a series of 28 ACE inhibitors. Alignment for CoMFA obtained by docking ligands to 1UZF protein using FlexX program showed better statistical model as compared to superposition of corresponding atoms. The statistical parameters indicate reasonable models for both CoMFA (q
$^2$ = 0.530, r$^2$ = 0.998) and CoMSIA (q$^2$ = 0.518, r$^2$ = 0.990). The 3D-QSAR analyses provide valuable information for the design of ACE inhibitors with potent activity towards C-domain of ACE. The group substitutions involving the phenyl ring and carbon chain at the propionyl and sulfonyl moieties of captopril are essential for specific activity to ACE. -
Among non-synonymous SNPs that cause amino acid change in the protein product, the selection of disease-causing SNPs has been of great interest. We present the comparison between the evolutionary (SIFT score) and structural information (binding pocket) to show that the incorporation between them provides an advantage of sorting disease-causing SNPs from normal SNPs. To set up the procedure, we apply the machine learning method to the test data set from the laboratory experiments.
-
We made a stepping stone for asthma study by analyzing an asthma-specific protein-protein interaction network. It follows the power-law degree distribution and its hub nodes and skeleton frame of the network agreed with the prior knowledge about asthma pathway. This study is providing a systematic approach to analyze the complex effect of genes or to represent the frame of their relations associated with specific disease.
-
The goal of data mining is to extract new and useful knowledge from large scale datasets. As the amount of available data grows explosively, it became vitally important to develop faster data mining algorithms for various types of data. Recently, an interest in developing data mining algorithms that operate on graphs has been increased. Especially, mining frequent patterns from structured data such as graphs has been concerned by many research groups. A graph is a highly adaptable representation scheme that used in many domains including chemistry, bioinformatics and physics. For example, the chemical structure of a given substance can be modelled by an undirected labelled graph in which each node corresponds to an atom and each edge corresponds to a chemical bond between atoms. Internet can also be modelled as a directed graph in which each node corresponds to an web site and each edge corresponds to a hypertext link between web sites. Notably in bioinformatics area, various kinds of newly discovered data such as gene regulation networks or protein interaction networks could be modelled as graphs. There have been a number of attempts to find useful knowledge from these graph structured data. One of the most powerful analysis tool for graph structured data is frequent subgraph analysis. Recurring patterns in graph data can provide incomparable insights into that graph data. However, to find recurring subgraphs is extremely expensive in computational side. At the core of the problem, there are two computationally challenging problems. 1) Subgraph isomorphism and 2) Enumeration of subgraphs. Problems related to the former are subgraph isomorphism problem (Is graph A contains graph B?) and graph isomorphism problem(Are two graphs A and B the same or not?). Even these simplified versions of the subgraph mining problem are known to be NP-complete or Polymorphism-complete and no polynomial time algorithm has been existed so far. The later is also a difficult problem. We should generate all of 2
$^n$ subgraphs if there is no constraint where n is the number of vertices of the input graph. In order to find frequent subgraphs from larger graph database, it is essential to give appropriate constraint to the subgraphs to find. Most of the current approaches are focus on the frequencies of a subgraph: the higher the frequency of a graph is, the more attentions should be given to that graph. Recently, several algorithms which use level by level approaches to find frequent subgraphs have been developed. Some of the recently emerging applications suggest that other constraints such as connectivity also could be useful in mining subgraphs : more strongly connected parts of a graph are more informative. If we restrict the set of subgraphs to mine to more strongly connected parts, its computational complexity could be decreased significantly. In this paper, we present an efficient algorithm to mine frequent subgraphs that are more strongly connected. Experimental study shows that the algorithm is scaling to larger graphs which have more than ten thousand vertices. -
We introduce a new approach to optimize the parameters in biological kinetic models by quantifier elimination (QE), in combination with numerical simulation methods. The optimization method was applied to a model for the inhibition kinetics of HIV proteinase with ten parameters and nine variables, and attained the goodness of fit to 300 points of observed data with the same magnitude as that obtained by the previous optimization methods, remarkably by using only one or two points of data. Furthermore, the utilization of QE demonstrated the feasibility of the present method for elucidating the behavior of the parameters in the analyzed model. The present symbolic-numeric method is therefore a powerful approach to reveal the fundamental mechanisms of kinetic models, in addition to being a computational engine.
-
Cell cycle is regulated cooperatively by several genes. The dynamic regulatory mechanism of protein interaction network of cell cycle will be presented taking the budding yeast as a sample system. Based on the mathematical model developed by Chen et at. (MBC, 11,369), at first, the dynamic role of the feedback loops is investigated. Secondly, using a bifurcation diagram, dynamic analysis of the cell cycle regulation is illustrated. The bifurcation diagram is a kind of ‘dynamic road map’ with stable and unstable solutions. On the map, a stable solution denotes a ‘road’ attracting the state and an unstable solution ‘a repelling road’ The ‘START’ transition, the initiation of the cell cycle, occurs at the point where the dynamic road changes from a fixed point to an oscillatory solution. The 'FINISH' transition, the completion of a cell cycle, is returning back to the initial state. The bifurcation analysis for the mutants could be used uncovering the role of proteins in the cell cycle regulation network.
-
Predicting RNA secondary structure as accurately as possible is very important in functional analysis of RNA molecules. However, different prediction methods and related parameters including terminal GU pair of helices, minimum length of helices, and free energy systems often give different prediction results for the same RNA sequence. Then, which structure is more important than the others? i.e. which combinations of the methods and related parameters are the optimal? In order to investigate above problems, first, three prediction methods, namely, random stacking of helical regions (RS), helical regions distribution (HD), and Zuker's minimum free energy algorithm (ZMFE) were compared by taking 1139 tRNA sequences from Rfam database as the samples with different combinations of parameters. The optimal parameters are derived. Second, Zuker's dynamic programming method for prediction of RNA secondary structure was revised using the above optimal parameters and related software BJRNAFold was developed. Third, the effects of short-range interaction were studied. The results indicated that the prediction accuracy would be improved much if proper short-range factor were introduced. But the optimal short-range factor was difficult to determine. A user-adjustable parameter for short-range factor was introduced in BJRNAFold software.
-
We have studied unbound docking for 12 protein-protein complexes using conformational space annealing (CSA) combined along with statistical pair potentials. The CSA, a powerful global optimization tool, is used to search the conformational space represented by a translational vector and three Euler amgles between two proteins. The energy function consists of three statistical pair-wise energy terms; one from the distance-scaled finite ideal-gas reference state (DFIRE) approach by Zhou and the other two derived from residue-residue contacts. The residue-residue contact terms describe both attractive and repulsive interactions between two residues in contact. The performance of the CSA docking is compared with that of ZDOCK, a well-established protein-protein docking method. The results show that the application of CSA to the protein-protein docking is quite successful, indicating that the CSA combined with a good scoring function is a promising method for the study of protein-protein interaction.
-
Molecular dynamics of nanosecond timescale could provide a significant result for the structure analysis of oligopeptides. The most surprising was the fact that Chou-Fasman parameters which have no direct relationship with nanosecond molecular dynamics could implicate the result to a certain extent. A novel parameter termed X %-stickiness, which is another measure of compactness of a molecule, was first introduced effectively.
-
Comparison of Protein Internal Motion by Inter-helical Motional Correlations and Hydrogen Bond RatioInternal motion of the protein has been described in many papers with C
$_{\alpha}$ correlation coefficients to find motional correlation and functional characteristics. To describe the secondary structural motion and stability in protein, we have studied molecular dynamics (MD) simulations on FADD Death Domain and FADD Death Effector Domain which have a similar structure but have different functional characteristics. After 10ns MD simulations, the inter-helical motional correlations and the hydrogen bond ratios were compared between the two domains. From these data we could distinctly compare the internal motions of them and could explain the differences in experimental thermodynamic melting behaviors at molecular level. -
Estimating the reliability of protein-protein interaction data sets obtained by high-throughput technologies such as yeast two-hybrid assays and mass spectrometry is of great importance. We develop a maximum likelihood estimation method that uses both protein localization and gene expression data to estimate the reliability of protein interaction data sets. By integrating protein localization data and gene expression data, we can obtain more accurate estimates of the reliability of various interaction data sets. We apply the method to protein physical interaction data sets and protein complex data sets. The reliability of the yeast two-hybrid interactions by Ito et al. (2001) is 27%, and that by Uetz et at.(2000) is 68%. The reliability of the protein complex data sets using tandem affinity purification-mass spec-trometry (TAP) by Gavin et at. (2002) is 45%, and that using high-throughput mass spectrometric protein complex identification (HMS-PCI) by Ho et al. (2002) is 20%. The method is general and can be applied to analyze any protein interaction data sets.
-
Hwang, Grace J.;Huang, Chuan-Ching;Chen, Ta Jen;Yue, Jack C.;Ivan Chang, Yuan-Chin;Adam, Bao-Ling 319
An integrated approach for prostate cancer detection using proteomic data is presented. Due to the high-dimensional feature of proteomic data, the discrete wavelet transform (DWT) is used in the first-stage for data reduction as well as noise removal. After the process of DWT, the dimensionality is reduced from 43,556 to 1,599. Thus, each sample of proteomic data can be represented by 1599 wavelet coefficients. In the second stage, a voting method is used to select a common set of wavelet coefficients for all samples together. This produces a 987-dimension subspace of wavelet coefficients. In the third stage, the Autoassociator algorithm reduces the dimensionality from 987 to 400. Finally, the artificial neural network (ANN) is applied on the 400-dimension space for prostate cancer detection. The integrated approach is examined on 9 categories of 2-class experiments, and also 3- and 4-class experiments. All of the experiments were run 10 times of ten-fold cross-validation (i. e. 10 partitions with 100 runs). For 9 categories of 2-class experiments, the average testing accuracies are between 81% and 96%, and the average testing accuracies of 3- and 4-way classifications are 85% and 84%, respectively. The integrated approach achieves exciting results for the early detection and diagnosis of prostate cancer. -
We are generally interested in the analysis, detection and prediction of structural motifs in proteins, in order to infer compatibility of amino acid sequence to structure in proteins of known three-dimensional structure available in the Protein Data Bank. In this context, we are analyzing some of the well-characterized structural motifs in proteins. We have analyzed simple structural motifs, such as,
${\beta}$ -turns and${\gamma}$ -turns by evaluating the statistically significant type-dependent amino acid positional preferences in enlarged representative protein datasets and revised the amino acid preferences. In doing so, we identified a number of ‘unexpected’ isolated${\beta}$ -turns with a proline amino acid residue at the (i+2) position. We extended our study to the identification of multiple turns, continuous turns and to peptides that correspond to the combinations of individual${\beta}$ and${\gamma}$ -turns in proteins and examined the hydrogen-bond interactions likely to stabilize these peptides. This led us to develop a database of structural motifs in proteins (DSMP) that would primarily allow us to make queries based on the various fields in the database for some well-characterized structural motifs, such as, helices,${\beta}$ -strands, turns,${\beta}$ -hairpins,${\beta}$ -${\alpha}$ -${\beta}$ ,${\psi}$ -loops,${\beta}$ -sheets, disulphide bridges. We have recently implemented this information for all entries in the current PDB in a relational database called ODSMP using Oracle9i that is easy to update and maintain and added few additional structural motifs. We have also developed another relational database corresponding to amino acid sequences and their associated secondary structure for representative proteins in the PDB called PSSARD. This database allows flexible queries to be made on the compatibility of amino acid sequences in the PDB to ‘user-defined’ super-secondary structure conformation and vice-versa. Currently, we have extended this database to include nearly 23,000 protein crystal structures available in the PDB. Further, we have analyzed the ‘structural plasticity’ associated with the${\beta}$ -propeller structural motif We have developed a method to automatically detect${\beta}$ -propellers from the PDB codes. We evaluated the accuracy and consistency of predicting${\beta}$ and${\gamma}$ -turns in proteins using the residue-coupled model. I will discuss results of our work and describe databases and software applications that have been developed. -
Post translational modifications (PTMs) discovery is an important problem in proteomic. In the past, people discover PTMs by Tandem Mass Spectrometer based on ‘bottom-up’ strategy. However, such strategy suffers from the problem of failing to discover all PTMs. Recently, due to the improvement in proteomic technology, Taylor et al. proposed a database software to discover PTMs with ‘topdown’ strategy by FTMS, which avoids the disadvantages of ‘bottom-up’ approach. However, their proposed algorithm runs in exponential time, requires a database of proteins, and needs prior knowledge about PTM sites. In this paper, a new algorithm is proposed which can work without a protein database and can identify modifications in polynomial time. Besides, no prior knowledge about PTM sites is needed.
-
Boolean networks(BN) construction is one of the commonly used methods for building gene networks from time series microarray data. However, BN has two major drawbacks. First, it requires heavy computing times. Second, the binary transformation of the microarray data may cause a loss of information. This paper propose two methods using liner regression to construct gene regulatory networks. The first proposed method uses regression based BN variable selection method, which reduces the computing time significantly in the BN construction. The second method is the regression based network method that can flexibly incorporate the interaction of the genes using continuous gene expression data. We construct the network structure from the simulated data to compare the computing times between Boolean networks and the proposed method. The regression based network method is evaluated using a microarray data of cell cycle in Caulobacter crescentus.
-
Helper T(Th) cells regulate immune response by producing various kinds of cytokines in response to antigen stimulation. The regulatory functions of Th cells are promoted by their differentiation into two distinct subsets, Th1 and Th2 cells. Th1 cells are involved in inducing cellular immune response by activating cytotoxic T cells. Th2 cells trigger B cells to produce antibodies, protective proteins used by the immune system to identify and neutralize foreign substances. Because cellular and humoral immune responses have quite different roles in protecting the host from foreign substances, Th cell differentiation is a crucial event in the immune response. The destiny of a naive Th cell is mainly controlled by cytokines such as IL-4, IL-12, and IFN-
${\gamma}$ . To understand the mechanism of Th cell differentiation, many mathematical models have been proposed. One of the most difficult problems in mathematical modeling is to find appropriate kinetic parameters needed to complete a model. However, it is relatively easy to get qualitative or linguistic knowledge of a model dynamics. To incorporate such knowledge into a model, we propose a novel approach, fuzzy continuous Petri nets extending traditional continuous Petri net by adding new types of places and transitions called fuzzy places and fuzzy transitions. This extension makes it possible to perform fuzzy inference with fuzzy places and fuzzy transitions acting as kinetic parameters and fuzzy inference systems between input and output places, respectively. -
In this paper we discuss the formulation and the analysis of a signaling pathway by Petri nets. In order to explicitly and formally describe the molecular mechanisms and pathological characteristics of signaling pathways, we propose a new modeling method to construct signaling pathways on the basis of formal representation of Petri net. Our proposed extended algorithm effectively finds basic enzymic components of signaling pathways by employing T-invariants of Petri nets with considering the origination leading to an occurrence of inhibition functions than existing methods. An application of the proposed algorithm is given with the example of Interleukin-1 and Interleukin-6 signaling pathways.
-
Bayesian Variable Selection in the Proportional Hazard Model with Application to DNA Microarray DataIn this paper we consider the well-known semiparametric proportional hazards (PH) models for survival analysis. These models are usually used with few covariates and many observations (subjects). But, for a typical setting of gene expression data from DNA microarray, we need to consider the case where the number of covariates p exceeds the number of samples n. For a given vector of response values which are times to event (death or censored times) and p gene expressions (covariates), we address the issue of how to reduce the dimension by selecting the significant genes. This approach enable us to estimate the survival curve when n < < p. In our approach, rather than fixing the number of selected genes, we will assign a prior distribution to this number. The approach creates additional flexibility by allowing the imposition of constraints, such as bounding the dimension via a prior, which in effect works as a penalty. To implement our methodology, we use a Markov Chain Monte Carlo (MCMC) method. We demonstrate the use of the methodology to diffuse large B-cell lymphoma (DLBCL) complementary DNA(cDNA) data.
-
Over the past few years, the complex and subtle roles of microRNA (miRNA) in gene regulation have been increasingly appreciated. Computational approaches have played one of important roles in identifying miRNAs from plant and animals, as well as in predicting their putative gene target. We present a new approach of comprehensive analysis of the evolutionarily conserved element scores and applied data compression technique to detect putative miRNA genes. We used the evolutionarily conserved elements [19] (see more detail on method and material) to calculate for base-by-base along the candidate pre-miRNA gene region by detecting common conserved pattern from target sequence. We applied the data compression technique [20] to detect unknown miRNA genes. This zipping method devises, without loss of generality with respect to the nature of the character strings, a method to measure the similarity between the strings under consideration [20]. Our experience to using our new computational method for detecting miRNA gene identification (or miRNA gene prediction) has been stratified and we were able to find 28 putative miRNA genes.
-
Tissue microarry is one of the high throughput technologies in the post-genomic era. Using tissue microarray, the researchers are able to investigate large amount of gene expressions at the level of DNA, RNA, and protein The important aspect of tissue microarry is its ability to assess a lot of biomarkers which have been used in clinical practice. To manipulate the categorical data of tissue microarray, we applied Bayesian network classifier algorithm. We identified that Bayesian network classifier algorithm could analyze tissue microarray data and integrating prior knowledge about gastric cancer could achieve better performance result. The results showed that relevant integration of prior knowledge promote the prediction accuracy of survival status of the immunohistochemical tissue microarray data of 18 tumor suppressor genes. In conclusion, the application of Bayesian network classifier seemed appropriate for the analysis of the tissue microarray data with clinical information.
-
The diagrammatic language for pathways is widely used for representing systems knowledge as a network of causal relations. Biologists infer and hypothesize with pathways to design experiments and verify models, and to identify potential drug targets. Although there have been many approaches to formalize pathways to simulate a system, reasoning with incomplete and high level knowledge has not been possible. We present a qualitative formalization of a pathway language with incomplete causal descriptions and its translation into propositional temporal logic to automate the reasoning process. Such automation accelerates the identification of drug targets in pathways.
-
We demonstrate workflows in biological data retrieval and analysis using the DDBJ Web Service; specifically introduce a workflow for the analysis of proteins or proteomics data sets. The workflow mechanically extracts the gene whose protein structure and function are known from all the genes of a human genome in Ensembl (http://www.ensembl.org/) based on cross-references among Ensembl, Swiss-Prot (http://www.ebi.ac.uk/swissprot) and PDB (Protein Data Bank; http://www.wwpdb.org/). The workflow discovered ‘hidden’ linkages among databases. We will be able to integrate distributed and heterogeneous data systems into workflows, if they are provided based on standards for Web services.
-
One of the most important issues in post-genome era is identifying functions of genes and understanding the interaction among them. Such interactions from complex biochemical pathways, which are very useful to understand the organism system. We present an integrated biochemical pathway database system with a set of software tools for reconstruction, visualization, and simulation of the pathways from the database. The novel features of the presented system include: (a) automatic integration of the heterogeneous biochemical pathway databases, (b) gene ontology for high quality of database in the integration and query (c) various biochemical simulations on the pathway database, (d) dynamic pathway reconstruction for the gene list or sequence data, (e) graphical tools which enable users to view the reconstructed pathways in a dynamic form, (f) importing/exporting SBML documents, a data exchange standard for systems biology.
-
The Protein Structural and Functional Conservation need a common language for data definition. With the help of common language provided by Protein Ontology the high level of sequence and functional conservation can be extended to all organisms with the likelihood that proteins that carry out core biological processes will again be probable orthologues. The structural and functional conservation in these proteins presents both opportunities and challenges. The main opportunity lies in the possibility of automated transfer of protein data annotations from experimentally traceable model organisms to a less traceable organism based on protein sequence similarity. Such information can be used to improve human health or agriculture. The challenge lies in using a common language to transfer protein data annotations among different species of organisms. First step in achieving this huge challenge is producing a structured, precisely defined common vocabulary using Protein Ontology. The Protein Ontology described in this paper covers the sequence, structure and biological roles of Protein Complexes in any organism.
-
Genetic association case-control studies using DNA pools are efficient ways of detecting association between a marker allele and disease status. DNA pooling is an efficient screening method for locating susceptibility genes associated with the disease. However, DNA pooling is efficient only when allele frequency estimation is done precisely and accurately. Through the evaluation of empirical type I errors and empirical powers by simulation, we will evaluate the methods that correct for preferential amplification of nucleotides when estimating the allele frequency of single-nucleotide polymorphisms.
-
Park, Jun-Hyung;Park, Hee-Kyung;Song, Eunsil;Jang, Hyun-Jung;Kang, Byeong-Chul;Lee, Seung-Won;Kim, Hyun-Jin;Kim, Cheol-Min 399
MIPROBE is a web-based tool for design of universal, genus-specific, and species-specific primers and probes. The main functions of MIPROBE are collection of target gene sequences, construction of consensus sequences, collection of candidate primers and probes, and evaluation of candidates by BLAST. Biologists with little computer skills can easily use MIPROBE to design large-scale universal, genus-, and species-specific primers and probes. This software is available at http://www.miprobe.com. Also detailed descriptions of how to use the program are found at this site. -
In this paper, we present a tool to calculate the distribution of amino acid contacts in proteins as well as in protein domains. The proteins are grouped according to the classification by Yanay Ofran and Burkhard Rost[1]. In addition, a protein's distribution was compared with that of proteins in the same group as well as the entire collection of proteins across all groups. With these statistics, biologists can pick out proteins which have characteristics that defer from the norm.
-
Yang, Jin-Ok;Hahn, Yoon-Soo;Kim, Nam-Soon;Yu, Ung-Sik;Woo, Hyun-Goo;Chu, In-Sun;Kim, Yong-Sung;Yoo, Hyang-Sook;Kim, Sang-Soo 407
KUGI (Korean UniGene Information) database contains the annotation information of the cDNA sequences obtained from the disease samples prevalent in Korean. A total of about 157,000 5'-EST high throughput sequences collected from cDNA libraries of stomach, liver, and some cancer tissues or established cell lines from Korean patients were clustered to about 35,000 contigs. From each cluster a representative clone having the longest high quality sequence or the start codon was selected. We stored the sequences of the representative clones and the clustered contigs in the KUGI database together with their information analyzed by running Blast against RefSeq, human mRNA, and UniGene databases from NCBI. We provide a web-based search engine fur the KUGI database using two types of user interfaces: attribute-based search and similarity search of the sequences. For attribute-based search, we use DBMS technology while we use BLAST that supports various similarity search options. The search system allows not only multiple queries, but also various query types. The results are as follows: 1) information of clones and libraries, 2) accession keys, location on genome, gene ontology, and pathways to public databases, 3) links to external programs, and 4) sequence information of contig and 5'-end of clones. We believe that the KUGI database and search system may provide very useful information that can be used in the study for elucidating the causes of the disease that are prevalent in Korean. -
Mathematical modeling and simulation of biochemical reaction networks gained a lot of attention recently since it can provide valuable insights into the interrelationships and interactions of genes, proteins and metabolites in a reaction network. A number of attempts have been made for modeling and storing biochemical reaction networks without their dynamical properties but unfortunately storing and efficiently querying of the dynamic (mathematical) models are not yet studied extensively. In this paper, we present a novel nested relational data schema to store a pathway with its dynamic properties. We then show how to make the mapping between this dynamic pathway schema with the corresponding static pathway representation.
-
In this paper, we investigate the nonlinear dynamic behavior of TPP (thiamine pyrophosphate) riboswitches in E. coli (Escherichia coli). TPP riboswitches are highly conserved RNA regulatory elements, embedded within the 5’'untranslated region of three TPP biosynthesis operons. The three operons thiCEFSGH, thiMD, and thiBPQ are involved in the biosynthesis, salvage, and transport of TPP, respectively. TPP riboswitches modulate their expressions in response to changing TPP concentration, without involving protein cofactors. Interestingly, the expression of thiMD is regulated at the translational level, while that of thiCEFSGH at both levels of transcription and translation. We develop a mathematical model of the TPP riboswitch’s regulatory system possessed by thiCEFSGH and thiMD, so as to simulate the time-course experiments of TPP biosynthesis in E. coli. The simulation results are validated against three sets of reported experimental data in order to gain insight into the nature of steady states and the stability of TPP riboswitches, and to explain the biological significance of regulating at level of transcription or translation, or even both. Our findings suggest that in the TPP biosynthesis pathway of E. coli, the biological effect of down-regulating thiCEFSGH operon at the translational level by TPP riboswitch is less prominent than that at the transcriptional level.
-
Kaminuma, Tsuguchika;Takai-Igarashi, Takako;Yukawa, Masumi;Tanaka, Yoshitomo;Tanaka, Hiroshi 427
Studies on cellular pathways and networks are now one of the most actively researched topics in all fields of biomedicine ranging from developmental biology to etiology. Many databases have been developed and quantitative simulation models have been proposed. One of the eventual goals of pathway/network studies is to integrate different types of pathway/network models and databases to simulate overall cellular responses. A bottleneck to this goal is modeling gene expression since the mechanism of this process is not yet fully unveiled. We are developing a small scale computer program called CiRMU (Cis-Regulatory Machinery Unit model) for describing, viewing, analyzing, and modeling the process of gene expression. A prototype system is being designed and implemented for analyzing functions of nuclear receptors. -
Living cells are sustained not by individual activities but rather by coordinated summative efforts of different biological functional modules. While recent research works have focused largely on finding individual functional modules, this paper attempts to explore the connections or relationships between different cellular functions through cross-function domain interaction maps. Exploring such a domain interaction map can help understand the underlying inter-function communication mechanisms. To construct a cross-function domain interaction map from existing genome-wide protein-protein interaction datasets, we propose a two-step procedure. First, we infer conserved domain-domain interactions from genome-wide protein-protein interactions of yeast, worm and fly. We then build a cross-function domain interaction map that shows the connections of different functions through various conserved domain interactions. The domain interaction maps reveal that conserved domain-domain interactions can be found in most detected cross-functional relationships and a f9w domains play pivotal roles in these relationships. Another important discovery in the paper is that conserved domains correspond to highly connected protein hubs that connect different functional modules together.