• 제목/요약/키워드: genomic data

검색결과 626건 처리시간 0.03초

Comparison of Distributed and Parallel NGS Data Analysis Methods based on Cloud Computing

  • Kang, Hyungil;Kim, Sangsoo
    • International Journal of Contents
    • /
    • 제14권1호
    • /
    • pp.34-38
    • /
    • 2018
  • With the rapid growth of genomic data, new requirements have emerged that are difficult to handle with big data storage and analysis techniques. Regardless of the size of an organization performing genomic data analysis, it is becoming increasingly difficult for an institution to build a computing environment for storing and analyzing genomic data. Recently, cloud computing has emerged as a computing environment that meets these new requirements. In this paper, we analyze and compare existing distributed and parallel NGS (Next Generation Sequencing) analysis based on cloud computing environment for future research.

Computational Detection of Prokaryotic Core Promoters in Genomic Sequences

  • Kim Ki-Bong;Sim Jeong Seop
    • Journal of Microbiology
    • /
    • 제43권5호
    • /
    • pp.411-416
    • /
    • 2005
  • The high-throughput sequencing of microbial genomes has resulted in the relatively rapid accumulation of an enormous amount of genomic sequence data. In this context, the problem posed by the detection of promoters in genomic DNA sequences via computational methods has attracted considerable research attention in recent years. This paper addresses the development of a predictive model, known as the dependence decomposition weight matrix model (DDWMM), which was designed to detect the core promoter region, including the -10 region and the transcription start sites (TSSs), in prokaryotic genomic DNA sequences. This is an issue of some importance with regard to genome annotation efforts. Our predictive model captures the most significant dependencies between positions (allowing for non­adjacent as well as adjacent dependencies) via the maximal dependence decomposition (MDD) procedure, which iteratively decomposes data sets into subsets, based on the significant dependence between positions in the promoter region to be modeled. Such dependencies may be intimately related to biological and structural concerns, since promoter elements are present in a variety of combinations, which are separated by various distances. In this respect, the DDWMM may prove to be appropriate with regard to the detection of core promoter regions and TSSs in long microbial genomic contigs. In order to demonstrate the effectiveness of our predictive model, we applied 10-fold cross-validation experiments on the 607 experimentally-verified promoter sequences, which evidenced good performance in terms of sensitivity.

Application of deep learning with bivariate models for genomic prediction of sow lifetime productivity-related traits

  • Joon-Ki Hong;Yong-Min Kim;Eun-Seok Cho;Jae-Bong Lee;Young-Sin Kim;Hee-Bok Park
    • Animal Bioscience
    • /
    • 제37권4호
    • /
    • pp.622-630
    • /
    • 2024
  • Objective: Pig breeders cannot obtain phenotypic information at the time of selection for sow lifetime productivity (SLP). They would benefit from obtaining genetic information of candidate sows. Genomic data interpreted using deep learning (DL) techniques could contribute to the genetic improvement of SLP to maximize farm profitability because DL models capture nonlinear genetic effects such as dominance and epistasis more efficiently than conventional genomic prediction methods based on linear models. This study aimed to investigate the usefulness of DL for the genomic prediction of two SLP-related traits; lifetime number of litters (LNL) and lifetime pig production (LPP). Methods: Two bivariate DL models, convolutional neural network (CNN) and local convolutional neural network (LCNN), were compared with conventional bivariate linear models (i.e., genomic best linear unbiased prediction, Bayesian ridge regression, Bayes A, and Bayes B). Phenotype and pedigree data were collected from 40,011 sows that had husbandry records. Among these, 3,652 pigs were genotyped using the PorcineSNP60K BeadChip. Results: The best predictive correlation for LNL was obtained with CNN (0.28), followed by LCNN (0.26) and conventional linear models (approximately 0.21). For LPP, the best predictive correlation was also obtained with CNN (0.29), followed by LCNN (0.27) and conventional linear models (approximately 0.25). A similar trend was observed with the mean squared error of prediction for the SLP traits. Conclusion: This study provides an example of a CNN that can outperform against the linear model-based genomic prediction approaches when the nonlinear interaction components are important because LNL and LPP exhibited strong epistatic interaction components. Additionally, our results suggest that applying bivariate DL models could also contribute to the prediction accuracy by utilizing the genetic correlation between LNL and LPP.

Detection of hydin Gene Duplication in Personal Genome Sequence Data

  • Kim, Jong-Il;Ju, Young-Seok;Kim, Shee-Hyun;Hong, Dong-Wan;Seo, Jeong-Sun
    • Genomics & Informatics
    • /
    • 제7권3호
    • /
    • pp.159-162
    • /
    • 2009
  • Human personal genome sequencing can be done with high efficiency by aligning a huge number of short reads derived from various next generation sequencing (NGS) technologies to the reference genome sequence. One of the major obstacles is the incompleteness of human reference genome. We tried to analyze the effect of hidden gene duplication on the NGS data using the known example of hydin gene. Hydin2, a duplicated copy of hydin on chromosome 16q22, has been recently found to be localized to chromosome 1q21, and is not included in the current version of standard human genome reference. We found that all of eight personal genome data published so far do not contain hydin2, and there is large number of nsSNPs in hydin. The heterozygosity of those nsSNPs was significantly higher than expected. The sequence coverage depth in hydin gene was about two fold of average depth. We believe that these unique finding of hydin can be used as useful indicators to discover new hidden multiplication in human genome.

Multiple Group Testing Procedures for Analysis of High-Dimensional Genomic Data

  • Ko, Hyoseok;Kim, Kipoong;Sun, Hokeun
    • Genomics & Informatics
    • /
    • 제14권4호
    • /
    • pp.187-195
    • /
    • 2016
  • In genetic association studies with high-dimensional genomic data, multiple group testing procedures are often required in order to identify disease/trait-related genes or genetic regions, where multiple genetic sites or variants are located within the same gene or genetic region. However, statistical testing procedures based on an individual test suffer from multiple testing issues such as the control of family-wise error rate and dependent tests. Moreover, detecting only a few of genes associated with a phenotype outcome among tens of thousands of genes is of main interest in genetic association studies. In this reason regularization procedures, where a phenotype outcome regresses on all genomic markers and then regression coefficients are estimated based on a penalized likelihood, have been considered as a good alternative approach to analysis of high-dimensional genomic data. But, selection performance of regularization procedures has been rarely compared with that of statistical group testing procedures. In this article, we performed extensive simulation studies where commonly used group testing procedures such as principal component analysis, Hotelling's $T^2$ test, and permutation test are compared with group lasso (least absolute selection and shrinkage operator) in terms of true positive selection. Also, we applied all methods considered in simulation studies to identify genes associated with ovarian cancer from over 20,000 genetic sites generated from Illumina Infinium HumanMethylation27K Beadchip. We found a big discrepancy of selected genes between multiple group testing procedures and group lasso.

Comparison of the MGISEQ-2000 and Illumina HiSeq 4000 sequencing platforms for RNA sequencing

  • Jeon, Sol A;Park, Jong Lyul;Kim, Jong-Hwan;Kim, Jeong Hwan;Kim, Yong Sung;Kim, Jin Cheon;Kim, Seon-Young
    • Genomics & Informatics
    • /
    • 제17권3호
    • /
    • pp.32.1-32.6
    • /
    • 2019
  • Currently, Illumina sequencers are the globally leading sequencing platform in the next-generation sequencing market. Recently, MGI Tech launched a series of new sequencers, including the MGISEQ-2000, which promise to deliver high-quality sequencing data faster and at lower prices than Illumina's sequencers. In this study, we compared the performance of two major sequencers (MGISEQ-2000 and HiSeq 4000) to test whether the MGISEQ-2000 sequencer delivers high-quality sequence data as suggested. We performed RNA sequencing of four human colon cancer samples with the two platforms, and compared the sequencing quality and expression values. The data produced from the MGISEQ-2000 and HiSeq 4000 showed high concordance, with Pearson correlation coefficients ranging from 0.98 to 0.99. Various quality control (QC) analyses showed that the MGISEQ-2000 data fulfilled the required QC measures. Our study suggests that the performance of the MGISEQ-2000 is comparable to that of the HiSeq 4000 and that the MGISEQ-2000 can be a useful platform for sequencing.

Generation and analysis of whole-genome sequencing data in human mammary epithelial cells

  • Jong-Lyul Park;Jae-Yoon Kim;Seon-Young Kim;Yong Sun Lee
    • Genomics & Informatics
    • /
    • 제21권1호
    • /
    • pp.11.1-11.5
    • /
    • 2023
  • Breast cancer is the most common cancer worldwide, and advanced breast cancer with metastases is incurable mainly with currently available therapies. Therefore, it is essential to understand molecular characteristics during the progression of breast carcinogenesis. Here, we report a dataset of whole genomes from the human mammary epithelial cell system derived from a reduction mammoplasty specimen. This system comprises pre-stasis 184D cells, considered normal, and seven cell lines along cancer progression series that are immortalized or additionally acquired anchorage-independent growth. Our analysis of the whole-genome sequencing (WGS) data indicates that those seven cancer progression series cells have somatic mutations whose number ranges from 8,393 to 39,564 (with an average of 30,591) compared to 184D cells. These WGS data and our mutation analysis will provide helpful information to identify driver mutations and elucidate molecular mechanisms for breast carcinogenesis.

Genetic evaluation of sheep for resistance to gastrointestinal nematodes and body size including genomic information

  • Torres, Tatiana Saraiva;Sena, Luciano Silva;dos Santos, Gleyson Vieira;Filho, Luiz Antonio Silva Figueiredo;Barbosa, Bruna Lima;Junior, Antonio de Sousa;Britto, Fabio Barros;Sarmento, Jose Lindenberg Rocha
    • Animal Bioscience
    • /
    • 제34권4호
    • /
    • pp.516-524
    • /
    • 2021
  • Objective: The genetic evaluation of Santa Inês sheep was performed for resistance to gastrointestinal nematode infection (RGNI) and body size using different relationship matrices to assess the efficiency of including genomic information in the analyses. Methods: There were 1,637 animals in the pedigree and 500, 980, and 980 records of RGNI, thoracic depth (TD), and rump height (RH), respectively. The genomic data consisted of 42,748 SNPs and 388 samples genotyped with the OvineSNP50 BeadChip. The (co)variance components were estimated in single- and multi-trait analyses using the numerator relationship matrix (A) and the hybrid matrix H, which blends A with the genomic relationship matrix (G). The BLUP and single-step genomic BLUP methods were used. The accuracies of estimated breeding values and Spearman rank correlation were also used to assess the feasibility of incorporating genomic information in the analyses. Results: The heritability estimates ranged from 0.11±0.07, for TD (in single-trait analysis using the A matrix), to 0.38±0.08, for RH (using the H matrix in multi-trait analysis). The estimates of genetic correlation ranged from -0.65±0.31 to 0.59±0.19, using A, and from -0.42±0.30 to 0.57±0.16 using H. The gains in accuracy of estimated breeding values ranged from 2.22% to 75.00% with the inclusion of genomic information in the analyses. Conclusion: The inclusion of genomic information will benefit the direct selection for the traits in this study, especially RGNI and TD. More information is necessary to improve the understanding on the genetic relationship between resistance to nematode infection and body size in Santa Inês sheep. The genetic evaluation for the evaluated traits was more efficient when genomic information was included in the analyses.

Single-step genomic evaluation for growth traits in a Mexican Braunvieh cattle population

  • Jonathan Emanuel Valerio-Hernandez;Agustin Ruiz-Flores;Mohammad Ali Nilforooshan;Paulino Perez-Rodriguez
    • Animal Bioscience
    • /
    • 제36권7호
    • /
    • pp.1003-1009
    • /
    • 2023
  • Objective: The objective was to compare (pedigree-based) best linear unbiased prediction (BLUP), genomic BLUP (GBLUP), and single-step GBLUP (ssGBLUP) methods for genomic evaluation of growth traits in a Mexican Braunvieh cattle population. Methods: Birth (BW), weaning (WW), and yearling weight (YW) data of a Mexican Braunvieh cattle population were analyzed with BLUP, GBLUP, and ssGBLUP methods. These methods are differentiated by the additive genetic relationship matrix included in the model and the animals under evaluation. The predictive ability of the model was evaluated using random partitions of the data in training and testing sets, consistently predicting about 20% of genotyped animals on all occasions. For each partition, the Pearson correlation coefficient between adjusted phenotypes for fixed effects and non-genetic random effects and the estimated breeding values (EBV) were computed. Results: The random contemporary group (CG) effect explained about 50%, 45%, and 35% of the phenotypic variance in BW, WW, and YW, respectively. For the three methods, the CG effect explained the highest proportion of the phenotypic variances (except for YW-GBLUP). The heritability estimate obtained with GBLUP was the lowest for BW, while the highest heritability was obtained with BLUP. For WW, the highest heritability estimate was obtained with BLUP, the estimates obtained with GBLUP and ssGBLUP were similar. For YW, the heritability estimates obtained with GBLUP and BLUP were similar, and the lowest heritability was obtained with ssGBLUP. Pearson correlation coefficients between adjusted phenotypes for non-genetic effects and EBVs were the highest for BLUP, followed by ssBLUP and GBLUP. Conclusion: The successful implementation of genetic evaluations that include genotyped and non-genotyped animals in our study indicate a promising method for use in genetic improvement programs of Braunvieh cattle. Our findings showed that simultaneous evaluation of genotyped and non-genotyped animals improved prediction accuracy for growth traits even with a limited number of genotyped animals.