• Title/Summary/Keyword: genomic data

Search Result 625, Processing Time 0.028 seconds

Design of Distributed Cloud System for Managing large-scale Genomic Data

  • Seine Jang;Seok-Jae Moon
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.16 no.2
    • /
    • pp.119-126
    • /
    • 2024
  • The volume of genomic data is constantly increasing in various modern industries and research fields. This growth presents new challenges and opportunities in terms of the quantity and diversity of genetic data. In this paper, we propose a distributed cloud system for integrating and managing large-scale gene databases. By introducing a distributed data storage and processing system based on the Hadoop Distributed File System (HDFS), various formats and sizes of genomic data can be efficiently integrated. Furthermore, by leveraging Spark on YARN, efficient management of distributed cloud computing tasks and optimal resource allocation are achieved. This establishes a foundation for the rapid processing and analysis of large-scale genomic data. Additionally, by utilizing BigQuery ML, machine learning models are developed to support genetic search and prediction, enabling researchers to more effectively utilize data. It is expected that this will contribute to driving innovative advancements in genetic research and applications.

A maximum likelihood approach to infer demographic models

  • Chung, Yujin
    • Communications for Statistical Applications and Methods
    • /
    • v.27 no.3
    • /
    • pp.385-395
    • /
    • 2020
  • We present a new maximum likelihood approach to estimate demographic history using genomic data sampled from two populations. A demographic model such as an isolation-with-migration (IM) model explains the genetic divergence of two populations split away from their common ancestral population. The standard probability model for an IM model contains a latent variable called genealogy that represents gene-specific evolutionary paths and links the genetic data to the IM model. Under an IM model, a genealogy consists of two kinds of evolutionary paths of genetic data: vertical inheritance paths (coalescent events) through generations and horizontal paths (migration events) between populations. The computational complexity of the IM model inference is one of the major limitations to analyze genomic data. We propose a fast maximum likelihood approach to estimate IM models from genomic data. The first step analyzes genomic data and maximizes the likelihood of a coalescent tree that contains vertical paths of genealogy. The second step analyzes the estimated coalescent trees and finds the parameter values of an IM model, which maximizes the distribution of the coalescent trees after taking account of possible migration events. We evaluate the performance of the new method by analyses of simulated data and genomic data from two subspecies of common chimpanzees in Africa.

Ethical Considerations in Genomic Cohort Study (유전체 코호트 연구의 윤리적 고려 사항)

  • Choi, Eun-Kyung;Kim, Ock-Joo
    • Journal of Preventive Medicine and Public Health
    • /
    • v.40 no.2
    • /
    • pp.122-129
    • /
    • 2007
  • During the last decade, genomic cohort study has been developed in many countries by linking health data and genetic data in stored samples. Genomic cohort study is expected to find key genetic components that contribute to common diseases, thereby promising great advance in genome medicine. While many countries endeavor to build biobank systems, biobank-based genome research has raised important ethical concerns including genetic privacy, confidentiality, discrimination, and informed consent. Informed consent for biobank poses an important question: whether true informed consent is possible in population-based genomic cohort research where the nature of future studies is unforeseeable when consent is obtained. Due to the sensitive character of genetic information, protecting privacy and keeping confidentiality become important topics. To minimize ethical problems and achieve scientific goals to its maximum degree, each country strives to build population-based genomic cohort research project, by organizing public consultation, trying public and expert consensus in research, and providing safeguards to protect privacy and confidentiality.

Statistical Issues in Genomic Cohort Studies (유전체 코호트 연구의 주요 통계학적 과제)

  • Park, So-Hee
    • Journal of Preventive Medicine and Public Health
    • /
    • v.40 no.2
    • /
    • pp.108-113
    • /
    • 2007
  • When conducting large-scale cohort studies, numerous statistical issues arise from the range of study design, data collection, data analysis and interpretation. In genomic cohort studies, these statistical problems become more complicated, which need to be carefully dealt with. Rapid technical advances in genomic studies produce enormous amount of data to be analyzed and traditional statistical methods are no longer sufficient to handle these data. In this paper, we reviewed several important statistical issues that occur frequently in large-scale genomic cohort studies, including measurement error and its relevant correction methods, cost-efficient design strategy for main cohort and validation studies, inflated Type I error, gene-gene and gene-environment interaction and time-varying hazard ratios. It is very important to employ appropriate statistical methods in order to make the best use of valuable cohort data and produce valid and reliable study results.

Prediction of Genomic Relationship Matrices using Single Nucleotide Polymorphisms in Hanwoo (한우의 유전체 표지인자 활용 개체 혈연관계 추정)

  • Lee, Deuk-Hwan;Cho, Chung-Il;Kim, Nae-Soo
    • Journal of Animal Science and Technology
    • /
    • v.52 no.5
    • /
    • pp.357-366
    • /
    • 2010
  • The emergence of next-generation sequencing technologies has lead to application of new computational and statistical methodologies that allow incorporating genetic information from entire genomes of many individuals composing the population. For example, using single-nucleotide polymorphisms (SNP) obtained from whole genome amplification platforms such as the Ilummina BovineSNP50 chip, many researchers are actively engaged in the genetic evaluation of cattle livestock using whole genome relationship analyses. In this study, we estimated the genomic relationship matrix (GRM) and compared it with one computed using a pedigree relationship matrix (PRM) using a population of Hanwoo. This project is a preliminary study that will eventually include future work on genomic selection and prediction. Data used in this study were obtained from 187 blood samples consisting of the progeny of 20 young bulls collected after parentage testing from the Hanwoo improvement center, National Agriculture Cooperative Federation as well as 103 blood samples from the progeny of 12 proven bulls collected from farms around the Kyong-buk area in South Korea. The data set was divided into two cases for analysis. In the first case missing genotypes were included. In the second case missing genotypes were excluded. The effect of missing genotypes on the accuracy of genomic relationship estimation was investigated. Estimation of relationships using genomic information was also carried out chromosome by chromosome for whole genomic SNP markers based on the regression method using allele frequencies across loci. The average correlation coefficient and standard deviation between relationships using pedigree information and chromosomal genomic information using data which was verified using a parentage test andeliminated missing genotypes was $0.81{\pm}0.04$ and their correlation coefficient when using whole genomic information was 0.98, which was higher. Variation in relationships between non-inbred half sibs was $0.22{\pm}0.17$ on chromosomal and $0.22{\pm}0.04$ on whole genomic SNP markers. The variations were larger and unusual values were observed when non-parentage test data were included. So, relationship matrix by genomic information can be useful for genetic evaluation of animal breeding.

Iterative integrated imputation for missing data and pathway models with applications to breast cancer subtypes

  • Linder, Henry;Zhang, Yuping
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.4
    • /
    • pp.411-430
    • /
    • 2019
  • Tumor development is driven by complex combinations of biological elements. Recent advances suggest that molecularly distinct subtypes of breast cancers may respond differently to pathway-targeted therapies. Thus, it is important to dissect pathway disturbances by integrating multiple molecular profiles, such as genetic, genomic and epigenomic data. However, missing data are often present in the -omic profiles of interest. Motivated by genomic data integration and imputation, we present a new statistical framework for pathway significance analysis. Specifically, we develop a new strategy for imputation of missing data in large-scale genomic studies, which adapts low-rank, structured matrix completion. Our iterative strategy enables us to impute missing data in complex configurations across multiple data platforms. In turn, we perform large-scale pathway analysis integrating gene expression, copy number, and methylation data. The advantages of the proposed statistical framework are demonstrated through simulations and real applications to breast cancer subtypes. We demonstrate superior power to identify pathway disturbances, compared with other imputation strategies. We also identify differential pathway activity across different breast tumor subtypes.

One Step Cloning of Defined DNA Fragments from Large Genomic Clones

  • Scholz, Christian;Doderlein, Gabriele;Simon, Horst H.
    • BMB Reports
    • /
    • v.39 no.4
    • /
    • pp.464-467
    • /
    • 2006
  • Recently, the nucleotide sequences of entire genomes became available. This information combined with older sequencing data discloses the exact chromosomal location of millions of nucleotide markers stored in the databases at NCBI, EMBO or DDBJ. Despite having resolved the intron/exon structures of all described genes within these genomes with a stroke of a pen, the sequencing data opens up other interesting possibilities. For example, the genomic mapping of the end sequences of the human, murine and rat BAC libraries generated at The Institute for Genomic Research (TIGR), reveals now the entire encompassed sequence of the inserts for more than a million of these clones. Since these clones are individually stored, they are now an invaluable source for experiments which depend on genomic DNA. Isolation of smaller fragments from such clones with standard methods is a time consuming process. We describe here a reliable one-step cloning technique to obtain a DNA fragment with a defined size and sequence from larger genomic clones in less than 48 hours using a standard vector with a multiple cloning site, and common restriction enzymes and equipment. The only prerequisites are the sequences of ends of the insert and of the underlying genome.

BaSDAS: a web-based pooled CRISPR-Cas9 knockout screening data analysis system

  • Park, Young-Kyu;Yoon, Byoung-Ha;Park, Seung-Jin;Kim, Byung Kwon;Kim, Seon-Young
    • Genomics & Informatics
    • /
    • v.18 no.4
    • /
    • pp.46.1-46.4
    • /
    • 2020
  • We developed the BaSDAS (Barcode-Seq Data Analysis System), a GUI-based pooled knockout screening data analysis system, to facilitate the analysis of pooled knockout screen data easily and effectively by researchers with limited bioinformatics skills. The BaSDAS supports the analysis of various pooled screening libraries, including yeast, human, and mouse libraries, and provides many useful statistical and visualization functions with a user-friendly web interface for convenience. We expect that BaSDAS will be a useful tool for the analysis of genome-wide screening data and will support the development of novel drugs based on functional genomics information.

A ChIP-Seq Data Analysis Pipeline Based on Bioconductor Packages

  • Park, Seung-Jin;Kim, Jong-Hwan;Yoon, Byung-Ha;Kim, Seon-Young
    • Genomics & Informatics
    • /
    • v.15 no.1
    • /
    • pp.11-18
    • /
    • 2017
  • Nowadays, huge volumes of chromatin immunoprecipitation-sequencing (ChIP-Seq) data are generated to increase the knowledge on DNA-protein interactions in the cell, and accordingly, many tools have been developed for ChIP-Seq analysis. Here, we provide an example of a streamlined workflow for ChIP-Seq data analysis composed of only four packages in Bioconductor: dada2, QuasR, mosaics, and ChIPseeker. 'dada2' performs trimming of the high-throughput sequencing data. 'QuasR' and 'mosaics' perform quality control and mapping of the input reads to the reference genome and peak calling, respectively. Finally, 'ChIPseeker' performs annotation and visualization of the called peaks. This workflow runs well independently of operating systems (e.g., Windows, Mac, or Linux) and processes the input fastq files into various results in one run. R code is available at github: https://github.com/ddhb/Workflow_of_Chipseq.git.

Non-Synteny Regions in the Human Genome

  • Lee, Ki-Chan;Kim, Sang-Soo
    • Genomics & Informatics
    • /
    • v.8 no.2
    • /
    • pp.86-89
    • /
    • 2010
  • Closely related species share large genomic segments called syntenic regions, where the genomic elements such as genes are arranged co-linearly among the species. While synteny is an important criteria in establishing orthologous regions between species, non-syntenic regions may display species-specific features. As the first step in cataloging human- or primate- specific genomic elements, we surveyed human genomic regions that are not syntenic with any other non-primate mammalian genomes sequenced so far. Based on the data compiled in Ensembl databases, we were able to identify 10 such regions located in eight different human chromosomes. Interestingly, most of these highly human- or primate- specific loci are concentrated in subtelomeric or pericentromeric regions. It has been reported that subtelomeric regions in human chromosomes are highly plastic and filled with recently shuffled genomic elements. Pericentromeric regions also show a great deal of segmental duplications. Such genomic rearrangements may have caused these large human- or primate- specific genome segments.