DOI QR코드

DOI QR Code

Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read data

  • Lee, Yuna (Department of Biomedical Informatics, Hanyang University) ;
  • Park, Kiejung (Cheonan Industry-Academic Collaboration Foundation, Sangmyung University) ;
  • Koh, Insong (Department of Biomedical Informatics, Hanyang University)
  • Received : 2019.10.10
  • Accepted : 2019.11.13
  • Published : 2019.12.31

Abstract

While studies aimed at detecting and analyzing indels or single nucleotide polymorphisms within human genomic sequences have been actively conducted, studies on detecting long insertions/deletions are not easy to orchestrate. For the last 10 years, the availability of long read data of human genomes from PacBio or Nanopore platforms has increased, which makes it easier to detect long insertions/deletions. However, because long read data have a critical disadvantage due to their relatively high cost, many next generation sequencing data are produced mainly by short read sequencing machines. Here, we constructed programs to detect so-called unmapped regions (UMRs, where no reads are mapped on the reference genome), scanned 40 Korean genomes to select UMR long deletion candidates, and compared the candidates with the long deletion break points within the genomes available from the 1000 Genomes Project (1KGP). An average of about 36,000 UMRs were found in the 40 Korean genomes tested, 284 UMRs were common across the 40 genomes, and a total of 37,943 UMRs were found. Compared with the 74,045 break points provided by the 1KGP, 30,698 UMRs overlapped. As the number of compared samples increased from 1 to 40, the number of UMRs that overlapped with the break points also increased. This eventually reached a peak of 80.9% of the total UMRs found in this study. As the total number of overlapped UMRs could probably grow to encompass 74,045 break points with the inclusion of more Korean genomes, this approach could be practically useful for studies on long deletions utilizing short read data.

Keywords

References

  1. Tattini L, D'Aurizio R, Magi A. Detection of genomic structural variants from next-generation sequencing data. Front Bioeng Biotechnol 2015;3:92.
  2. Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics 2013;14 Suppl 11:S1.
  3. Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res 2011;21:974-984. https://doi.org/10.1101/gr.114876.110
  4. Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods 2009;6:677-681. https://doi.org/10.1038/nmeth.1363
  5. Chen K, Chen L, Fan X, Wallis J, Ding L, Weinstock G. TIGRA: a targeted iterative graph routing assembler for breakpoint assembly. Genome Res 2014;24:310-317. https://doi.org/10.1101/gr.162883.113
  6. Rausch T, Zichner T, Schlattl A, Stutz AM, Benes V, Korbel JO. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 2012;28:i333-i339.
  7. Sun R, Love MI, Zemojtel T, Emde AK, Chung HR, Vingron M, et al. Breakpointer: using local mapping artifacts to support sequence breakpoint discovery from single-end reads. Bioinformatics 2012;28:1024-1025. https://doi.org/10.1093/bioinformatics/bts064
  8. Layer RM, Chiang C, Quinlan AR, Hall IM. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol 2014;15:R84. https://doi.org/10.1186/gb-2014-15-6-r84
  9. Soylev A, Kockan C, Hormozdiari F, Alkan C. Toolkit for automated and rapid discovery of structural variants. Methods 2017;129:3-7. https://doi.org/10.1016/j.ymeth.2017.05.030
  10. Abo RP, Ducar M, Garcia EP, Thorner AR, Rojas-Rudilla V, Lin L, et al. BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers. Nucleic Acids Res 2015;43:e19. https://doi.org/10.1093/nar/gku1211
  11. Quinlan AR, Clark RA, Sokolova S, Leibowitz ML, Zhang Y, Hurles ME, et al. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res 2010;20:623-635. https://doi.org/10.1101/gr.102970.109
  12. Hajirasouliha I, Hormozdiari F, Alkan C, Kidd JM, Birol I, Eichler EE, et al. Detection and characterization of novel sequence insertions using paired-end next-generation sequencing. Bioinformatics 2010;26:1277-1283. https://doi.org/10.1093/bioinformatics/btq152
  13. Andrews S. FastQC. A quality control tool for high throughput sequence data. Cambridge: Babraham Bioinformatics, 2014. Accessed 2019 Jun 1. Available from: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
  14. Joshi NA, Fas JN. Sickle: a sliding-window, adaptive, quality-based trimming tool for FastQ files. San Francisco: GitHub Inc., 2011. Accessed 2019 Jun 1. Available from: https://github.com/najoshi/sickle.
  15. Genome Reference Consortium. Human Build 37 (GRCh37.p5). Bethesda: National Center for Biotechnology Information, 2014.
  16. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009;25:1754-1760. https://doi.org/10.1093/bioinformatics/btp324
  17. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Accessed 2019 Jun 1. Ithaca: arXiv, Cornell University, 2013. Available from: https://arxiv.org/abs/1303.3997.
  18. Wysokar A, Tibbetts A, McCown M, Homer N, Fennell T. Picard: a set of tools for working with next generation sequencing data in BAM format. San Francisco: GitHub Inc., 2014. Accessed 2019 Jun 1. Available from: https:// broadinstitute.github.io/picard/.
  19. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009;25:2078-2079. https://doi.org/10.1093/bioinformatics/btp352
  20. Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al. A map of human genome variation from population-scale sequencing. Nature 2010;467:1061-1073. https://doi.org/10.1038/nature09534