Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read data

Lee, Yuna;Park, Kiejung;Koh, Insong;

doi:10.5808/GI.2019.17.4.e40

Genomics & Informatics

Volume 17 Issue 4
/
Pages.40.1-40.9
/
2019
/
1598-866X(pISSN)
/
2234-0742(eISSN)

Korea Genome Organization (한국유전체학회)

DOI QR Code

Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read data

Lee, Yuna (Department of Biomedical Informatics, Hanyang University) ;
Park, Kiejung (Cheonan Industry-Academic Collaboration Foundation, Sangmyung University) ;
Koh, Insong (Department of Biomedical Informatics, Hanyang University)

Received : 2019.10.10
Accepted : 2019.11.13
Published : 2019.12.31

https://doi.org/10.5808/GI.2019.17.4.e40 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

While studies aimed at detecting and analyzing indels or single nucleotide polymorphisms within human genomic sequences have been actively conducted, studies on detecting long insertions/deletions are not easy to orchestrate. For the last 10 years, the availability of long read data of human genomes from PacBio or Nanopore platforms has increased, which makes it easier to detect long insertions/deletions. However, because long read data have a critical disadvantage due to their relatively high cost, many next generation sequencing data are produced mainly by short read sequencing machines. Here, we constructed programs to detect so-called unmapped regions (UMRs, where no reads are mapped on the reference genome), scanned 40 Korean genomes to select UMR long deletion candidates, and compared the candidates with the long deletion break points within the genomes available from the 1000 Genomes Project (1KGP). An average of about 36,000 UMRs were found in the 40 Korean genomes tested, 284 UMRs were common across the 40 genomes, and a total of 37,943 UMRs were found. Compared with the 74,045 break points provided by the 1KGP, 30,698 UMRs overlapped. As the number of compared samples increased from 1 to 40, the number of UMRs that overlapped with the break points also increased. This eventually reached a peak of 80.9% of the total UMRs found in this study. As the total number of overlapped UMRs could probably grow to encompass 74,045 break points with the inclusion of more Korean genomes, this approach could be practically useful for studies on long deletions utilizing short read data.

Keywords

References

Tattini L, D'Aurizio R, Magi A. Detection of genomic structural variants from next-generation sequencing data. Front Bioeng Biotechnol 2015;3:92.
Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics 2013;14 Suppl 11:S1.
Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res 2011;21:974-984. https://doi.org/10.1101/gr.114876.110
Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods 2009;6:677-681. https://doi.org/10.1038/nmeth.1363
Chen K, Chen L, Fan X, Wallis J, Ding L, Weinstock G. TIGRA: a targeted iterative graph routing assembler for breakpoint assembly. Genome Res 2014;24:310-317. https://doi.org/10.1101/gr.162883.113
Rausch T, Zichner T, Schlattl A, Stutz AM, Benes V, Korbel JO. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 2012;28:i333-i339.
Sun R, Love MI, Zemojtel T, Emde AK, Chung HR, Vingron M, et al. Breakpointer: using local mapping artifacts to support sequence breakpoint discovery from single-end reads. Bioinformatics 2012;28:1024-1025. https://doi.org/10.1093/bioinformatics/bts064
Layer RM, Chiang C, Quinlan AR, Hall IM. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol 2014;15:R84. https://doi.org/10.1186/gb-2014-15-6-r84
Soylev A, Kockan C, Hormozdiari F, Alkan C. Toolkit for automated and rapid discovery of structural variants. Methods 2017;129:3-7. https://doi.org/10.1016/j.ymeth.2017.05.030
Abo RP, Ducar M, Garcia EP, Thorner AR, Rojas-Rudilla V, Lin L, et al. BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers. Nucleic Acids Res 2015;43:e19. https://doi.org/10.1093/nar/gku1211
Quinlan AR, Clark RA, Sokolova S, Leibowitz ML, Zhang Y, Hurles ME, et al. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res 2010;20:623-635. https://doi.org/10.1101/gr.102970.109
Hajirasouliha I, Hormozdiari F, Alkan C, Kidd JM, Birol I, Eichler EE, et al. Detection and characterization of novel sequence insertions using paired-end next-generation sequencing. Bioinformatics 2010;26:1277-1283. https://doi.org/10.1093/bioinformatics/btq152
Andrews S. FastQC. A quality control tool for high throughput sequence data. Cambridge: Babraham Bioinformatics, 2014. Accessed 2019 Jun 1. Available from: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
Joshi NA, Fas JN. Sickle: a sliding-window, adaptive, quality-based trimming tool for FastQ files. San Francisco: GitHub Inc., 2011. Accessed 2019 Jun 1. Available from: https://github.com/najoshi/sickle.
Genome Reference Consortium. Human Build 37 (GRCh37.p5). Bethesda: National Center for Biotechnology Information, 2014.
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009;25:1754-1760. https://doi.org/10.1093/bioinformatics/btp324
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Accessed 2019 Jun 1. Ithaca: arXiv, Cornell University, 2013. Available from: https://arxiv.org/abs/1303.3997.
Wysokar A, Tibbetts A, McCown M, Homer N, Fennell T. Picard: a set of tools for working with next generation sequencing data in BAM format. San Francisco: GitHub Inc., 2014. Accessed 2019 Jun 1. Available from: https:// broadinstitute.github.io/picard/.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009;25:2078-2079. https://doi.org/10.1093/bioinformatics/btp352
Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al. A map of human genome variation from population-scale sequencing. Nature 2010;467:1061-1073. https://doi.org/10.1038/nature09534

Genomics & Informatics

Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read data

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)