Comparison of Distributed and Parallel NGS Data Analysis Methods based on Cloud Computing

Kang, Hyungil;Kim, Sangsoo;

doi:10.5392/IJoC.2018.14.1.034

International Journal of Contents

Volume 14 Issue 1
/
Pages.34-38
/
2018
/
1738-6764(pISSN)
/
2093-7504(eISSN)

The Korea Contents Association (한국콘텐츠학회)

DOI QR Code

Comparison of Distributed and Parallel NGS Data Analysis Methods based on Cloud Computing

Kang, Hyungil (Dept. of Semiconductor Electronics Engineering Chungbuk Health & Science University) ;
Kim, Sangsoo (Dept. of Course-based Qualification Exam Team2 Human Resources Development Service of Korea)

Received : 2018.01.03
Accepted : 2018.04.02
Published : 2018.03.28

https://doi.org/10.5392/IJoC.2018.14.1.034 Citation PDF KSCI HTML

Download PDF

⟨ Previous Next ⟩

Abstract

With the rapid growth of genomic data, new requirements have emerged that are difficult to handle with big data storage and analysis techniques. Regardless of the size of an organization performing genomic data analysis, it is becoming increasingly difficult for an institution to build a computing environment for storing and analyzing genomic data. Recently, cloud computing has emerged as a computing environment that meets these new requirements. In this paper, we analyze and compare existing distributed and parallel NGS (Next Generation Sequencing) analysis based on cloud computing environment for future research.

Keywords

E1CTBR_2018_v14n1_34_f0001.png 이미지

Fig. 1. Cost of DNA analysis

E1CTBR_2018_v14n1_34_f0002.png 이미지

Fig. 2. NGS analysis process of [28]

E1CTBR_2018_v14n1_34_f0003.png 이미지

Fig. 3. Overall Architecture of Halvade [24]

E1CTBR_2018_v14n1_34_f0004.png 이미지

Fig. 4. Workflow of SparkGA

Table 1. Tools for each NGS steps

E1CTBR_2018_v14n1_34_t0001.png 이미지

References

M. Choi, "Development Trends of Medical Genomics Using Next Generation Sequencing Techniques," Molecular Cell Biology Newsletter, Apr. 2014.
https://www.genome.gov/sequencingcostsdata/
M. C. Schatz, B. Langmead, and S. L. Salzberg, "Cloud Computing and the DNA Data Race," Nature Biotechnology, vol. 28, no. 7, 2010, pp. 691-693. https://doi.org/10.1038/nbt0710-691
M. Baker, "Next-generation Sequencing: Adjusting to Data Overload," Nature Methods, vol. 7, no. 7, 2010, pp. 495-499. https://doi.org/10.1038/nmeth0710-495
B. Calabrese and M. Cannataro, "Bioinformatics and Microarray Data Analysis on the Cloud," Methods in Molecular Biology, vol. 1375, 2016, pp. 25-39.
http://ngenebio.com/
C. Lee, Bioinformatics Analysis of Next-Generation Sequence Data, BRIC View Trend Report, 2016
A. Geraldine, V. Auwera, M. O. Carneiro, C. Hartl, R. Poplin, G. Angel, A. Levy-Moonshine, T. Jordan, K. Shakir, D. Roazen, J. Thibault, E. Banks, K. V. Garimella, D. Altshuler, S. Gabriel, and M. A. DePristo, "From FastQ Data to High Confidence Variant Calls: the Genome Analysis Toolkit Best Practices Pipeline," Current Protocols in Bioinformatics, 2013, pp. 11-10.
https://www.bioin.or.kr/board.do?cmd=view&bid=tech&num=216321
BWA, https://github.com/lh3/bwa
GATK, https://software.broadinstitute.org/gatk/
B. Langmead, C. Trapnell, M. Pop, and S. Salzberg, "Ultrafast and Memory-efficient Alignment of Short DNA Sequences to the Human Genome," Genome biology, vol. 10, no. 3, 2009.
http://broadinstitute.github.io/picard/
https://github.com/GregoryFaust/samblaster
https://github.com/broadinstitute/mutect
https://hpc.nih.gov/apps/MutSig.html
https://github.com/ekg/freebayes
https://github.com/WGLab/doc-ANNOVAR/
https://www.ensembl.org/vep
https://gencore.bio.nyu.edu/variant-calling-pipeline/
https://wikis.utexas.edu/display/bioiteam/DNAseq+Variant+Calling+Pipeline
https://hadoop.apache.org/
https://spark.apache.org/
D. Decap, J. Reumers, C. Herzeel, P. Costanza, and J. Fostier, "Halvade: Scalable Sequence Analysis with MapReduce," Bioinformatics, vol. 31, no. 15, 2015, pp. 2482-2488. https://doi.org/10.1093/bioinformatics/btv179
https://github.com/citiususc/BigBWA
https://github.com/citiususc/SparkBWA
J. Lee, H. Lee, J. Moon, H. Kang, S. Song, and S. Yu, "Parallel and Distributed PCR Duplication Marking Algorithm Integrated with Genome Sequence Alignment by Using Streaming Technology," Proceedings of TBC 2017, 2017.
H. Mushtaq and Z. Al-Ars, "Cluster-based Apache Spark Implementation of the GATK DNA Analysis Pipeline," In Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2015, pp. 1471-1477.
H. Mushtaq, F. Liu, C. Costa, G. Liu, P. Hofstee, and Z. Al-Ars, "Sparkga: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale," In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, 2017, pp. 148-157.