Parallelization of Genome Sequence Data Pre-Processing on Big Data and HPC Framework

Byun, Eun-Kyu;Kwak, Jae-Hyuck;Mun, Jihyeob;

doi:10.3745/KTCCS.2019.8.10.231

KIPS Transactions on Computer and Communication Systems (정보처리학회논문지:컴퓨터 및 통신 시스템)

Volume 8 Issue 10
/
Pages.231-238
/
2019
/
2287-5891(pISSN)
/
2734-049X(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Parallelization of Genome Sequence Data Pre-Processing on Big Data and HPC Framework

빅데이터 및 고성능컴퓨팅 프레임워크를 활용한 유전체 데이터 전처리 과정의 병렬화

변은규 (한국과학기술정보연구원) ;
곽재혁 (한국과학기술정보연구원) ;
문지협 (한국과학기술정보연구원)

Received : 2018.12.10
Accepted : 2019.09.09
Published : 2019.10.31

https://doi.org/10.3745/KTCCS.2019.8.10.231 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Analyzing next-generation genome sequencing data in a conventional way using single server may take several tens of hours depending on the data size. However, in order to cope with emergency situations where the results need to be known within a few hours, it is required to improve the performance of a single genome analysis. In this paper, we propose a parallelized method for pre-processing genome sequence data which can reduce the analysis time by utilizing the big data technology and the highperformance computing cluster which is connected to the high-speed network and shares the parallel file system. For the reliability of analytical data, we have chosen a strategy to parallelize the existing analytical tools and algorithms to the new environment. Parallelized processing, data distribution, and parallel merging techniques have been developed and performance improvements have been confirmed through experiments.

차세대 염기 서열 분석법이 생성한 유전체 원시 데이터를 기존의 방식대로 하나의 서버에서 분석하기 위해서는 데이터 크기에 따라 수십 시간이 필요할 수 있다. 그러나 응급 환자의 진단처럼 수 시간 내에 결과를 알아야 하는 상황이 존재하기 때문에 단일 유전체 분석의 성능을 향상시킬 필요가 있다. 본 연구에서는 빅데이터 기술의 병렬화 기법과 고속의 네트워크로 연결되고 병렬파일시스템을 공유하는 고성능컴퓨팅 클러스터를 적극적으로 활용하여 분석 시간을 크게 단축시킬 수 있는 유전체 데이터 분석의 전처리 프로세스의 병렬화 방법을 제안한다. 분석 데이터의 신뢰성을 위해 기존의 검증된 분석 도구 및 알고리즘을 새로운 환경에 맞게 병렬화 하는 전략을 선택하였다. 프로세스의 병렬화, 데이터의 분배 및 병렬 병합 기법을 개발하였고 실험을 통해 성능 향상을 확인하였다.

Keywords

References

S. Goodwin, J. D. McPherson, and W. R. McCombie, "Coming of age: ten years of next-generation sequencing technologies," Nature Review Genetics, Vol.17, No.6, pp.333-351, May 2016. https://doi.org/10.1038/nrg.2016.49
"Sequence Alignment/Map Format Specification", The SAM/BAM Format Specification Working Group [Internet], https://samtools.github.io/hts-specs/SAMv1.pdf
An introduction to Next-Generation Sequencing Technology, Illumina, Inc., [Internet], https://www.illumina.com/documents/products/illumina_sequencing_introduction.pdf
H. Li and R. Durbin, "Fast and accurate long-read alignment with Burrows-Wheeler transform," Bioinformatics, Vol.26, No.5, pp.589-595, 2010. https://doi.org/10.1093/bioinformatics/btp698
Faust, Gregory G., and Ira M. Hall, "SAMBLASTER: Fast Duplicate Marking and Structural Variant Read Extraction," Bioinformatics, Vol.30, No.17, pp.2503-2505, 2014. https://doi.org/10.1093/bioinformatics/btu314
H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup, "The Sequence Alignment/Map format and SAMtools," Bioinformatics, Vol.25, No.16, pp.2078-2079, 2009. https://doi.org/10.1093/bioinformatics/btp352
L. Pireddu, S. Leo, and G. Zanetti, "SEAL: a distributed short read mapping and duplicate removal tool," Bioinformatics, Vol. 27, No.15, pp.2159-2160, Aug. 2011. https://doi.org/10.1093/bioinformatics/btr325
T. Nguyen, W. Shi, and D. Ruden, "CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping," BMC Res Notes, Vol.4, No.1, pp.171, Jun. 2011. https://doi.org/10.1186/1756-0500-4-171
M. C. Schatz, "CloudBurst: highly sensitive read mapping with MapReduce," Bioinformatics, Vol.25, No.11, pp.1363-1369, Jun. 2009. https://doi.org/10.1093/bioinformatics/btp236
J. M. Abuin, J. C. Pichel, T. F. Pena, and J. Amigo, "BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies," Bioinformatics, Vol.31, No.24, pp.4003-4005, 2015. https://doi.org/10.1093/bioinformatics/btv506
D. Decap, J. Reumers, C. Herzeel, P. Costanza, and J. Fostier, "Halvade: Scalable Sequence Analysis with MapReduce," Bioinformatics, Vol.31, No.15, pp.2482-2488, 2015. https://doi.org/10.1093/bioinformatics/btv179
E.-K. Byun, J. Lee, S. J. Yu, J.-H. Kwak, and S. Hwang, "Accelerating Genome Sequence Alignment on Hadoop on Lustre Environment," 2017 IEEE 13th International Conference on E-Science, pp.436-437, 2017.
Lustre Hadoop Plugin, Seagate [Internet], https://github.com/Seagate/lustrefs

KIPS Transactions on Computer and Communication Systems (정보처리학회논문지:컴퓨터 및 통신 시스템)

Parallelization of Genome Sequence Data Pre-Processing on Big Data and HPC Framework

빅데이터 및 고성능컴퓨팅 프레임워크를 활용한 유전체 데이터 전처리 과정의 병렬화

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)