Capturing Data from Untapped Sources using Apache Spark for Big Data Analytics

Nichie, Aaron;Koo, Heung-Seo;

doi:10.5370/KIEE.2016.65.7.1277

The Transactions of The Korean Institute of Electrical Engineers (전기학회논문지)

Volume 65 Issue 7
/
Pages.1277-1282
/
2016
/
1975-8359(pISSN)
/
2287-4364(eISSN)

The Korean Institute of Electrical Engineers (대한전기학회)

DOI QR Code

Capturing Data from Untapped Sources using Apache Spark for Big Data Analytics

빅데이터 분석을 위해 아파치 스파크를 이용한 원시 데이터 소스에서 데이터 추출

Nichie, Aaron (Dept. of Computer and Information Engineering, Cheong-Ju Univ.) ;
Koo, Heung-Seo (Dept. of Computer and Information Engineering, Cheong-Ju Univ.)

;
구흥서

Received : 2016.05.15
Accepted : 2016.06.20
Published : 2016.07.01

https://doi.org/10.5370/KIEE.2016.65.7.1277 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

The term "Big Data" has been defined to encapsulate a broad spectrum of data sources and data formats. It is often described to be unstructured data due to its properties of variety in data formats. Even though the traditional methods of structuring data in rows and columns have been reinvented into column families, key-value or completely replaced with JSON documents in document-based databases, the fact still remains that data have to be reshaped to conform to certain structure in order to persistently store the data on disc. ETL processes are key in restructuring data. However, ETL processes incur additional processing overhead and also require that data sources are maintained in predefined formats. Consequently, data in certain formats are completely ignored because designing ETL processes to cater for all possible data formats is almost impossible. Potentially, these unconsidered data sources can provide useful insights when incorporated into big data analytics. In this project, using big data solution, Apache Spark, we tapped into other sources of data stored in their raw formats such as various text files, compressed files etc and incorporated the data with persistently stored enterprise data in MongoDB for overall data analytics using MongoDB Aggregation Framework and MapReduce. This significantly differs from the traditional ETL systems in the sense that it is compactible regardless of the data formats at source.

Keywords

References

D. Agrawal, P. Bernstein, E. Bertino, S. Davidson, U. Dayal, M. Franklin, J. Widom, "Challenges and Opportunities with Big Data: A white paper prepared for the Computing Community Consortium committee of the Computing Research Association". pp. 1-17, Nov.2012. [Online]. Available:http://cra.org/ccc/resources/ccc-led-whitepapers/Downloaded: Feb. 14, 2016.
Bringing Big Data to The Enterprise. [Online]. Available: http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html accessed on Feb. 05, 2016.
P. Nathan, "Intro to Apache Spark", Chicago International Software Conference 2015. pp. 1-188, May 14, 2015. [Online]. Available: http://training.dat abricks.com/workshop/sparkcamp.pdf Downloaded: Ja n. 03, 2016.
L. Neal, "Will NoSQL Databases Live Up to Their Promise?", Technology News, IEEE Computer Society, pp. 12-14, Oct. 2010.
Big Data Analytics. [Online]. Available: http://www-01.ibm.com/software/data/infosphere/doop/what-is-big-data-analytics.html Accessed: Feb. 14, 2016.
A. Hafiz, O. Lukumon, B. Muhammad, A. Olugbenga, O. Hakeem, A. Saheed, "Bankruptcy Prediction of Construction Businesses: Towards a Big Data Analytics Approach", IEEE Conf. Pub., pp.1-5, Mar. 09, 2015.
M. Kalan, "Tutorial for Operationalizing Spark with MongoDB", [Online]. Available: https://www.mongodb.com/blog/post/tutorial-for-operationalizing-spark-with-mongodb Accessed Dec. 12, 2015.
MongoDB, "Apache Spark and MongoDB Turning Analytics into Real-Time Action", A MongoDB White Paper, Aug. 2015.
QAing New Code with MMS: Map/Reduce vs. Aggregation Framework, available at http://blog.mongodb.org/post/62900213496/qaing-new-code-with-mms-mapreduce-vs accessed on Mar. 01, 2016.
How Apache Spark Is Transforming Big Data Processing, Development. [Online]. Available: http://www.eweek.com/enterprise-apps/how-apache-spark-is-transforming-big-data-processing-development.html Accessed: Feb. 16, 2016.

The Transactions of The Korean Institute of Electrical Engineers (전기학회논문지)

Capturing Data from Untapped Sources using Apache Spark for Big Data Analytics

빅데이터 분석을 위해 아파치 스파크를 이용한 원시 데이터 소스에서 데이터 추출

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)