DOI QR코드

DOI QR Code

Capturing Data from Untapped Sources using Apache Spark for Big Data Analytics

빅데이터 분석을 위해 아파치 스파크를 이용한 원시 데이터 소스에서 데이터 추출

  • Nichie, Aaron (Dept. of Computer and Information Engineering, Cheong-Ju Univ.) ;
  • Koo, Heung-Seo (Dept. of Computer and Information Engineering, Cheong-Ju Univ.)
  • Received : 2016.05.15
  • Accepted : 2016.06.20
  • Published : 2016.07.01

Abstract

The term "Big Data" has been defined to encapsulate a broad spectrum of data sources and data formats. It is often described to be unstructured data due to its properties of variety in data formats. Even though the traditional methods of structuring data in rows and columns have been reinvented into column families, key-value or completely replaced with JSON documents in document-based databases, the fact still remains that data have to be reshaped to conform to certain structure in order to persistently store the data on disc. ETL processes are key in restructuring data. However, ETL processes incur additional processing overhead and also require that data sources are maintained in predefined formats. Consequently, data in certain formats are completely ignored because designing ETL processes to cater for all possible data formats is almost impossible. Potentially, these unconsidered data sources can provide useful insights when incorporated into big data analytics. In this project, using big data solution, Apache Spark, we tapped into other sources of data stored in their raw formats such as various text files, compressed files etc and incorporated the data with persistently stored enterprise data in MongoDB for overall data analytics using MongoDB Aggregation Framework and MapReduce. This significantly differs from the traditional ETL systems in the sense that it is compactible regardless of the data formats at source.

Keywords

References

  1. D. Agrawal, P. Bernstein, E. Bertino, S. Davidson, U. Dayal, M. Franklin, J. Widom, "Challenges and Opportunities with Big Data: A white paper prepared for the Computing Community Consortium committee of the Computing Research Association". pp. 1-17, Nov.2012. [Online]. Available:http://cra.org/ccc/resources/ccc-led-whitepapers/Downloaded: Feb. 14, 2016.
  2. Bringing Big Data to The Enterprise. [Online]. Available: http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html accessed on Feb. 05, 2016.
  3. P. Nathan, "Intro to Apache Spark", Chicago International Software Conference 2015. pp. 1-188, May 14, 2015. [Online]. Available: http://training.dat abricks.com/workshop/sparkcamp.pdf Downloaded: Ja n. 03, 2016.
  4. L. Neal, "Will NoSQL Databases Live Up to Their Promise?", Technology News, IEEE Computer Society, pp. 12-14, Oct. 2010.
  5. Big Data Analytics. [Online]. Available: http://www-01.ibm.com/software/data/infosphere/doop/what-is-big-data-analytics.html Accessed: Feb. 14, 2016.
  6. A. Hafiz, O. Lukumon, B. Muhammad, A. Olugbenga, O. Hakeem, A. Saheed, "Bankruptcy Prediction of Construction Businesses: Towards a Big Data Analytics Approach", IEEE Conf. Pub., pp.1-5, Mar. 09, 2015.
  7. M. Kalan, "Tutorial for Operationalizing Spark with MongoDB", [Online]. Available: https://www.mongodb.com/blog/post/tutorial-for-operationalizing-spark-with-mongodb Accessed Dec. 12, 2015.
  8. MongoDB, "Apache Spark and MongoDB Turning Analytics into Real-Time Action", A MongoDB White Paper, Aug. 2015.
  9. QAing New Code with MMS: Map/Reduce vs. Aggregation Framework, available at http://blog.mongodb.org/post/62900213496/qaing-new-code-with-mms-mapreduce-vs accessed on Mar. 01, 2016.
  10. How Apache Spark Is Transforming Big Data Processing, Development. [Online]. Available: http://www.eweek.com/enterprise-apps/how-apache-spark-is-transforming-big-data-processing-development.html Accessed: Feb. 16, 2016.