Efficient K-Anonymization Implementation with Apache Spark

Kim, Tae-Su;Kim, Jong Wook;

doi:10.9708/jksci.2018.23.11.017

Journal of the Korea Society of Computer and Information (한국컴퓨터정보학회논문지)

Volume 23 Issue 11
/
Pages.17-24
/
2018
/
1598-849X(pISSN)
/
2383-9945(eISSN)

Korean Society of Computer Information (한국컴퓨터정보학회)

DOI QR Code

Efficient K-Anonymization Implementation with Apache Spark

Kim, Tae-Su (Dept. of Computer Science, Sangmyung University) ;
Kim, Jong Wook (Dept. of Computer Science, Sangmyung University)

Received : 2018.08.31
Accepted : 2018.10.22
Published : 2018.11.30

https://doi.org/10.9708/jksci.2018.23.11.017 Citation PDF KSCI HTML

Download PDF

⟨ Previous Next ⟩

Abstract

Today, we are living in the era of data and information. With the advent of Internet of Things (IoT), the popularity of social networking sites, and the development of mobile devices, a large amount of data is being produced in diverse areas. The collection of such data generated in various area is called big data. As the importance of big data grows, there has been a growing need to share big data containing information regarding an individual entity. As big data contains sensitive information about individuals, directly releasing it for public use may violate existing privacy requirements. Thus, privacy-preserving data publishing (PPDP) has been actively studied to share big data containing personal information for public use, while preserving the privacy of the individual. K-anonymity, which is the most popular method in the area of PPDP, transforms each record in a table such that at least k records have the same values for the given quasi-identifier attributes, and thus each record is indistinguishable from other records in the same class. As the size of big data continuously getting larger, there is a growing demand for the method which can efficiently anonymize vast amount of dta. Thus, in this paper, we develop an efficient k-anonymity method by using Spark distributed framework. Experimental results show that, through the developed method, significant gains in processing time can be achieved.

Keywords

CPTSCQ_2018_v23n11_17_f0001.png 이미지

Fig. 1. Example taxonomy trees of Age and Gender

CPTSCQ_2018_v23n11_17_f0002.png 이미지

Fig. 2. HDFS Architecture

CPTSCQ_2018_v23n11_17_f0003.png 이미지

Fig. 3. Hadoop Yarn Architecture

CPTSCQ_2018_v23n11_17_f0004.png 이미지

Fig. 4. Spark Workflow

CPTSCQ_2018_v23n11_17_f0005.png 이미지

Fig. 5. The Pseudocode for k-Anonymity in Spark Distributed Environment

CPTSCQ_2018_v23n11_17_f0006.png 이미지

Fig. 6. Load Files to HDFS

CPTSCQ_2018_v23n11_17_f0007.png 이미지

Fig. 9. K-Anonymity on Spark Map & Reduce

CPTSCQ_2018_v23n11_17_f0008.png 이미지

Fig. 7. Generalization Lattice Tree

CPTSCQ_2018_v23n11_17_f0009.png 이미지

Fig. 8. Make RDD and Partition, Cache

CPTSCQ_2018_v23n11_17_f0010.png 이미지

Fig. 10. The Pseudocode of Map for k-Anonymity

CPTSCQ_2018_v23n11_17_f0011.png 이미지

Fig. 11. The Pseudocode of Reduce for k-Anonymity

CPTSCQ_2018_v23n11_17_f0012.png 이미지

Fig. 12. Execution time comparison between non-distributed k-anonymity and distributed k-anonymity for varying number of records (k = 500)

CPTSCQ_2018_v23n11_17_f0013.png 이미지

Fig. 13. Execution time comparison between non-distributed k-anonymity and distributed k-anonymity for varying number of k value (data size = 0.5 GB)

CPTSCQ_2018_v23n11_17_f0014.png 이미지

Fig. 14. K-Anonymity on Spark Distributed System

Table 1. Original table and 2-Anonymized table

CPTSCQ_2018_v23n11_17_t0001.png 이미지

Table. 2. Server Specs and Spark Options used in the experiments

CPTSCQ_2018_v23n11_17_t0002.png 이미지

References

A. Narayanan and V. Shmatikov, "Robust De-anonymi zation of Large Sparse Datasets", In Proceedings of the 2008 IEEE Symposium on Security and Privacy Page, 2008.
J. Kim, K.Jung, H. Lee, S. Kim, J.W. Kim and Y.D. Chung, "Models for Privacy-preserving Data Publishing : A Survey", Journal of KIISE, Vol. 44, No. 2, pp. 195-207, 2017. https://doi.org/10.5626/JOK.2017.44.2.195
B.C.M. Fung, K. Wang, R. Chen, and P.S. Yu, "Privacy-pres erving data publishing: A survey of recent developments", ACM Computing Surveys, 42(4), June 2010.
N. Mohammed, B.C.M. Fung, P.C.K. Hung, and C.K. Lee, "Centralized and distributed anonymization for high-dimensional healthcare data", ACM Transactions on Knowledge Discovery from Data, 4(4), October 2010.
L. Sweeney, "k-anonymity: A model for protecting privacy", International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol. 10, No. 05, pp. 557-570 (2002) https://doi.org/10.1142/S0218488502001648
L. Sweeney, "Achieving k-Anonymity Privacy Protection using Generalization and Suppression", International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol. 10, No. 05, pp. 571-588 (2002) https://doi.org/10.1142/S021848850200165X
K. LeFevre, D.J. DeWitt and R. Ramakrishnan, "Incognito: Efficient full domain k-anonymity", In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2005.
J. Byun, A. Kamra, E. Bertino, N. Li, "Efficient k-Anonym ization Using Clustering Technique", DASFAA 2007: Advances in Databases: Concepts, Systems and Applications pp 188-200, 2007
S. Kim, H. Lee, Y.D. Chung, "Privacy-preserving data cub for electronic medical records: An experimental evaluation", International Journal of medical Informatics, 2017
Apache Hadoop 2.8.4 API docs[Website]. https://hadoo p.apache.org/docs/r2.8.4/ (2018)
Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, "Efficient big data processing in Hadoop MapReduce", Proceedings of the VLDB Endowment, Volume 5 Issue 12, August 2012, Pages 2014-2015
Apache Spark 2.3.0 API docs[Website]. https://spark.ap ache.org/docs/2.3.0/index.html (2018)
Health Insurance Review and Assessment Service in Korea. http://opendata.hira.or.kr (2012).

Journal of the Korea Society of Computer and Information (한국컴퓨터정보학회논문지)

Efficient K-Anonymization Implementation with Apache Spark

Abstract

Keywords

References

Detail Search