Fig. 1. Example taxonomy trees of Age and Gender
Fig. 2. HDFS Architecture
Fig. 3. Hadoop Yarn Architecture
Fig. 4. Spark Workflow
Fig. 5. The Pseudocode for k-Anonymity in Spark Distributed Environment
Fig. 6. Load Files to HDFS
Fig. 9. K-Anonymity on Spark Map & Reduce
Fig. 7. Generalization Lattice Tree
Fig. 8. Make RDD and Partition, Cache
Fig. 10. The Pseudocode of Map for k-Anonymity
Fig. 11. The Pseudocode of Reduce for k-Anonymity
Fig. 12. Execution time comparison between non-distributed k-anonymity and distributed k-anonymity for varying number of records (k = 500)
Fig. 13. Execution time comparison between non-distributed k-anonymity and distributed k-anonymity for varying number of k value (data size = 0.5 GB)
Fig. 14. K-Anonymity on Spark Distributed System
Table 1. Original table and 2-Anonymized table
Table. 2. Server Specs and Spark Options used in the experiments
References
- A. Narayanan and V. Shmatikov, "Robust De-anonymi zation of Large Sparse Datasets", In Proceedings of the 2008 IEEE Symposium on Security and Privacy Page, 2008.
- J. Kim, K.Jung, H. Lee, S. Kim, J.W. Kim and Y.D. Chung, "Models for Privacy-preserving Data Publishing : A Survey", Journal of KIISE, Vol. 44, No. 2, pp. 195-207, 2017. https://doi.org/10.5626/JOK.2017.44.2.195
- B.C.M. Fung, K. Wang, R. Chen, and P.S. Yu, "Privacy-pres erving data publishing: A survey of recent developments", ACM Computing Surveys, 42(4), June 2010.
- N. Mohammed, B.C.M. Fung, P.C.K. Hung, and C.K. Lee, "Centralized and distributed anonymization for high-dimensional healthcare data", ACM Transactions on Knowledge Discovery from Data, 4(4), October 2010.
- L. Sweeney, "k-anonymity: A model for protecting privacy", International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol. 10, No. 05, pp. 557-570 (2002) https://doi.org/10.1142/S0218488502001648
- L. Sweeney, "Achieving k-Anonymity Privacy Protection using Generalization and Suppression", International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol. 10, No. 05, pp. 571-588 (2002) https://doi.org/10.1142/S021848850200165X
- K. LeFevre, D.J. DeWitt and R. Ramakrishnan, "Incognito: Efficient full domain k-anonymity", In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2005.
- J. Byun, A. Kamra, E. Bertino, N. Li, "Efficient k-Anonym ization Using Clustering Technique", DASFAA 2007: Advances in Databases: Concepts, Systems and Applications pp 188-200, 2007
- S. Kim, H. Lee, Y.D. Chung, "Privacy-preserving data cub for electronic medical records: An experimental evaluation", International Journal of medical Informatics, 2017
- Apache Hadoop 2.8.4 API docs[Website]. https://hadoo p.apache.org/docs/r2.8.4/ (2018)
- Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, "Efficient big data processing in Hadoop MapReduce", Proceedings of the VLDB Endowment, Volume 5 Issue 12, August 2012, Pages 2014-2015
- Apache Spark 2.3.0 API docs[Website]. https://spark.ap ache.org/docs/2.3.0/index.html (2018)
- Health Insurance Review and Assessment Service in Korea. http://opendata.hira.or.kr (2012).