DOI QR코드

DOI QR Code

프라이버시를 보호하는 분산 기계 학습 연구 동향

Systematic Research on Privacy-Preserving Distributed Machine Learning

  • 이민섭 (고려대학교 정보보호대학원 ) ;
  • 신영아 (고려대학교 정보보호대학원 ) ;
  • 천지영 (서울사이버대학교 빅데이터.정보보호학과)
  • 투고 : 2023.12.05
  • 심사 : 2023.12.21
  • 발행 : 2024.02.29

초록

인공지능 기술은 스마트 시티, 자율 주행, 의료 분야 등 다양한 분야에서 활용 가능성을 높이 평가받고 있으나, 정보주체의 개인정보 및 민감정보의 노출 문제로 모델 활용이 제한되고 있다. 이에 따라 데이터를 중앙 서버에 모아서 학습하지 않고, 보유 데이터셋을 바탕으로 일차적으로 학습을 진행한 후 글로벌 모델을 최종적으로 학습하는 분산 기계 학습의 개념이 등장하였다. 그러나, 분산 기계 학습은 여전히 협력하여 학습을 진행하는 과정에서 데이터 프라이버시 위협이 발생한다. 본 연구는 분산 기계 학습 연구 분야에서 프라이버시를 보호하기 위한 연구를 서버의 존재 유무, 학습 데이터셋의 분포 환경, 참여자의 성능 차이 등 현재까지 제안된 분류 기준들을 바탕으로 유기적으로 분석하여 최신 연구 동향을 파악한다. 특히, 대표적인 분산 기계 학습 기법인 수평적 연합학습, 수직적 연합학습, 스웜 학습에 집중하여 활용된 프라이버시 보호 기법을 살펴본 후 향후 진행되어야 할 연구 방향을 모색한다.

Although artificial intelligence (AI) can be utilized in various domains such as smart city, healthcare, it is limited due to concerns about the exposure of personal and sensitive information. In response, the concept of distributed machine learning has emerged, wherein learning occurs locally before training a global model, mitigating the concentration of data on a central server. However, overall learning phase in a collaborative way among multiple participants poses threats to data privacy. In this paper, we systematically analyzes recent trends in privacy protection within the realm of distributed machine learning, considering factors such as the presence of a central server, distribution environment of the training datasets, and performance variations among participants. In particular, we focus on key distributed machine learning techniques, including horizontal federated learning, vertical federated learning, and swarm learning. We examine privacy protection mechanisms within these techniques and explores potential directions for future research.

키워드

과제정보

이 논문은 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. 2021R1F1A1063992).

참고문헌

  1. A. Shamir, "How to share a secret," Communications of the ACM, Vol.22, No.11, pp.612-613, 1979.  https://doi.org/10.1145/359168.359176
  2. W. Diffie and M. E. Hellman, "New directions in cryptography," Democratizing Cryptography: The Work of Whitfield Diffie and Martin Hellman, pp.365-390, 2022. 
  3. P. Paillier, ''Public-key cryptosystems based on composite degree residuosity classes,'' in International Conference on the Theory and Applications of Cryptographic Techniques, pp.223-238, 1999. 
  4. Q. Li, Z. Wen, Z. Wu, S. Hu, N. Wang, and Y. Li, "A survey on federated learning systems: Vision, hype and reality for data privacy and protection," IEEE Transactions on Knowledge and Data Engineering, 2021. 
  5. "What is Data Cleansing?" [Internet], https://aws.amazon.com/ko/what-is/data-cleansing/ 
  6. L. Ma, Q. Pei, L. Zhou, H. Zhu, L. Wang, and Y. Ji, "Federated Data Cleaning: Collaborative and Privacy-Preserving Data Cleaning for Edge Intelligence," in IEEE Internet of Things Journal, Vol.8, No.8, pp.6757-6770, 2021. doi: 10.1109/JIOT.2020.3027980. 
  7. A. Koufakou, E. G. Ortiz, M. Georgiopoulos, G. C. Anagnostopoulos, and K. M. Reynolds, "A scalable and efficient outlier detection strategy for categorical data," in 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), Vol.2, pp.210-217, 2007. 
  8. S. D. Bay and M. Schwabacher, "Mining distance-based outliers in near linear time with randomization and a simple pruning rule," in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.29-38, 2003. 
  9. F. Jiang, G. Liu, J. Du, and Y. Sui, "Initialization of K-modes clustering using outlier detection techniques," Information Sciences, Vol.332, pp.167-183, 2016.  https://doi.org/10.1016/j.ins.2015.11.005
  10. M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, "LOF: Identifying density-based local outliers," Proceedings of the 2000 ACM SIGMOD international conference on Management of data, Vol.29, No.2, pp.93-104, 2000. 
  11. A. Arasu, M. Gotz, and R. Kaushik, "On active learning of record matching packages," in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp.783-794, 2010. 
  12. S. Mudgal et al., "Deep learning for entity matching: A design space exploration," in Proceedings of the 2018 International Conference on Management of Data, pp.19-34, 2018. 
  13. T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Re, "Holoclean: Holistic data repairs with probabilistic inference," Proceeding VLDB Endowment, Vol.10, No.11, pp.1190-1201, 2017.  https://doi.org/10.14778/3137628.3137631
  14. M. Yakout, L. Berti-Equille, and A. K. Elmagarmid, "Don't be SCAREd: Use SCalable automatic REpairing with maximal likelihood and bounded changes," in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp.553-564, 2013. 
  15. M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas, "Guided data repair," Proceeding VLDB Endowment, Vol.4, No.5, pp.279-289, 2011.  https://doi.org/10.14778/1952376.1952378
  16. S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska, "PrivateClean: Data cleaning and differential privacy," in Proceedings of the 2016 International Conference on Management of Data, pp.937-951, 2016. 
  17. R. A. Popa, C. Redfield, N. Zeldovich, and H. Balakrishnan, "CryptDB: Protecting confidentiality with encrypted query processing," in Proceedings of the twenty-third ACM symposium on operating systems principles, pp.85-100, 2011. 
  18. P. Mohassel and Y. Zhang, "SecureML: A system for scalable privacypreserving machine learning," in IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, pp.19-38, 2017. 
  19. D. Demmler, T. Schneider, and M. Zohner, "Aby-a framework for efficient mixed-protocol secure two-party computation," in Network and Distributed System Security (NDSS), pp.59, 2015. 
  20. H. L. Dunn, "Record linkage," American Journal of Public Health Nations Health, Vol.36, No.12, pp.1412-1416, 1946.  https://doi.org/10.2105/AJPH.36.12.1412
  21. I. P. Fellegi and A. B. Sunter, "A Theory for Record Linkage", Journal of the American Statistical Association, Vol.64, No.328, pp.1183-1210, 1969.  https://doi.org/10.1080/01621459.1969.10501049
  22. 가명정보결합종합지원시스템 [Internet], https://link.privacy.go.kr/nadac/organ/introData.do 
  23. D. Vatsalan, Z. Sehili, P. Christen, and E. Rahm, "Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges," In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. 2017. https://doi.org/10.1007/978-3-319-49340-4_25 
  24. A. Gkoulalas-Divanis, D. Vatsalan, D. Karapiperis, and M. Kantarcioglu, "Modern privacy-preserving record linkage techniques: An overview," in IEEE Transactions on Information Forensics and Security, Vol.16, pp.4966-4987, 2021. doi: 10.1109/TIFS.2021.3114026 
  25. S. Gomatam, R. Carter, M. Ariet, and G. Mitchell, "An empirical comparison of record linkage procedures," Statistics in Medicine, Vol.21, No.10, pp.1485-1496, 2002.  https://doi.org/10.1002/sim.1147
  26. Peter Christen, "Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection," Springer Science & Business Media, 2012. 
  27. A. P. Brown, C. Borgs, S. M. Randall, and R. Schnell, "Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets," BMC Medical Informatics and Decision Making, Vol.17, pp.1-7, 2017. https://doi.org/10.1186/s12911-017-0478-5 
  28. I. Lazrig, T. C. Ong, I. Ray, I. Ray, X. Jiang, and J. Vaidya, "Privacy preserving probabilistic record linkage without trusted third party," in 2018 16th Annual Conference on Privacy, Security and Trust (PST), pp.1-10, 2018. 
  29. B. H. Bloom, "Space/time trade-offs in hash coding with allowable errors," Communications of the ACM, Vol.13, No.7, pp.422-426, 1970.  https://doi.org/10.1145/362686.362692
  30. R. Schnell, T. Bachteler, and J. Reiher, "A novel error-tolerant anonymous linking code," Social Science Research Network, WP-GRLC-2011-02, 2011. 
  31. Christine M. O'Keefe, Ming Yung, Lifang Gu, and Rohan Baxter. 2004. "Privacy-preserving data linkage protocols," In Proceedings of the 2004 ACM Workshop on Privacy in the Electronic Society (WPES '04). Association for Computing Machinery,NY,USA,94-102. https://doi.org/10.1145/1029179.1029203 
  32. S. B. Dusetzina, S. Tyree, A.-M. Meyer, A. Meyer, L. Green, and W. R. Carpenter, "An Overview of Record Linkage Methods," 2014. 
  33. S. B. Johnson, G. Whitney, M. McAuliffe, H. Wang, E. McCreedy, L. Rozenblit, and C. C. Evans, "Using global unique identifiers to link autism collections," Journal of the American Medical Informatics Association, Vol.17, No.6, pp.689-695, 2010.  https://doi.org/10.1136/jamia.2009.002063
  34. A. Inan, M. Kantarcioglu, G. Ghinita, and E. Bertino, "Private record matching using differential privacy," in Proceeding EDBT, pp.123-134, 2010. 
  35. M. Kuzu, M. Kantarcioglu, A. Inan, E. Bertino, E. Durham, and B. Malin, "Efficient privacy-aware record integration," in Proceeding EDBT, Genoa, Italy, pp.167-178, 2013. 
  36. A. L. Potosky, G. F. Riley, J. D. Lubitz, R. M. Mentnech, and L. G. Kessler, "Potential for cancer related health services research using a linked Medicare-tumor registry database," Medical Care, Vol.31, No.8, pp.732-748, 1993.  https://doi.org/10.1097/00005650-199308000-00006
  37. S. J. Grannis, J. M. Overhage, and C. J. McDonald, "Analysis of identifier performance using a deterministic linkage algorithm," Proceeding of AMIA Symposium, pp.305-309, 2002. 
  38. B. McMahan, E. Moore, D. Ramage, S. Hampson, and y Arcas, "Communication-efficient learning of deep networks from decentralized data," Artificial Intelligence and Statistics, Vol.54, 2017. 
  39. S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini, G. Smith, and B. Thorne, "Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption," arXiv preprint arXiv:1711.10677, 2017. 
  40. R. Xu, N. Baracaldo, Y. Zhou, A. Anwar, J. Joshi, and H. Ludwig, "Fedv: Privacy-preserving federated learning over vertically partitioned data," Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security, 2021. 
  41. D. Romanini, A. J. Hall, P. Papadopoulos et al., "Pyvertical: A vertical federated learning framework for multi-headed splitnn," arXiv:2104.00489, 2021. 
  42. S. Stammler et al., "Mainzelliste SecureEpiLinker (MainSEL): Privacypreserving record linkage using secure multi-party computation," Bioinformatics, Vol.2020, pp.1-12, 2020.  https://doi.org/10.1155/2020/3407907
  43. A, Southwell et al., "Validating a novel deterministic privacy-preserving record linkage between administrative & clinical data: applications in stroke research," International Journal of Population Data Science, Vol.7, No.4, pp.1755, 2022. doi: 10.23889/ijpds.v7i4.1755. PMID: 37152407; PMCID: PMC10161965. 
  44. D. Morales, I. Agudo, and J. Lopez, "Private set intersection: A systematic literature review," Computer Science Review, Vol.49, pp.100567, 2023, https://doi.org/10.1016/j.cosrev.2023.100567. 
  45. A. Adir, E. Aharoni, N. Drucker, E. Kushnir, R. Masalha, M. Mirkin and O. Soceanu, "Privacy-preserving record linkage using local sensitive hash and private set intersection," ArXiv:2203.14284v1, 2022. 
  46. B. McMahan, E. Moore, D. Ramage, S. Hampson, and y Arcas, "Communication-efficient learning of deep networks from decentralized data," Artificial Intelligence and Statistics, PMLR, 2017. 
  47. K. Bonawitz et al., "Practical secure aggregation for privacy-preserving machine learning," Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017. 
  48. S. Truex, "A hybrid approach to privacy-preserving federated learning," Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security, 2019. 
  49. R. Xu, N. Baracaldo, Y. Zhou, A. Anwar and H. Ludwig, "Hybridalpha: An efficient approach for privacy-preserving federated learning," Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security, 2019. 
  50. J. Zhang, B. Chen, S. Yu, and H. Deng, "PEFL: A privacy-enhanced federated learning scheme for big data analytics," 2019 IEEE Global Communications Conference (GLOBECOM), IEEE, 2019. 
  51. C. Zhang, S. Li, J. Xia, and W. Wang, "{BatchCrypt}: Efficient homomorphic encryption for {Cross-Silo} federated learning," 2020 USENIX Annual Technical Conference (USENIX ATC 20), 2020. 
  52. G. Xu, H. Li, S. Liu, K. Yang and X. Lin, "Verifynet: Secure and verifiable federated learning," IEEE Transactions on Information Forensics and Security, Vol.15, pp.911-926, 2019.  https://doi.org/10.1109/TIFS.2019.2929409
  53. X. Guo et al., "VeriFL: Communication-Efficient and Fast Verifiable Aggregation for Federated Learning," IEEE Transactions on Information Forensics and Security, Vol.16, pp.1736-1751, 2020.  https://doi.org/10.1109/TIFS.2020.3043139
  54. H. Fereidooni et al., "SAFELearn: Secure aggregation for private federated learning," 2021 IEEE Security and Privacy Workshops (SPW), IEEE, 2021. 
  55. J. Park and H. Lim, "Privacy-preserving federated learning using homomorphic encryption," Applied Sciences, Vol.12, No.2, pp.734, 2022. 
  56. Y. A. Shin, G. Noh, I. R. Jeong, and J. Y. Chun, "Securing a local training dataset size in federated learning," IEEE Access, Vol.10, pp.104135-104143, 2022.  https://doi.org/10.1109/ACCESS.2022.3210702
  57. J. Ma, SA. Naas, S. Sigg, and X. Lyu, "Privacy-preserving federated learning based on multi-key homomorphic encryption," International Journal of Intelligent Systems, Vol.37, No.9, pp.5880-5901, 2022.  https://doi.org/10.1002/int.22818
  58. Y. Cheng, Y. Liu, T. Chen, and Q. Yang, "Federated learning for privacy-preserving AI," Communications of the ACM, Vol.63, No.12, pp.33-36, 2020.  https://doi.org/10.1145/3387107
  59. M. G. Poirot, P. Vepakomma, K. Chang, J. K.Cramer, R. Gupta, and R. Raskar, "Split Learning for collaborative deep learning in healthcare," NeurIPS, 2019. 
  60. B. McMahan and D. Ramage, Google Research, Apr. 2017, [Online] Available: https://blog.research.google/2017/04/federated-learning-collaborative.html 
  61. A. Hard et al., "Federated learning for mobile keyboard prediction," arXiv preprint arXiv:1811.03604, 2018. 
  62. A. Gascon, P. Schoppmann, B. Balle, M. Raykova, J. Doemer, S. Zahur and D. Evans, "Secure linear regression on vertically partitioned datasets," International Association for Cryptologic Research Cryptology ePrint Archive, 892, 2016. 
  63. K. Yang, T. Fan, T. Chen, Y. Shi, and Q. Yang, "A quasi-newton method based vertical federated learning framework for logistic regression," arXiv preprint arXiv:1912.00513, 2019. 
  64. B. Gu, Z. Dang, X. Li, and H. Huang, "Federated doubly stochastic kernel learning for vertically partitioned data," Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020. 
  65. T. Chen, X. Jin, Y. Sun, and W. Yin, "Vafl: a method of vertical asynchronous federated learning," arXiv preprint arXiv:2007.06081, 2020. 
  66. C. Wang, J. Liang, M. Huang, B. Bai, K. Bai, and H. Li, "Hybrid differentially private federated learning on vertically partitioned data," arXiv preprint arXiv:2009.02763, 2020. 
  67. K. Cheng et al., "Secureboost: A lossless federated learning framework," IEEE Intelligent Systems, Vol.36, No.6, pp.87-98, 2021.  https://doi.org/10.1109/MIS.2021.3082561
  68. Q. Zhang, B. Gu, C. Deng, and H. Huang, "Secure bilevel asynchronous vertical federated learning with backward updating," Proceedings of the AAAI Conference on Artificial Intelligence. Vol.35, No.12, 2021. 
  69. S. Warnat-Herresthal et al., "Swarm Learning for decentralized and confidential clinical machine learning," Nature, Vol.594, pp.265-270, 2021.  https://doi.org/10.1038/s41586-021-03583-3
  70. O. L. Saldanha et al., "Swarm learning for decentralized artificial intelligence in cancer histopathology," Nature Medicine, Vol.28, No.6, pp.1232-1239, 2022.  https://doi.org/10.1038/s41591-022-01768-5
  71. H. Basak, R. Kundu, PK. Singh, MF. Ijaz, M. Wozniak, and R. Sarkar, "A union of deep learning and swarm-based optimization for 3D human action recognition," Scientific Reports, Vol.12, No.1, pp.5494, 2022. 
  72. F. Wang, X. Wang, and S. Sun, "A reinforcement learning level-based particle swarm optimization algorithm for large-scale optimization," Information Sciences, Vol.602, pp.298-312, 2022.  https://doi.org/10.1016/j.ins.2022.04.053
  73. M. Al-Rubaie and J. M. Chang, "Privacy-preserving machine learning: Threats and solutions," IEEE Security & Privacy, Vol.17, No.2, pp.49-58, 2019. 
  74. R. Xu, N. Baracaldo, and J. Joshi. "Privacy-preserving machine learning: Methods, challenges and directions," arXiv preprint arXiv:2108.04417, 2021. 
  75. G. A. Kaissis, Kaissis, M. R. Makowski, D. Ruckert, and R. F. Braren, "Secure, privacy-preserving and federated machine learning in medical imaging," Nature Machine Intelligence, Vol.2, No.6, pp.305-311, 2020.  https://doi.org/10.1038/s42256-020-0186-1
  76. A. Lau, and J. Passerat-Palmbach. "Statistical privacy guarantees of machine learning preprocessing techniques," arXiv preprint arXiv:2109.02496, 2021.