Data Quality Measurement on a De-identified Data Set Based on Statistical Modeling

Chun, Heuiju;Yi, Hyun Jee;Yeon, Kyupil;Kim, Dongrae;

doi:10.5392/JKCA.2019.19.05.553

The Journal of the Korea Contents Association (한국콘텐츠학회논문지)

Volume 19 Issue 5
/
Pages.553-561
/
2019
/
1598-4877(pISSN)
/
2508-6723(eISSN)

The Korea Contents Association (한국콘텐츠학회)

DOI QR Code

Data Quality Measurement on a De-identified Data Set Based on Statistical Modeling

통계모형의 정확도에 기반한 비식별화 데이터의 품질 측정

전희주 (동덕여자대학교) ;
이현지 (동국대학교) ;
연규필 (호서대학교) ;
김동례 ((주)이지서티)

Received : 2019.03.04
Accepted : 2019.04.18
Published : 2019.05.28

https://doi.org/10.5392/JKCA.2019.19.05.553 Citation PDF KSCI HTML

Download PDF

⟨ Previous Next ⟩

Abstract

In this study, the method of quality measurement for the statistical usefulness of de-identified data was examined in terms of prediction accuracy by statistical modeling. In the era of the 4th industrial revolution, effective use of big data is essential to innovation through information and communication technology, but personal information issues are constrained to actively utilize big data. In order to solve this problem, de-identification guidelines have been established and the possibility of actual re-identification of personal information has become very low due to the utilization of various de-identification methods. On the other hand, strong de-identification can have side effects that degrade the usefulness of the data. We have studied the quality of statistical usefulness of the de-identified data by KLT model which is a representative de-identification method, A case study was conducted to see how statistical accuracy of prediction is degraded by de-identification. We also proposed a new measure of data usefulness of the de-identified data by quantifying how much data is added to the de-identified data to restore the accuracy of the predictive model.

본 연구에서는 개인정보 비식별화 데이터의 통계적 유용성에 대한 품질 측정 방안에 대하여 통계 모형화에 따른 예측 정확도 측면에서 고찰하였다. 4차 산업혁명 시대에서 정보통신기술을 통한 혁신에는 반드시 빅데이터의 효과적인 활용이 필수적이지만, 개인정보 이슈는 적극적인 빅데이터 활용에 제약이 되고 있다. 이를 해결하기 위해 비식별화 가이드라인이 제정되었으며 다양한 개인정보 비식별화 방법이 활용되면서 개인정보의 실질적인 재식별 가능성은 매우 낮아졌다. 반면에 강력한 비식별화는 데이터의 유용성을 떨어뜨리는 부작용이 나타날 수 있다. 그 동안은 재식별 불가능한 비식별화 방법이 연구의 주를 이루어 왔다면 본 연구에서는 대표적인 비식별 방법인 KLT 모형에 의한 비식별화 데이터에 대한 통계적 유용성 측면의 품질 측정에 대하여 연구하였다. 비식별화 데이터에 대한 통계적 예측모형의 정확도에 기반하여 비식별화 된 데이터의 통계적 유용성이 어느 정도 훼손되는지에 대하여 사례분석을 수행하였다. 또한, 비식별 자료에 어느 정도의 비식별화 되지 않은 자료가 추가되어야 예측모형의 정확도를 회복하는 지를 살펴봄으로써 비식별화된 자료의 데이터 유용성 정도에 대한 새로운 측정지표를 제안하였다.

Keywords

표 1. 원본DB 구성 변수들

CCTHCV_2019_v19n5_553_t0001.png 이미지

표 2. 비식별 전후의 추정된 회귀계수

CCTHCV_2019_v19n5_553_t0002.png 이미지

표 3. 검증용 자료의 분류성능

CCTHCV_2019_v19n5_553_t0003.png 이미지

표 4. 두 모형의 분류성능

CCTHCV_2019_v19n5_553_t0004.png 이미지

References

양현철, 이영주, 김신곤, "개인정보 비식별화기술 적용수준이 빅데이터 활성화에 미치는 영향," 정보화연구, 제13권, 제3호, pp.395-404, 2016.
국무조정실 등, 개인정보 비식별 조치 가이드라인, 2016.
이영환, 전희주, 윤정연, "데이터 산업에서 창업 활성화를 위한 데이터 거래소 제안 : 금융거래소형 데이터거래소를 중심으로," 한국창업학회지, 제10권, 제2호, pp.28-49, 2015. https://doi.org/10.24878/TKES.2015.10.2.28
김동국, 이혁, "빅데이터 기반의 개인정보 비식별화 동향," 한국인터넷정보학회지, 제16권, 제2호, pp.15-22, 2015.
이현승, 송지환, 개인정보 비식별화기술의 쟁점 연구, 소프트웨어정책연구소, 2016.
임형진, "빅데이터 환경에서의 개인정보 비식별 처리방법 분석," 전자금융과 금융보안, 제8호, pp.9-37, 금융보안원, 2017.
엄수현, 이인경, 이우기, "빅데이터 기반 개인정보 비식별화 동향," 정보화연구, 제15권, 제4호, pp.545-552, 2018. https://doi.org/10.22865/JITA.2018.15.4.545
김근령, 이대희, "보건의료 빅데이터 활용에 관한 법적검토-개인정보보호를 중심으로-," 과학기술법연구, 제24권, 제3호, pp.57-90, 2018. https://doi.org/10.32430/ILST.2018.24.3.57
D. Rebollo-Monedero, J. Forne, M. Soriano, and J. P. Allepuz, "k-Anonymous microaggregation with preservation of statistical dependence," Information Sciences, Vol.342, pp.1-23, 2016. https://doi.org/10.1016/j.ins.2016.01.012
J. Soria-Comas, J. Domingo-Ferrer, D. Sanchez, and S. Martinez, "Enhancing Data Utility in Differential Privacy via Microaggregation- based k-Anonymity," The International Journal on Very Large Data Bases, Vol.23, No.5, pp.771-794, 2014. https://doi.org/10.1007/s00778-014-0351-4
D. Sanchez, J. Domingo-Ferrer, S. Martinez, and J. Soria-Comas, "Utility-preserving differentially private data releases via individual ranking microaggregation," Information Fusion, Vol.30, pp.1-14, 2016. https://doi.org/10.1016/j.inffus.2015.11.002
강동현, 오현석, 용우석, 이원석, "비식별 데이터의 유사성 보존에 관한 연구," 한국정보처리학회 추계학술발표대회 논문집, 제24권, 제2호, pp.285-288, 2017.
H. Lee, S. Kim, J. W. Kim, and Y. D. Chung, "Utility-preserving anonymization for health data publishing," BMC Medical informatics and Decision Making, Vol.17, No.1(104), 2017.
김동한, "개인정보 비식별화 기술 동향 및 전망," Weekly ICT Trend 주간기술동향, 제1809호, 정보통신기술진흥센터, pp.14-24, 2017.
K. LeFevre, D. DeWitt, and R. Ramakrishnan, "Incognito: Efficient full-domain k-anonymity," In Proceedings of the 2005 ACM SIGMOD international conference on Management of data (SIGMOD '05) , pp.49-60, 2005.
A. Machanavajjhala, J. Gehrke, and D. Kifer," $\ell$-Diversity: Privacy beyond k-anonymity," 22nd International Conference on Data Engineering, 2006.
N. Li, T. Li, and S. Venkatasubramanian, "t-Closeness: Privacy beyond k-anonymity and l-diversity," IEEE 23rd International Conference on Data Engineering , 2007.

The Journal of the Korea Contents Association (한국콘텐츠학회논문지)

Data Quality Measurement on a De-identified Data Set Based on Statistical Modeling

통계모형의 정확도에 기반한 비식별화 데이터의 품질 측정

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)