DOI QR코드

DOI QR Code

Multiple imputation and synthetic data

다중대체와 재현자료 작성

  • Kim, Joungyoun (Department of Information & Statistics, Chungbuk National University) ;
  • Park, Min-Jeong (Statistical Research Institute, Statistics Korea)
  • Received : 2018.11.26
  • Accepted : 2019.01.04
  • Published : 2019.02.28

Abstract

As society develops, the dissemination of microdata has increased to respond to diverse analytical needs of users. Analysis of microdata for policy making, academic purposes, etc. is highly desirable in terms of value creation. However, the provision of microdata, whose usefulness is guaranteed, has a risk of exposure of personal information. Several methods have been considered to ensure the protection of personal information while ensuring the usefulness of the data. One of these methods has been studied to generate and utilize synthetic data. This paper aims to understand the synthetic data by exploring methodologies and precautions related to synthetic data. To this end, we first explain muptiple imputation, Bayesian predictive model, and Bayesian bootstrap, which are basic foundations for synthetic data. And then, we link these concepts to the construction of fully/partially synthetic data. To understand the creation of synthetic data, we review a real longitudinal synthetic data example which is based on sequential regression multivariate imputation.

사회가 발전함에 따라 이용자의 다양한 분석 요구에 대응하기 위해 개인 단위로 구성된 마이크로데이터 제공이 증가했다. 나아가 센서스, 행정자료와 같은 전수자료를 마이크로데이터 형태로 제공받아 연구하고자 하는 요구 역시 커지고 있다. 정책결정, 학술목적 등을 위한 마이크로데이터 분석은 가치 창출 측면에서 대단히 바람직하다. 하지만 자료 유용성이 확보된 마이크로데이터 제공은 개인정보가 노출될 가능성이라는 위험을 가질 수 밖에 없다. 이에, 자료의 유용성을 확보하면서 개인정보보호를 보장할 수 있는 여러 방법들이 고려되어 왔다. 이러한 방법 중 하나로 재현자료(synthetic data)를 생성해서 활용하는 방법이 연구되어 왔다. 본 논문은 재현자료 생성과 관련된 방법론 및 주의사항을 소개하여, 재현자료의 이해를 도모하고자 한다. 이를 위해 재현자료 작성에 필수적인 다중대체, 베이지안 예측 모형 및 베이지안 붓스트랩 등의 개념들을 먼저 설명하고, 완전 재현자료 및 부분 재현자료에 대해 살펴본다. 특히, 재현자료 작성을 심도 깊이 이해하기 위해 순차회귀 다중대체(sequential regression multivariate imputation)를 이용해 경시적(longitudinal) 자료를 재현자료로 작성하는 구체적 사례를 살펴본다.

Keywords

GCGHDE_2019_v32n1_83_f0001.png 이미지

Figure 2.1. An example of multiple imputation.

GCGHDE_2019_v32n1_83_f0002.png 이미지

Figure 2.2. Scatter plots of the coefficient estimates from the analysis of the complete data (x-axis) and the meanof the coefficient estimates from the analysis of multiple imputation data (y-axis).

GCGHDE_2019_v32n1_83_f0003.png 이미지

Figure 3.1. An example of generating fully synthetic data.

Table 2.1. Sequential regression multivariate imputation simulation results

GCGHDE_2019_v32n1_83_t0001.png 이미지

References

  1. Abowd, J. M., Kramarz, F., and Margolis, D. N. (1999). High wage workers and high wage firms, Econometrica, 67, 251-333. https://doi.org/10.1111/1468-0262.00020
  2. Abowd, J. M. and Woodcock, S. D. (2001). Disclosure limitation in longitudinal linked data. In P. Doyle, J. Lane, J. Theeuwes, L. Zayatz (Eds.) Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies (pp. 215-277), Amsterdam, North Holland.
  3. Clyde, M. A. and Lee, H. K. H. (2001). Bagging and the Bayesian bootstrap. In T. Richardson and T. Jaakkola (Eds) Artificial Intelligence and Statistics (pp. 169-174), Morgan Kaufmann, Burlington.
  4. Drechsler, J. (2018). Some clarifications regrading fully synthetic data. In Domingo-Ferrer, J., Montes, F. (eds.) LNCS, (Vol. 11126, pp. 109-121), Springer, Heidelberg.
  5. Efron, B. (1979). Bootstrap methods: another look at the jackknife, Annals of Statistics, 7, 1-26. https://doi.org/10.1214/aos/1176344552
  6. Little, R. J. A. (1993). Statistical analysis of masked data, Journal of Official Statistics, 9, 407-426.
  7. Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., and Vilhuber, L. (2008). Privacy: theory meets practice on the map. In Proceedings of the 24th International Conference on Data Engineering, 277-286.
  8. Park, M. J. (2016). Comparative study on the recent SDC methods. Statistical Research Institute.
  9. Park, M. J. and Kim, H. (2016). Statistical disclosure control for public microdata: present and future, Korean Journal of Applied Statistics, 39, 1041-1059. https://doi.org/10.5351/KJAS.2016.29.6.1041
  10. Park, M. J. and Kim, J. (2017). Reveiw on the synthetic data generation methodologies. Statistical Research Institute.
  11. Raab, G. M., Nowork, B., and Dibben, C. (2017). Practical data synthesis for large samples. Journal of Privacy and Confidentiality, 7, 67-97.
  12. Raghunathan, T. E., Lepkowski, J. M., Hoewyk, J. V., and Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models, Statistics Canada, 27, 85-95.
  13. Raghunathan, T. E., Reiter, J. P., and Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19, 1-16.
  14. Reiter, J. P. (2002). Satisfying disclosure restrictions with synthetic data sets, Journal of Official Statistics, 18, 531-543.
  15. Reiter, J. P. (2003). Inference for partially synthetic, public use microdata sets, Survey Methodology, 29, 181-188.
  16. Reiter, J. P. (2004). Significance tests for multi-component estimands from multiply imputed, synthetic microdata, Journal of Statistical Planning and Inference, 131, 365-377. https://doi.org/10.1016/j.jspi.2004.02.003
  17. Rubin, D. B. (1978). Multiple imputations in sample surveys - a phenomenological Bayesian approach to nonresponse. In Proceedings of the Survey Research Methods Section, American Statistical Association, 20-34.
  18. Rubin, D. B. (1981). The Bayesian bootstrap, Annals of Statistics, 9, 130-134. https://doi.org/10.1214/aos/1176345338
  19. Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys, John Wiley & Sons, New York.
  20. Rubin, D. B. (1988). An overview of multiple imputation. In Proceedings of the Survey Research Section, American Statistical Association, 79-84.
  21. Rubin, D. B. (1993). Discussion statistical disclosure limitation, Journal of Official Statistics, 9, 461-468.