• Title/Summary/Keyword: Microdata

Search Result 58, Processing Time 0.028 seconds

Study on a Measurement of Disclosure Risk of Microdata by Similarity

  • Cho, Hyeon-Kwan;Kwon, Dae-Hong;Lee, Suk-Hoon
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.5
    • /
    • pp.743-755
    • /
    • 2012
  • Researchers using various of statistical data want to obtain microdata for a detailed analysis. Institutes need to provide microdata after masking processes for sensitive data. Many researchers have used the proportion of unique identity for the measurement of disclosure risk. We proposed a new measurement of disclosure risk that considers the case that all identities are the same or similar. As an application example, we compare the newly proposed measurement and the existing measurement using 10667 data in 'Korea Household Income and Expenditure Survey data for 2010'.

A Method of Masking for 2005 Korean Census Microdata (인구주택총조사 마이크로자료의 개인정보 노출제한방법)

  • Jeong, Dong-Myeong;Jeong, Mi-Ock
    • The Korean Journal of Applied Statistics
    • /
    • v.21 no.2
    • /
    • pp.313-325
    • /
    • 2008
  • Large amounts of information on individuals is available to many organizations and data users and government agencies release microdata files from their survey data or administrative records data. However, if a microdata file is released without any limitation, an invasion of privacy is likely to occur. Therefore, in creating a microdata file, agencies attempt to eliminate disclosure risk of the file while maintaining maximum utility of the data. In this paper, we introduce the concept of disclosure risk, identification and uniqueness. Also, we show the method for creating a 2% microdata file using the 2005 Korean census microdata.

Multiple imputation and synthetic data (다중대체와 재현자료 작성)

  • Kim, Joungyoun;Park, Min-Jeong
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.1
    • /
    • pp.83-97
    • /
    • 2019
  • As society develops, the dissemination of microdata has increased to respond to diverse analytical needs of users. Analysis of microdata for policy making, academic purposes, etc. is highly desirable in terms of value creation. However, the provision of microdata, whose usefulness is guaranteed, has a risk of exposure of personal information. Several methods have been considered to ensure the protection of personal information while ensuring the usefulness of the data. One of these methods has been studied to generate and utilize synthetic data. This paper aims to understand the synthetic data by exploring methodologies and precautions related to synthetic data. To this end, we first explain muptiple imputation, Bayesian predictive model, and Bayesian bootstrap, which are basic foundations for synthetic data. And then, we link these concepts to the construction of fully/partially synthetic data. To understand the creation of synthetic data, we review a real longitudinal synthetic data example which is based on sequential regression multivariate imputation.

Investigations into Coarsening Continuous Variables

  • Jeong, Dong-Myeong;Kim, Jay-J.
    • The Korean Journal of Applied Statistics
    • /
    • v.23 no.2
    • /
    • pp.325-333
    • /
    • 2010
  • Protection against disclosure of survey respondents' identifiable and/or sensitive information is a prerequisite for statistical agencies that release microdata files from their sample surveys. Coarsening is one of popular methods for protecting the confidentiality of the data. Grouped data can be released in the form of microdata or tabular data. Instead of releasing the data in a tabular form only, having microdata available to the public with interval codes with their representative values greatly enhances the utility of the data. It allows the researchers to compute covariance between the variables and build statistical models or to run a variety of statistical tests on the data. It may be conjectured that the variance of the interval data is lower that of the ungrouped data in the sense that the coarsened data do not have the within interval variance. This conjecture will be investigated using the uniform and triangular distributions. Traditionally, midpoint is used to represent all the values in an interval. This approach implicitly assumes that the data is uniformly distributed within each interval. However, this assumption may not hold, especially in the last interval of the economic data. In this paper, we will use three distributional assumptions - uniform, Pareto and lognormal distribution - in the last interval and use either midpoint or median for other intervals for wage and food costs of the Statistics Korea's 2006 Household Income and Expenditure Survey(HIES) data and compare these approaches in terms of the first two moments.

Limiting Attribute Disclosure in Randomization Based Microdata Release

  • Guo, Ling;Ying, Xiaowei;Wu, Xintao
    • Journal of Computing Science and Engineering
    • /
    • v.5 no.3
    • /
    • pp.169-182
    • /
    • 2011
  • Privacy preserving microdata publication has received wide attention. In this paper, we investigate the randomization approach and focus on attribute disclosure under linking attacks. We give efficient solutions to determine optimal distortion parameters, such that we can maximize utility preservation while still satisfying privacy requirements. We compare our randomization approach with l-diversity and anatomy in terms of utility preservation (under the same privacy requirements) from three aspects (reconstructed distributions, accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss.

Comparisons of Young Renter Households' Housing Situation by Locations Reflected in the 2012 Korea Housing Survey (2012년 주거실태조사에 나타난 청년 임차가구의 지역별 주거 실태 비교)

  • Lee, Hyun-Jeong
    • Journal of the Korean housing association
    • /
    • v.26 no.1
    • /
    • pp.81-90
    • /
    • 2015
  • The purpose of this study was to investigate housing characteristics of young renter households by locations using licensed microdata of the 2012 Korea Housing Survey. There were 1,020,216 renter households (weighted count) headed by persons between 20 and 34 years of age, and their housing characteristics were compared statistically across their residential locations (Capital Region, metropolitan cities, other areas). Major findings are as follows: (1) Capital Region young renters had the worst housing quality to have the greatest proportion of households living in units failed to meet national minimum housing standards, and/or in basement or semi-basement units; (2) Capital Region young renters had the greatest proportion of households that had housing cost burdens; and (3) 37.3% of young renter households in metropolitan areas and 33.5% in Capital Region were found to receive family support in order to afford current rental costs.

Release of Microdata and Statistical Disclosure Control Techniques (마이크로데이터 제공과 통계적 노출조절기법)

  • Kim, Kyu-Seong
    • Communications for Statistical Applications and Methods
    • /
    • v.16 no.1
    • /
    • pp.1-11
    • /
    • 2009
  • When micro data are released to users, record by record data are disclosed and the disclosure risk of respondent's information is inevitable. Statistical disclosure control techniques are statistical tools to reduce the risk of disclosure as well as to increase data utility in case of data release. In this paper, we reviewed the concept of disclosure and disclosure risk as well as statistical disclosure control techniques and then investigated selection strategies of a statistical disclosure control technique related with data utility. The risk-utility frontier map method was illustrated as an example. Finally, we listed some check points at each step when microdata are released.

A Study on Performing Join Queries over K-anonymous Tables

  • Kim, Dae-Ho;Kim, Jong Wook
    • Journal of the Korea Society of Computer and Information
    • /
    • v.22 no.7
    • /
    • pp.55-62
    • /
    • 2017
  • Recently, there has been an increasing need for the sharing of microdata containing information regarding an individual entity. As microdata usually contains sensitive information on an individual, releasing it directly for public use may violate existing privacy requirements. Thus, to avoid the privacy problems that occur through the release of microdata for public use, extensive studies have been conducted in the area of privacy-preserving data publishing (PPDP). The k-anonymity algorithm, which is the most popular method, guarantees that, for each record, there are at least k-1 other records included in the released data that have the same values for a set of quasi-identifier attributes. Given an original table, the corresponding k-anonymous table is obtained by generalizing each record in the table into an indistinguishable group, called the equivalent class, by replacing the specific values of the quasi-identifier attributes with more general values. However, query processing over the anonymized data is a very challenging task, due to generalized attribute values. In particular, the problem becomes more challenging with an equi-join query (which is the most common type of query in data analysis tasks) over k-anonymous tables, since with the generalized attribute values, it is hard to determine whether two records can be joinable. Thus, to address this challenge, in this paper, we develop a novel scheme that is able to effectively perform an equi-join between k-anonymous tables. The experiment results show that, through the proposed method, significant gains in accuracy over using a naive scheme can be achieved.

Influences on Housing Satisfaction of Multifamily Housing Renter Households in the U.S. Metropolitan Statistical Areas (미국 대도시권역 공동주택 임차가구의 주거 만족도 영향 요인)

  • Lee, Hyun-Jeong
    • Journal of the Korean housing association
    • /
    • v.23 no.2
    • /
    • pp.125-133
    • /
    • 2012
  • The purpose of this study was to explore characteristics and housing satisfaction of multifamily renter households in metropolitan areas using 2009 American Housing Survey public-use microdata. A total of 8,139 multifamily renter household residing in metropolitan statistical areas were selected for data analysis. The findings are as follows: (1) In comparison with other types of households in the metropolitan areas, multifamily renter households tended to show a smaller household size, younger householders, a greater proportion of households with householders who have never married, or have been widowed, divorced or separated; (2) housing cost related variables such as monthly rent or rent per square footage were found not to have significant influence on housing satisfaction of multifamily renter households in metropolitan areas; (3) factors influencing housing satisfaction of multifamily renter households with householder's age 34 years or younger were neighborhood satisfaction, householder's race, structure age and per-person unit size; and (4) neighborhood satisfaction was found to have the strongest influence on housing satisfaction of multifamily renter households in metropolitan areas.

Housing Cost Burden of Single- or Two-person Households in Their 20s and 30s in the United States (미국 20-30대 1-2인가구의 주거비 부담 실태)

  • Lee, Hyun-Jeong
    • Journal of the Korean housing association
    • /
    • v.23 no.2
    • /
    • pp.69-77
    • /
    • 2012
  • The purpose of this study was to explore housing cost burden of young single- or two-person households in the United States who have recently moved for job-related reasons. Total 580 households were selected from 2009 American Housing Survey public-use microdata for data analysis. The findings are as follows: (1) Targeted single-person households were characterized as younger households with higher educational attainment, lower household income, and greater proportion of renters, multifamily housing residents and households with housing cost burden than other households; (2) two-person households showed a higher income level and lower housing cost burden; (3) characteristics that showed significant influences on housing cost burden were household size, householder's age, gender, race and educational attainment, household income level and tenure type; and (4) a linear combination of household size, household income, whether or not a low-income household, residency in metropolitan area, and home structural type were found to be most efficient to predict a single- or two-person household's housing cost burden regardless of the household size.