DOI QR코드

DOI QR Code

Personal Information Detection by Using Na$\ddot{i}$ve Bayes Methodology

Na$\ddot{i}$ve Bayes 방법론을 이용한 개인정보 분류

  • Kim, Nam-Won (College of Business Administration, Seoul National University) ;
  • Park, Jin-Soo (Graduate School of Business, Seoul National University)
  • 김남원 (서울대학교 일반대학원 경영학과) ;
  • 박진수 (서울대학교 경영전문대학원)
  • Received : 2012.03.10
  • Accepted : 2012.03.18
  • Published : 2012.03.31

Abstract

As the Internet becomes more popular, many people use it to communicate. With the increasing number of personal homepages, blogs, and social network services, people often expose their personal information online. Although the necessity of those services cannot be denied, we should be concerned about the negative aspects such as personal information leakage. Because it is impossible to review all of the past records posted by all of the people, an automatic personal information detection method is strongly required. This study proposes a method to detect or classify online documents that contain personal information by analyzing features that are common to personal information related documents and learning that information based on the Na$\ddot{i}$ve Bayes algorithm. To select the document classification algorithm, the Na$\ddot{i}$ve Bayes classification algorithm was compared with the Vector Space classification algorithm. The result showed that Na$\ddot{i}$ve Bayes reveals more excellent precision, recall, F-measure, and accuracy than Vector Space does. However, the measurement level of the Na$\ddot{i}$ve Bayes classification algorithm is still insufficient to apply to the real world. Lewis, a learning algorithm researcher, states that it is important to improve the quality of category features while applying learning algorithms to some specific domain. He proposes a way to incrementally add features that are dependent on related documents and in a step-wise manner. In another experiment, the algorithm learns the additional dependent features thereby reducing the noise of the features. As a result, the latter experiment shows better performance in terms of measurement than the former experiment does.

인터넷의 성장과 개인의 참여는 사생활 정보 보호에 관련된 비효율적 관리 방안에 대한 문제의식을 불러일으키고 있으며 이를 해결하기 위한 여러 연구들이 이루어지고 있다. 본 연구에서는 기존에 존재하는 문서 분류 방법론을 이용하여 개인의 사적 공간을 나타내는 프라이버시의 항목 중 개인을 식별할 수 있거나 개인이 민감해 할 수 있는 사생활 정보를 담고 있는 문서를 탐지 혹은 분류하는 방법에 대해서 다룬다. 논문의 실험에서 기존의 학습데이터에 추가적으로 개인정보의 유형에 관련된 하위 학습 데이터를 추가함으로써 자동 문서 분류 알고리즘의 성능 측정치를 높이는 것을 시도하였다. 또한 개인정보의 유형에 따라 알고리즘에 효과적으로 적용하는 방향을 제시하기 위하여 기존 논문에서 나타난 개인정보의 유형들을 분석하였다. 개인정보 관련 문서로 분류된 학습 대상과 함께 개인정보에 영향력이 있는 개인정보 유형들을 추가 학습시켜 알고리즘이 학습하는 문서 자질(feature)의 질(quality)을 높였다. 높아진 학습 자질의 질로 인하여 기존의 Na$\ddot{i}$ve Bayes 방법론을 이용한 평가 측정치가 높아질 수 있었다.

Keywords

References

  1. 권건보, 개인정보보호와 자기정보통제권, 서울 :경인문화사, 2005.
  2. 방송통신위원회, 트위터에 노출된 나의 정보는 얼마나 될까?, 방송통신위원회, 2011.
  3. 윤상오, "전자정부 구현을 위한 개인정보보호 정책에관한 연구 : 정부신뢰 구축의 관점에서", 한국지역정보화학회지, 12권 2호(2009), 1-29.
  4. 이강신, 이기혁, 박진식, 최일훈, 개인정보보호 기초와 활용, 서울 : 미디어그룹 인포더, 2010
  5. 이창범, 조정현, "APT(Asia-Pacific Telecommunity)개인정보 및 프라이버시 보호 가이드라인 제정 방안 연구", 개인정보분쟁조정위원, 2003.
  6. 조동기, 김성우, "인터넷의 일상화와 개인정보 보호", KISDI 이슈리포트, 11권(2003), 10-11.
  7. 조동기, 김성우, "인터넷의 일상화와 개인정보 보호", KISDI 이슈리포트, 11권(2003), 10-11.
  8. 황인호, "개인정보보호 제도에서의 규제에 관한연구", 공법연구, 30권 4호(2002), 232-232.
  9. Bayes, T., "An essay towards solving a problem in the doctrine of chances. Philos", Philosophical Transactions, Vol.53(1763), 370- 418. https://doi.org/10.1098/rstl.1763.0053
  10. Boyd, D. M. and N. B. Ellison, "Social Network Sites : Definition, History, and Scholarship", Journal of Computer-Mediated Communication, Vol.13, No.1(2008), 210-230.
  11. Clarke, R., Beyond the OECD Guidelines : Privacy Protection for the 21st Century, (2000), Roger Clarke's Web-Site : http://www.roge rclarke.com/DV/PP21C.html.
  12. Cooley, T. C., Laws of Torts, New York : Praeger, 1888.
  13. Davies, S., Big Brother : Britain's web of surveillance and the new technological order, London : Pan, 1996.
  14. Domingos, P. and M. Pazzani, "Beyond Independence : Conditions for the Optimality of the Simple Bayesian Classifier", Proceedings of the 13th International Conference on Machine Learning, (1996), 105-112.
  15. Gross, R. and A. Acquisti, "Information revelation and privacy in online social networks", WPES '05 Proceedings of the 2005 ACM workshop on Privacy in the electronic society, 2005.
  16. Information Commissioner's Office, Notification Handbook-A Complete Guide to Notification. Information Commissioner, 2001.
  17. Jagatic, T., N. Johnson, M. Jakobsson, and F. Menczer, "Social phishing", Communications of the ACM, Vol.5, No.10(2007), 94-100.
  18. Kobsa, A., "Personalized Hypermedia and International Privacy", Communications of the ACM, Vol.45, No.5(2002), 64-67.
  19. Lampe, C. and N. B. Ellison, "Changes in use and perception of facebook", Proceedings of the 2008 ACM conference on Computer supported cooperative work CSCW '08, (2008), 721-730.
  20. Lee, D. L., H. Chuang, and K. Seamons, "Document Ranking and the Vector-Space Model", IEEE Software, Vol.14, No.2(1997), 67-75. https://doi.org/10.1109/52.582976
  21. Lewis, D. D., Representation and Learning in Information Retrieval, Doctorial Dissertation : The Graduate School of the University of Massachusetts, 1992.
  22. Livingstone, S., "Taking risky opportunities in youthful content creation : teenager's use of social networking sites for intimacy, privacy and self-expression", New Media Society, Vol.10, No.3(2008), 393-411. https://doi.org/10.1177/1461444808089415
  23. LoPucki, L. M., "Human Identification Theory and the Identity Theft Problem", Tex. L. Rev., Vol.80(2001), 89-135.
  24. Manning, C. D., P. Raghavan, and H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008.
  25. Mason, R. O., "Four Ethical Issues of the Information Age", MIS Quarterly, Vol.10, No.1 (1986), 5-12. https://doi.org/10.2307/248873
  26. Meeder, B., J. Tam, P. G. Kelley, and L. F. Cranor, "RT@IWantPrivacy : Widespread Violation of Privacy Settings in the Twitter Social Network", Web 2.0 Privacy and Security Workshop, IEEE Symposium on Security and Privacy, 2010.
  27. Mitchell, T. M., Machine Learning, McGraw Hill, 2010.
  28. Parent, W. A., "Privacy : A Brief Survey of the Conceptual Landscape", Santa Clara Computer and High Tech. L. J., Vol.11(1995).
  29. Peng, H., F. Long, and C. Ding, "Feature selection based on mutual information : criteria of max-dependency, max-relevance, and min-redundancy", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.27, No.8(2005), 1226-1238. https://doi.org/10.1109/TPAMI.2005.159
  30. Schauer, F., "Internet Privacy and The Public- Private Distinction", Jurimetrics, Vol.38, No.4 (1998), 555-564.
  31. Smith, R. E., Ben Franklin's Web Site : Privacy and Curiosity from Plymouth Rock to the Internet, Sheridan Books, 2000.
  32. Solove, D. J., "A taxonomy of privacy", University of Pennsylvania Law Review, Vol.154, No.3(2006), 477-560. https://doi.org/10.2307/40041279
  33. Steinbach, M., G. Karypis, and V. Kumar, "A Comparison of Document Clustering Techniques", KDD Workshop on Text Mining, 2000.
  34. Tong, S. T., B. Van Der Heide, L. Langwell, and J. B. Walther, "Too Much of a Good Thing? The Relationship Between Number of Friends and Interpersonal Impressions on Facebook", Journal of Computer-Mediated Communication, Vol.13, No.3(2008), 531-549. https://doi.org/10.1111/j.1083-6101.2008.00409.x
  35. Wacks, R., Privacy, New York : Oxford University Express, 1993.
  36. Warren. S. D., L. D. Brandeis, "The Right to Privacy", Harvard Law Review, Vol.4, No.5 (1890), 193-220. https://doi.org/10.2307/1321160
  37. Weible, R. J., Privacy and data : An empirical study of the influence of types of data and situational context upon privacy perceptions, D.B.A. : Mississippi State University, 1993.
  38. Wolak J., D. Finkelhor, K. J. Mitchell, and M. L. Ybarra, "Online "Predators" and Their Victims", Psychology of Violence, Vol.1, No.1(2010), 13-35. https://doi.org/10.1037/2152-0828.1.S.13

Cited by

  1. 텍스트 마이닝을 활용한 신문사에 따른 내용 및 논조 차이점 분석 vol.18, pp.3, 2012, https://doi.org/10.13088/jiis.2012.18.3.053