Privacy-Preserving Clustering on Time-Series Data Using Fourier Magnitudes

시계열 데이타 클러스터링에서 푸리에 진폭 기반의 프라이버시 보호

  • 김혜숙 (강원대학교 컴퓨터과학과) ;
  • 문양세 (강원대학교 컴퓨터과학과)
  • Published : 2008.12.15

Abstract

In this paper we propose Fourier magnitudes based privacy preserving clustering on time-series data. The previous privacy-preserving method, called DFT coefficient method, has a critical problem in privacy-preservation itself since the original time-series data may be reconstructed from privacy-preserved data. In contrast, the proposed DFT magnitude method has an excellent characteristic that reconstructing the original data is almost impossible since it uses only DFT magnitudes except DFT phases. In this paper, we first explain why the reconstruction is easy in the DFT coefficient method, and why it is difficult in the DFT magnitude method. We then propose a notion of distance-order preservation which can be used both in estimating clustering accuracy and in selecting DFT magnitudes. Degree of distance-order preservation means how many time-series preserve their relative distance orders before and after privacy-preserving. Using this degree of distance-order preservation we present greedy strategies for selecting magnitudes in the DFT magnitude method. That is, those greedy strategies select DFT magnitudes to maximize the degree of distance-order preservation, and eventually we can achieve the relatively high clustering accuracy in the DFT magnitude method. Finally, we empirically show that the degree of distance-order preservation is an excellent measure that well reflects the clustering accuracy. In addition, experimental results show that our greedy strategies of the DFT magnitude method are comparable with the DFT coefficient method in the clustering accuracy. These results indicate that, compared with the DFT coefficient method, our DFT magnitude method provides the excellent degree of privacy-preservation as well as the comparable clustering accuracy.

본 논문에서는 시계열 데이타 클러스터링에서 DFT 진폭 기반의 프라이버시 보호 기법을 제안한다. 기존의 프라이버시 보호 연구인 DFT 계수 기법은 원본과 유사한 데이타가 복원될 수 있어 프라이버시 보호 측면에서 큰 문제점이 있다. 반면에, 제안한 DFT 진폭 기법은 DFT 변환 후에 위상을 제외한 진폭만을 사용함으로써 원본 데이타를 복원하기 매우 어려운 특징을 가진다. 본 논문에서는 우선 기존의 DFT 계수 기법이 복원이 용이한 함수이고, 제안한 DFT 진폭 기법이 복원이 어려운 함수임을 체계적으로 설명한다. 다음으로, 클러스터링 정확도를 대신하고 진폭을 선택하기 위한 척도로서 거리-순서 보존정도의 개념을 제안한다. 거리-순서 보존 정도는 객체들의 상대적 순서가 클러스터링 보호 함수의 적용전후에 얼마나 보존되는지의 척도를 나타낸다. 본 논문에서는 이러한 거리-순서 보존 정도의 개념을 사용하여 DFT 진폭 기법에서 진폭을 선택하는 탐욕적 전략들을 제시한다. 즉, 제안한 탐욕적 전략은 거리-순서 보존 정도를 극대화하는 방향으로 DFT 진폭을 선택하여, 궁극적으로 클러스터링 정확도를 높이고자 하는 방법이다. 마지막으로 실험을 통해 제안한 거리-순서 보존 정도가 클러스터링 정확도를 대신할 수 있는 척도임을 보인다. 또한, 제안한 DFT 진폭 기법의 탐욕적 전략들이 기존의 DFT 계수 기법에 비해 정확도가 크게 떨어지지 않음을 확인한다. 이 같은 결과를 달 때, 제안한 DFT 진폭 기법은 DFT 계수 기법에 비해 프라이버시 보호 정도를 크게 개선했을 뿐 아니라 비교적 정확한 클러스터링 정확도를 보이는 우수한 연구 결과라 사료된다.

Keywords

References

  1. R. Agrawal and R. Srikant, "Privacy Preserving Data Mining," In Proc. of the Int'l Conf. on Management of Data, ACM SIGMOD, Dallas, Texas, pp. 439-450, May 2000
  2. Y. Lindell and B. Pinkas, "Privacy Preserving Data Mining," Advances in Cryptology, Vol. 1807, pp. 35-53, Dec. 2000
  3. S. Rizvi and J. R. Haritsa, "Maintaining Data Privacy in Association Rule Mining," In Proc. of the 28th Int'l Conf. on Very Large Data Bases, Hong Kong, China, pp. 682-693, Sept. 2002
  4. A. V. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, "Privacy Preserving Mining of Association Rules," In Proc. of the 8th Int'l Conf. on Knowledge Discovery and Data Mining, ACM SIGKDD, Edmonton, Canada, pp. 217-228, July 2002
  5. J. Vaidya and C. Clifton, "Privacy Preserving Association Rule Mining in Vertically Partitioned Data," In Proc. of the 8th Int'l Conf. on Knowledge Discovery and Data Mining, ACM SIGKDD, Edmonton, Canada, pp. 639-644, July 2002
  6. J. Vaidya and C. Clifton, "Privacy-Preserving k-Means Clustering over Vertically Partitioned Data," In Proc. of the 9th Int'l Conf. on Knowledge Discovery and Data Mining, ACM SIGKDD, Washington D.C., pp. 24-27, Aug. 2003
  7. S. R. M. Oliveira and O. R. Zaiane, "Privacy- Preserving Clustering by Object Similarity-Based Representation and Dimensionality Reduction Transformation," In Workshop on Privacy and Security Aspects of Data Mining, Houston, Texas, pp. 21-30, Nov. 2004
  8. S. Mukherjee and Z. Chen, "A Privacy-Preserving Technique for Euclidean Distance-based Mining Algorithms Using Fourier-Related Transforms," The VLDB Journal, Vol. 15, No. 4, pp. 293-315, Nov. 2006 https://doi.org/10.1007/s00778-006-0010-5
  9. S. Papadimitriou, F. Li, G. Kollios, and P. S. Yu, "Time Series Compressibility and Privacy," In Proc. of the 33th Int'l Conf. on Very Large Data Bases, Vienna, Austria, pp. 459-470, Sept. 2007
  10. M. Vlachos, Z. Vagena, P. S. Yu, and V. Athitsos, "Rotation Invariant Indexing of Shapes and Line Drawings," In Proc. of the Int'l Conf. on Information and Knowledge Management, Bremen, Germany, pp. 131-138, Oct. 2005
  11. J. Han and M. Kamber, Data Mining, 2nd Ed., Morgan Kaufmann Publishers, 2006
  12. J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," In Proc. of the 5th Berkeley Symp. on Math. Stat. Prob., California, pp. 281-297, Mar. 1967
  13. L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley-Interscience, 1990
  14. R. Ng and J. Han, "Efficient and Effective Clustering Method for Spatial Data Mining," In Proc. of the 20th Int'l Conf. on Very Large Data Bases, Santiago, Chile, pp. 144-155, Sept. 1994
  15. T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: An Efficient Data Clustering Method for Very Large Databases," In Proc. of the Int'l Conf. on Management of Data, ACM SIGMOD, Montreal, Canada, pp. 103-114, June 1996
  16. S. Guha, R. Rastogi, and K. Shim, "A Efficient Clustering Algorithm for Large Databases," In Proc. of the Int'l Conf. on Management of Data, ACM SIGMOD, Seattle, Washington, pp. 73-84, June 1998
  17. E. Keogh, "A Decade of Progress in Indexing and Mining Large Time Series Databases," In Proc. of the 32th Int'l Conf. on Very Large Data Bases, A Tutorial, Seoul, Korea, Sept. 2006
  18. R. Agrawal, C. Faloutsos, and A. N. Swami, "Efficient Similarity Search in Sequence Databases," In Proc. of the 4th Int'l Conf. on Foundations of Data Organization and Algorithms, Chicago, Illinois, pp. 69-84, Oct. 1993
  19. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, "Fast Subsequence Matching in Time-Series Databases." In Proc. of the Int'l Conf. on Management of Data, ACM SIGMOD, Minneapolis, Minnesota, pp. 419-429, May 1994
  20. Y.-S. Moon, K.-Y. Whang, and W.-S. Han, "General Match: A Subsequence Matching Method in Time-Series Databases Based on Generalized Windows," In Proc. of the Int'l Conf. on Management of Data, ACM SIGMOD, Madison, Wisconsin, pp. 382-393, June 2002
  21. E. Keogh, L. Wei, X. Xi, S.-H. Lee, and M. Vlachos, "LB_Keogh Supports Exact Indexing of Shapes under Rotation Invariance with Arbitrary Representations and Distance Measures," In Proc. of the 32th Int'l Conf. on Very Large Data Bases, Seoul, Korea, pp. 882-893, Sept. 2006
  22. Y.-S. Moon, K.-Y. Whang, and W.-K. Loh, "Duality-Based Subsequence Matching in Time- Series Databases," In Proc. of the 17th Int'l Conf. on Data Engineering, Heidelberg, Germany, pp. 263-272, Apr. 2001
  23. G. Bebis, "Image Processing and Interpretation," Lecture Notes.(http://www.cse.unr.edu/~bebis/Math Methods/FT/lecture.pdf)
  24. T. Rath and R. Manmatha, "Word Image Matching Using Dynamic Time Warping," In Proc. of Computer Vision and Pattern Recognition, Madison, Wisconsin, pp. 521-527, June 2003
  25. X. Xi, E. Keogh, C. Shelton, L. Wei, and C. A. Ratanamahatana, "Fast Time Series Classification Using Numerosity Reduction," In Proc. of the Int'l Conf. on Machine Learning, Pittsburgh, Pennsylvania, pp. 1033-1040, June 2006
  26. E. Keogh, X. Xi, L. Wei, and C. A. Ratanamahatana, The UCR Time Series for Classification/Clustering (http://www.cs.ucr.edu/~eamonn/time_series_data.
  27. F. Crestani, M. Lalmas, C. J. V. Rijsbergen, Information retrieval, Butterworths, 1979