An Efficient Algorithm for Streaming Time-Series Matching that Supports Normalization Transform

정규화 변환을 지원하는 스트리밍 시계열 매칭 알고리즘

  • Published : 2006.11.15

Abstract

According to recent technical advances on sensors and mobile devices, processing of data streams generated by the devices is becoming an important research issue. The data stream of real values obtained at continuous time points is called streaming time-series. Due to the unique features of streaming time-series that are different from those of traditional time-series, similarity matching problem on the streaming time-series should be solved in a new way. In this paper, we propose an efficient algorithm for streaming time- series matching problem that supports normalization transform. While the existing algorithms compare streaming time-series without any transform, the algorithm proposed in the paper compares them after they are normalization-transformed. The normalization transform is useful for finding time-series that have similar fluctuation trends even though they consist of distant element values. The major contributions of this paper are as follows. (1) By using a theorem presented in the context of subsequence matching that supports normalization transform[4], we propose a simple algorithm for solving the problem. (2) For improving search performance, we extend the simple algorithm to use $k\;({\geq}\;1)$ indexes. (3) For a given k, for achieving optimal search performance of the extended algorithm, we present an approximation method for choosing k window sizes to construct k indexes. (4) Based on the notion of continuity[8] on streaming time-series, we further extend our algorithm so that it can simultaneously obtain the search results for $m\;({\geq}\;1)$ time points from present $t_0$ to a time point $(t_0+m-1)$ in the near future by retrieving the index only once. (5) Through a series of experiments, we compare search performances of the algorithms proposed in this paper, and show their performance trends according to k and m values. To the best of our knowledge, since there has been no algorithm that solves the same problem presented in this paper, we compare search performances of our algorithms with the sequential scan algorithm. The experiment result showed that our algorithms outperformed the sequential scan algorithm by up to 13.2 times. The performances of our algorithms should be more improved, as k is increased.

최근에 센서 및 모바일 장비들의 발전으로 인하여 이러한 장비들로부터 생성된 대량의 데이터 스트림(data stream)의 처리가 중요한 연구 과제가 되고 있다. 데이타 스트림 중에서 연속되는 시점에 얻어진 실수 값들의 스트림을 스트리밍 시계열(streaming time-series)이라 한다. 스트리밍 시계열에 대한 유사성 매칭은 여러 가지 고유 특성에 의하여 기존의 시계열 데이타와는 다르게 처리되어야 한다. 본 논문에서는 정규화 변환(normalization transform)을 지원하는 스트리밍 시계열 매칭 문제를 해결하기 위한 효율적인 알고리즘을 제안한다. 기존에는 스트리밍 시계열을 아무런 변환 없이 비교하였으나, 본 논문에서는 정규화 변환된 스트리밍 시계열을 비교한다. 정규화 변환은 절대적인 값은 달라도 유사한 변동 경향을 가지는 시계열 데이타를 찾기 위하여 유용하다. 본 논문의 공헌은 다음과 같다. (1) 기존의 정규화 변환을 지원하는 서브시퀀스 매칭 알고리즘[4]에서 제시된 정리(theorem)를 이용하여 정규화 변환을 지원하는 스트리밍 시계열 매칭 문제를 풀기 위한 간단한 알고리즘을 제안한다. (2) 검색 성능을 향상시키기 위하여 간단한 알고리즘을 $k\;({\geq}\;1)$ 개의 인덱스를 이용하는 알고리즘으로 확장한다. (3) 주어진 k에 대하여, 확장된 알고리즘의 검색 성능을 최대화하기 위해 k 개의 인덱스를 생성할 최적의 윈도우 길이를 선택하기 위한 근사 방법(approximation)을 제시한다. (4) 스트리밍 시계열의 연속성(continuity) 개념[8]에 기반하여, 현재 시점 $t_0$에서의 스트리밍 서브시퀀스에 대한 검색과 동시에 미래 시점 $(t_0+m-1)\;(m\geq1)$까지의 검색 결과를 한번의 인덱스 검색으로 구할 수 있도록 재차 확장한 알고리즘을 제안한다. (5) 일련의 실험을 통하여 본 논문에서 제안된 알고리즘들 간의 성능을 비교하고, k 및 m 값의 변화에 따라 제안된 알고리즘들의 검색 성능 변화를 보인다. 본 논문에서 제시한 정규화 변환 스트리밍 시계열 매칭 문제에 대한 연구는 이전에 수행된 적이 없으므로 순차 검색(sequential scan) 알고리즘과 성능을 비교한다. 실험결과, 제안된 알고리즘은 순차 검색에 비하여 최대 13.2배까지 성능이 향상되었으며, 인덱스의 개수 k가 증가함에 따라 검색 성능도 함께 증가하였다.

Keywords

References

  1. R. Agrawal, C. Faloutsos, and A. Swami, 'Efficient Similarity Search in Sequence Databases,' In Proc. the 4th Int'l Canf. on Foundations of Data Organization and Algorithms(FODO), Chicago, Illinois, pp. 69-84, Oct. 1993 https://doi.org/10.1007/3-540-57301-1_5
  2. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, 'Fast Subsequence Matching in Time-Series Databases,' In Proc. Int'l Canf. on Management of Data, ACM SIGMOD, Minneapolis, Minnesota, pp. 419-429, May 1994 https://doi.org/10.1145/191839.191925
  3. J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd Ed., Morgan Kaufman, 2005
  4. W.-K Loh, S.-W., Kim, and K-Y. Whang, 'A Subsequence Matching Algorithm that Supports Normalization Transform in Time-Series Databases,' Data Mining and Knowledge Discovery, Vol. 9, No. 1, pp. 5-28, July 2004 https://doi.org/10.1023/B:DAMI.0000026902.89522.a3
  5. Y.-S. Moon, K-Y., Whang, and W.-K Loh, 'Duality-Based Subsequence Matching in TimeSeries Databases,' In Proc. the 17th 1nt'1 Canf. on Data Engineering (ICDE), IEEE, Heidelberg, Germany, pp. 263-272, April 2001 https://doi.org/10.1109/ICDE.2001.914837
  6. B.-K Yi, H. V. Jagadish, and C. Faloutsos, 'Efficient Retrieval of Similar Time Sequences Under Time Warping,' In Proc. the 14th Int'l Canf. on Data Engineering(JCDE), IEEE, Orlando, Florida, pp. 201-208, Feb. 1998
  7. L. Gao and X. S. Wang, 'Continually Evaluating Similarity-Based Pattern Queries on a Streaming Time Series,' In Proc. Int'l Canf. on Management of Data, ACM SIGMOD, Madison, Wisconsin, pp. 370-381, June 2002 https://doi.org/10.1145/564691.564734
  8. L. Gao, Z. Yao, and X. S. Wang, 'Evaluating Continuous Nearest Neighbor Queries for Streaming Time Series via Pre-Fetching,' In Proc. ACM Int'l Canf. on Information and Knowledge Management (CIKM), McLean, Virginia, pp. 485-492, Nov. 2002 https://doi.org/10.1145/584792.584872
  9. H. Wu, B. Salzberg, and D. Zhang, 'Online Event-driven Subsequence Matching Over Financial Data Streams,' In Proc. of Int'l Conf. on Management of Data, ACM SIGMOD, Paris, France, pp. 23-34, June 2004 https://doi.org/10.1145/1007568.1007574
  10. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, 'Models and Issues in Data. Stream Systems,' In Proc. ACM SIGACT-SIGMODSIGART Symp. on Principles of Database Systems (PODS), Madison, Wisconsin, pp. 1-16, June 2002 https://doi.org/10.1145/543613.543615
  11. D. Q. Goldin and P. C. Kanellakis, 'On Similarity Queries for Time-Series Data: Constraint Specification and Implementation,' In Proc. Ini'l Conf. on Principles and Practices of Constraint Programming, Cassis, France, pp. 137-153, Sept. 1995 https://doi.org/10.1007/3-540-60299-2_9
  12. E. J. Keogh, 'A Decade of Progress in Indexing and Mining Large Time Series Databases,' In Proc. Int'l Conf. on Very Large Data Bases (VLDB), Tutorial, Seoul, Korea, pp. 1268, Sept. 2006
  13. A. J. Frost, R. R. Prechter, and C. J. Collins, Elliott Wave Principle: Key to Market Behavior, John Wiley & Sons, 2001
  14. N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, 'The R*-tree: An Efficient and Robust Access Method for Points and Rectangles,' In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Atlantic City, New Jersey, pp, 322-331, May 1990 https://doi.org/10.1145/93597.98741
  15. S. Berchtold, C. Bohm, and H.-P. Kriegel, 'The Pyramid-Technique: Towards Breaking the Curse of Dimensionality,' In Proc. Ini'l Conf. on Management of Data, ACM SIGMOD, Seattle, Washington, pp. 142-153, June 1998
  16. W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes in C: The Art of Scientific Computing, Cambridge University Press, 2nd Ed., 1992
  17. S.-H. Lim, H.-J. Park, and S.-W. Kim, 'Using Multiple Indexes for Efficient Subsequence Matching in Time-Series Databases,' In Proc. Int'l Conf. on Database Systems for Advanced Applications (DASFAA), pp. 65-79, Singapore, Apr. 2006 https://doi.org/10.1007/11733836_7
  18. W. R. Stevens and S. A. Rago, Advanced Programming in the UNIX Environment, 2nd Ed., Addison-Wesley, 2005
  19. E. J. Keogh, L. Wei, X. Xi, S.-H. Lee, and M. Vlachos, 'LB_Keogh Supports Exact Indexing of Shapes under Rotation Invariance with Arbitrary Representations and Distance Measures,' In Proc. Int'l Conf. on Very Large Data Bases (VLDB), Seoul, Korea, pp. 882-893, Sept. 2006