An Efficient M-way Stream Join Algorithm Exploiting a Bit-vector Hash Table

비트-벡터 해시 테이블을 이용한 효율적인 다중 스트림 조인 알고리즘

  • 권태형 (한국과학기술원 전산학과) ;
  • 김현규 (한국과학기술원 전산학과) ;
  • 이유원 (한국과학기술원 전산학과) ;
  • 김명호 (한국과학기술원 전산학과)
  • Published : 2008.08.15

Abstract

MJoin is proposed as an algorithm to join multiple data streams efficiently, whose characteristics are unpredictably changed. It extends a symmetric hash join to handle multiple data streams. Whenever a tuple arrives from a remote stream source, MJoin checks whether all of hash tables have matching tuples. However, when a join involves many data streams with low join selectivity, the performance of this checking process is significantly influenced by the checking order of hash tables. In this paper, we propose a BiHT-Join algorithm which extends MJoin to conduct this checking in a constant time regardless of a join order. BiHT-Join maintains a bit-vector which represents the existence of tuples in streams and decides a successful/unsuccessful join through comparing a bit-vector. Based on the bit-vector comparison, BiHT-Join can conduct a hash join only for successful joining tuples based on this decision. Our experimental results show that the proposed BiHT-Join provides better performance than MJoin in the processing of multiple streams.

MJoin은 변화가 잦은 데이타 스트림의 조인을 효율적으로 수행하기 위한 방법으로 소개되었다. MJoin은 다중 스트림의 처리가 가능하도록 대칭적 해시 알고리즘을 확장한 것으로, 각 입력 튜플마다 모든 해시 테이블에 동일한 키를 지닌 튜플이 존재하는지 반복적으로 체크한다. 그러나, 조인 선택율이 낮고 조인되는 데이타 스트림의 수가 많을 경우, 이러한 체크 과정의 성능은 조인되는 데이타 스트림의 조인순서에 많은 영향을 받게 된다. 본 논문에서는 MJoin처럼 대칭적 해시 알고리즘을 기본으로 하지만, 이러한 체크 과정을 조인순서에 상관없이 상수 시간에 처리하는 BiHT-Join 알고리즘을 제안한다. BiHT-Join은 스트림에 있는 튜플의 존재 유무를 비트-벡터로 유지하며, 이를 비교하는 것으로 조인의 성공/실패를 판단한다. 따라서, BiHT-Join은 이 판단을 기준으로 조인이 성공하는 튜플만 해시 조인을 수행함으로 조인 효율을 높일 수 있다. 우리는 실험을 통해 BiHT-Join이 다중 데이타 스트림 조인에서 MJoin에 비해 더 나은 성능을 제공한다는 것을 보인다.

Keywords

References

  1. Hammad, M.A., Aref, W.G. and Elmagarmid, A.K. (2003): Stream Window Join: Tracking Moving Object in Sensor-Network Database. Proceedings of 15th International Conference on Scientific and Statistical Database Management, Cambridge, Massachusetts, USA: 75-84
  2. Gehrke, J. and Madden, S. (2004): Query Processing for Sensor Networks. IEEE Pervasive Computing 3(1): 46-55 https://doi.org/10.1109/MPRV.2004.1269131
  3. Theodore, J., Charles D.C., Oliver S. (2003): Gigascope: A Stream Database for Network Applications. Proceedings of the ACM SIGMOD International Conference on Management of Data, San Diego, California, USA : 647-651
  4. Yali, Z., Elke, A.R., and Goerge, T.H. (2004): Dynamic Plan Migration for Continuous Queries Over Data Streams. Proceedings of the ACM SIGMOD international conference on Management of data, Paris, France: 13-18
  5. Lukas, G. and M Tamer Őzsu. (2003): Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams. Proceedings of the 29th international conference on Very large data bases, Berlin, Germany (29): 500-511
  6. Annita, N.W. and Peter, M.G.A. (1993): DataFlow query execution in parallel main-memory environment. Distributed and Parallel Databases 1(1): 103-128 https://doi.org/10.1007/BF01277522
  7. Urhan, T. and Franklin, M. (2000): Xjoin: A Reatively-Scheduled Pipelined Join Operator. IEEE Data Engineering Bulletin 23(2): 27-33
  8. Stratis, D.V., Jeffrey F.N. and Josef, B. (2003): Maximizing the output rate of multi-join queries over streaming information sources. Proceedings of the 29th international conference on Very large data bases, Berlin, Germany (29): 285-296
  9. Toshihide, I. and Tiko K. (1984): On the optimal nesting order for computing N-relational joins. ACM Transactions on Database Systems (TODS) 9(3): 482-502 https://doi.org/10.1145/1270.1498
  10. O'Neil, P. and Graefe, G. (1995): Multi-table joins through bitmapped join indices, ACM SIGMOD Record, 24(3): 8-11 https://doi.org/10.1145/211990.212001
  11. Haas, P. J., Hellerstein J. M. (1999): Ripple Joins for Online Aggregation. Proceedings of the ACM SIGMOD international conference on Management of data, Piladelphia, USA: 287-298
  12. Avnur, R. and Hellerstein, J. M. (2000): Eddies: Continuously adaptive query processing. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, Texas, USA: 261-272
  13. Viglas, S. and Naughton, J. F. (2002): Rate-Based Query Optimization for Streaming Information Sources. Proceedings of the 2002 ACM SIGMOD international conference on Management of data, 2002, Madison, Wisconsin, USA: 37-48
  14. Bizarro, P., Babu, S., DeWitt, D. and Widom, J. (2005): Content-based routing: Different plans for different data. Proceedings of the 31st international conference on Very large data bases, Trondheim, Norway: 757-768
  15. Babu, S., Munagala, K., Widom, J. and Motwani, R. (2005): Adaptive Caching for Continuous Queries, Proceedings of the 21st International Conference on Data Engineering, Washington, DC, USA: 118-129
  16. Hai Y., Ee-Peng L. and Jun Z. (2006): On In- network Synopsis Join Processing for Sensor Networks. Proceedings of the 7th International Conference on Mobile Data Management, Nara, Japan : 32-39
  17. Yijian B., Haixun, W. and Carlo, Z. (2007): Load Shedding in Classifying Multi-Source Streaming Data: A Bayes Risk Approach. Proceedings of the Seventh SIAM International Conference on Data Mining, Minneapolis, Minnesota, USA: 425-430