DOI QR코드

DOI QR Code

An Update-Efficient, Disk-Based Inverted Index Structure for Keyword Search on Data Streams

데이터 스트림에 대한 키워드 검색을 위한, 효율적인 갱신이 가능한 디스크 기반 역색인 구조

  • Received : 2016.03.07
  • Accepted : 2016.03.15
  • Published : 2016.04.30

Abstract

As social networking services such as twitter become increasingly popular, data streams are widely prevalent these days. In order to search data accumulated from data streams efficiently, the use of an index structure is essential. In this paper, we propose an update-efficient, disk-based inverted index structure for efficient keyword search on data streams. When new data arrive at the data stream, the index needs to be updated to incorporate the new data. The traditional inverted index is very inefficient to update in terms of disk I/O, because all index data stored in the disk need to be read and written to the disk each time the index is updated. To solve this problem, we divide the whole inverted index into a sequence of inverted indices with exponentially increasing size. When new data arrives, it is first inserted into the smallest index and, later, the small indices are merged with the larger indices, which leads to a small amortize update cost for each new data. Furthermore, when indices stored in the disk are merged with each other, we minimize the disk I/O cost incurred for the merge operation, resulting in an even smaller update cost. Through various experiments, we compare the update efficiency of the proposed index structure with the previous one, and show the performance advantage of the proposed structure in terms of the update cost.

트위터와 같은 소셜 네트워킹 서비스(social networking service)의 확산으로 스트림 형태의 데이터가 크게 증가하고 있다. 스트림 형태로 들어와 누적되는 데이터를 효율적으로 검색하기 위해서는 색인이 반드시 필요하다. 본 논문에서는 스트림 형태로 들어와 계속 누적되는 데이터에 대한 키워드 검색을 효율적으로 할 수 있게 해주는, 효율적인 갱신이 가능한 디스크 기반 역색인(inverted index) 구조를 제안한다. 데이터 스트림을 검색하기 위해서는 데이터의 유입에 따라 역색인을 계속해서 갱신해 주어야 한다. 전통적인 역색인을 사용하는 경우, 역색인을 갱신하기 위해서는 매번 디스크에 저장된 모든 색인 데이터를 읽고 다시 써야 하므로 디스크 I/O 측면에서 매우 비효율적이다. 이러한 문제를 해결하기 위해 본 논문에서는 역색인을 크기가 지수적으로 증가하는 여러 역색인들로 나누어 저장한다. 새로운 데이터가 들어오면 우선 가장 작은 크기의 역색인에 삽입하고, 작은 크기의 역색인들을 더 큰 크기를 가진 역색인들과 나중에 병합함으로써 평균적으로 역색인을 갱신하는 비용을 크게 낮춘다. 또한 디스크에 저장된 역색인들을 병합할 때 발생하는 디스크 I/O 비용을 최소화함으로써 역색인의 갱신 비용을 더욱 낮춘다. 다양한 실험을 통해 기존 방법과 제안 방법의 효율성을 비교하고, 제안 방법이 갱신 비용에 있어 기존 방법에 비해 훨씬 효율적임을 보인다.

Keywords

Acknowledgement

Supported by : 한국연구재단

References

  1. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom., "Processing sliding window multi-joins in continuous queries over data streams," in Proceedings of ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems (PODS), pp.1-16, June, 2002.
  2. M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin, "Earlybird: Real-time search at twitter," in ICDE, pp. 1360-1369, 2012.
  3. S. Helmer and G. Moerkotte, "A performance study of four index structures for set-valued attributes of low cardinality," The International Journal on Very Large Data Bases(VLDB), Vol.12, No.3, pp.244-261, 2003. https://doi.org/10.1007/s00778-003-0106-0
  4. C. Chen, F. Li, B. C. Ooi, and S. Wu, "TI: An efficient indexing mechanism for real-time search on tweets," in SIGMOD, pp. 649-660, 2011.
  5. Lingkun Wu, Wenqing Lin, Xiaokui Xiao, and Yabo Xu3, "LSII: An Indexing Structure for Exact Real-Time Search on Microblogs," in ICDE, pp.482-493, 2013.
  6. J. Zobel and A. Moat, "Inverted files for text search engines," ACM Computing Survey, Vol.38, No.2, July, 2006.
  7. D. Arroyuelo, S. Gonzalez, M. Oyarzun, and V. Sepulveda, "Document identifier reassignment and run-length-compressed inverted indexes for improved search performance," Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.173-182, 2013.
  8. R. Baeza-Yates and B. Ribeiro-Neto, "Modern Information Retrieval: The Concepts and Technology behind Search," 2nd Edition, Addison-Wesley Professional, 2011.
  9. H. Yan, S. Ding, and T. Suel. "Inverted index compression and query processing with optimized document ordering," in Proceedings of the 18th international conference on World Wide Web, pp.401-410, 2009.
  10. Carolina Bonacic, Danilo Bustos, and Veronica Gil-Costa, "Multithreaded Processing in Dynamic Inverted Indexes for Web Search Engines," in Proceedings of the 2015 Workshop on Large-Scale and Distributed System for Information Retrieval, pp.15-20, 2015.
  11. M. Stonebraker, "The case for partial indexes," ACMSIGMOD Record, Vol.18, No.4, pp.4-11.
  12. P. Seshadri and A. N. Swami. 1995, "Generalized partial indexes," in ICDE, pp.420-427, December, 1989.
  13. B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, and I. Stoica, "Enhancing p2p file-sharing with an internet-scale query processor," in VLDB, pp.432-443, 2004.
  14. E. Adar, "User 4xxxxx9: Anonymizing query logs," in Workshop on Query Log Analysis at the 16th World Wide Web Conference, 2007.
  15. J. Lin and G. Mishne, "A study of 'churn' in tweets and real-time search queries," in Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media, 2012.
  16. P. E. O'Neil, E. Cheng, D. Gawlick, and E. J. O'Neil, "The log structured merge-tree (lsm-tree)," Journal Acta Informatica, Vol.33, No.4, pp.351-385, 1996. https://doi.org/10.1007/s002360050048