DOI QR코드

DOI QR Code

An Efficient Block Index Scheme with Segmentation for Spatio-Textual Similarity Join

  • Xiang, Yiming (College of Computer & Information Engineering, Zhejiang Gongshang University) ;
  • Zhuang, Yi (College of Computer & Information Engineering, Zhejiang Gongshang University) ;
  • Jiang, Nan (Hangzhou First People's Hospital)
  • Received : 2016.09.24
  • Accepted : 2017.03.29
  • Published : 2017.07.31

Abstract

Given two collections of objects that carry both spatial and textual information in the form of tags, a $\text\underline{S}patio$-$\text\underline{T}extual$-based object $\text\underline{S}imilarity$ $\text\underline{JOIN}$ (ST-SJOIN) retrieves the pairs of objects that are textually similar and spatially close. In this paper, we have proposed a block index-based approach called BIST-JOIN to facilitate the efficient ST-SJOIN processing. In this approach, a dual-feature distance plane (DFDP) is first partitioned into some blocks based on four segmentation schemes, and the ST-SJOIN is then transformed into searching the object pairs falling in some affected blocks in the DFDP. Extensive experiments on real and synthetic datasets demonstrate that our proposed join method outperforms the state-of-the-art solutions.

Keywords

References

  1. A-X. Sun, S. S. Bhowmick, K. Tran Nam Nguyen, G. Bai, "Tag-Based Social Image Retrieval: An Empirical Evaluation," Journal of the American Society for Information Science and Technology (JASIST), vol.62, no.12, pp. 2364-2381, 2011. https://doi.org/10.1002/asi.21659
  2. X-R. Li, Cees G. M. Snoek, M. Worring, A. W. M. Smeulders, "Harvesting Social Objects for Bi-Concept Search," IEEE Transactions on Multimedia, vol.14, no.4, pp. 1091-1104, 2012. https://doi.org/10.1109/TMM.2012.2191943
  3. A. Arasu, V. Ganti, and R. Kaushik, "Efficient exact set-similarity joins," in Proc. of VLDB, pp. 918-929, 2006.
  4. J. Ballesteros, A. Cary, and N. Rishe, "Spsjoin: parallel spatial similarity joins," in Proc. of GIS, pp. 481- 484, 2011.
  5. R. J. Bayardo, Y. Ma, and R. Srikant, "Scaling up all pairs similarity search," in Proc. of WWW, 2007.
  6. T. Brinkhoff, H.-P. Kriegel, and B. Seeger, "Efficient processing of spatial joins using r-trees," in Proc. of SIGMOD Conference, 1993.
  7. E. P. F. Chan. "Buffer queries," IEEE Transactions on Knowledge and Data Engineering, vol.15, no.4, pp.895-910, 2003. https://doi.org/10.1109/TKDE.2003.1209007
  8. S. Chaudhuri, V. Ganti, and R. Kaushik, "A primitive operator for similarity joins in data cleaning," in Proc. of ICDE, 2006.
  9. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, "Duplicate record detection: A survey," IEEE Transactions on Knowledge and Data Engineering, vol.19, no.1, pp.1-16, 2007. https://doi.org/10.1109/TKDE.2007.250581
  10. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, "Approximate string joins in a database (almost) for free," in Proc. of VLDB, pp.491-500, 2001.
  11. S. Sarawagi and A. Kirpal, "Efficient set joins on similarity predicates," in Proc. of SIGMOD, 2004.
  12. C. Xiao, W. Wang, X. Lin, and H. Shang, "Top-k set similarity joins," in Proc. of ICDE, 2009.
  13. C. Xiao, W. Wang, X. Lin, and J. X. Yu, "Efficient similarity joins for near duplicate detection," in Proc. of WWW, 2008.
  14. C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang, "Efficient similarity joins for near-duplicate detection," ACM Transactions on Database Systems, vol. 36, no.3, 15, 2011.
  15. P. Bouros, S. Ge, and N. Mamoulis, "Spatio-Textual Similarity Joins," in Proc. of VLDB 2013.
  16. R. Baeza-Yates and B. Ribeiro-Neto, "Modern Information Retrieval. Addison Wesley," 1st edition edition, May 1999.
  17. A. Z. Broder, "On the resemblance and containment of documents," in Proc. of SEQS, 1997.
  18. M. Charikar, "Similarity estimation techniques from rounding algorithms," in Proc. of STOC, 2002.
  19. S. Chaudhuri, V. Ganti, and R. Kaushik, "A primitive operator for similarity joins in data cleaning," in Proc. of ICDE, 2006.
  20. A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe, "Collection statistics for fast duplicate document detection," ACM Transactions on Information Systems, vol. 20, no. 2, pp.171-191, 2002. https://doi.org/10.1145/506309.506311
  21. D. Deng, G-L. Li, J-H. Feng, "A Pivotal Prefix Based Filtering Algorithm for String Similarity Search," in Proc. of SIGMOD 2014, 673-684, 2014.
  22. H-Q. Hu, G-L. Li, Z-F. Bao, J-H. Feng, Z-G. Gong, "Topk Spatio-Textual Similarity Join," IEEE Transactions on Knowledge and Data Engineering, 2015.
  23. G-L. Li, J. He, D. Dong, J. Li, J-H. Feng, "Efficient Similarity Search and Join on Multi-Attribute Data," in Proc. of SIGMOD 2015, 2015.
  24. W-B. Tao, M-H. Yu, G-L. Li, "Efficient Top-K SimRank-based Similarity Join," in Proc. of VLDB 2015.
  25. D. Deng, G-L. Li, H. Wen, J-H. Feng, "An Efficient Partition Based Method for Exact Set Similarity Joins," in Proc. of VLDB, 2016.
  26. Z-Y. Shang, Y-X. Liu, G-L. Li and J-H. Feng, "K-Join: Knowledge-Aware Similarity Join," in Proc. of ICDE 2017.
  27. N. Ta, G-L. Li, J-H. Feng, "Signature-Based Trajectory Similarity Join," IEEE Transactions on Knowledge and Data Engineering, 2017.
  28. http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm
  29. http://www.panoramio.com/