DOI QR코드

DOI QR Code

Practical Datasets for Similarity Measures and Their Threshold Values

유사도 측정 데이터 셋과 쓰레숄드

  • Yang, Byoungju (Security Solution Division, Samsung Techwin Co.) ;
  • Shim, Junho (Division of Computer Science, Sookmyung Women's University)
  • 양병주 (삼성테크윈 시큐리티솔루션사업부) ;
  • 심준호 (숙명여자대학교 컴퓨터과학부)
  • Received : 2013.01.24
  • Accepted : 2013.02.15
  • Published : 2013.02.28

Abstract

In the e-business domain where data objects are quantitatively large, measuring similarity to find the same or similar objects is important. It basically requires comparing and computing the features of objects in pairs, and therefore takes longer time as the amount of data becomes bigger. Recent studies have shown various algorithms to efficiently perform it. Most of them show their performance superiority by empirical tests over some sets of data. In this paper, we introduce those data sets, present their characteristics and the meaningful threshold values that each of data sets contain in nature. The analysis on practical data sets with respect to their threshold values may serve as a referential baseline to the future experiments of newly developed algorithms.

방대한 량의 전자상거래 데이터 객체를 다루는데 같거나 유사한 객체들을 찾는 유사도 측정은 중요하다. 객체간 유사도 측정은 객체 쌍의 유사도 측정값을 비교하므로 객체 량이 많아질수록 오랜 시간이 걸린다. 최근의 여러 유사도 측정 연구에선 이를 더 효율적으로 수행하는 기법을 제시하고 실제 데이터 셋에서 그 성능을 평가해왔다. 본 논문에서는 이들 연구에서 사용하는 데이터 셋의 특성과 실험에서 사용되는 쓰레숄드 값이 가지는 의미에 대해 분석해본다. 이러한 분석은 새로운 유사도 측정 기법의 성능 평가 실험의 참조 기준을 제시하는 역할을 한다.

Keywords

References

  1. Bayardo, R. J., Ma, Y., and Srikant, R., "Scaling up all pairs similarity search," In Proceedings of the 16th international conference on World Wide Web, WWW '07, USA, 2007.
  2. Dean, J. and Ghemawat, S., "Mapreduce: simplified data processing on large clusters," Communications of ACM, Vol. 51, No. 1, pp. 107-113, 2008.
  3. Last.fm Web Services, http://www.last. fm/api, 2012.
  4. Lee, D. and Shim, J., "Survey on Vector Similarity Measures : Focusing on Algebraic Characteristic," The Journal of Society for e-Business Studies, Vol. 17, No. 4, pp. 209-219, 2012. https://doi.org/10.7838/jsebs.2012.17.4.209
  5. Lee, D., Park, J., Shim, J., and Lee, S. G., "An efficient similarity join algorithm with cosine similarity predicate," In Proceedings of the DEXA (2), 2010.
  6. Metwally, A. and Faloutsos, C., "V-smart join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors," Proc. VLDB Endow, Vol. 5, No. 8, pp. 704-715, 2012. https://doi.org/10.14778/2212351.2212353
  7. Movielens data sets, grouplens research. http://www.grouplens.org/node/73, 2011.
  8. Nister, D. and Stewenius, H., "Scalable recognition with a vocabulary tree," In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, pp. 2161-2168, 2006.
  9. Stanford Large Network Dataset Collection, Stanford University, http://snap.- stanford.edu/data/, 2012.
  10. The DBLP Computer Science Bibliography, http://www.informatik.uni-trier. de/-ley/db/, 2012.
  11. Vernica, R., Carey, M. J., and Li, C., "Efficient parallel set-similarity joins using mapreduce," In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010.
  12. Yang, B., Kim, H., Shim, J., Lee, D., and Lee, S. G., "A MapReduce-based Filtering Framework for Vector Similarity Joins," Technical Report, Seoul National Univ, 2013.
  13. Yang, B., Myung, J., Lee, S. G. and Lee, D., "A mapreduce-based filtering algorithm for vector similarity join," In Proceedings of the ICUIMC(IMCOM) '13, 2013.
  14. Yeon, J., Lee, D., Shim, J., and Lee, S. G., "Product Review Data and Sentiment Analytical Processing Modeling," The Journal of Society for e-Business Studies, Vol. 16, No. 4, pp. 125-137, 2011. https://doi.org/10.7838/jsebs.2011.16.4.125

Cited by

  1. Structural Health Monitoring with Sensor Data and Cosine Similarity for Multi-Damages vol.19, pp.14, 2019, https://doi.org/10.3390/s19143047