DOI QR코드

DOI QR Code

A Dynamic Locality Sensitive Hashing Algorithm for Efficient Security Applications

  • Mohammad Y. Khanafseh (Computer Science Department Birzeit University) ;
  • Ola M. Surakhi (Computer Science Department American University of Madaba )
  • Received : 2024.05.05
  • Published : 2024.05.30

Abstract

The information retrieval domain deals with the retrieval of unstructured data such as text documents. Searching documents is a main component of the modern information retrieval system. Locality Sensitive Hashing (LSH) is one of the most popular methods used in searching for documents in a high-dimensional space. The main benefit of LSH is its theoretical guarantee of query accuracy in a multi-dimensional space. More enhancement can be achieved to LSH by adding a bit to its steps. In this paper, a new Dynamic Locality Sensitive Hashing (DLSH) algorithm is proposed as an improved version of the LSH algorithm, which relies on employing the hierarchal selection of LSH parameters (number of bands, number of shingles, and number of permutation lists) based on the similarity achieved by the algorithm to optimize searching accuracy and increasing its score. Using several tampered file structures, the technique was applied, and the performance is evaluated. In some circumstances, the accuracy of matching with DLSH exceeds 95% with the optimal parameter value selected for the number of bands, the number of shingles, and the number of permutations lists of the DLSH algorithm. The result makes DLSH algorithm suitable to be applied in many critical applications that depend on accurate searching such as forensics technology.

Keywords

References

  1. Bello-Orgaz, G.; Jung, J.J.; Camacho, D. Social big data: Recent achievements and new challenges. Inf. Fusion 2016, 28, 45-59 
  2. Jinfeng Li, James Cheng, Fan Yang, Yuzhen Huang, Yunjian Zhao, Xiao Yan, and Ruihao Zhao. 2017. Losha: A general framework for scalable locality sensitive hashing. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 635-644. 
  3. G Padmasundari and Hema A Murthy. 2017. Raga identification using locality sensitive hashing. In 2017 Twenty-third National Conference on Communications (NCC). IEEE, 1-6 
  4. Brian D Ondov, Todd J Treangen, Pall Melsted, Adam B Mallonee, Nicholas H Bergman, Sergey Koren, and Adam M Phillippy. 2016. Mash: fast genome and metagenome distance estimation using MinHash. Genome biology 17, 1 (2016), 1-14. 
  5. [5] Edgar Chavez, Gonzalo Navarro, Ricardo Baeza-Yates, and Jose Luis Marroquin. 2001. Searching in metric spaces. ACM computing surveys (CSUR) 33, 3 (2001), 273-321 
  6. Park, J.S.; Chen, M.S.; Yu, P.S. An Effective Hash-Based Algorithm for Mining Association Rules; ACM: New York, NY, USA, 1995 
  7. Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry. 
  8. Mehmet Ali Abdulhayoglu and Bart Thijs. 2018. Use of locality sensitive hashing (LSH) algorithm to match Web of Science and Scopus. Scientometrics 116, 2 (2018), 1229-1245 
  9. Mayank Bawa, Tyson Condie, and Prasanna Ganesan. 2005. LSH forest: self-tuning indexes for similarity search. In Proceedings of the 14th international conference on World Wide Web. 651-660. 
  10. Gan, Junhao, et al. "Locality-sensitive hashing scheme based on dynamic collision counting." Proceedings of the 2012 ACM SIGMOD international conference on management of data. 2012. 
  11. Jafari, Omid, et al. "A survey on locality sensitive hashing algorithms and their applications." arXiv preprint arXiv:2102.08942 (2021). 
  12. Wanqi Liu, Hanchen Wang, Ying Zhang, Wei Wang, and Lu Qin. 2019. I-LSH: I/O efficient c-approximate nearest neighbour search in high dimensional space. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1670-1673. 
  13. Sunwoo Kim, Haici Yang, and Minje Kim. 2020. Boosted Locality Sensitive Hashing: Discriminative Binary Codes for Source Separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 106-110. 
  14. Junhao Gan, Jianlin Feng, Qiong Fang, and Wilfred Ng. 2012. Locality-sensitive hashing scheme based on dynamic collision counting. In Proceedings of the 2012 ACM SIGMOD international conference on management of data. 541-552 
  15. Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing. 604-613. 
  16. Sunwoo Kim, Haici Yang, and Minje Kim. 2020. Boosted Locality Sensitive Hashing: Discriminative Binary Codes for Source Separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 106-110. 
  17. Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. [n.d.]. A Time-Space Efficient Locality Sensitive Hashing Method for Similarity Search in High Dimensions. Technical Report. 
  18. Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. 2007. Multi-probe LSH: efficient indexing for high-dimensional similarity search. In 33rd International Conference on Very Large Data Bases, VLDB 2007. Association for Computing Machinery, Inc, 950-961. 
  19. Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, and Qi Tian. 2012. Super-bit locality-sensitive hashing. In Advances in neural information processing systems. Citeseer, 108-116. 
  20. Mayank Bawa, Tyson Condie, and Prasanna Ganesan. 2005. LSH forest: self-tuning indexes for similarity search. In Proceedings of the 14th international conference on World Wide Web. 651-660. 
  21. Venu Satuluri and Srinivasan Parthasarathy. 2011. Bayesian locality sensitive hashing for fast similarity search. arXiv preprint arXiv:1110.1328 (2011). 
  22. Qiang Huang, Jianlin Feng, Yikai Zhang, Qiong Fang, and Wilfred Ng. 2015. Query-aware locality-sensitive hashing for approximate nearest neighbor search. Proceedings of the VLDB Endowment 9, 1 (2015), 1-12. 
  23. Yi Yu, Michel Crucianu, Vincent Oria, and Ernesto Damiani. 2010. Combining multi-probe histogram and order-statistics based lsh for scalable audio content retrieval. In Proceedings of the 18th ACM international conference on Multimedia. 381-390. 
  24. Seiichi Ozawa, Junji Nakazato, Tao Ban, Jumpei Shimamura, et al. 2015. An online malicious spam email detection system using resource allocating network with locality sensitive hashing. Journal of intelligent learning systems and applications 7, 02 (2015), 42. 
  25. Xuyun Zhang, Wanchun Dou, Qiang He, Rui Zhou, Christopher Leckie, Ramamohanarao Kotagiri, and Zoran Salcic. 2017. LSHiForest: A generic framework for fast tree isolation based ensemble anomaly analysis. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 983-994 
  26. Qixia Jiang and Maosong Sun. 2011. Semi-supervised simhash for efficient document similarity search. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 93-101. 
  27. Anshumali Shrivastava and Ping Li. 2014. Asymmetric LSH (ALSH) for sublinear time Maximum Inner Product Search (MIPS). Advances in Neural Information Processing Systems 3, January (2014), 2321-2329. 
  28. Yongwook Bryce Kim, Erik Hemberg, and Una-May O'Reilly. 2016. Stratified locality-sensitive hashing for accelerated physiological time series retrieval. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2479-2483. 
  29. Marco Fisichella, Fan Deng, and Wolfgang Nejdl. 2010. Efficient incremental near duplicate detection based on locality sensitive hashing. In the International Conference on Database and Expert Systems Applications. Springer, 152-166 
  30. Jafari, Omid, et al. "A Survey on Locality Sensitive Hashing Algorithms and their Applications." arXiv preprint arXiv:2102.08942 (2021). 
  31. Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing. 604-613 
  32. D. Ravichandran, P. Pantel, and E. Hovy. Randomized algorithms and nlp: using locality sensitive hash function for high speed noun clustering. In ACL, 2005. 
  33. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. of 25th Intl. Conf. on Very Large Data Bases(VLDB), pages 518-529, 1999