k-NN Join Based on LSH in Big Data Environment

  • Ji, Jiaqi (Department of Information Center, Hebei Normal University for Nationalities) ;
  • Chung, Yeongjee (Department of Computer Engineering, Wonkwang University)
  • Received : 2017.12.10
  • Accepted : 2018.01.19
  • Published : 2018.06.30


k-Nearest neighbor join (k-NN Join) is a computationally intensive algorithm that is designed to find k-nearest neighbors from a dataset S for every object in another dataset R. Most related studies on k-NN Join are based on single-computer operations. As the data dimensions and data volume increase, running the k-NN Join algorithm on a single computer cannot generate results quickly. To solve this scalability problem, we introduce the locality-sensitive hashing (LSH) k-NN Join algorithm implemented in Spark, an approach for high-dimensional big data. LSH is used to map similar data onto the same bucket, which can reduce the data search scope. In order to achieve parallel implementation of the algorithm on multiple computers, the Spark framework is used to accelerate the computation of distances between objects in a cluster. Results show that our proposed approach is fast and accurate for high-dimensional and big data.


Supported by : Wonkwang University


  1. Y. Hu, C. Yang, C. Ji, Y. Xu, and X. Li, "Efficient snapshot kNN join processing for large data using MapReduce," in Proceedings of 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), Wuhan, China pp. 713-720, 2016. DOI: 10.1109/ICPADS.2016.0098.
  2. J. Maillo, J. Luengo, S. Garcia, F. Herrera, and I. Triguero, "Exact fuzzy k-nearest neighbor classification for big datasets," in Proceedings of 2017 IEEE International Conference on Fuzzy Systems, Naples, Italy, pp. 1-6, 2017. DOI: 10.1109/FUZZ-IEEE.2017.8015686.
  3. T. Wen, Z. Zhang, M. Qiu, M. Zeng, and W. Luo, "A two-dimensional matrix image based feature extraction method for classification of sEMG: a comparative analysis based on SVM, kNN and RBF-NN," Journal of X-ray Science and Technology, vol. 25, no. 2, pp. 287-300, 2017. DOI: 10.3233/XST-17260.
  4. M. Antol and V. Dohnal, "Popularity-based ranking for fast approximate kNN search," Informatica, vol. 28, no. 1, pp. 1-21, 2017. DOI: 10.15388/informatica.2017.118.
  5. T. Emrich, H. P. Kriegel, P. Kroger, J. Niedermayer, M. Renz, and A. Zufle, "On reverse-k-nearest-neighbor joins," GeoInformatica, vol. 19, no. 2, pp. 299-330, 2015. DOI: 10.1007/s10707-014-0215-5.
  6. M. Afzali, N. Singh, and S. Kumar, "Hadoop-MapReduce: a platform for mining large datasets," in Proceedings of 2016 3rd International Conference on Computing for Sustainable Global Development, New Delhi, India, pp. 1856-1860, 2016.
  7. H. V. L. Cao, T. N. Phan, M. Q. Tran, T. L. Hong, and M. N. Q. Truong, "Processing all k-nearest neighbor query on large multidimensional data," in Proceedings of 2016 International Conference on Advanced Computing and Applications, Can Tho, Vietnam, pp. 11-17, 2016. DOI: 10.1109/ACOMP.2016.012.
  8. G. Song, J. Rochas, L. El Beze, F. Huet, and F. Magoules, "k-Nearest neighbour joins for big data on map reduce: a theoretical and experimental analysis," IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 9, pp. 2376-2392, 2016. DOI: 10.1109/TKDE.2016.2562627.
  9. J. D. Kim, "A method for continuous k-nearest neighbor search with partial order," Journal of the Korea Institute of Information and Communication Engineering, vol. 15, no. 1, pp. 126-132, 2011. DOI: 10.6109/jkiice.2011.15.1.126.
  10. M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, et al., "Apache spark: a unified engine for big data processing," Communications of the ACM, vol. 59, no. 11, pp. 56-65, 2016. DOI: 10.1145/2934664.
  11. X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, et al., "Mllib: machine learning in Apache Spark," The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1235-1241, 2016.
  12. Y. Zhong and X. Peng, "SIFT-based low-quality fingerprint LSH retrieval and recognition method," International Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 8, no. 8, pp. 263-272, 2015. DOI: 10.14257/IJSIP.2015.8.8.28.
  13. C. Zhang, F. Li, and J. Jestes, "Efficient parallel kNN joins for large data in MapReduce," in Proceedings of the 15th International Conference on Extending Database Technology, Berlin, Germany, pp. 38-49, 2012. DOI: 10.1145/2247596.2247602.