DOI QR코드

DOI QR Code

SSF: Sentence Similar Function Based on word2vector Similar Elements

  • Yuan, Xinpan (School of Computer Science, Hunan University of Technology) ;
  • Wang, Songlin (School of Computer Science, Hunan University of Technology) ;
  • Wan, Lanjun (School of Computer Science, Hunan University of Technology) ;
  • Zhang, Chengyuan (School of Information Science and Engineering, Central South University)
  • Received : 2018.11.07
  • Accepted : 2019.03.30
  • Published : 2019.12.31

Abstract

In this paper, to improve the accuracy of long sentence similarity calculation, we proposed a sentence similarity calculation method based on a system similarity function. The algorithm uses word2vector as the system elements to calculate the sentence similarity. The higher accuracy of our algorithm is derived from two characteristics: one is the negative effect of penalty item, and the other is that sentence similar function (SSF) based on word2vector similar elements doesn't satisfy the exchange rule. In later studies, we found the time complexity of our algorithm depends on the process of calculating similar elements, so we build an index of potentially similar elements when training the word vector process. Finally, the experimental results show that our algorithm has higher accuracy than the word mover's distance (WMD), and has the least query time of three calculation methods of SSF.

Keywords

Acknowledgement

We are grateful to the support of research fund of Hunan Province (2018JJ2099, 19C0558, CX1911), Hunan open fund (Project Similarity detection cloud platform; Clothing material aggregation and prediction), Hunan Education Reform fund (The whole process performance management and evaluation system of innovation and entrepreneurship center).

References

  1. M. U. Devi and G. M. Gandhi, "Query expansion on the role of word and sentence similarity for domain ontology driven fuzzy retrieval systems," Journal of Computational and Theoretical Nanoscience, vol. 14, no. 6, pp. 2612-2619, 2017. https://doi.org/10.1166/jctn.2017.6548
  2. W. Yin, K. Kann, M. Yu, and H. Schutze, "Comparative study of CNN and RNN for natural language processing," 2017; https://arxiv.org/abs/1702.01923.
  3. D. Zhang, T. He, Y. Liu, S. Lin, and J. A. Stankovic, "A carpooling recommendation system for taxicab services," IEEE Transactions on Emerging Topics in Computing, vol. 2, no. 3, pp. 254-266, 2014. https://doi.org/10.1109/TETC.2014.2356493
  4. J. R. Lin, Z. Z. Hu, J. P. Zhang, and F. Q. Yu, "A natural‐language‐based approach to intelligent data retrieval and representation for cloud BIM," Computer‐Aided Civil and Infrastructure Engineering, vol. 31, no. 1, pp. 18-33, 2016. https://doi.org/10.1111/mice.12151
  5. A. Prakash, S. A. Hasan, K. Lee, V. Datla, A. Qadir, J. Liu, and O. Farri, "Neural paraphrase generation with stacked residual LSTM networks," 2016; https://arxiv.org/abs/1610.03098.
  6. M. A. Boudia, A. Rahmani, M. E. Rahmani, A. Djebbar, H. A. Bouarara, F. Kabli, and M. Guandouz, M. "Hybridization between scoring technique and similarity technique for automatic summarization by extraction," International Journal of Organizational and Collective Intelligence, vol. 6, no. 1, pp. 1-14, 2016. https://doi.org/10.4018/IJOCI.2016010101
  7. P. W. McBurney and C. McMillan, "Automatic source code summarization of context for Java methods," IEEE Transactions on Software Engineering, vol. 42, no. 2, pp. 103-119, 2015. https://doi.org/10.1109/TSE.2015.2465386
  8. S. K. Bharti and K. S. Babu, "Automatic keyword extraction for text summarization: a survey," 2017; https://arxiv.org/abs/1704.03242.
  9. Y. C. Lee, C. M. Eastman, and W. Solihin, "An ontology-based approach for developing data exchange requirements and model views of building information modeling," Advanced Engineering Informatics, vol. 30, no. 3, pp. 354-367, 2016. https://doi.org/10.1016/j.aei.2016.04.008
  10. J. Muralikumar, S. A. Seelan, N. Vijayakumar, and V. Balasubramanian, "A statistical approach for modeling inter-document semantic relationships in digital libraries," Journal of Intelligent Information Systems, vol. 48, no. 3, pp. 477-498, 2017. https://doi.org/10.1007/s10844-016-0423-6
  11. G. Zhou, J. Zhao, T. He, and W. Wu, "An empirical study of topic-sensitive probabilistic model for expert finding in question answer communities," Knowledge-Based Systems, vol. 66, pp. 136-145, 2014. https://doi.org/10.1016/j.knosys.2014.04.032
  12. S. Guo and D. Xing, "Sentence similarity calculation based on word vector and its application research," Modern Electronics Technique, vol. 39, no. 13, pp. 99-102, 2016.
  13. F. Li, J. Hou, R. Zeng, and C. Ling, "Research on multi-feature sentence similarity computing method with word embedding," Journal of Frontiers of Computer Science and Technology, vol. 11, no. 4, pp. 608-618, 2017 https://doi.org/10.1007/s11704-016-5003-y
  14. S. Arora, Y. Liang, and T. Ma, "A simple but tough-to-beat baseline for sentence embeddings," in Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 2017.
  15. Y. Mrabet, H. Kilicoglu, and D. Demner-Fushman, "TextFlow: a text similarity measure based on continuous sequences," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 763-772.
  16. S. Xu, "Research and implementation of paraphrasing recognition technology for question-and-answer system," Harbin Institute of Technology, Harbin, China, 2009.
  17. M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, "From word embeddings to document distances," in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015, pp. 957-966.
  18. Y. Guan, X. Wang, and Q. Wang, "A new measurement of systematic similarity," IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 38, no. 4, pp. 743-758, 2008. https://doi.org/10.1109/TSMCA.2008.918611
  19. Y. Guan, X. Wang, and Q. Wang, "Measurement of system similarity," in Proceedings of China Computational Linguistics Conference (CCL), Nanjing, China, 2005, pp. 341-347.
  20. Wikimedia Chinese corpus [Online]. Available: https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latestpages-articles.xml.bz2.
  21. Word2VEC_java [Online]. Available: https://github.com/NLPchina/Word2VEC_java.
  22. wmd4j is a Java library for calculating Word Mover's Distance (WMD) [Online]. Available: https://github.com/crtomirmajer/wmd4j.
  23. Word Sentence Similarity Code [Online]. Available: https://download.csdn.net/download/u011001835/9849524.
  24. Word2VEC [Online]. Available: https://github.com/jsksxs360/Word2Vec.