DOI QR코드

DOI QR Code

A Method for Measuring Similarity Measure of Thesaurus Transformation Documents using DBSCAN

DBSCAN을 활용한 유의어 변환 문서 유사도 측정 방법

  • Kim, Byeongsik (Dept. of Software Convergence Engineering Chosun University) ;
  • Shin, Juhyun (Dept. of ICT Convergence, Chosun University)
  • Received : 2018.04.09
  • Accepted : 2018.06.20
  • Published : 2018.09.30

Abstract

There is a case where the core content of another person's work is decorated as though it is his own thoughts by changing own thoughts without showing the source. Plagiarism test of copykiller free service used in plagiarism check is performed by comparing plagiarism more than 6th word. However, it is not enough to judge it as a plagiarism with a six - word match if it is replaced with a similar word. Therefore, in this paper, we construct word clusters by using DBSCAN algorithm, find synonyms, convert the words in the clusters into representative synonyms, and construct L-R tables through L-R parsing. We then propose a method for determining the similarity of documents by applying weights to the thesaurus and weights for each paragraph of the thesis.

Keywords

References

  1. I.S. Hwang, "Development of A Plagiarism Detection System Using Web Search and Morpheme Analysis," Journal of Information Technology Applications and Management, Vol. 16, No. 1, pp. 21-36, 2009.
  2. D. Kwack, "A Study on the Types of Plagiarism and Appropriate Citation Practices of Writing Research Papers," Proceeding of the Korean Society for Library and Information Science, Vol. 41, No. 3, pp. 103-126, 2007. https://doi.org/10.4275/KSLIS.2007.41.3.103
  3. R. Robertson, "Understanding Inverse Document Frequency: on Theoretical Arguments for IDF," Journal of Documentation, Vol. 60, No. 5, pp. 503-520, 2004. https://doi.org/10.1108/00220410410560582
  4. J.Y. Son and Y.T. Shin, "Music Lyrics Summarization Method Using TextRank Algorithm," Journal of Korea Multimedia Society, Vol. 21, No. 1, pp. 45-50, 2018. https://doi.org/10.9717/KMMS.2018.21.1.045
  5. Q. Le and T. Milokov, "Distributed Representations of Sentences and Documents," Proceeding of the 31st International Conference on Machine Learning, Vol. 23, No. 12, pp. 698-702, 2014.
  6. K. Cheng, J. Li, J. Tang, and H. Liu, "Unsupervised Sentiment Analysis with Signed Social Networks," Proceeding of the 23rd ACM Special Interest Group on Knowledge Discovery and Data Mining International Conference on Knowledge Discorvery and data Mining, pp. 777-786, 2017.
  7. D.W. Kim and M.W. Koo, "Categorization of Korean News Articles Based on Convolutional Neural Network Using Doc2Vec and Word2Vec," Journal of Korea Institute on Information Scientists Engineers, Vol. 44, No. 7, pp. 742-747, 2017.
  8. J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, USA, 2005.
  9. M.S. Kwon, Y.H. Kang, H.J. Han, and D.S. Cho, "Adaptive DBSCAN for Time-varing Clustering DBSCAN," Proceeding of Information and Control Symposium, Vol. 2016, No. 4, pp. 134-135, 2016.
  10. M. Ester, H.P. Kriegel, J. Sander, and X. Xu, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," Proceeding of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226-231, 1996.
  11. Y.H. Won, Efficient LR(k) Parsing Algorithms, Master's Thesis of Korea Advanced Institute of Science, 1975.
  12. M.J. Kim and S.J. Lee. "Measures of Abnormal User Activities in Online Comments Based on Cosine Similarity," Journal of the Korea Institute of Information Security and Cryptology, Vol. 24, No. 2, pp. 335-343, 2014. https://doi.org/10.13089/JKIISC.2014.24.2.335
  13. H.S. Ji, J.H. Joh, and H.S. Lim, "A Detection Method of Similar Sentences Considering Plagiarism Patterns of Korean Sentence," Journal of Korea Computer Education Association, Vol. 13, No. 6, pp. 78-89, 2010.