DOI QR코드

DOI QR Code

문장 및 어절 유사도를 이용한 표절 탐지 시스템 구현

Implementation of A Plagiarism Detecting System with Sentence and Syntactic Word Similarities

  • 맹주수 (한국방송통신대학교 이러닝학과) ;
  • 박지수 (동국대학교 융합소프트웨어교육원) ;
  • 손진곤 (한국방송통신대학교 컴퓨터과학과)
  • 투고 : 2018.10.08
  • 심사 : 2018.11.20
  • 발행 : 2019.03.31

초록

기존 표절 탐지 시스템은 형태소 분석을 기반으로 공통 단어의 빈도수를 이용해 문서의 유사도를 측정한다. 그러나 주제가 같아 유사 단어가 많이 쓰인 경우, 문장 단위로 일부만 발췌 표절한 경우, 그리고 조사와 어미의 유사성이 있는 경우는 공통 단어의 빈도수만으로는 정확한 유사도를 측정하는데 한계가 있다. 따라서 본 논문에서는 공통 단어 빈도수 기반의 유사도 측정 외에 문장 유사도와 어절 유사도를 추가적으로 측정해 유사도의 정확성을 높일 수 있는 표절 탐지 시스템을 설계하고 구현하였다. 실험 결과, 문장 유사도를 측정함으로써 문장 단위로 표절이 이루어진 경우를 발견할 수 있었고, 어절 유사도를 추가로 측정함으로써 부분표절이 일어난 경우라도 조사나 어미까지 그대로 사용한 표절의 경우 등을 발견할 수 있었다.

The similarity detecting method that is basically used in most plagiarism detecting systems is to use the frequency of shared words based on morphological analysis. However, this method has limitations on detecting accurate degree of similarity, especially when similar words concerning the same topics are used, sentences are partially separately excerpted, or postpositions and endings of words are similar. In order to overcome this problem, we have designed and implemented a plagiarism detecting system that provides more reliable similarity information by measuring sentence similarity and syntactic word similarity in addition to the conventional word similarity. We have carried out a comparison of on our system with a conventional system using only word similarity. The comparative experiment has shown that our system can detect plagiarized document that the conventional system can detect or cannot.

키워드

JBCRJM_2019_v8n3_109_f0001.png 이미지

Fig. 1. The Entire System Chart

JBCRJM_2019_v8n3_109_f0002.png 이미지

Fig. 2. Similarity Analysis based on the Frequency of Shared Words

JBCRJM_2019_v8n3_109_f0003.png 이미지

Fig 3. Similarity Analysis based on the Frequency of Shared Syntactic Words

JBCRJM_2019_v8n3_109_f0004.png 이미지

Fig. 4. Database Structure

JBCRJM_2019_v8n3_109_f0005.png 이미지

Fig. 5. The Examples of Detecting Similar Sentences

JBCRJM_2019_v8n3_109_f0006.png 이미지

Fig. 6. The Examples of Detecting the Same Sentences

Table 1. Similarity Analysis based on the Frequency of Shared Sentences

JBCRJM_2019_v8n3_109_t0001.png 이미지

Table 2. Syntactic Word Similarity and Final Point

JBCRJM_2019_v8n3_109_t0002.png 이미지

Table 3. System Environment

JBCRJM_2019_v8n3_109_t0003.png 이미지

Table 4. Similarities of Matched Documents

JBCRJM_2019_v8n3_109_t0004.png 이미지

Table 5. Word Similarity Result Comparison

JBCRJM_2019_v8n3_109_t0005.png 이미지

Table 6. Sentence Similarity

JBCRJM_2019_v8n3_109_t0006.png 이미지

Table 7. Syntactic Word Similarity

JBCRJM_2019_v8n3_109_t0007.png 이미지

참고문헌

  1. Ministry of "Education, Instructions to Securing Research Ethics," 2015.
  2. Jun, M. J, Park, S. D., Park W., Heo, J. Y., and Cho, H. G., "Plagiarised Reports Detection System using Characteristcs of Korean Language and Local alignment Algorithm," Journal of KIISE, Vol.31, No.02, pp.727-729, 2004.
  3. Seung-hee Yoo, Yil-hyeong Mun, and Dong-sub Cho, "Similarity Measurement of Korean Documents using the Specified Particles and Major Keywords," Journal of Korea Multimedia Society, Vol.2007, No.1, pp.0686-0688, 2007.
  4. Sang Wook Park, Jeong Yoon Kim, Tae Hoon Lee, Seung Beom Hong, Jin Sook Lim, and Won Seog Kang, "Development of Document Plagiarism Detection Algorithm using Syntactic Analysis Method," The Korean Association of Computer Education, Vol.17, No.1, pp.89-93, 2013.
  5. Bang-Won Ko and Young-Chul Kim, "A Similarity Valuating System using The Pattern Matching," Journal of the Korea Society of Computer and Information, Vol.15, No.1, pp.185-192, 2010. https://doi.org/10.9708/jksci.2010.15.1.185
  6. J. H. Choi and S. J. Lee, "A Method for Reducing Dictionary Access with Bidirectional Longest Match Strategy in Korean Morphological Analyzer," Journal of KIISE, Vol.20, No.10, pp.1497-1507, 1993.
  7. Kang Seung-Shik, "Multi-level Morphological Analysis Model for Korean," Journal of KIISE, Vol.1994, No.10, pp.140-145, 1994.
  8. Lee Mi-suk, "A copy detection system," Ph.D. dissertation, University of Dongguk, Seoul, Korea, 2005.
  9. Won Ji Hur and Yong Gyu Jung, "A Study on Improved Measurement of Similarity Between Documents," Journal of KIISE, Vol.38, No.2, pp.122-124, 2011.
  10. Erik Hatcher, Otis Gospodnetic, and Mike McCandless, "Lucene in Action," pp.68-69, 2010.
  11. Diquest Mariner2 [internet], http://cfile248.uf.daum.net/image/2509DF40552DACBE05C48A. 2018. 11. 18
  12. Go Eun-byeol, "String and Sentence Similarity Measurement Methods Using Set-based POI Search Algorithm," Ph.D. dissertation, Sookmyung Women's University, Seoul, Korea, 2014.