Multi-Level Sequence Alignment : An Adaptive Control Method Between Speed and Accuracy for Document Comparison

Seo, Jong-Kyu;Tak, Haesung;Cho, Hwan-Gue;

doi:10.5626/JOK.2014.41.9.728

Journal of KIISE (정보과학회 논문지)

Volume 41 Issue 9
/
Pages.728-743
/
2014
/
2383-630X(pISSN)
/
2383-6296(eISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

DOI QR Code

Multi-Level Sequence Alignment : An Adaptive Control Method Between Speed and Accuracy for Document Comparison

계산속도 및 정확도의 적응적 제어가 가능한 다단계 문서 비교 시스템

Seo, Jong-Kyu (Pusan National Univ.) ;
Tak, Haesung (Pusan National Univ.) ;
Cho, Hwan-Gue (Pusan National Univ.)

서종규 (부산대학교 전자전기컴퓨터공학과) ;
탁해성 (부산대학교 전자전기컴퓨터공학과) ;
조환규 (부산대학교 정보컴퓨터공학부)

Received : 2014.05.20
Accepted : 2014.08.08
Published : 2014.09.15

https://doi.org/10.5626/JOK.2014.41.9.728 Citation

⟨ Previous Next ⟩

Abstract

Finger printing and sequence alignment are well-known approaches for document similarity comparison. A fingerprinting method is simple and fast, but it can not find particular similar regions. A string alignment method is used for identifying regions of similarity by arranging the sequences of a string. It has an advantage of finding particular similar regions, but it also has a disadvantage of taking more computing time. The Multi-Level Alignment (MLA) is a new method designed for taking the advantages of both methods. The MLA divides input documents into uniform length blocks, and then extracts fingerprints from each block and calculates similarity of block pairs by comparing the fingerprints. A similarity table is created in this process. Finally, sequence alignment is used for specifying longest similar regions in the similarity table. The MLA allows users to change block's size to control proportion of the fingerprint algorithm and the sequence alignment. As a document is divided into several blocks, similar regions are also fragmented into two or more blocks. To solve this fragmentation problem, we proposed a united block method. Experimentally, we show that computing document's similarity with the united block is more accurate than the original MLA method, with minor time loss.

유사한 문서를 비교하는 방법으로는 지문법과 서열 정렬법이 널리 알려져 있다. 지문법은 계산속도가 빠른 대신 정확도가 떨어지며, 서열정렬법은 계산속도가 느린 대신 정확도가 높다. 다단계 정렬은 두 방법의 비중을 조절하여 문서 유사도를 비교할 수 있는 새로운 방법의 문서 유사도 측정 방법으로, 각 방법의 장점을 얻으면서 동시에 단점을 보완하도록 고안되었다. 특히 두 비교 방법의 비중을 "블록크기"라는 단일 변수를 이용하여 조절할 수 있도록 한 것이 제안 시스템의 핵심이다. 다단계 정렬은 문서를 일정한 길이의 블록으로 나누어 지문을 추출하고 블록간의 유사도를 계산한 다음 그 결과를 서열정렬법으로 다시 한 번 탐색하는 과정을 거친다. 이때 문서가 분할되는 과정에서 유사구간이 두 개 이상의 블록으로 나누어지는 현상이 발생하기도 한다. 이 논문에서는 다단계 정렬방법에 대해 설명하고, 유사도 비교 성능 개선을 위한 단편화 제거 기법과 휴리스틱 비교법에 대해 설명하고 실험적으로 그 결과를 보인다.

Keywords

References

Barbara Gardner Conklin, Encyclopedia of forensic science: a compendium of detective fact and fiction, 2002.
Michael O. Rabin, Fingerprinting by random polynomials, CRCT, Harvard Univ., 1981.
Timothy C. Hoad and Justin Zobel, Methods for identifying versioned and plagiarized documents, JASIST, Vol. 54, No. 3, pp. 203-215, 2003. https://doi.org/10.1002/asi.10170
Liu Weihong, Adaptive spam filtering based on fingerprint vectors, IEEE Computer Society, pp. 384-388, 2008.
Mount DM. Bioinformatics: Sequence and Genome Analysis (2nd ed.), Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY., 2004.
Noam Shoval and Michal Isaacson, Sequence alignment as a method for human activity analysis in space and time, Annals of the AAG, Vol. 97, No. 2, pp. 282-297, 2007.
Linyuan Lu and Tao Zhou, Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, Vol. 390, No. 6, pp.1150-1170, 2011. https://doi.org/10.1016/j.physa.2010.11.027
Taffee T Tanimoto, Ibm internal report, Nov. 1957.
Lee R Dice, Measures of the amount of ecologic association between species, Ecology, Vol. 26, No. 3, pp. 297-302, 1945. https://doi.org/10.2307/1932409
H.G Cho S.H Kim, S.Y Park, A text-secure searching method using the first phonemes in korean documents, KIPS, Vol. 18, No. 1, pp. 386-389, 2011.
S.H. Oh J.H Lee, Entropy and average mutual information for a 'choseong', a 'jungseong' and a 'jongseong' of a korean syllable, IEEK, Vol. 26, No. 9, pp. 1299-1307, 1989.
21st Century Sejong Project. http://www.sejong.or.kr. 2013.
TurnItIn. http://www.turnitin.com/. 2012.
MmemeChecker. http://www.memechecker.com/. 2012.
DeVAC. http://devac.cs.pusan.ac.kr/. 2013.

Journal of KIISE (정보과학회 논문지)

Multi-Level Sequence Alignment : An Adaptive Control Method Between Speed and Accuracy for Document Comparison

계산속도 및 정확도의 적응적 제어가 가능한 다단계 문서 비교 시스템

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)