String Matching Algorithm on Multi-byte Character Set Texts

Kim, Eun-Sang;Kim, Jin-Wook;Park, Kun-Soo;

한국정보과학회논문지:컴퓨팅의 실제 및 레터 (Journal of KIISE:Computing Practices and Letters)

제16권10호
/
Pages.1015-1019
/
2010
/
1229-7712(pISSN)

한국정보과학회 (Korean Institute of Information Scientists and Engineers)

다중바이트 문자집합 텍스트에서의 문자열 검색 알고리즘

String Matching Algorithm on Multi-byte Character Set Texts

김은상 (서울대학교 컴퓨터공학부) ;
김진욱 (서울대학교병원 의료정보센터) ;
박근수 (서울대학교 컴퓨터공학부)

투고 : 2010.07.28
심사 : 2010.08.24
발행 : 2010.10.15

PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

문자열 완전일치 검색 알고리즘용 지금까지 많은 연구가 되어왔지만, EUC-KR 용 다중바이트 문자집합에 대해서는 연구원 것이 부족한 상황이다. 이 논문에서는 기존의 KMP 알고리즘을 사용할 때 EUC-KR과 같은 다중바이트 문자집합 텍스트에서 오검색이 발생할 수 있음을 보이며, 문자 단위의 접두사 함수를 적용하여 오검색이 발생하지 않도록 개선한 KMP 알고리즘을 제안한다. 또한, 널리 사용되고 있는 편집기인 Vim과 Emacs의 검색 알고리즘 및 기존의 오토마타 방식의 연구 결과에 비해 논문에서 제안한 알고리즘이 더 빠른 속도를 보이는 실험 결과를 제시한다.

An extensive research on exact string matching has been done, but there have been few researches on the matching in multi-byte character set texts such as EUC~KR. This paper shows that false matches may occur in multi-byte character set texts such as EUC-KR when using KMP algorithm, and presents a refined KMP algorithm without false matches applying a character-based prefix function. And also, Experimental results show that our algorithm is faster than string matching algorithms of widely used editors, Vim and Emacs, and the existing automata-based algorithm.

키워드

참고문헌

D. E. Knuth, J. H. Morris Jr, and V. R. Pratt. Fast Pattern Matching in Strings. SIAM Journal on Computing, vol.6, pp.323-350, 1977. https://doi.org/10.1137/0206024
Robert S. Boyer and J. Strother Moore. A Fast String Searching Algorithm. Communications of the ACM, vol.20, no.10, pp.762-772, 1977. https://doi.org/10.1145/359842.359859
Cyril Allauzen, Maxime Crochemore, and Mathieu Raffinot. Efficient experimental string matching by weak factor recognition. 12th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science, vol.2089, pp.51-72, 2001.
Masayuki Takeda, Satoru Miyamoto, Takuya Kida, Ayumi Shinohara, Shuichi Fukamachi, Takeshi Shinohara, and Setsuo Arikawa. Processing Text Files as Is: Pattern matching over Compressed Texts, Multibyte Character Texts, and Semi- structured Texts. String Processing and Information Retrieval (SPIRE) 2002, LNCS 2476, pp.170-186, 2002.
Heikki Hyyro, Jun Takaba, Ayumi Shinohara, and Masayuki Takeda. On Bit-parallel Processing of Multi-byte Text. Proceedings of the 1st Asia Information Retrieval Symposium (AIRS) 2004, LNCS 3411, pp.289-300, 2005.
Vim Editor, http://www.vim.org
GNU Emacs Editor, http://www.gnu.org/software/emacs
R. Nigel Horspool. Practical Fast Searching in Strings. Software Practice and Experience, vol.10, no.6, pp.501-506, 1980. https://doi.org/10.1002/spe.4380100608
Korean Industrial Standards, http://www.standard.go.kr
Daniel M. Sunday. A Very Fast Substring Search Algorithm. Communications of the ACM, vol.33, no.8, pp.132-142, 1990 https://doi.org/10.1145/79173.79184
G. De V. Smit. A comparison of three string matching algorithms. Software: Practice and Experience, vol.12, no.1, pp.57-66, 1982. https://doi.org/10.1002/spe.4380120106

한국정보과학회논문지:컴퓨팅의 실제 및 레터 (Journal of KIISE:Computing Practices and Letters)

다중바이트 문자집합 텍스트에서의 문자열 검색 알고리즘

String Matching Algorithm on Multi-byte Character Set Texts

초록

키워드

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)