• Title/Summary/Keyword: string edit distance

Search Result 11, Processing Time 0.017 seconds

Word Similarity Calculation by Using the Edit Distance Metrics with Consonant Normalization

  • Kang, Seung-Shik
    • Journal of Information Processing Systems
    • /
    • v.11 no.4
    • /
    • pp.573-582
    • /
    • 2015
  • Edit distance metrics are widely used for many applications such as string comparison and spelling error corrections. Hamming distance is a metric for two equal length strings and Damerau-Levenshtein distance is a well-known metrics for making spelling corrections through string-to-string comparison. Previous distance metrics seems to be appropriate for alphabetic languages like English and European languages. However, the conventional edit distance criterion is not the best method for agglutinative languages like Korean. The reason is that two or more letter units make a Korean character, which is called as a syllable. This mechanism of syllable-based word construction in the Korean language causes an edit distance calculation to be inefficient. As such, we have explored a new edit distance method by using consonant normalization and the normalization factor.

Modified Edit Distance Method for Finding Similar Words in Various Smartphone Keypad Environment (다양한 스마트폰 키패드 환경에서 유사 단어 검색을 위한 수정된 편집 거리 계산 방법)

  • Song, Yeong-Kil;Kim, Hark-Soo
    • The Journal of the Korea Contents Association
    • /
    • v.11 no.12
    • /
    • pp.12-18
    • /
    • 2011
  • Most smartphone use virtual keypads based on touch-pad. The virtual keypads often make typographical errors because of the physical limitations of device such as small screen and limited input methods. To resolve this problem, many similar word-finding methods have been studied. In the paper, we propose an edit distance method (a well-known string similarity measure) that is modified to consider various types of virtual keypads. The proposed method effectively covers typographical errors in various keypads by converting an input string into a physical key sequence and by reflecting characteristics of virtual keypads to edit scores. In the experiments with various keypads, the proposed method showed better performances than a typical edit distance method.

Edit Distance Problem for the Korean Alphabet with Phoneme Classification System (음소의 분류 체계를 이용한 한글 편집 거리 알고리즘)

  • Roh, Kang-Ho;Park, Kun-Soo;Cho, Hwan-Gue;Chang, So-Won
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.37 no.6
    • /
    • pp.323-329
    • /
    • 2010
  • The edit distance problem is finding the minimum number of edit operations to transform a string into another one. It is one of the important problems in algorithm research and there are some algorithms that compute an optimal edit distance for the one-dimensional languages such as the English alphabet. However, there are a few researches to find the edit distance for the more complicated language such as the Korean or Chinese alphabet. In this paper, we define the measure of the edit distance for the Korean alphabet with the phoneme classification system to improve the previous edit distance algorithm and present an algorithm for the edit distance problem for the Korean alphabet.

Parallel Computation For The Edit Distance Based On The Four-Russians' Algorithm (4-러시안 알고리즘 기반의 편집거리 병렬계산)

  • Kim, Young Ho;Jeong, Ju-Hui;Kang, Dae Woong;Sim, Jeong Seop
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.2 no.2
    • /
    • pp.67-74
    • /
    • 2013
  • Approximate string matching problems have been studied in diverse fields. Recently, fast approximate string matching algorithms are being used to reduce the time and costs for the next generation sequencing. To measure the amounts of errors between two strings, we use a distance function such as the edit distance. Given two strings X(|X| = m) and Y(|Y| = n) over an alphabet ${\Sigma}$, the edit distance between X and Y is the minimum number of edit operations to convert X into Y. The edit distance between X and Y can be computed using the well-known dynamic programming technique in O(mn) time and space. The edit distance also can be computed using the Four-Russians' algorithm whose preprocessing step runs in $O((3{\mid}{\Sigma}{\mid})^{2t}t^2)$ time and $O((3{\mid}{\Sigma}{\mid})^{2t}t)$ space and the computation step runs in O(mn/t) time and O(mn) space where t represents the size of the block. In this paper, we present a parallelized version of the computation step of the Four-Russians' algorithm. Our algorithm computes the edit distance between X and Y in O(m+n) time using m/t threads. Then we implemented both the sequential version and our parallelized version of the Four-Russians' algorithm using CUDA to compare the execution times. When t = 1 and t = 2, our algorithm runs about 10 times and 3 times faster than the sequential algorithm, respectively.

Edit Distance Problem for the Korean Alphabet (한글에 대한 편집 거리 문제)

  • Roh, Kang-Ho;Kim, Jin-Wook;Kim, Eun-Sang;Park, Kun-Soo;Cho, Hwan-Gue
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.37 no.2
    • /
    • pp.103-109
    • /
    • 2010
  • The edit distance problem is finding the minimum number of edit operations to transform a string into another one. It is one of the important problems in algorithm research and there are some algorithms that compute an optimal edit distance for the one-dimensional languages such as the English alphabet. However, there are a few researches to find the edit distance for the more complicated language such as the Korean or Chinese alphabet. In this paper, we define the measure of the edit distance for the Korean alphabet and present an algorithm for the edit distance problem for the Korean alphabet.

Parallel Computation for Extended Edit Distances Using the Shared Memory on GPU (GPU의 공유메모리를 활용한 확장편집거리 병렬계산)

  • Kim, Youngho;Na, Joong Chae;Sim, Jeong Seop
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.4 no.7
    • /
    • pp.213-218
    • /
    • 2015
  • Given two strings X and Y (|X|=m, |Y|=n) over an alphabet ${\Sigma}$, the extended edit distance between X and Y can be computed using dynamic programming in O(mn) time and space. Recently, a parallel algorithm that takes O(m+n) time and O(mn) space using m threads to compute the extended edit distance between X and Y was presented. In this paper, we present an improved parallel algorithm using the shared memory on GPU. The experimental results show that our parallel algorithm runs about 19~25 times faster than the previous parallel algorithm.

Approximate Periods of Strings based on Distance Sum for DNA Sequence Analysis (DNA 서열분석을 위한 거리합기반 문자열의 근사주기)

  • Jeong, Ju Hui;Kim, Young Ho;Na, Joong Chae;Sim, Jeong Seop
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.2 no.2
    • /
    • pp.119-122
    • /
    • 2013
  • Repetitive strings such as periods have been studied vigorously in so diverse fields as data compression, computer-assisted music analysis, bioinformatics, and etc. In bioinformatics, periods are highly related to repetitive patterns in DNA sequences so called tandem repeats. In some cases, quite similar but not the same patterns are repeated and thus we need approximate string matching algorithms to study tandem repeats in DNA sequences. In this paper, we propose a new definition of approximate periods of strings based on distance sum. Given two strings $p({\mid}p{\mid}=m)$ and $x({\mid}x{\mid}=n)$, we propose an algorithm that computes the minimum approximate period distance based on distance sum. Our algorithm runs in $O(mn^2)$ time for the weighted edit distance, and runs in O(mn) time for the edit distance, and runs in O(n) time for the Hamming distance.

An Efficient String Similarity Search Technique based on Generating Inverted Lists of Variable-Length Grams (가변길이 그램의 역리스트 생성을 이용한 효율적인 유사 문자열 검색 기법)

  • Kim, Jongik
    • Journal of KIISE
    • /
    • v.43 no.11
    • /
    • pp.1275-1280
    • /
    • 2016
  • Existing techniques for string similarity search first generate a set of candidate strings and then verify the candidates. The efficiency of string similarity search is highly dependent on candidate generation methods. State of the art techniques select fixed length q-grams from a query string and generate candidates using inverted lists of the selected q-grams. In this paper, we propose a technique to generate candidates using variable length grams of a query string and develop a dynamic programming algorithm that selects an optimal combination of variable length grams from a query string. Experimental results show that the proposed technique improves the performance of string similarity search compared with the existing techniques.

Finding Approximate Covers of Strings (문자열의 근사커버 찾기)

  • Sim, Jeong-Seop;Park, Kun-Soo;Kim, Sung-Ryul;Lee, Jee-Soo
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.29 no.1
    • /
    • pp.16-21
    • /
    • 2002
  • Repetitive strings have been studied in such diverse fields as molecular biology data compression etc. Some important regularities that have been studied are perods, covers seeds and squares. A natural extension of the repetition problems is to allow errors. Among the four notions above aproximate squares and approximate periodes have been studied. In this paper, we introduce the notion of approximate covers which is an approximate version of covers. Given two strings P(|P|=m) and T(|T|=n) we propose and algorithm with finds the minimum distance t such that P is a t-approximate cover of T. The algorithm take O(m,n) time for the edit distance and $O(mn^2)$ time of finding a string which is an approximate cover of T is minimum distance is NP-complete.

Construction of Linearly Aliened Corpus Using Unsupervised Learning (자율 학습을 이용한 선형 정렬 말뭉치 구축)

  • Lee, Kong-Joo;Kim, Jae-Hoon
    • The KIPS Transactions:PartB
    • /
    • v.11B no.3
    • /
    • pp.387-394
    • /
    • 2004
  • In this paper, we propose a modified unsupervised linear alignment algorithm for building an aligned corpus. The original algorithm inserts null characters into both of two aligned strings (source string and target string), because the two strings are different from each other in length. This can cause some difficulties like the search space explosion for applications using the aligned corpus with null characters and no possibility of applying to several machine learning algorithms. To alleviate these difficulties, we modify the algorithm not to contain null characters in the aligned source strings. We have shown the usability of our approach by applying it to different areas such as Korean-English back-trans literation, English grapheme-phoneme conversion, and Korean morphological analysis.