A Korean Homonym Disambiguation System Based on Statistical, Model Using weights

  • Kim, Jun-Su (Dept. of Computer Engineering & Information Technology, University of Ulsan, 680-749, San29, Mugeo-dong, Nam-gu, Ulsan) ;
  • Lee, Wang-Woo (Dept. of Computer Engineering & Information Technology, University of Ulsan, 680-749, San29, Mugeo-dong, Nam-gu, Ulsan) ;
  • Kim, Chang-Hwan (Dept. of Computer Engineering & Information Technology, University of Ulsan, 680-749, San29, Mugeo-dong, Nam-gu, Ulsan) ;
  • Ock, Cheol-young (Dept. of Computer Engineering & Information Technology, University of Ulsan, 680-749, San29, Mugeo-dong, Nam-gu, Ulsan)
  • Published : 2002.02.01

Abstract

A homonym could be disambiguated by another words in the context as nouns, predicates used with the homonym. This paper using semantic information (co-occurrence data) obtained from definitions of part of speech (POS) tagged UMRD-S$^1$), In this research, we have analyzed the result of an experiment on a homonym disambiguation system based on statistical model, to which Bayes'theorem is applied, and suggested a model established of the weight of sense rate and the weight of distance to the adjacent words to improve the accuracy. The result of applying the homonym disambiguation system using semantic information to disambiguating homonyms appearing on the dictionary definition sentences showed average accuracy of 98.32% with regard to the most frequent 200 homonyms. We selected 49 (31 substantives and 18 predicates) out of the 200 homonyms that were used in the experiment, and performed an experiment on 50,703 sentences extracted from Sejong Project tagged corpus (i.e. a corpus of morphologically analyzed words) of 3.5 million words that includes one of the 49 homonyms. The result of experimenting by assigning the weight of sense rate(prior probability) and the weight of distance concerning the 5 words at the front/behind the homonym to be disambiguated showed better accuracy than disambiguation systems based on existing statistical models by 2.93%,

Keywords