• Title/Summary/Keyword: Korean POS Tagging

Search Result 56, Processing Time 0.02 seconds

A Hidden Markov Model Imbedding Multiword Units for Part-of-Speech Tagging

  • Kim, Jae-Hoon;Jungyun Seo
    • Journal of Electrical Engineering and information Science
    • /
    • v.2 no.6
    • /
    • pp.7-13
    • /
    • 1997
  • Morphological Analysis of Korean has known to be a very complicated problem. Especially, the degree of part-of-speech(POS) ambiguity is much higher than English. Many researchers have tried to use a hidden Markov model(HMM) to solve the POS tagging problem and showed arround 95% correctness ratio. However, the lack of lexical information involves a hidden Markov model for POS tagging in lots of difficulties in improving the performance. To alleviate the burden, this paper proposes a method for combining multiword units, which are types of lexical information, into a hidden Markov model for POS tagging. This paper also proposes a method for extracting multiword units from POS tagged corpus. In this paper, a multiword unit is defined as a unit which consists of more than one word. We found that these multiword units are the major source of POS tagging errors. Our experiment shows that the error reduction rate of the proposed method is about 13%.

  • PDF

Syllable-based POS Tagging without Korean Morphological Analysis (형태소 분석기 사용을 배제한 음절 단위의 한국어 품사 태깅)

  • Shim, Kwang-Seob
    • Korean Journal of Cognitive Science
    • /
    • v.22 no.3
    • /
    • pp.327-345
    • /
    • 2011
  • In this paper, a new approach to Korean POS (Part-of-Speech) tagging is proposed. In previous works, a Korean POS tagger was regarded as a post-processor of a morphological analyzer, and as such a tagger was used to determine the most likely morpheme/POS sequence from morphological analysis. In the proposed approach, however, the POS tagger is supposed to generate the most likely morpheme and POS pair sequence directly from the given sentences. 398,632 eojeol POS-tagged corpus and 33,467 eojeol test data are used for training and evaluation, respectively. The proposed approach shows 96.31% of POS tagging accuracy.

  • PDF

A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors

  • Son, Jeong-Woo;Noh, Tae-Gil;Park, Seong-Bae
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.12 no.1
    • /
    • pp.6-14
    • /
    • 2012
  • All types of part-of-speech (POS) tagging errors have been equally treated by existing taggers. However, the errors are not equally important, since some errors affect the performance of subsequent natural language processing seriously while others do not. This paper aims to minimize these serious errors while retaining the overall performance of POS tagging. Two gradient loss functions are proposed to reflect the different types of errors. They are designed to assign a larger cost for serious errors and a smaller cost for minor errors. Through a series of experiments, it is shown that the classifier trained with the proposed loss functions not only reduces serious errors but also achieves slightly higher accuracy than ordinary classifiers.

Syllable-based Korean POS Tagging Based on Combining a Pre-analyzed Dictionary with Machine Learning (기분석사전과 기계학습 방법을 결합한 음절 단위 한국어 품사 태깅)

  • Lee, Chung-Hee;Lim, Joon-Ho;Lim, Soojong;Kim, Hyun-Ki
    • Journal of KIISE
    • /
    • v.43 no.3
    • /
    • pp.362-369
    • /
    • 2016
  • This study is directed toward the design of a hybrid algorithm for syllable-based Korean POS tagging. Previous syllable-based works on Korean POS tagging have relied on a sequence labeling method and mostly used only a machine learning method. We present a new algorithm integrating a machine learning method and a pre-analyzed dictionary. We used a Sejong tagged corpus for training and evaluation. While the machine learning engine achieved eojeol precision of 0.964, the proposed hybrid engine achieved eojeol precision of 0.990. In a Quiz domain test, the machine learning engine and the proposed hybrid engine obtained 0.961 and 0.972, respectively. This result indicates our method to be effective for Korean POS tagging.

(Resolving Prepositional Phrase Attachment and POS Tagging Ambiguities using a Maximum Entropy Boosting Model) (최대 엔트로피 부스팅 모델을 이용한 영어 전치사구 접속과 품사 결정 모호성 해소)

  • 박성배
    • Journal of KIISE:Software and Applications
    • /
    • v.30 no.5_6
    • /
    • pp.570-578
    • /
    • 2003
  • Maximum entropy models are promising candidates for natural language modeling. However, there are two major hurdles in applying maximum entropy models to real-life language problems, such as prepositional phrase attachment: feature selection and high computational complexity. In this paper, we propose a maximum entropy boosting model to overcome these limitations and the problem of imbalanced data in natural language resources, and apply it to prepositional phrase (PP) attachment and part-of-speech (POS) tagging. According to the experimental results on Wall Street Journal corpus, the model shows 84.3% of accuracy for PP attachment and 96.78% of accuracy for POS tagging that are close to the state-of-the-art performance of these tasks only with small efforts of modeling.

A Korean POS Tagging System with Handling Corpus Errors (말뭉치 오류를 고려한 HMM 한국어 품사 태깅 시스템)

  • Seol, Yong-Soo;Kim, Dong-Joo;Kim, Kyu-Sang;Kim, Han-Woo
    • KSCI Review
    • /
    • v.15 no.1
    • /
    • pp.117-124
    • /
    • 2007
  • 통계 기반 접근 방법을 이용한 품사태깅에서 태깅 정확도는 훈련 데이터의 양에 좌우될 뿐 아니라, 말뭉치가 충분할지라도 수작업으로 구축한 말뭉치의 경우 항상 오류의 가능성을 내포하고 있으며 언어의 특성상 통계적으로 신뢰할만한 데이터의 수집에도 어려움이 따른다. 훈련 데이터로 사용되는 말뭉치는 많은 사람들이 수작업으로 구축하므로 작업자 중 일부가 언어에 대한 지식이 부족하다거나 주관적인 판단에 의한 태깅 실수를 포함할 수도 있기 때문에 단순한 저빈도와 관련된 잡음 외의 오류들이 포함될 수 있는데 이러한 오류들은 재추정이나 평탄화 기법으로 해결될 수 있는 문제가 아니다. 본 논문에서는 HMM(Hidden Markov Model)을 이용한 한국어 품사 태깅에서 재추정 후 여전히 존재하는 말뭉치의 잡음에 인한 태깅 오류 해결을 위해 비터비 알고리즘적용 단계에서 데이터 부족과 말뭉치의 오류로 인해 문제가 되는 부분을 찾아내고 규칙을 통해 수정을 하여 태깅 결과를 개선하는 방안을 제안한다. 실험결과는 오류가 존재하는 말뭉치를 사용하여 구현된 HMM과 비터비 알고리즘을 적용한 태깅 정확도에 비해 오류를 수정하는 과정을 거친 후 정확도가 향상됨을 보여준다.

  • PDF

Morpheme Recovery Based on Naïve Bayes Model (NB 모델을 이용한 형태소 복원)

  • Kim, Jae-Hoon;Jeon, Kil-Ho
    • The KIPS Transactions:PartB
    • /
    • v.19B no.3
    • /
    • pp.195-200
    • /
    • 2012
  • In Korean, spelling change in various forms must be recovered into base forms in morphological analysis as well as part-of-speech (POS) tagging is difficult without morphological analysis because Korean is agglutinative. This is one of notorious problems in Korean morphological analysis and has been solved by morpheme recovery rules, which generate morphological ambiguity resolved by POS tagging. In this paper, we propose a morpheme recovery scheme based on machine learning methods like Na$\ddot{i}$ve Bayes models. Input features of the models are the surrounding context of the syllable which the spelling change is occurred and categories of the models are the recovered syllables. The POS tagging system with the proposed model has demonstrated the $F_1$-score of 97.5% for the ETRI tree-tagged corpus. Thus it can be decided that the proposed model is very useful to handle morpheme recovery in Korean.

Sequence-to-sequence based Morphological Analysis and Part-Of-Speech Tagging for Korean Language with Convolutional Features (Sequence-to-sequence 기반 한국어 형태소 분석 및 품사 태깅)

  • Li, Jianri;Lee, EuiHyeon;Lee, Jong-Hyeok
    • Journal of KIISE
    • /
    • v.44 no.1
    • /
    • pp.57-62
    • /
    • 2017
  • Traditional Korean morphological analysis and POS tagging methods usually consist of two steps: 1 Generat hypotheses of all possible combinations of morphemes for given input, 2 Perform POS tagging search optimal result. require additional resource dictionaries and step could error to the step. In this paper, we tried to solve this problem end-to-end fashion using sequence-to-sequence model convolutional features. Experiment results Sejong corpus sour approach achieved 97.15% F1-score on morpheme level, 95.33% and 60.62% precision on word and sentence level, respectively; s96.91% F1-score on morpheme level, 95.40% and 60.62% precision on word and sentence level, respectively.

A knowledge-based pronunciation generation system for French (지식 기반 프랑스어 발음열 생성 시스템)

  • Kim, Sunhee
    • Phonetics and Speech Sciences
    • /
    • v.10 no.1
    • /
    • pp.49-55
    • /
    • 2018
  • This paper aims to describe a knowledge-based pronunciation generation system for French. It has been reported that a rule-based pronunciation generation system outperforms most of the data-driven ones for French; however, only a few related studies are available due to existing language barriers. We provide basic information about the French language from the point of view of the relationship between orthography and pronunciation, and then describe our knowledge-based pronunciation generation system, which consists of morphological analysis, Part-of-Speech (POS) tagging, grapheme-to-phoneme generation, and phone-to-phone generation. The evaluation results show that the word error rate of POS tagging, based on a sample of 1,000 sentences, is 10.70% and that of phoneme generation, using 130,883 entries, is 2.70%. This study is expected to contribute to the development and evaluation of speech synthesis or speech recognition systems for French.

Performance Comparison Analysis on Named Entity Recognition system with Bi-LSTM based Multi-task Learning (다중작업학습 기법을 적용한 Bi-LSTM 개체명 인식 시스템 성능 비교 분석)

  • Kim, GyeongMin;Han, Seunggnyu;Oh, Dongsuk;Lim, HeuiSeok
    • Journal of Digital Convergence
    • /
    • v.17 no.12
    • /
    • pp.243-248
    • /
    • 2019
  • Multi-Task Learning(MTL) is a training method that trains a single neural network with multiple tasks influences each other. In this paper, we compare performance of MTL Named entity recognition(NER) model trained with Korean traditional culture corpus and other NER model. In training process, each Bi-LSTM layer of Part of speech tagging(POS-tagging) and NER are propagated from a Bi-LSTM layer to obtain the joint loss. As a result, the MTL based Bi-LSTM model shows 1.1%~4.6% performance improvement compared to single Bi-LSTM models.