• Title/Summary/Keyword: POS-tagged corpus

Search Result 27, Processing Time 0.022 seconds

Word sense disambiguation using dynamic sized context and distance weighting (가변 크기 문맥과 거리가중치를 이용한 동형이의어 중의성 해소)

  • Lee, Hyun Ah
    • Journal of Advanced Marine Engineering and Technology
    • /
    • v.38 no.4
    • /
    • pp.444-450
    • /
    • 2014
  • Most researches on word sense disambiguation have used static sized context regardless of sentence patterns. This paper proposes to use dynamic sized context considering sentence patterns and distance between words for word sense disambiguation. We evaluated our system 12 words in 32,735sentences with Sejong POS and sense tagged corpus, and dynamic sized context showed 92.2% average accuracy for predicates, which is better than accuracy of static sized context.

A Crowdsourcing-based Emotional Words Tagging Game for Building a Polarity Lexicon in Korean (한국어 극성 사전 구축을 위한 크라우드소싱 기반 감성 단어 극성 태깅 게임)

  • Kim, Jun-Gi;Kang, Shin-Jin;Bae, Byung-Chull
    • Journal of Korea Game Society
    • /
    • v.17 no.2
    • /
    • pp.135-144
    • /
    • 2017
  • Sentiment analysis refers to a way of analyzing the writer's subjective opinions or feelings through text. For effective sentiment analysis, it is essential to build emotional word polarity lexicon. This paper introduces a crowdsourcing-based game that we have developed for efficiently building a polarity lexicon in Korean. First, we collected a corpus from the relating Internet communities using a crawler, and we classified them into words using the Twitter POS analyzer. These POS-tagged words are provided as a form of mobile platform based tagging game in which the players voluntarily tagged the polarities of the words, and then the result was collected into the database. So far we have tagged the polarities of about 1200 words. We expect that our research can contribute to the Korean sentiment analysis research especially in the game domain by collecting more emotional word data in the future.

Coreference Resolution for Korean using Mention Pair with SVM (SVM 기반의 멘션 페어 모델을 이용한 한국어 상호참조해결)

  • Choi, Kyoung-Ho;Park, Cheon-Eum;Lee, Changki
    • KIISE Transactions on Computing Practices
    • /
    • v.21 no.4
    • /
    • pp.333-337
    • /
    • 2015
  • In this paper, we suggest a Coreference Resolution system for Korean using Mention Pair with SVM. The system introduced in this paper, also be able to extract Mention from document which is including automatically tagged name entity information, dependency trees and POS tags. We also built a corpus, including 214 documents with Coreference tags, referencing online news and Wikipedia for training the system and testing the system's performance. The corpus had 14 documents from online news, along with 200 question-and-answer documents from Wikipedia. When we tested the system by corpus, the performance of the system was extracted by MUC-F1 55.68%, B-cube-F1 57.19%, and CEAFE-F1 61.75%.

Syllable-based Korean POS Tagging Based on Combining a Pre-analyzed Dictionary with Machine Learning (기분석사전과 기계학습 방법을 결합한 음절 단위 한국어 품사 태깅)

  • Lee, Chung-Hee;Lim, Joon-Ho;Lim, Soojong;Kim, Hyun-Ki
    • Journal of KIISE
    • /
    • v.43 no.3
    • /
    • pp.362-369
    • /
    • 2016
  • This study is directed toward the design of a hybrid algorithm for syllable-based Korean POS tagging. Previous syllable-based works on Korean POS tagging have relied on a sequence labeling method and mostly used only a machine learning method. We present a new algorithm integrating a machine learning method and a pre-analyzed dictionary. We used a Sejong tagged corpus for training and evaluation. While the machine learning engine achieved eojeol precision of 0.964, the proposed hybrid engine achieved eojeol precision of 0.990. In a Quiz domain test, the machine learning engine and the proposed hybrid engine obtained 0.961 and 0.972, respectively. This result indicates our method to be effective for Korean POS tagging.

Morpheme Recovery Based on Naïve Bayes Model (NB 모델을 이용한 형태소 복원)

  • Kim, Jae-Hoon;Jeon, Kil-Ho
    • The KIPS Transactions:PartB
    • /
    • v.19B no.3
    • /
    • pp.195-200
    • /
    • 2012
  • In Korean, spelling change in various forms must be recovered into base forms in morphological analysis as well as part-of-speech (POS) tagging is difficult without morphological analysis because Korean is agglutinative. This is one of notorious problems in Korean morphological analysis and has been solved by morpheme recovery rules, which generate morphological ambiguity resolved by POS tagging. In this paper, we propose a morpheme recovery scheme based on machine learning methods like Na$\ddot{i}$ve Bayes models. Input features of the models are the surrounding context of the syllable which the spelling change is occurred and categories of the models are the recovered syllables. The POS tagging system with the proposed model has demonstrated the $F_1$-score of 97.5% for the ETRI tree-tagged corpus. Thus it can be decided that the proposed model is very useful to handle morpheme recovery in Korean.

A Korean Homonym Disambiguation System Based on Statistical, Model Using weights

  • Kim, Jun-Su;Lee, Wang-Woo;Kim, Chang-Hwan;Ock, Cheol-young
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2002.02a
    • /
    • pp.166-176
    • /
    • 2002
  • A homonym could be disambiguated by another words in the context as nouns, predicates used with the homonym. This paper using semantic information (co-occurrence data) obtained from definitions of part of speech (POS) tagged UMRD-S$^1$), In this research, we have analyzed the result of an experiment on a homonym disambiguation system based on statistical model, to which Bayes'theorem is applied, and suggested a model established of the weight of sense rate and the weight of distance to the adjacent words to improve the accuracy. The result of applying the homonym disambiguation system using semantic information to disambiguating homonyms appearing on the dictionary definition sentences showed average accuracy of 98.32% with regard to the most frequent 200 homonyms. We selected 49 (31 substantives and 18 predicates) out of the 200 homonyms that were used in the experiment, and performed an experiment on 50,703 sentences extracted from Sejong Project tagged corpus (i.e. a corpus of morphologically analyzed words) of 3.5 million words that includes one of the 49 homonyms. The result of experimenting by assigning the weight of sense rate(prior probability) and the weight of distance concerning the 5 words at the front/behind the homonym to be disambiguated showed better accuracy than disambiguation systems based on existing statistical models by 2.93%,

  • PDF

Dealing with Compouds in the Construction of a POS Tagged Korean Corpus (형태 분석 말뭉치 구축을 위한 합성어의 처리 방법 - 띄어쓰기를 고려하여 -)

  • Cho, Jin-Hyun;Kim, Il-Hwan;Lee, Hyun-Hee;Lee, Young-Je;Kang, Beom-Mo
    • Annual Conference on Human and Language Technology
    • /
    • 2002.10e
    • /
    • pp.9-15
    • /
    • 2002
  • 이 연구는 형태 분석 정보가 부착된 말뭉치를 구축할 때 합성어를 처리하기 위한 방법론을 제시하고, 그 타당성을 검증해 보는 데 있다. 그동안 합성어 처리를 위해서 합성어 선정 기준을 이용하거나 목록을 이용하는 방법이 이용되었는데, 본고에서는 ${\ulcorner}$표준국어대사전${\lrcorner}$의 합성어 목록을 참조하는 것이 적절한 방법이 될 수 있음을 보이고자 한다. 또한 이 방법을 실제 말뭉치 구축에 활용할 경우, 원문의 띄어쓰기 정보가 합성어 처리에서 중요한 요인이 될 수 있다는 점을 지적하고, 이러한 처리가 가지는 한계와 의의에 대해서도 논의하고자 한다.

  • PDF

Construction and Evaluation of a Sentiment Dictionary Using a Web Corpus Collected from Game Domain (게임 도메인 웹 코퍼스를 이용한 감성사전 구축 및 평가)

  • Jeong, Woo-Young;Bae, Byung-Chull;Cho, Sung Hyun;Kang, Shin-Jin
    • Journal of Korea Game Society
    • /
    • v.18 no.5
    • /
    • pp.113-122
    • /
    • 2018
  • This paper describes an approach to building and evaluating a sentiment dictionary using a Web corpus in the game domain. To build a sentiment dictionary, we collected vocabulary based on game-related web documents from a domestic portal site, using the Twitter Korean Processor. From the collected vocabulary, we selected the words whose POS are tagged as either verbs or adjectives, and assigned sentiment score for each selected word. To evaluate the constructed sentiment dictionary, we calculated F1 score with precision and recall, using Korean-SWN that is based on English Senti-word Net(SWN). The evaluation results show that average F1 scores are 0.85 for adjectives and 0.77 for verbs, respectively.

Sentiment Analysis System Using Stanford Sentiment Treebank (스탠포드 감성 트리 말뭉치를 이용한 감성 분류 시스템)

  • Lee, Songwook
    • Journal of Advanced Marine Engineering and Technology
    • /
    • v.39 no.3
    • /
    • pp.274-279
    • /
    • 2015
  • The main goal of this research is to build a sentiment analysis system which automatically determines user opinions of the Stanford Sentiment Treebank in terms of three sentiments such as positive, negative, and neutral. Firstly, sentiment sentences are POS tagged and parsed to dependency structures. All nodes of the Treebank and their polarities are automatically extracted from the Treebank. We train two Support Vector Machines models. One is for a node level classification and the other is for a sentence level. We have tried various type of features such as word lexicons, POS tags, Sentiment lexicons, head-modifier relations, and sibling relations. Though we acquired 74.2% in accuracy on the test set for 3 class node level classification and 67.0% for 3 class sentence level classification, our experimental results for 2 class classification are comparable to those of the state of art system using the same corpus.

The Method for the Construction of POS Tagged Corpus based on Morpheme Ready Made Dictionary and RDBMS (형태소 기분석 사전과 DBMS를 이장한 형태소 분석 말뭉치 구축의 한 방법)

  • Cho, Jin-Hyun;Kang, Beom-Mo
    • Annual Conference on Human and Language Technology
    • /
    • 2001.10d
    • /
    • pp.33-40
    • /
    • 2001
  • 본 논문은 1999년도에 구축된 '150만 세종 형태소 분석 말뭉치'를 바탕으로 형태소 기분석 사전을 구축하고, 이를 토대로 후처리의 수작업을 고려한 반자동 태거를 구축하는 방법론에 대해 연구한 것이다. 분석말뭉치 구축에 있어 기존 자동 태거에 의한 자동 태깅의 문제점을 분석하고, 이미 구축된 형태분석 말뭉치를 이용해 후처리 작업이 보다 용이한 1차 가공말뭉치를 구축하는 반자동 태거의 개발과 그 방법론을 제시하는데 목적을 두고 있다. 이와 같은 논의에 따라 분석 말뭉치의 구축을 위한 태거는 일반적인 언어 처리를 위한 태거와는 다르다는 점을 주장하였고, 태거에 전적으로 의존하는 태깅 방식보다는 수작업의 편의를 제공할 수 있는 태깅 방식이 필요함을 강조하였다. 본 연구에서 제안된 반자동 태거는 전체적인 태깅 성공률과 정확도가 기존의 태거에 비해 떨어지지만 정확한 단일 분석 결과를 텍스트의 장르에 따른 편차 없이 50% 이상으로 산출하고, 해결이 어려운 어절 유형에 대해서 완전히 작업자의 판단에 맡김으로써 오류의 가능성을 줄인다. 또한 분석 어절에 대해 여러 표지를 부착함으로써 체계적이고 단계적인 후처리 작업이 가능하도록 하였다.

  • PDF