• Title/Summary/Keyword: 국어 정보처리

Search Result 237, Processing Time 0.021 seconds

Graph-Based Word Sense Disambiguation Using Iterative Approach (반복적 기법을 사용한 그래프 기반 단어 모호성 해소)

  • Kang, Sangwoo
    • The Journal of Korean Institute of Next Generation Computing
    • /
    • v.13 no.2
    • /
    • pp.102-110
    • /
    • 2017
  • Current word sense disambiguation techniques employ various machine learning-based methods. Various approaches have been proposed to address this problem, including the knowledge base approach. This approach defines the sense of an ambiguous word in accordance with knowledge base information with no training corpus. In unsupervised learning techniques that use a knowledge base approach, graph-based and similarity-based methods have been the main research areas. The graph-based method has the advantage of constructing a semantic graph that delineates all paths between different senses that an ambiguous word may have. However, unnecessary semantic paths may be introduced, thereby increasing the risk of errors. To solve this problem and construct a fine-grained graph, in this paper, we propose a model that iteratively constructs the graph while eliminating unnecessary nodes and edges, i.e., senses and semantic paths. The hybrid similarity estimation model was applied to estimate a more accurate sense in the constructed semantic graph. Because the proposed model uses BabelNet, a multilingual lexical knowledge base, the model is not limited to a specific language.

The Automatic Extraction of Hypernyms and the Development of WordNet Prototype for Korean Nouns using Korean MRD (Machine Readable Dictionary) (국어사전을 이용한 한국어 명사에 대한 상위어 자동 추출 및 WordNet의 프로토타입 개발)

  • Kim, Min-Soo;Kim, Tae-Yeon;Noh, Bong-Nam
    • The Transactions of the Korea Information Processing Society
    • /
    • v.2 no.6
    • /
    • pp.847-856
    • /
    • 1995
  • When a human recognizes nouns in a sentence, s/he associates them with the hyper concepts of onus. For computer to simulate the human's word recognition, it should build the knowledge base (WordNet)for the hyper concepts of words. Until now, works for the WordNet haven't been performed in Korea, because they need lots of human efforts and time. But, as the power of computer is radically improved and common MRD becomes available, it is more feasible to automatically construct the WordNet. This paper proposes the method that automatically builds the WordNet of Korean nouns by using the descripti on of onus in Korean MRD, and it proposes the rules for extracting the hyper concepts (hypernyms)by analyzing structrual characteristics of Korean. The rules effect such characteristics as a headword lies on the rear part of sentences and the descriptive sentences of nouns have special structure. In addition, the WordNet prototype of Korean Nouns is developed, which is made by combining the hypernyms produced by the rules mentioned above. It extracts the hypernyms of about 2,500 sample words, and the result shows that about 92per cents of hypernyms are correct.

  • PDF

CKFont2: An Improved Few-Shot Hangul Font Generation Model Based on Hangul Composability (CKFont2: 한글 구성요소를 이용한 개선된 퓨샷 한글 폰트 생성 모델)

  • Jangkyoung, Park;Ammar, Ul Hassan;Jaeyoung, Choi
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.12
    • /
    • pp.499-508
    • /
    • 2022
  • A lot of research has been carried out on the Hangeul generation model using deep learning, and recently, research is being carried out how to minimize the number of characters input to generate one set of Hangul (Few-Shot Learning). In this paper, we propose a CKFont2 model using only 14 letters by analyzing and improving the CKFont (hereafter CKFont1) model using 28 letters. The CKFont2 model improves the performance of the CKFont1 model as a model that generates all Hangul using only 14 characters including 24 components (14 consonants and 10 vowels), where the CKFont1 model generates all Hangul by extracting 51 Hangul components from 28 characters. It uses the minimum number of characters for currently known models. From the basic consonants/vowels of Hangul, 27 components such as 5 double consonants, 11/11 compound consonants/vowels respectively are learned by deep learning and generated, and the generated 27 components are combined with 24 basic consonants/vowels. All Hangul characters are automatically generated from the combined 51 components. The superiority of the performance was verified by comparative analysis with results of the zi2zi, CKFont1, and MX-Font model. It is an efficient and effective model that has a simple structure and saves time and resources, and can be extended to Chinese, Thai, and Japanese.

Differences of Reading the Pure Hangul Text and the Hangul Plus Hanja Text in Reading Speed, Comprehension, and Memory (한글 전용과 국한 혼용의 언어 심리학적 고찰(I): 읽기 시간, 이해, 기억에서의 차이)

  • Nam, Ki-Chun;Kim, Tae-Hoon;Lee, Kyung-In;Park, Young-Chan;Seo, Kwang-Jun;Choi, Key-Sun
    • Annual Conference on Human and Language Technology
    • /
    • 1997.10a
    • /
    • pp.469-476
    • /
    • 1997
  • 본 연구는 한글 전용과 한자 혼용이 글의 이해 속도, 이해 정도, 내용의 기억에 어떤 영향을 미치는 가를 조사하기 위해 실시되었다. 한글 전용론과 한자 혼용론은 각각 나름대로의 논리를 가지고 서로 논쟁을 펼쳐왔다. 먼저 한글전용론의 주장을 살펴보면, 한자는 배우기 어렵고 쓰기도 불편해 쉽게 익힐 수 있는 한글을 사용해야 하며, 한자를 사용함으로 인해서 순 우리말의 발전이 방해를 받고, 글자 생활의 기계화와 출판 문화의 발달에 있어 한자는 이것을 어렵게 한다는 것이다. 반면에, 한자혼용론의 주장에 따르면, 한자말은 한자로 적을 때 그 의미의 파악이 빠르고 정확하고, 우리말 어휘의 절반 이상을 한자어가 차지하고 있으므로 한자를 가르치는 것이 국어교육의 지름길이고, 우리말에는 동음이의어가 많아서 한글로만 적을 경우 그 뜻을 식별하기 어렵고, 한자는 각 글자가 모두 뜻을 가지고 있으므로, 각각을 조합하여 새로운 의미의 단어를 쉽게 만들 수 있음은 물론이고 한글로 쓸 경우 길게 쓰여져야만 하는 것을 짧게 쓸 수 있다는 장점을 가지고 있으며, 한자를 쓰지 않으면 우리의 전통 문화를 이어 받을 수 없으며 한자를 공통으로 쓰고 있는 동양문화권에서도 고립을 초래할 수 있다는 것이다. 이렇게 한글전용과 한자혼용에 대한 의견이 다양한 만큼 우리나라의 한자에 대한 정책도 그 갈피를 잡지 못하고 계속 바뀌어 왔다. 독립이후 정부에서는 법령과 훈령 등으로 모든 공문서에서의 한자사용을 금지하고 일반 사회의 문자 생활에도 한글을 전용할 것을 권고하지만 이 지침은 결국 공문서에만 한정되어 왔고 후에는 이것조차도 유명무실해졌다. 또한 중고등학교의 한자교육 정책도 수차례 변화되어 한글만을 배운 세대가 사회에 나와 여전히 한자가 사용되고 있어 적응에 문제점을 가지기도 하였다. 본 연구에서는 그 동안 계속되어 온 한글과 한잔의 사용에 관한 논쟁을 언어심리학적인 연구 방법을 통해 조사하였다. 즉, 글을 읽는 속도, 글의 의미를 얼마나 정확하게 이해했는지, 어느 것이 더 기억에 오래 남는지를 측정하여 어느 쪽의 입장이 옮은 지를 판단하는 것이다. 실험 결과는 문장을 읽는 시간에서는 한글 전용문인 경우에 월등히 빨랐다. 그러나. 내용에 대한 기억 검사에서는 국한 혼용 조건에서 더 우수하였다. 반면에, 이해력 검사에서는 천장 효과(Ceiling effect)로 두 조건간에 차이가 없었다. 따라서, 본 실험 결과에 따르면, 글의 읽기 속도가 중요한 문서에서는 한글 전용이 좋은 반면에 글의 내용 기억이 강조되는 경우에는 한자를 혼용하는 것이 더 효율적이다.

  • PDF

Application and Process Standardization of Terminology Dictionary for Defense Science and Technology (국방과학기술 전문용어 사전 구축을 위한 프로세스 표준화 및 활용 방안)

  • Choi, Jung-Hwoan;Choi, Suk-Doo;Kim, Lee-Kyum;Park, Young-Wook;Jeong, Jong-Hee;An, Hee-Jung;Jung, Han-Min;Kim, Pyung
    • The Journal of the Korea Contents Association
    • /
    • v.11 no.8
    • /
    • pp.247-259
    • /
    • 2011
  • It is necessary to collect, manage and standardize defense and technology terminologies which are used by defense-related agencies in the field of national defense science and technology. The standardization of terminology dictionary can eliminate confusion about terminology and increase accessibility for the terminology by offline and online services. This study focuses on building national defense science and technology terminologies, publishing dictionary including them, and improving information analysis in defense area. as well as take advantage of offline and online services for easy accessibility for the terminology. Based on the results of this study, the terminology data will be used as follows; 1) Defence science and technology terminology databases and its publication. 2) Information analysis in military fields. 3) Multilingual information analysis translated terms in the thesauri. 4) Verification on the consistency of information processing. 5) Language resources for terminology extraction.

Building a Korean Sentiment Lexicon Using Collective Intelligence (집단지성을 이용한 한글 감성어 사전 구축)

  • An, Jungkook;Kim, Hee-Woong
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.2
    • /
    • pp.49-67
    • /
    • 2015
  • Recently, emerging the notion of big data and social media has led us to enter data's big bang. Social networking services are widely used by people around the world, and they have become a part of major communication tools for all ages. Over the last decade, as online social networking sites become increasingly popular, companies tend to focus on advanced social media analysis for their marketing strategies. In addition to social media analysis, companies are mainly concerned about propagating of negative opinions on social networking sites such as Facebook and Twitter, as well as e-commerce sites. The effect of online word of mouth (WOM) such as product rating, product review, and product recommendations is very influential, and negative opinions have significant impact on product sales. This trend has increased researchers' attention to a natural language processing, such as a sentiment analysis. A sentiment analysis, also refers to as an opinion mining, is a process of identifying the polarity of subjective information and has been applied to various research and practical fields. However, there are obstacles lies when Korean language (Hangul) is used in a natural language processing because it is an agglutinative language with rich morphology pose problems. Therefore, there is a lack of Korean natural language processing resources such as a sentiment lexicon, and this has resulted in significant limitations for researchers and practitioners who are considering sentiment analysis. Our study builds a Korean sentiment lexicon with collective intelligence, and provides API (Application Programming Interface) service to open and share a sentiment lexicon data with the public (www.openhangul.com). For the pre-processing, we have created a Korean lexicon database with over 517,178 words and classified them into sentiment and non-sentiment words. In order to classify them, we first identified stop words which often quite likely to play a negative role in sentiment analysis and excluded them from our sentiment scoring. In general, sentiment words are nouns, adjectives, verbs, adverbs as they have sentimental expressions such as positive, neutral, and negative. On the other hands, non-sentiment words are interjection, determiner, numeral, postposition, etc. as they generally have no sentimental expressions. To build a reliable sentiment lexicon, we have adopted a concept of collective intelligence as a model for crowdsourcing. In addition, a concept of folksonomy has been implemented in the process of taxonomy to help collective intelligence. In order to make up for an inherent weakness of folksonomy, we have adopted a majority rule by building a voting system. Participants, as voters were offered three voting options to choose from positivity, negativity, and neutrality, and the voting have been conducted on one of the largest social networking sites for college students in Korea. More than 35,000 votes have been made by college students in Korea, and we keep this voting system open by maintaining the project as a perpetual study. Besides, any change in the sentiment score of words can be an important observation because it enables us to keep track of temporal changes in Korean language as a natural language. Lastly, our study offers a RESTful, JSON based API service through a web platform to make easier support for users such as researchers, companies, and developers. Finally, our study makes important contributions to both research and practice. In terms of research, our Korean sentiment lexicon plays an important role as a resource for Korean natural language processing. In terms of practice, practitioners such as managers and marketers can implement sentiment analysis effectively by using Korean sentiment lexicon we built. Moreover, our study sheds new light on the value of folksonomy by combining collective intelligence, and we also expect to give a new direction and a new start to the development of Korean natural language processing.

Exploratory Research on Automating the Analysis of Scientific Argumentation Using Machine Learning (머신 러닝을 활용한 과학 논변 구성 요소 코딩 자동화 가능성 탐색 연구)

  • Lee, Gyeong-Geon;Ha, Heesoo;Hong, Hun-Gi;Kim, Heui-Baik
    • Journal of The Korean Association For Science Education
    • /
    • v.38 no.2
    • /
    • pp.219-234
    • /
    • 2018
  • In this study, we explored the possibility of automating the process of analyzing elements of scientific argument in the context of a Korean classroom. To gather training data, we collected 990 sentences from science education journals that illustrate the results of coding elements of argumentation according to Toulmin's argumentation structure framework. We extracted 483 sentences as a test data set from the transcription of students' discourse in scientific argumentation activities. The words and morphemes of each argument were analyzed using the Python 'KoNLPy' package and the 'Kkma' module for Korean Natural Language Processing. After constructing the 'argument-morpheme:class' matrix for 1,473 sentences, five machine learning techniques were applied to generate predictive models relating each sentences to the element of argument with which it corresponded. The accuracy of the predictive models was investigated by comparing them with the results of pre-coding by researchers and confirming the degree of agreement. The predictive model generated by the k-nearest neighbor algorithm (KNN) demonstrated the highest degree of agreement [54.04% (${\kappa}=0.22$)] when machine learning was performed with the consideration of morpheme of each sentence. The predictive model generated by the KNN exhibited higher agreement [55.07% (${\kappa}=0.24$)] when the coding results of the previous sentence were added to the prediction process. In addition, the results indicated importance of considering context of discourse by reflecting the codes of previous sentences to the analysis. The results have significance in that, it showed the possibility of automating the analysis of students' argumentation activities in Korean language by applying machine learning.