• Title/Summary/Keyword: 최소단어

Search Result 56, Processing Time 0.022 seconds

Sentiment Analysis of Korean Reviews Using CNN: Focusing on Morpheme Embedding (CNN을 적용한 한국어 상품평 감성분석: 형태소 임베딩을 중심으로)

  • Park, Hyun-jung;Song, Min-chae;Shin, Kyung-shik
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.59-83
    • /
    • 2018
  • With the increasing importance of sentiment analysis to grasp the needs of customers and the public, various types of deep learning models have been actively applied to English texts. In the sentiment analysis of English texts by deep learning, natural language sentences included in training and test datasets are usually converted into sequences of word vectors before being entered into the deep learning models. In this case, word vectors generally refer to vector representations of words obtained through splitting a sentence by space characters. There are several ways to derive word vectors, one of which is Word2Vec used for producing the 300 dimensional Google word vectors from about 100 billion words of Google News data. They have been widely used in the studies of sentiment analysis of reviews from various fields such as restaurants, movies, laptops, cameras, etc. Unlike English, morpheme plays an essential role in sentiment analysis and sentence structure analysis in Korean, which is a typical agglutinative language with developed postpositions and endings. A morpheme can be defined as the smallest meaningful unit of a language, and a word consists of one or more morphemes. For example, for a word '예쁘고', the morphemes are '예쁘(= adjective)' and '고(=connective ending)'. Reflecting the significance of Korean morphemes, it seems reasonable to adopt the morphemes as a basic unit in Korean sentiment analysis. Therefore, in this study, we use 'morpheme vector' as an input to a deep learning model rather than 'word vector' which is mainly used in English text. The morpheme vector refers to a vector representation for the morpheme and can be derived by applying an existent word vector derivation mechanism to the sentences divided into constituent morphemes. By the way, here come some questions as follows. What is the desirable range of POS(Part-Of-Speech) tags when deriving morpheme vectors for improving the classification accuracy of a deep learning model? Is it proper to apply a typical word vector model which primarily relies on the form of words to Korean with a high homonym ratio? Will the text preprocessing such as correcting spelling or spacing errors affect the classification accuracy, especially when drawing morpheme vectors from Korean product reviews with a lot of grammatical mistakes and variations? We seek to find empirical answers to these fundamental issues, which may be encountered first when applying various deep learning models to Korean texts. As a starting point, we summarized these issues as three central research questions as follows. First, which is better effective, to use morpheme vectors from grammatically correct texts of other domain than the analysis target, or to use morpheme vectors from considerably ungrammatical texts of the same domain, as the initial input of a deep learning model? Second, what is an appropriate morpheme vector derivation method for Korean regarding the range of POS tags, homonym, text preprocessing, minimum frequency? Third, can we get a satisfactory level of classification accuracy when applying deep learning to Korean sentiment analysis? As an approach to these research questions, we generate various types of morpheme vectors reflecting the research questions and then compare the classification accuracy through a non-static CNN(Convolutional Neural Network) model taking in the morpheme vectors. As for training and test datasets, Naver Shopping's 17,260 cosmetics product reviews are used. To derive morpheme vectors, we use data from the same domain as the target one and data from other domain; Naver shopping's about 2 million cosmetics product reviews and 520,000 Naver News data arguably corresponding to Google's News data. The six primary sets of morpheme vectors constructed in this study differ in terms of the following three criteria. First, they come from two types of data source; Naver news of high grammatical correctness and Naver shopping's cosmetics product reviews of low grammatical correctness. Second, they are distinguished in the degree of data preprocessing, namely, only splitting sentences or up to additional spelling and spacing corrections after sentence separation. Third, they vary concerning the form of input fed into a word vector model; whether the morphemes themselves are entered into a word vector model or with their POS tags attached. The morpheme vectors further vary depending on the consideration range of POS tags, the minimum frequency of morphemes included, and the random initialization range. All morpheme vectors are derived through CBOW(Continuous Bag-Of-Words) model with the context window 5 and the vector dimension 300. It seems that utilizing the same domain text even with a lower degree of grammatical correctness, performing spelling and spacing corrections as well as sentence splitting, and incorporating morphemes of any POS tags including incomprehensible category lead to the better classification accuracy. The POS tag attachment, which is devised for the high proportion of homonyms in Korean, and the minimum frequency standard for the morpheme to be included seem not to have any definite influence on the classification accuracy.

A Study on Project Information Integrated Management Measures Using Life Cycle Information in Road Construction Projects (도로건설사업의 생애주기별 정보를 이용한 건설사업정보 통합관리방안 연구)

  • Kim, Seong-Jin;Kim, Bum-Soo;Kim, Tae-Hak;Kim, Nam-Gon
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.20 no.11
    • /
    • pp.208-216
    • /
    • 2019
  • Construction projects generate a massive amount of diverse information. It takes at least five years to more than 10 years to complete, so it is important to manage the information on a project's history, including processes and costs. Furthermore, it is necessary to determine if construction projects have been carried out according to the planned goals, and to convert a construction information management system (CALS) into a virtuous cycle. It is easy to ensure integrated information management in private construction projects because constructors can take care of the whole process (from planning to completion), whereas it is difficult for public construction projects because various agencies are involved in the projects. A CALS manages the project information of public road construction, but that information is managed according to CALS subsystems, resulting in disconnected information among the subsystems, and making it impossible to monitor integrated information. Thus, this study proposes integrated information management measures to ensure comprehensive management of the information generated during the construction life cycle. To that end, a CALS is improved by standardizing and integrating the system database, integrating the individually managed user information, and connecting the system with the Dbrain tool, which collectively builds artificial intelligence, to ensure information management based on the project budget.

희귀식물 무주나무(Lasianthu japonicus Miquel)의 특성과 자생지

  • 이은주;문명옥;강영제;김문홍
    • Proceedings of the Plant Resources Society of Korea Conference
    • /
    • 2002.11b
    • /
    • pp.76-76
    • /
    • 2002
  • 무주나무(Lasianthus japonicus Miquel)는 일본, 대만, 중국 등에 분포하고 열대 및 아열대의 상록활엽수림에서만 자라는 것으로 알려져 있으며, 우리나라에는 제주도 남쪽계곡에만 분포하는 희귀식물이다. 무주나무는 현재 환경부 지정 보호야생식물로 보호되고 있으나 개체특성 및 자생지에 대한 정확한 조사가 이루어진 바 없다. 본 연구는 무주나무의 자생지 현황과 생육특성을 파악하고자 실시하였다. 무주나무의 자생지는 제주도 남제주군 남원읍 하례리 해발 250m의 계곡 동사면과 서귀포시 돈네코 계곡의 해발 350 m 계곡의 서사면 등 2개소로 확인되었다. 자생지별 개체수는 남원읍 하례리 4개체, 서귀포시 돈네코계곡 5개체 등 총 9개체에 불과하였다. 자생지는 계곡의 상록수림 하부에 바위 위 부엽토나, 습한 계곡 사면이었으며, 교목층에는 구실잣밤나무, 비쭈기나무, 황칠나무, 동백나무 등이 우점하고, 관목층에는 사스레피나무, 백량금, 산호수 등이 우점하는 상록활엽수림이었다. 분포 개체의 수고는 최소 0.4 m, 최대 1.55 m로 평균 1.5 m 였다. 생장특성을 조사한 결과 줄기는 어릴 때는 사각형이지만 점차 원형으로 되며, 일정한 마디가 있고 털이 없으며, 잎은 대생하고, 혁질이며, 중륵과 측맥이 뚜렷한 특성을 갖고 있었다. 열매는 장과형으로 성숙 시에는 남색이며 털이 없으며, 직경 약 6-7 mm, 4-5개의 종자가 들어 있었다. 종자는 반달형이며, 3개의 홈이 지는 특성을 갖고 있었다. 현재의 자생지는 자연적인 요인으로서 토양유실이 심하게 일어나고 있는 지역이었으며, 교목 또는 다른 관목에 의한 피압으로 무주나무의 생장에 부적절한 환경으로 판단되었다. 따라서 자생지의 적절한 식생관리와 지속적인 자생지 조사 및 자생지외 보존에 관한 연구가 이루어져야할 것으로 생각된다.$I_{NO}$ 가 죽절초를 제외한 3종에서 여름철 낮시간에 증가하였다. 겨울철의 O-J-I-P곡선은 모든 종에서 낮시간에 다소 낮아지지만 큰 변화는 없었다. 그리고, 문주란, 박달목서, 파초일엽에서 $\psi$o/(1-$\psi$o)가 낮시간에 다소 증가하였다. 이로부터 P $I_{NO}$ , SF $I_{NO}$ , $\psi$o/(1-$\psi$o)등의 변수는 식물의 활력도를 검정하는 지표로 활용될 가능성이 높다고 할 수 있다.irc}C$) 까지 동시에 냉각된 사실을 지시한다. 각섬석 편암내의 각섬석들은 복잡한 40Ar/39Ar 연대를 보여주며 일부가 평형연대를 보여주지만 특별한 의미 부여가 힘들다.해예방행동을 촉구하는 등의 효과도 높은 것으로 예방의학적인 유용성이 크다고 볼 수 있다. 미침을 알 수 있었다. 대두 단백질로 코팅된 golden delicious는 상온에서60일 동안 보관하였을 경우, 사과표피의 색도 변화를 현저히 지연시킴을 확인하였다. 또한 control과 비교하여 성공적으로 사과에 코팅하였으며, 상온에서 보관하여을 때 사과의 품질을 30일 이상 연장하는 효과를 관찰하였다. 이들 결과로부터 대두단백질 필름이 과일 등의 포장제로서 이용할 가능성을 확인하였다.로 [-wh] 겹의문사는 복수 의미를 지닐 수 없 다. 그러면 단수 의미는 어떻게 생성되는가\ulcorner 본 논문에서는 표면적 형태에도 불구하고 [-wh]의미의 겹의문사는 병렬적 관계의 합성어가 아니라 내부구조를 지니지 않은 단순한 단어(minimal $X^{0}$ elements)로 가정한다. 즉, [+wh] 의미의 겹의문사는 동일한 구성

  • PDF

A Study on the Optimization of State Tying Acoustic Models using Mixture Gaussian Clustering (혼합 가우시안 군집화를 이용한 상태공유 음향모델 최적화)

  • Ann, Tae-Ock
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.42 no.6
    • /
    • pp.167-176
    • /
    • 2005
  • This paper describes how the state tying model based on the decision tree which is one of Acoustic models used for speech recognition optimizes the model by reducing the number of mixture Gaussians of the output probability distribution. The state tying modeling uses a finite set of questions which is possible to include the phonological knowledge and the likelihood based decision criteria. And the recognition rate can be improved by increasing the number of mixture Gaussians of the output probability distribution. In this paper, we'll reduce the number of mixture Gaussians at the highest point of recognition rate by clustering the Gaussians. Bhattacharyya and Euclidean method will be used for the distance measure needed when clustering. And after calculating the mean and variance between the pair of lowest distance, the new Gaussians are created. The parameters for the new Gaussians are derived from the parameters of the Gaussians from which it is born. Experiments have been performed using the STOCKNAME (1,680) databases. And the test results show that the proposed method using Bhattacharyya distance measure maintains their recognition rate at $97.2\%$ and reduces the ratio of the number of mixture Gaussians by $1.0\%$. And the method using Euclidean distance measure shows that it maintains the recognition rate at $96.9\%$ and reduces the ratio of the number of mixture Gaussians by $1.0\%$. Then the methods can optimize the state tying model.

The Association of Institutional Information on Websites with Present and Future Financial Performance (웹사이트에 게시된 기업의 소개글 분석을 통한 기업의 현재 및 미래 가치 예측 분석 방법)

  • Na, Hyung Jong;Choi, Sukjae;Kwon, Ohbyung
    • The Journal of Society for e-Business Studies
    • /
    • v.23 no.4
    • /
    • pp.63-85
    • /
    • 2018
  • The "About Us" page on the website of a corporation provides information regarding the organization's vision, philosophy, and values. We examine the association between institutional information provided on corporate websites (i.e., the "About Us" section) with present and future financial performance. Utilizing a text mining technique, we analyze the institutional information of S&P500 firms in the year 2016. We conduct a factor analysis including words that are intentionally repeated in the introductory text of corporate websites. The results of the analysis reveal that keywords from this institutional information can be grouped into six factors. We then carry out an ordinary least squares regression analysis to determine the associations between these six factors and present financial performance. The results show that keywords in Factor 2 (those related to Purchasing experience) are positively associated with ROE, a variable representing present financial performance, while keywords in Factor 1 (those related to Note to customers) show a negative relationship with ROE. On the other hand, keywords in Factor 1 have a positive relationship with Tobin's Q, a variable representing future financial performance. These results indicate that there is some relationship between the words used in the institutional information in this section of corporate websites and firms' financial performance. Hence, the institutional information on a website may be a useful indicator of current firm performance and future firm value.

Korean Morphological Analysis Method Based on BERT-Fused Transformer Model (BERT-Fused Transformer 모델에 기반한 한국어 형태소 분석 기법)

  • Lee, Changjae;Ra, Dongyul
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.4
    • /
    • pp.169-178
    • /
    • 2022
  • Morphemes are most primitive units in a language that lose their original meaning when segmented into smaller parts. In Korean, a sentence is a sequence of eojeols (words) separated by spaces. Each eojeol comprises one or more morphemes. Korean morphological analysis (KMA) is to divide eojeols in a given Korean sentence into morpheme units. It also includes assigning appropriate part-of-speech(POS) tags to the resulting morphemes. KMA is one of the most important tasks in Korean natural language processing (NLP). Improving the performance of KMA is closely related to increasing performance of Korean NLP tasks. Recent research on KMA has begun to adopt the approach of machine translation (MT) models. MT is to convert a sequence (sentence) of units of one domain into a sequence (sentence) of units of another domain. Neural machine translation (NMT) stands for the approaches of MT that exploit neural network models. From a perspective of MT, KMA is to transform an input sequence of units belonging to the eojeol domain into a sequence of units in the morpheme domain. In this paper, we propose a deep learning model for KMA. The backbone of our model is based on the BERT-fused model which was shown to achieve high performance on NMT. The BERT-fused model utilizes Transformer, a representative model employed by NMT, and BERT which is a language representation model that has enabled a significant advance in NLP. The experimental results show that our model achieves 98.24 F1-Score.