• Title/Summary/Keyword: Two-level Lexicon

Search Result 9, Processing Time 0.017 seconds

Automatic Construction of Korean Two-level Lexicon using Lexical and Morphological Information (어휘 및 형태 정보를 이용한 한국어 Two-level 어휘사전 자동 구축)

  • Kim, Bogyum;Lee, Jae Sung
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.2 no.12
    • /
    • pp.865-872
    • /
    • 2013
  • Two-level morphology analysis method is one of rule-based morphological analysis method. This approach handles morphological transformation using rules and analyzes words with morpheme connection information in a lexicon. It is independent of language and Korean Two-level system was also developed. But, it was limited in practical use, because of using very small set of lexicon built manually. And it has also a over-generation problem. In this paper, we propose an automatic construction method of Korean Two-level lexicon for PC-KIMMO from morpheme tagged corpus. We also propose a method to solve over-generation problem using lexical information and sub-tags. The experiment showed that the proposed method reduced over-generation by 68% compared with the previous method, and the performance increased from 39% to 65% in f-measure.

Maximum Likelihood-based Automatic Lexicon Generation for AI Assistant-based Interaction with Mobile Devices

  • Lee, Donghyun;Park, Jae-Hyun;Kim, Kwang-Ho;Park, Jeong-Sik;Kim, Ji-Hwan;Jang, Gil-Jin;Park, Unsang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.9
    • /
    • pp.4264-4279
    • /
    • 2017
  • In this paper, maximum likelihood-based automatic lexicon generation using mixed-syllables is proposed for unlimited vocabulary voice interface for East Asian languages (e.g. Korean, Chinese and Japanese) in AI-assistant based interaction with mobile devices. The conventional lexicon has two inevitable problems: 1) a tedious repetition of out-of-lexicon unit additions to the lexicon, and 2) the propagation of errors during a morpheme analysis and space segmentation. The proposed method provides an automatic framework to solve the above problems. The proposed method produces a level of overall accuracy similar to one of previous methods in the presence of one out-of-lexicon word in a sentence, but the proposed method provides superior results with the absolute improvements of 1.62%, 5.58%, and 10.09% in terms of word accuracy when the number of out-of-lexicon words in a sentence was two, three and four, respectively.

PC-KIMMO-based Description of Mongolian Morphology

  • Jaimai, Purev;Zundui, Tsolmon;Chagnaa, Altangerel;Ock, Cheol-Young
    • Journal of Information Processing Systems
    • /
    • v.1 no.1 s.1
    • /
    • pp.41-48
    • /
    • 2005
  • This paper presents the development of a morphological processor for the Mongolian language, based on the two-level morphological model which was introduced by Koskenniemi. The aim of the study is to provide Mongolian syntactic parsers with more effective information on word structure of Mongolian words. First hand written rules that are the core of this model are compiled into finite-state transducers by a rule tool. Output of the compiler was edited to clarity by hand whenever necessary. The rules file and lexicon presented in the paper describe the morphology of Mongolian nouns, adjectives and verbs. Although the rules illustrated are not sufficient for accounting all the processes of Mongolian lexical phonology, other necessary rules can be easily added when new words are supplemented to the lexicon file. The theoretical consideration of the paper is concluded in representation of the morphological phenomena of Mongolian by the general, language-independent framework of the two-level morphological model.

A Study on the Lexicon-Use Behaviour of Architects & the Basic Lexicons in House Design (주택디자인에서 건축가들의 어휘 사용행태 및 기본어휘에 관한 연구)

  • Youn, Dae-Han
    • Journal of the Korean housing association
    • /
    • v.17 no.5
    • /
    • pp.27-37
    • /
    • 2006
  • This paper analyzed statistically two corpora that were constructed from the texts about house designs written by Korean architects and PA Awards architects. The main results are as follows; (1) The numbers of words in Korean house-design corpus were 9,352 and those of words in PA Awards house design corpus were 2,379. The former were 18.7% and the latter 4.8% of about 50,000 words regarded as the rest using scale in actual life. (2) When the architects described their house designs, the lexicon-concentration phenomenon was pervasive in both groups. Therefore, we can infer that the high-frequency lexicons are very important in house design. (3) The architects' behaviour patterns of using the house-design lexicons, went by rules according to the word frequency order. The tendency formulas of them had the $R^{2}$ values which were more than 90%. (4) In Korean house design corpus, the high frequency lexicons were '공간', '층', '주택', '집', '대지', '거실', and '실'. In PA awards house design corpus, they were 'house','room','space','living','wall','level' and 'area'. From these results, We can tell that 'space' is the highest frequency word in house design of the two groups, and that '대지 ' and 'wall' are the words that reveal well the differences between the two groups.

An Automatic Extraction of English-Korean Bilingual Terms by Using Word-level Presumptive Alignment (단어 단위의 추정 정렬을 통한 영-한 대역어의 자동 추출)

  • Lee, Kong Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.2 no.6
    • /
    • pp.433-442
    • /
    • 2013
  • A set of bilingual terms is one of the most important factors in building language-related applications such as a machine translation system and a cross-lingual information system. In this paper, we introduce a new approach that automatically extracts candidates of English-Korean bilingual terms by using a bilingual parallel corpus and a basic English-Korean lexicon. This approach can be useful even though the size of the parallel corpus is small. A sentence alignment is achieved first for the document-level parallel corpus. We can align words between a pair of aligned sentences by referencing a basic bilingual lexicon. For unaligned words between a pair of aligned sentences, several assumptions are applied in order to align bilingual term candidates of two languages. A location of a sentence, a relation between words, and linguistic information between two languages are examples of the assumptions. An experimental result shows approximately 71.7% accuracy for the English-Korean bilingual term candidates which are automatically extracted from 1,000 bilingual parallel corpus.

Orthographic and phonological links in Korean lexical processing (한국어 어휘 처리 과정에서 글짜 정보와 발음 정보의 연결성)

  • Kim, Jee-Sun;Taft, Marcus
    • Annual Conference on Human and Language Technology
    • /
    • 1995.10a
    • /
    • pp.211-214
    • /
    • 1995
  • At what level of orthographic representation is phonology linked in thelexicon? Is it at the whole word level, the syllable level, letter level, etc? This question can be addressed by comparing the two scripts used in Korean, logographic Hanmoon and alphabetic/syllabic Hangul, on a task where judgements must be made about the phonology of a visually presented word. Four experiments are reported using a "homophone decision task" and manipulating the sub-lexical relationship between orthography and phonology in Hanmoon and Hangul, and the lexical status of the stimuli. Hangul words showed a much higher error rate in judging whether there was another word identically pronounced than both Hangul nonwords and Hanmoon words. It is concluded that the relationship between orthography and phonology in the lexicon differs according tn the type of script owing to the availability of sub-lexical information: the process of making a homophone derision is based on a spread of activation exclusively among lexical entries, from orthography to phonology and vice versa (called "Orthography-Phonology-Orthography Rebound" or "OPO Rebound"). The results are explained within the mulitilevel interactive activation model with orthographic units linked to phonological units at each level.

  • PDF

A Quantitative Linguistic Study for the Appreciation of the Lexical Richness (어휘 풍부성 평가에 대한 계량언어학적 연구 (프랑스어 텍스트를 중심으로))

  • Bae, Hee-Sook
    • Speech Sciences
    • /
    • v.7 no.3
    • /
    • pp.139-149
    • /
    • 2000
  • Studying language by the quantitative linguistic method is not a recent development. Lately however, the interest in the quantitative linguistics has increased according to the demand on communication between human and human or between human and machine. We are required to transfer the system of the natural language onto machine. This requires the study of quantitative linguistics because we are unable to seize the characters of the tiny linguistic units and their structure in an intuitive way. In fact, the quantitative linguistics treats the internal structure of the language by the relation between the linguitic units and their quantitative characters. It is natural then that there is this growing interest in quantitative linguistics. In addition, Korean linguists take interest in the quantitative linguistics, although quantitative linguistics in Korea is not advanced by the level of the statistical analysis. Therefore, this present study shows how statistics can be applied in the field of linguistics through the two texts written in French: Lovers of the Subway and Our life's A. B. C.

  • PDF

The Semantic System in Late Korean-English Bilinguals (후기 한국어-영어 이중언어자의 의미체계)

  • Jeong, Woo-Rim;Kim, Min-Jung;Lee, Seung-Bok
    • Korean Journal of Cognitive Science
    • /
    • v.19 no.2
    • /
    • pp.177-203
    • /
    • 2008
  • The present study was aimed to compare the semantic systems represented by the lexicon between L1 and L2 in late Korean-English bilinguals. The participants performed the word-picture matching task. the task was to decide whether the pictures represent the previously presented words' meaning. The words were the basic level categories. The stimuli were consisted of common object belonged to two different semantic categories (natural and artificial). To control the translation strategies, the SOA were manipulated as 650ms(Exp. 1) and 250ms(Exp. 2). No translation effort was found in the comparison of the two experiments. In both experiment, the RTs were faster in L1 rendition, and it took longer to decide the stimuli in natural categories than with artificial ones in L1. However, this category effect was not observed in L2. The results showed the differences in the organization of semantic representations in the brain through the bilinguals' two languages. While L1 semantic knowledge might be more systematically organized, that of L2 seems to be less well organized, at least by late bilinguals who participated in the present study.

  • PDF

Influence analysis of Internet buzz to corporate performance : Individual stock price prediction using sentiment analysis of online news (온라인 언급이 기업 성과에 미치는 영향 분석 : 뉴스 감성분석을 통한 기업별 주가 예측)

  • Jeong, Ji Seon;Kim, Dong Sung;Kim, Jong Woo
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.4
    • /
    • pp.37-51
    • /
    • 2015
  • Due to the development of internet technology and the rapid increase of internet data, various studies are actively conducted on how to use and analyze internet data for various purposes. In particular, in recent years, a number of studies have been performed on the applications of text mining techniques in order to overcome the limitations of the current application of structured data. Especially, there are various studies on sentimental analysis to score opinions based on the distribution of polarity such as positivity or negativity of vocabularies or sentences of the texts in documents. As a part of such studies, this study tries to predict ups and downs of stock prices of companies by performing sentimental analysis on news contexts of the particular companies in the Internet. A variety of news on companies is produced online by different economic agents, and it is diffused quickly and accessed easily in the Internet. So, based on inefficient market hypothesis, we can expect that news information of an individual company can be used to predict the fluctuations of stock prices of the company if we apply proper data analysis techniques. However, as the areas of corporate management activity are different, an analysis considering characteristics of each company is required in the analysis of text data based on machine-learning. In addition, since the news including positive or negative information on certain companies have various impacts on other companies or industry fields, an analysis for the prediction of the stock price of each company is necessary. Therefore, this study attempted to predict changes in the stock prices of the individual companies that applied a sentimental analysis of the online news data. Accordingly, this study chose top company in KOSPI 200 as the subjects of the analysis, and collected and analyzed online news data by each company produced for two years on a representative domestic search portal service, Naver. In addition, considering the differences in the meanings of vocabularies for each of the certain economic subjects, it aims to improve performance by building up a lexicon for each individual company and applying that to an analysis. As a result of the analysis, the accuracy of the prediction by each company are different, and the prediction accurate rate turned out to be 56% on average. Comparing the accuracy of the prediction of stock prices on industry sectors, 'energy/chemical', 'consumer goods for living' and 'consumer discretionary' showed a relatively higher accuracy of the prediction of stock prices than other industries, while it was found that the sectors such as 'information technology' and 'shipbuilding/transportation' industry had lower accuracy of prediction. The number of the representative companies in each industry collected was five each, so it is somewhat difficult to generalize, but it could be confirmed that there was a difference in the accuracy of the prediction of stock prices depending on industry sectors. In addition, at the individual company level, the companies such as 'Kangwon Land', 'KT & G' and 'SK Innovation' showed a relatively higher prediction accuracy as compared to other companies, while it showed that the companies such as 'Young Poong', 'LG', 'Samsung Life Insurance', and 'Doosan' had a low prediction accuracy of less than 50%. In this paper, we performed an analysis of the share price performance relative to the prediction of individual companies through the vocabulary of pre-built company to take advantage of the online news information. In this paper, we aim to improve performance of the stock prices prediction, applying online news information, through the stock price prediction of individual companies. Based on this, in the future, it will be possible to find ways to increase the stock price prediction accuracy by complementing the problem of unnecessary words that are added to the sentiment dictionary.