• Title/Summary/Keyword: Word Corpus

Search Result 284, Processing Time 0.022 seconds

중국 코퍼스 및 인터넷을 이용한 중한사전의 표제어 연구 - gu~guang을 중심으로

  • Park, Yeong-Jong
    • 중국학논총
    • /
    • no.67
    • /
    • pp.25-41
    • /
    • 2020
  • 当我们翻开中韩词典时, 就不难发现令人莫名其妙的词汇不在少数, 而且在部分词汇的解释上也存在着不少问题。本文主要探讨了这些词汇被收录于词典是否合适和词语释义是否正确的问题。为此, 先从中韩词典里筛选出在中国教育部语言文字应用研究所和北京大学中国语言学研究中心所提供的现代汉语语料库中出现频率较低的词汇。若考虑到这两个语料库为全方位收集现代汉语而做了巨大的努力, 而且肯定这一学术成果的话, 就能推测此文里筛选出的这些词汇很可能不太正规或现在不怎幺使用等事实。为了使这种推测能得到更准确的印证, 作者在百度网上又检索了是否出现它们的用例, 且对词语释义和实际用例是否一致做了详细的比较, 就发现不少词汇确实存在各种问题, 根本不适合被收录到词典, 或必须修改释义内容。

중국 코퍼스 및 인터넷을 이용한 중한사전의 표제어 연구 - huan~hui일부를 중심으로

  • Park, Yeong-Jong
    • 중국학논총
    • /
    • no.70
    • /
    • pp.39-60
    • /
    • 2021
  • 当我们翻开中韩词典时, 就不难发现令人莫名其妙的词汇不在少数, 而且在部分词汇的解释上也存在着不少问题。本文主要探讨了这些词汇被收录于词典是否合适和词语释义是否正确的问题。为此, 先从中韩词典里筛选出在中国教育部语言文字应用研究所和北京大学中国语言学研究中心所提供的现代汉语语料库中出现频率较低的词汇。若考虑到这两个语料库为全方位收集现代汉语而做了巨大的努力, 而且肯定这一学术成果的话, 就能推测此文里筛选出的这些词汇很可能不太正规或现在不怎幺使用等事实。为了使这种推测能得到更准确的印证, 作者在百度网上又检索了是否出现它们的用例, 且对词语释义和实际用例是否一致做了详细的比较, 就发现不少词汇确实存在各种问题, 根本不适合被收录到词典, 或必须修改释义内容。

중국 코퍼스 및 인터넷을 이용한 중한사전 표제어의 적합성 연구 - 'ge~gou'를 중심으로

  • Park, Yeong-Jong
    • 중국학논총
    • /
    • no.61
    • /
    • pp.1-18
    • /
    • 2019
  • 当我们翻开中韩词典时, 就不难发现令人莫名其妙的词汇不在少数, 而且在部分词汇的解释上也存在着不少问题. 本文主要探讨了这些词汇被收录于词典是否合适和词语释义是否正确的问题. 为此, 先从中韩词典里筛选出在中国教育部语言文字应用研究所和北京大学中国语言学研究中心所提供的现代汉语语料库中出现频率较低的词汇. 若考虑到这两个语料库为全方位收集现代汉语而做了巨大的努力, 而且肯定这一学术成果的话, 就能推测此文里筛选出的这些词汇很可能不太正规或现在不怎幺使用等事实. 为了使这种推测能得到更准确的印证, 作者在百度网上又检索了是否出现它们的用例, 且对词语释义和实际用例是否一致做了详细的比较, 就发现不少词汇确实存在各种问题, 根本不适合被收录到词典, 或必须修改释义内容.

Identification of Profane Words in Cyberbullying Incidents within Social Networks

  • Ali, Wan Noor Hamiza Wan;Mohd, Masnizah;Fauzi, Fariza
    • Journal of Information Science Theory and Practice
    • /
    • v.9 no.1
    • /
    • pp.24-34
    • /
    • 2021
  • The popularity of social networking sites (SNS) has facilitated communication between users. The usage of SNS helps users in their daily life in various ways such as sharing of opinions, keeping in touch with old friends, making new friends, and getting information. However, some users misuse SNS to belittle or hurt others using profanities, which is typical in cyberbullying incidents. Thus, in this study, we aim to identify profane words from the ASKfm corpus to analyze the profane word distribution across four different roles involved in cyberbullying based on lexicon dictionary. These four roles are: harasser, victim, bystander that assists the bully, and bystander that defends the victim. Evaluation in this study focused on occurrences of the profane word for each role from the corpus. The top 10 common words used in the corpus are also identified and represented in a graph. Results from the analysis show that these four roles used profane words in their conversation with different weightage and distribution, even though the profane words used are mostly similar. The harasser is the first ranked that used profane words in the conversation compared to other roles. The results can be further explored and considered as a potential feature in a cyberbullying detection model using a machine learning approach. Results in this work will contribute to formulate the suitable representation. It is also useful in modeling a cyberbullying detection model based on the identification of profane word distribution across different cyberbullying roles in social networks for future works.

Chatbot Design Method Using Hybrid Word Vector Expression Model Based on Real Telemarketing Data

  • Zhang, Jie;Zhang, Jianing;Ma, Shuhao;Yang, Jie;Gui, Guan
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.4
    • /
    • pp.1400-1418
    • /
    • 2020
  • In the development of commercial promotion, chatbot is known as one of significant skill by application of natural language processing (NLP). Conventional design methods are using bag-of-words model (BOW) alone based on Google database and other online corpus. For one thing, in the bag-of-words model, the vectors are Irrelevant to one another. Even though this method is friendly to discrete features, it is not conducive to the machine to understand continuous statements due to the loss of the connection between words in the encoded word vector. For other thing, existing methods are used to test in state-of-the-art online corpus but it is hard to apply in real applications such as telemarketing data. In this paper, we propose an improved chatbot design way using hybrid bag-of-words model and skip-gram model based on the real telemarketing data. Specifically, we first collect the real data in the telemarketing field and perform data cleaning and data classification on the constructed corpus. Second, the word representation is adopted hybrid bag-of-words model and skip-gram model. The skip-gram model maps synonyms in the vicinity of vector space. The correlation between words is expressed, so the amount of information contained in the word vector is increased, making up for the shortcomings caused by using bag-of-words model alone. Third, we use the term frequency-inverse document frequency (TF-IDF) weighting method to improve the weight of key words, then output the final word expression. At last, the answer is produced using hybrid retrieval model and generate model. The retrieval model can accurately answer questions in the field. The generate model can supplement the question of answering the open domain, in which the answer to the final reply is completed by long-short term memory (LSTM) training and prediction. Experimental results show which the hybrid word vector expression model can improve the accuracy of the response and the whole system can communicate with humans.

Sentiment Analysis using Robust Parallel Tri-LSTM Sentence Embedding in Out-of-Vocabulary Word (Out-of-Vocabulary 단어에 강건한 병렬 Tri-LSTM 문장 임베딩을 이용한 감정분석)

  • Lee, Hyun Young;Kang, Seung Shik
    • Smart Media Journal
    • /
    • v.10 no.1
    • /
    • pp.16-24
    • /
    • 2021
  • The exiting word embedding methodology such as word2vec represents words, which only occur in the raw training corpus, as a fixed-length vector into a continuous vector space, so when mapping the words incorporated in the raw training corpus into a fixed-length vector in morphologically rich language, out-of-vocabulary (OOV) problem often happens. Even for sentence embedding, when representing the meaning of a sentence as a fixed-length vector by synthesizing word vectors constituting a sentence, OOV words make it challenging to meaningfully represent a sentence into a fixed-length vector. In particular, since the agglutinative language, the Korean has a morphological characteristic to integrate lexical morpheme and grammatical morpheme, handling OOV words is an important factor in improving performance. In this paper, we propose parallel Tri-LSTM sentence embedding that is robust to the OOV problem by extending utilizing the morphological information of words into sentence-level. As a result of the sentiment analysis task with corpus in Korean, we empirically found that the character unit is better than the morpheme unit as an embedding unit for Korean sentence embedding. We achieved 86.17% accuracy on the sentiment analysis task with the parallel bidirectional Tri-LSTM sentence encoder.

Korean Probabilistic Dependency Grammar Induction by morpheme (형태소 단위의 한국어 확률 의존문법 학습)

  • Choi, Seon-Hwa;Park, Hyuk-Ro
    • The KIPS Transactions:PartB
    • /
    • v.9B no.6
    • /
    • pp.791-798
    • /
    • 2002
  • In this thesis. we present a new method for inducing a probabilistic dependency grammar (PDG) from text corpus. As words in Korean are composed of a set of more basic morphemes, there exist various dependency relations in a word. So, if the induction process does not take into account of these in-word dependency relations, the accuracy of the resulting grammar nay be poor. In comparison with previous PDG induction methods. the main difference of the proposed method lies in the fact that the method takes into account in-word dependency relations as well as inter-word dependency relations. To access the performance of the proposed method, we conducted an experiment using a manually-tagged corpus of 25,000 sentences which is complied by Korean Advanced Institute of Science and Technology (KAIST). The grammar induction produced 2,349 dependency rules. The parser with these dependency rules shoved 69.77% accuracy in terms of the number of correct dependency relations relative to the total number dependency relations for best-1 parse trees of sample sentences. The result shows that taking into account in-word dependency relations in the course of grammar induction results in a more accurate dependency grammar.

A Hybrid Method of Verb disambiguation in Machine Translation (기계번역에서 동사 모호성 해결에 관한 하이브리드 기법)

  • Moon, Yoo-Jin;Martha Palmer
    • The Transactions of the Korea Information Processing Society
    • /
    • v.5 no.3
    • /
    • pp.681-687
    • /
    • 1998
  • The paper presents a hybrid mcthod for disambiguation of the verb meaning in the machine translation. The presented verb translation algorithm is to perform the concept-based method and the statistics-based method simultaneously. It uses a collocation dictionary, WordNct and the statistical information extracted from corpus. In the transfer phase of the machine translation, it tries to find the target word of the source verb. If it fails, it refers to Word Net to try to find it by calculating word similarities between the logical constraints of the source sentence and those in the collocation dictionary. At the same time, it refers to the statistical information extracted from corpus to try to find it by calculating co-occurrence similarity knowledge. The experimental result shows that the algorithm performs more accurate verb translation than the other algorithms and improves accuracy of the verb translation by 24.8% compared to the collocation-based method.

  • PDF

Segmenting and Classifying Korean Words based on Syllables Using Instance-Based Learning (사례기반 학습을 이용한 음절기반 한국어 단어 분리 및 범주 결정)

  • Kim, Jae-Hoon;Lee, Kong-Joo
    • The KIPS Transactions:PartB
    • /
    • v.10B no.1
    • /
    • pp.47-56
    • /
    • 2003
  • Korean delimits words by white-space like English, but words In Korean Is a little different in structure from those in English. Words in English generally consist of one word, but those in Korean are composed of one word and/or morpheme or more. Because of this difference, a word between white-spaces is called an Eojeol in Korean. We propose a method for segmenting and classifying Korean words and/or morphemes based on syllables using an instance-based learning. In this paper, elements of feature sets for the instance-based learning are one previous syllable, one current syllable, two next syllables, a final consonant of the current syllable, and two previous categories. Our method shows more than 97% of the F-measure of word segmentation using ETRI corpus and KAIST corpus.

Patterns of consonant deletion in the word-internal onset position: Evidence from spontaneous Seoul Korean speech

  • Kim, Jungsun;Yun, Weonhee;Kang, Ducksoo
    • Phonetics and Speech Sciences
    • /
    • v.8 no.1
    • /
    • pp.45-51
    • /
    • 2016
  • This study examined the deletion of onset consonant in the word-internal structure in spontaneous Seoul Korean speech. It used the dataset of speakers in their 20s extracted from the Korean Corpus of Spontaneous Speech (Yun et al., 2015). The proportion of deletion of word-internal onset consonants was analyzed using the linear mixed-effects regression model. The factors that promoted the deletion of onsets were primarily the types of consonants and their phonetic contexts. The results showed that onset deletion was more likely to occur for a lenis velar stop [k] than the other consonants, and in the phonetic contexts, when the preceding vowel was a low central vowel [a]. Moreover, some speakers tended to more frequently delete onset consonants (e.g., [k] and [n]) than other speakers, which reflected individual differences. This study implies that word-internal onsets undergo a process of gradient reduction within individuals' articulatory strategies.