• Title/Summary/Keyword: Dictionary learning

Search Result 141, Processing Time 0.021 seconds

Study on Automatic Mapping Method for Reference of Scholarly Papers (학술논문의 참고문헌 자동매핑 방법에 관한 연구)

  • Han, Jeong-Min;Jang, Hyun-Chul;Kim, Jin-Hyun;Yea, Sang-Jun;Kim, Sang-Kyun;Kim, Chul;Song, Mi-Young
    • Journal of Information Management
    • /
    • v.41 no.3
    • /
    • pp.155-173
    • /
    • 2010
  • With the advanced learning and the diversity of topics, researchers on each area keenly feel the need of precise and a quick discovery of required information at any time. This study presents a way of constructing the automatic mapping system that can compare and analyze duplicated data and that describes the result by building an effective reference extraction method and another way of correcting the wrong form of used Chinese characters with Traditional Korean Medicine dictionary. With this innovation, data duplication on references and Chinese characters errors can be fixed. Under the situation that a number of references of newly published papers that can continuously be extracted.

Inference of Korean Public Sentiment from Online News (온라인 뉴스에 대한 한국 대중의 감정 예측)

  • Matteson, Andrew Stuart;Choi, Soon-Young;Lim, Heui-Seok
    • Journal of the Korea Convergence Society
    • /
    • v.9 no.7
    • /
    • pp.25-31
    • /
    • 2018
  • Online news has replaced the traditional newspaper and has brought about a profound transformation in the way we access and share information. News websites have had the ability for users to post comments for quite some time, and some have also begun to crowdsource reactions to news articles. The field of sentiment analysis seeks to computationally model the emotions and reactions experienced when presented with text. In this work, we analyze more than 100,000 news articles over ten categories with five user-generated emotional annotations to determine whether or not these reactions have a mathematical correlation to the news body text and propose a simple sentiment analysis algorithm that requires minimal preprocessing and no machine learning. We show that it is effective even for a morphologically complex language like Korean.

Automatic Construction of a Negative/positive Corpus and Emotional Classification using the Internet Emotional Sign (인터넷 감정기호를 이용한 긍정/부정 말뭉치 구축 및 감정분류 자동화)

  • Jang, Kyoungae;Park, Sanghyun;Kim, Woo-Je
    • Journal of KIISE
    • /
    • v.42 no.4
    • /
    • pp.512-521
    • /
    • 2015
  • Internet users purchase goods on the Internet and express their positive or negative emotions of the goods in product reviews. Analysis of the product reviews become critical data to both potential consumers and to the decision making of enterprises. Therefore, the importance of opinion mining techniques which derive opinions by analyzing meaningful data from large numbers of Internet reviews. Existing studies were mostly based on comments written in English, yet analysis in Korean has not actively been done. Unlike English, Korean has characteristics of complex adjectives and suffixes. Existing studies did not consider the characteristics of the Internet language. This study proposes an emotional classification method which increases the accuracy of emotional classification by analyzing the characteristics of the Internet language connoting feelings. We can classify positive and negative comments about products automatically using the Internet emoticon. Also we can check the validity of the proposed algorithm through the result of high precision, recall and coverage for the evaluation of this method.

An Opinionated Document Retrieval System based on Hybrid Method (혼합 방식에 기반한 의견 문서 검색 시스템)

  • Lee, Seung-Wook;Song, Young-In;Rim, Hae-Chang
    • Journal of the Korean Society for information Management
    • /
    • v.25 no.4
    • /
    • pp.115-129
    • /
    • 2008
  • Recently, as its growth and popularization, the Web is changed into the place where people express, share and debate their opinions rather than the space of information seeking. Accordingly, the needs for searching opinions expressed in the Web are also increasing. However, it is difficult to meet these needs by using a classical information retrieval system that only concerns the relevance between the user's query and documents. Instead, a more advanced system that captures subjective information through documents is required. The proposed system effectively retrieves opinionated documents by utilizing an existing information retrieval system. This paper proposes a kind of hybrid method which can utilize both a dictionary-based opinion analysis technique and a machine learning based opinion analysis technique. Experimental results show that the proposed method is effective in improving the performance.

Korean Semantic Role Labeling Using Semantic Frames and Synonym Clusters (의미 프레임과 유의어 클러스터를 이용한 한국어 의미역 인식)

  • Lim, Soojong;Lim, Joon-Ho;Lee, Chung-Hee;Kim, Hyun-Ki
    • Journal of KIISE
    • /
    • v.43 no.7
    • /
    • pp.773-780
    • /
    • 2016
  • Semantic information and features are very important for Semantic Role Labeling(SRL) though many SRL systems based on machine learning mainly adopt lexical and syntactic features. Previous SRL research based on semantic information is very few because using semantic information is very restricted. We proposed the SRL system which adopts semantic information, such as named entity, word sense disambiguation, filtering adjunct role based on sense, synonym cluster, frame extension based on synonym dictionary and joint rule of syntactic-semantic information, and modified verb-specific numbered roles, etc. According to our experimentations, the proposed present method outperforms those of lexical-syntactic based research works by about 3.77 (Korean Propbank) to 8.05 (Exobrain Corpus) F1-scores.

Research on Subword Tokenization of Korean Neural Machine Translation and Proposal for Tokenization Method to Separate Jongsung from Syllables (한국어 인공신경망 기계번역의 서브 워드 분절 연구 및 음절 기반 종성 분리 토큰화 제안)

  • Eo, Sugyeong;Park, Chanjun;Moon, Hyeonseok;Lim, Heuiseok
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.3
    • /
    • pp.1-7
    • /
    • 2021
  • Since Neural Machine Translation (NMT) uses only a limited number of words, there is a possibility that words that are not registered in the dictionary will be entered as input. The proposed method to alleviate this Out of Vocabulary (OOV) problem is Subword Tokenization, which is a methodology for constructing words by dividing sentences into subword units smaller than words. In this paper, we deal with general subword tokenization algorithms. Furthermore, in order to create a vocabulary that can handle the infinite conjugation of Korean adjectives and verbs, we propose a new methodology for subword tokenization training by separating the Jongsung(coda) from Korean syllables (consisting of Chosung-onset, Jungsung-neucleus and Jongsung-coda). As a result of the experiment, the methodology proposed in this paper outperforms the existing subword tokenization methodology.

Extractiong mood metadata through sound effects of video (영상의 효과음을 통한 분위기 메타데이터 추출)

  • You, Yeon-Hwi;Park, Hyo-Gyeong;Yong, Sung-Jung;Lee, Seo-Young;Moon, Il-Young
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.05a
    • /
    • pp.453-455
    • /
    • 2022
  • Metadata is data that explains attributes and features to the data as structured data. Among them, video metadata refers to data extracted from information constituting the video for accurate content-based search. Recently, as the number of users using video content increases, the number of OTT providers is also increasing, and the role of metadata is becoming more important for OTT providers to recommend a large amount of video content to individual users or to search appropriately. In this paper, a study was conducted on a method of automatically extracting metadata for mood attributes through sound effects of images. In order to classify the sound effect of the video and generate metadata about the attributes of the mood, I would like to propose a method of establishing a terminology dictionary for the mood and extracting information through supervised learning.

  • PDF

A Two-Stage Learning Method of CNN and K-means RGB Cluster for Sentiment Classification of Images (이미지 감성분류를 위한 CNN과 K-means RGB Cluster 이-단계 학습 방안)

  • Kim, Jeongtae;Park, Eunbi;Han, Kiwoong;Lee, Junghyun;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.3
    • /
    • pp.139-156
    • /
    • 2021
  • The biggest reason for using a deep learning model in image classification is that it is possible to consider the relationship between each region by extracting each region's features from the overall information of the image. However, the CNN model may not be suitable for emotional image data without the image's regional features. To solve the difficulty of classifying emotion images, many researchers each year propose a CNN-based architecture suitable for emotion images. Studies on the relationship between color and human emotion were also conducted, and results were derived that different emotions are induced according to color. In studies using deep learning, there have been studies that apply color information to image subtraction classification. The case where the image's color information is additionally used than the case where the classification model is trained with only the image improves the accuracy of classifying image emotions. This study proposes two ways to increase the accuracy by incorporating the result value after the model classifies an image's emotion. Both methods improve accuracy by modifying the result value based on statistics using the color of the picture. When performing the test by finding the two-color combinations most distributed for all training data, the two-color combinations most distributed for each test data image were found. The result values were corrected according to the color combination distribution. This method weights the result value obtained after the model classifies an image's emotion by creating an expression based on the log function and the exponential function. Emotion6, classified into six emotions, and Artphoto classified into eight categories were used for the image data. Densenet169, Mnasnet, Resnet101, Resnet152, and Vgg19 architectures were used for the CNN model, and the performance evaluation was compared before and after applying the two-stage learning to the CNN model. Inspired by color psychology, which deals with the relationship between colors and emotions, when creating a model that classifies an image's sentiment, we studied how to improve accuracy by modifying the result values based on color. Sixteen colors were used: red, orange, yellow, green, blue, indigo, purple, turquoise, pink, magenta, brown, gray, silver, gold, white, and black. It has meaning. Using Scikit-learn's Clustering, the seven colors that are primarily distributed in the image are checked. Then, the RGB coordinate values of the colors from the image are compared with the RGB coordinate values of the 16 colors presented in the above data. That is, it was converted to the closest color. Suppose three or more color combinations are selected. In that case, too many color combinations occur, resulting in a problem in which the distribution is scattered, so a situation fewer influences the result value. Therefore, to solve this problem, two-color combinations were found and weighted to the model. Before training, the most distributed color combinations were found for all training data images. The distribution of color combinations for each class was stored in a Python dictionary format to be used during testing. During the test, the two-color combinations that are most distributed for each test data image are found. After that, we checked how the color combinations were distributed in the training data and corrected the result. We devised several equations to weight the result value from the model based on the extracted color as described above. The data set was randomly divided by 80:20, and the model was verified using 20% of the data as a test set. After splitting the remaining 80% of the data into five divisions to perform 5-fold cross-validation, the model was trained five times using different verification datasets. Finally, the performance was checked using the test dataset that was previously separated. Adam was used as the activation function, and the learning rate was set to 0.01. The training was performed as much as 20 epochs, and if the validation loss value did not decrease during five epochs of learning, the experiment was stopped. Early tapping was set to load the model with the best validation loss value. The classification accuracy was better when the extracted information using color properties was used together than the case using only the CNN architecture.

Sentiment Analysis of Movie Review Using Integrated CNN-LSTM Mode (CNN-LSTM 조합모델을 이용한 영화리뷰 감성분석)

  • Park, Ho-yeon;Kim, Kyoung-jae
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.4
    • /
    • pp.141-154
    • /
    • 2019
  • Rapid growth of internet technology and social media is progressing. Data mining technology has evolved to enable unstructured document representations in a variety of applications. Sentiment analysis is an important technology that can distinguish poor or high-quality content through text data of products, and it has proliferated during text mining. Sentiment analysis mainly analyzes people's opinions in text data by assigning predefined data categories as positive and negative. This has been studied in various directions in terms of accuracy from simple rule-based to dictionary-based approaches using predefined labels. In fact, sentiment analysis is one of the most active researches in natural language processing and is widely studied in text mining. When real online reviews aren't available for others, it's not only easy to openly collect information, but it also affects your business. In marketing, real-world information from customers is gathered on websites, not surveys. Depending on whether the website's posts are positive or negative, the customer response is reflected in the sales and tries to identify the information. However, many reviews on a website are not always good, and difficult to identify. The earlier studies in this research area used the reviews data of the Amazon.com shopping mal, but the research data used in the recent studies uses the data for stock market trends, blogs, news articles, weather forecasts, IMDB, and facebook etc. However, the lack of accuracy is recognized because sentiment calculations are changed according to the subject, paragraph, sentiment lexicon direction, and sentence strength. This study aims to classify the polarity analysis of sentiment analysis into positive and negative categories and increase the prediction accuracy of the polarity analysis using the pretrained IMDB review data set. First, the text classification algorithm related to sentiment analysis adopts the popular machine learning algorithms such as NB (naive bayes), SVM (support vector machines), XGboost, RF (random forests), and Gradient Boost as comparative models. Second, deep learning has demonstrated discriminative features that can extract complex features of data. Representative algorithms are CNN (convolution neural networks), RNN (recurrent neural networks), LSTM (long-short term memory). CNN can be used similarly to BoW when processing a sentence in vector format, but does not consider sequential data attributes. RNN can handle well in order because it takes into account the time information of the data, but there is a long-term dependency on memory. To solve the problem of long-term dependence, LSTM is used. For the comparison, CNN and LSTM were chosen as simple deep learning models. In addition to classical machine learning algorithms, CNN, LSTM, and the integrated models were analyzed. Although there are many parameters for the algorithms, we examined the relationship between numerical value and precision to find the optimal combination. And, we tried to figure out how the models work well for sentiment analysis and how these models work. This study proposes integrated CNN and LSTM algorithms to extract the positive and negative features of text analysis. The reasons for mixing these two algorithms are as follows. CNN can extract features for the classification automatically by applying convolution layer and massively parallel processing. LSTM is not capable of highly parallel processing. Like faucets, the LSTM has input, output, and forget gates that can be moved and controlled at a desired time. These gates have the advantage of placing memory blocks on hidden nodes. The memory block of the LSTM may not store all the data, but it can solve the CNN's long-term dependency problem. Furthermore, when LSTM is used in CNN's pooling layer, it has an end-to-end structure, so that spatial and temporal features can be designed simultaneously. In combination with CNN-LSTM, 90.33% accuracy was measured. This is slower than CNN, but faster than LSTM. The presented model was more accurate than other models. In addition, each word embedding layer can be improved when training the kernel step by step. CNN-LSTM can improve the weakness of each model, and there is an advantage of improving the learning by layer using the end-to-end structure of LSTM. Based on these reasons, this study tries to enhance the classification accuracy of movie reviews using the integrated CNN-LSTM model.

A quantitative study on the minimal pair of Korean phonemes: Focused on syllable-initial consonants (한국어 음소 최소대립쌍의 계량언어학적 연구: 초성 자음을 중심으로)

  • Jung, Jieun
    • Phonetics and Speech Sciences
    • /
    • v.11 no.1
    • /
    • pp.29-40
    • /
    • 2019
  • The paper investigates the minimal pair of Korean phonemes quantitatively. To achieve this goal, I calculated the number of consonant minimal pairs in the syllable-initial position as both raw counts and relative counts, and analyzed the part of speech relations of the two words in the minimal pair. "Urimalsaem" was chosen as the object of this study because it was judged that the minimal pair analysis should be done through a dictionary and it is the largest among Korean dictionaries. The results of the study are summarized as follows. First, there were 153 types of minimal pairs out of 337,135 examples. The ranking of phoneme pairs from highest to lowest was 'ㅅ-ㅈ, ㄱ-ㅅ, ㄱ-ㅈ, ㄱ-ㅂ, ㄱ-ㅎ, ${\ldots}$, ㅆ-ㅋ, ㄸ-ㅋ, ㅉ-ㅋ, ㄹ-ㅃ, ㅃ-ㅋ'. The phonemes that played a major role in the formation of the minimal pair were /ㄱ, ㅅ, ㅈ, ㅂ, ㅊ/, in that order, which showed a high proportion of palatals. The correlation between the raw count of minimal pairs and the relative count of minimal pairs was found to be quite high r=0.937. Second, 87.91% of the minimal pairs shared the part of speech (same syntactic category). The most frequently observed type has been 'noun-noun' pair (70.25%), and 'vowel-vowel' pair (14.77%) was the next ranking. It can be indicated that the minimal pair could be grouped into similar categories in terms of semantics. The results of this study can be useful for various research in Korean linguistics, speech-language pathology, language education, language acquisition, speech synthesis, and artificial intelligence-machine learning as basic data related to Korean phonemes.