• Title/Summary/Keyword: Text data

Search Result 2,959, Processing Time 0.028 seconds

Developing and Evaluating Damage Information Classifier of High Impact Weather by Using News Big Data (재해기상 언론기사 빅데이터를 활용한 피해정보 자동 분류기 개발)

  • Su-Ji, Cho;Ki-Kwang Lee
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.46 no.3
    • /
    • pp.7-14
    • /
    • 2023
  • Recently, the importance of impact-based forecasting has increased along with the socio-economic impact of severe weather have emerged. As news articles contain unconstructed information closely related to the people's life, this study developed and evaluated a binary classification algorithm about snowfall damage information by using media articles text mining. We collected news articles during 2009 to 2021 which containing 'heavy snow' in its body context and labelled whether each article correspond to specific damage fields such as car accident. To develop a classifier, we proposed a probability-based classifier based on the ratio of the two conditional probabilities, which is defined as I/O Ratio in this study. During the construction process, we also adopted the n-gram approach to consider contextual meaning of each keyword. The accuracy of the classifier was 75%, supporting the possibility of application of news big data to the impact-based forecasting. We expect the performance of the classifier will be improve in the further research as the various training data is accumulated. The result of this study can be readily expanded by applying the same methodology to other disasters in the future. Furthermore, the result of this study can reduce social and economic damage of high impact weather by supporting the establishment of an integrated meteorological decision support system.

Robust Sentiment Classification of Metaverse Services Using a Pre-trained Language Model with Soft Voting

  • Haein Lee;Hae Sun Jung;Seon Hong Lee;Jang Hyun Kim
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.9
    • /
    • pp.2334-2347
    • /
    • 2023
  • Metaverse services generate text data, data of ubiquitous computing, in real-time to analyze user emotions. Analysis of user emotions is an important task in metaverse services. This study aims to classify user sentiments using deep learning and pre-trained language models based on the transformer structure. Previous studies collected data from a single platform, whereas the current study incorporated the review data as "Metaverse" keyword from the YouTube and Google Play Store platforms for general utilization. As a result, the Bidirectional Encoder Representations from Transformers (BERT) and Robustly optimized BERT approach (RoBERTa) models using the soft voting mechanism achieved a highest accuracy of 88.57%. In addition, the area under the curve (AUC) score of the ensemble model comprising RoBERTa, BERT, and A Lite BERT (ALBERT) was 0.9458. The results demonstrate that the ensemble combined with the RoBERTa model exhibits good performance. Therefore, the RoBERTa model can be applied on platforms that provide metaverse services. The findings contribute to the advancement of natural language processing techniques in metaverse services, which are increasingly important in digital platforms and virtual environments. Overall, this study provides empirical evidence that sentiment analysis using deep learning and pre-trained language models is a promising approach to improving user experiences in metaverse services.

Automatic Poster Generation System Using Protagonist Face Analysis

  • Yeonhwi You;Sungjung Yong;Hyogyeong Park;Seoyoung Lee;Il-Young Moon
    • Journal of information and communication convergence engineering
    • /
    • v.21 no.4
    • /
    • pp.287-293
    • /
    • 2023
  • With the rapid development of domestic and international over-the-top markets, a large amount of video content is being created. As the volume of video content increases, consumers tend to increasingly check data concerning the videos before watching them. To address this demand, video summaries in the form of plot descriptions, thumbnails, posters, and other formats are provided to consumers. This study proposes an approach that automatically generates posters to effectively convey video content while reducing the cost of video summarization. In the automatic generation of posters, face recognition and clustering are used to gather and classify character data, and keyframes from the video are extracted to learn the overall atmosphere of the video. This study used the facial data of the characters and keyframes as training data and employed technologies such as DreamBooth, a text-to-image generation model, to automatically generate video posters. This process significantly reduces the time and cost of video-poster production.

Perception and Trend Differences between Korea, China, and the US on Vegan Fashion -Using Big Data Analytics- (빅데이터를 이용한 비건 패션 쟁점의 분석 -한국, 중국, 미국을 중심으로-)

  • Jiwoon Jeong;Sojung Yun
    • Journal of the Korean Society of Clothing and Textiles
    • /
    • v.47 no.5
    • /
    • pp.804-821
    • /
    • 2023
  • This study examines current trends and perceptions of veganism and vegan fashion in Korea, China, and the United States. Using big data tools Textom and Ucinet, we conducted cluster analysis between keywords. Further, frequency analysis using keyword extraction and CONCOR analysis obtained the following results. First, the nations' perceptions of veganism and vegan fashion differ significantly. Korea and the United States generally share a similar understanding of vegan fashion. Second, the industrial structures, such as products and businesses, impacted how Korea perceived veganism. Third, owing to its ongoing sociopolitical tensions, the United States views veganism as an ethical consumption method that ties into activism. In contrast, China views veganism as a healthy diet rather than a lifestyle and associates it with Buddhist vegetarianism. This perception is because of their religious history and culinary culture. Fundamentally, this study is meaningful for using big data to extract keywords related to vegan fashion in Korea, China, and the United States. This study deepens our understanding of vegan fashion by comparing perceptions across nations.

Sentiment Analysis of Korean Reviews Using CNN: Focusing on Morpheme Embedding (CNN을 적용한 한국어 상품평 감성분석: 형태소 임베딩을 중심으로)

  • Park, Hyun-jung;Song, Min-chae;Shin, Kyung-shik
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.59-83
    • /
    • 2018
  • With the increasing importance of sentiment analysis to grasp the needs of customers and the public, various types of deep learning models have been actively applied to English texts. In the sentiment analysis of English texts by deep learning, natural language sentences included in training and test datasets are usually converted into sequences of word vectors before being entered into the deep learning models. In this case, word vectors generally refer to vector representations of words obtained through splitting a sentence by space characters. There are several ways to derive word vectors, one of which is Word2Vec used for producing the 300 dimensional Google word vectors from about 100 billion words of Google News data. They have been widely used in the studies of sentiment analysis of reviews from various fields such as restaurants, movies, laptops, cameras, etc. Unlike English, morpheme plays an essential role in sentiment analysis and sentence structure analysis in Korean, which is a typical agglutinative language with developed postpositions and endings. A morpheme can be defined as the smallest meaningful unit of a language, and a word consists of one or more morphemes. For example, for a word '예쁘고', the morphemes are '예쁘(= adjective)' and '고(=connective ending)'. Reflecting the significance of Korean morphemes, it seems reasonable to adopt the morphemes as a basic unit in Korean sentiment analysis. Therefore, in this study, we use 'morpheme vector' as an input to a deep learning model rather than 'word vector' which is mainly used in English text. The morpheme vector refers to a vector representation for the morpheme and can be derived by applying an existent word vector derivation mechanism to the sentences divided into constituent morphemes. By the way, here come some questions as follows. What is the desirable range of POS(Part-Of-Speech) tags when deriving morpheme vectors for improving the classification accuracy of a deep learning model? Is it proper to apply a typical word vector model which primarily relies on the form of words to Korean with a high homonym ratio? Will the text preprocessing such as correcting spelling or spacing errors affect the classification accuracy, especially when drawing morpheme vectors from Korean product reviews with a lot of grammatical mistakes and variations? We seek to find empirical answers to these fundamental issues, which may be encountered first when applying various deep learning models to Korean texts. As a starting point, we summarized these issues as three central research questions as follows. First, which is better effective, to use morpheme vectors from grammatically correct texts of other domain than the analysis target, or to use morpheme vectors from considerably ungrammatical texts of the same domain, as the initial input of a deep learning model? Second, what is an appropriate morpheme vector derivation method for Korean regarding the range of POS tags, homonym, text preprocessing, minimum frequency? Third, can we get a satisfactory level of classification accuracy when applying deep learning to Korean sentiment analysis? As an approach to these research questions, we generate various types of morpheme vectors reflecting the research questions and then compare the classification accuracy through a non-static CNN(Convolutional Neural Network) model taking in the morpheme vectors. As for training and test datasets, Naver Shopping's 17,260 cosmetics product reviews are used. To derive morpheme vectors, we use data from the same domain as the target one and data from other domain; Naver shopping's about 2 million cosmetics product reviews and 520,000 Naver News data arguably corresponding to Google's News data. The six primary sets of morpheme vectors constructed in this study differ in terms of the following three criteria. First, they come from two types of data source; Naver news of high grammatical correctness and Naver shopping's cosmetics product reviews of low grammatical correctness. Second, they are distinguished in the degree of data preprocessing, namely, only splitting sentences or up to additional spelling and spacing corrections after sentence separation. Third, they vary concerning the form of input fed into a word vector model; whether the morphemes themselves are entered into a word vector model or with their POS tags attached. The morpheme vectors further vary depending on the consideration range of POS tags, the minimum frequency of morphemes included, and the random initialization range. All morpheme vectors are derived through CBOW(Continuous Bag-Of-Words) model with the context window 5 and the vector dimension 300. It seems that utilizing the same domain text even with a lower degree of grammatical correctness, performing spelling and spacing corrections as well as sentence splitting, and incorporating morphemes of any POS tags including incomprehensible category lead to the better classification accuracy. The POS tag attachment, which is devised for the high proportion of homonyms in Korean, and the minimum frequency standard for the morpheme to be included seem not to have any definite influence on the classification accuracy.

Predicting the Direction of the Stock Index by Using a Domain-Specific Sentiment Dictionary (주가지수 방향성 예측을 위한 주제지향 감성사전 구축 방안)

  • Yu, Eunji;Kim, Yoosin;Kim, Namgyu;Jeong, Seung Ryul
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.1
    • /
    • pp.95-110
    • /
    • 2013
  • Recently, the amount of unstructured data being generated through a variety of social media has been increasing rapidly, resulting in the increasing need to collect, store, search for, analyze, and visualize this data. This kind of data cannot be handled appropriately by using the traditional methodologies usually used for analyzing structured data because of its vast volume and unstructured nature. In this situation, many attempts are being made to analyze unstructured data such as text files and log files through various commercial or noncommercial analytical tools. Among the various contemporary issues dealt with in the literature of unstructured text data analysis, the concepts and techniques of opinion mining have been attracting much attention from pioneer researchers and business practitioners. Opinion mining or sentiment analysis refers to a series of processes that analyze participants' opinions, sentiments, evaluations, attitudes, and emotions about selected products, services, organizations, social issues, and so on. In other words, many attempts based on various opinion mining techniques are being made to resolve complicated issues that could not have otherwise been solved by existing traditional approaches. One of the most representative attempts using the opinion mining technique may be the recent research that proposed an intelligent model for predicting the direction of the stock index. This model works mainly on the basis of opinions extracted from an overwhelming number of economic news repots. News content published on various media is obviously a traditional example of unstructured text data. Every day, a large volume of new content is created, digitalized, and subsequently distributed to us via online or offline channels. Many studies have revealed that we make better decisions on political, economic, and social issues by analyzing news and other related information. In this sense, we expect to predict the fluctuation of stock markets partly by analyzing the relationship between economic news reports and the pattern of stock prices. So far, in the literature on opinion mining, most studies including ours have utilized a sentiment dictionary to elicit sentiment polarity or sentiment value from a large number of documents. A sentiment dictionary consists of pairs of selected words and their sentiment values. Sentiment classifiers refer to the dictionary to formulate the sentiment polarity of words, sentences in a document, and the whole document. However, most traditional approaches have common limitations in that they do not consider the flexibility of sentiment polarity, that is, the sentiment polarity or sentiment value of a word is fixed and cannot be changed in a traditional sentiment dictionary. In the real world, however, the sentiment polarity of a word can vary depending on the time, situation, and purpose of the analysis. It can also be contradictory in nature. The flexibility of sentiment polarity motivated us to conduct this study. In this paper, we have stated that sentiment polarity should be assigned, not merely on the basis of the inherent meaning of a word but on the basis of its ad hoc meaning within a particular context. To implement our idea, we presented an intelligent investment decision-support model based on opinion mining that performs the scrapping and parsing of massive volumes of economic news on the web, tags sentiment words, classifies sentiment polarity of the news, and finally predicts the direction of the next day's stock index. In addition, we applied a domain-specific sentiment dictionary instead of a general purpose one to classify each piece of news as either positive or negative. For the purpose of performance evaluation, we performed intensive experiments and investigated the prediction accuracy of our model. For the experiments to predict the direction of the stock index, we gathered and analyzed 1,072 articles about stock markets published by "M" and "E" media between July 2011 and September 2011.

A study on the improving and constructing the content for the Sijo database in the Period of Modern Enlightenment (계몽기·근대시조 DB의 개선 및 콘텐츠화 방안 연구)

  • Chang, Chung-Soo
    • Sijohaknonchong
    • /
    • v.44
    • /
    • pp.105-138
    • /
    • 2016
  • Recently with the research function, "XML Digital collection of Sijo Texts in the Period of Modern Enlightenment" DB data is being provided through the Korean Research Memory (http://www.krm.or.kr) and the foundation for the constructing the contents of Sijo Texts in the Period of Modern Enlightenment has been laid. In this paper, by reviewing the characteristics and problems of Digital collection of Sijo Texts in the Period of Modern Enlightenment and searching for the improvement, I tried to find a way to make it into the content. This database has the primary meaning in the integrating and glancing at the vast amounts of Sijo in the Period of Modern Enlightenment to reaching 12,500 pieces. In addition, it is the first Sijo data base which is provide the variety of search features according to literature, name of poet, title of work, original text, per period, and etc. However, this database has the limits to verifying the overall aspects of the Sijo in the Period of Modern Enlightenment. The title and original text, which is written in the archaic word or Chinese character, could not be searched, because the standard type text of modern language is not formatted. And also the works and the individual Sijo works released after 1945 were missing in the database. It is inconvenient to extract the datum according to the poet, because poets are marked in the various ways such as one's real name, nom de plume and etc. To solve this kind of problems and improve the utilization of the database, I proposed the providing the standard type text of modern language, giving the index terms about content, providing the information on the work format and etc. Furthermore, if the Sijo database in the Period of Modern Enlightenment which is prepared the character of the Sijo Culture Information System could be built, it could be connected with the academic, educational contents. For the specific plan, I suggested as follow, - learning support materials for the Modern history and the national territory recognition on the Modern Age - source materials for studying indigenous animals and plants characters creating the commercial characters - applicability as the Sijo learning tool such as Sijo Game.

  • PDF

Trend Analysis of Barrier-free Academic Research using Text Mining and CONCOR (텍스트 마이닝과 CONCOR을 활용한 배리어 프리 학술연구 동향 분석)

  • Jeong-Ki Lee;Ki-Hyok Youn
    • Journal of Internet of Things and Convergence
    • /
    • v.9 no.2
    • /
    • pp.19-31
    • /
    • 2023
  • The importance of barrier free is being highlighted worldwide. This study attempted to identify barrier-free research trends using text mining. Through this, it was intended to help with research and policies to create a barrier free environment. The analysis data is 227 papers published in domestic academic journals from 1996 when barrier free research began to 2022. The researcher converted the title, keywords, and abstract of an academic thesis into text, and then analyzed the pattern of the thesis and the meaning of the data. The summary of the research results is as follows. First, barrier-free research began to increase after 2009, with an annual average of 17.1 papers being published. This is related to the implementation guidelines for the barrier-free certification system that took effect on July 15, 2008. Second, results of barrier-free text mining i) As a result of word frequency analysis of top keywords, important keywords such as barrier free, disabled, design, universal design, access, elderly, certification, improvement, evaluation, and space, facility, and environment were searched. ii) As a result of TD-IDF analysis, the main keywords were universal design, design, certification, house, access, elderly, installation, disabled, park, evaluation, architecture, and space. iii) As a result of N-Ggam analysis, barrier free+certification, barrier free+design, barrier free+barrier free, elderly+disabled, disabled+elderly, disabled+convenience facilities, the disabled+the elderly, society+the elderly, convenience facilities+installation, certification+evaluation index, physical+environment, life+quality, etc. appeared in a related language. Third, as a result of the CONCOR analysis, cluster 1 was barrier-free issues and challenges, cluster 2 was universal design and space utilization, cluster 3 was Improving Accessibility for the Disabled, and cluster 4 was barrier free certification and evaluation. Based on the analysis results, this study presented policy implications for vitalizing barrier-free research and establishing a desirable barrier free environment.

Development of Integrated Biomedical Signal Management System Based on XML Web Technology

  • Lee Joo-sung;Yoon Young-ro
    • Journal of Biomedical Engineering Research
    • /
    • v.26 no.6
    • /
    • pp.399-406
    • /
    • 2005
  • In these days, HIS(Hospital Information System) raise the quality of medical services by effective management of medical records. As computing environment was developed, it is possible to search information quickly. But, standard medical data exchange is not completed between medical clinic and another organ so far. In case of patient transfer, past medical record was not efficiently transmitted. It be feasible treatment delay or medical accident. It is trouble that medical records is transferred by a person and communicate with each other. Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML. Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere. Form in system of company product, relative organs that handle bio-signal data is each other dissimilar and integration and to transmit to supplement bottleneck this research uses XML. In this study, it is discussed about sharing of medical data using XML web technology to standard medical record between hospital and relative organization The data structure model was designed to manage bio-signal data and patient record. We experimented about data transmission and all-in-one between different systems (one make use of MS-SQL database system and the other manage existent bio-signal data in itself form in file in this research). In order to search and refer medical record, the web-based system was implemented. The system that can be shared medical data was tested to estimate the merits of XML. Implemented XML schema confirms data transmission between different data system and integration result.

A Study for Electronic Trading Business System Using Big Data (빅데이터를 활용한 전자무역시스템에 대한 연구)

  • Lee, Cheol-Woong;Cho, Sung-Woo;Cho, Sae-Hong;Hwang, Dae-Hoon
    • Journal of Digital Contents Society
    • /
    • v.14 no.4
    • /
    • pp.573-580
    • /
    • 2013
  • With the growth of the smart-devices and information & communication technology, information society has developed and information can be produced, spread and consumed at much faster pace easily. Hence, individuals can utilize wireless communication and smart-devices to create, share and consume information at anytime and anywhere. The growth of technology has allowed the large-scale transfer and sharing of image, sound and video data; it changed the users' data consumption pattern that was mainly consisted of the text. Therefore, the amount of data that an individual consumes increased significantly. The importance of finding and analyzing practical and necessary data among huge amount of data has arisen. In this study, the current status of Big Data is researched and analyzed and the method to utilize Big Data in the electronic trading field is suggested.