• Title/Summary/Keyword: Text Collection

Search Result 302, Processing Time 0.027 seconds

Detecting Errors in POS-Tagged Corpus on XGBoost and Cross Validation (XGBoost와 교차검증을 이용한 품사부착말뭉치에서의 오류 탐지)

  • Choi, Min-Seok;Kim, Chang-Hyun;Park, Ho-Min;Cheon, Min-Ah;Yoon, Ho;Namgoong, Young;Kim, Jae-Kyun;Kim, Jae-Hoon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.7
    • /
    • pp.221-228
    • /
    • 2020
  • Part-of-Speech (POS) tagged corpus is a collection of electronic text in which each word is annotated with a tag as the corresponding POS and is widely used for various training data for natural language processing. The training data generally assumes that there are no errors, but in reality they include various types of errors, which cause performance degradation of systems trained using the data. To alleviate this problem, we propose a novel method for detecting errors in the existing POS tagged corpus using the classifier of XGBoost and cross-validation as evaluation techniques. We first train a classifier of a POS tagger using the POS-tagged corpus with some errors and then detect errors from the POS-tagged corpus using cross-validation, but the classifier cannot detect errors because there is no training data for detecting POS tagged errors. We thus detect errors by comparing the outputs (probabilities of POS) of the classifier, adjusting hyperparameters. The hyperparameters is estimated by a small scale error-tagged corpus, in which text is sampled from a POS-tagged corpus and which is marked up POS errors by experts. In this paper, we use recall and precision as evaluation metrics which are widely used in information retrieval. We have shown that the proposed method is valid by comparing two distributions of the sample (the error-tagged corpus) and the population (the POS-tagged corpus) because all detected errors cannot be checked. In the near future, we will apply the proposed method to a dependency tree-tagged corpus and a semantic role tagged corpus.

Key Determinants of Dissatisfaction on COVID-19 Contact Tracing and Exposure Notification Apps (COVID-19 접촉추적과 노출알림 앱사용자의 항의 및 불만요인 탐색)

  • Leem, Byung-hak;Hong, Han-Kook
    • The Journal of the Korea Contents Association
    • /
    • v.21 no.9
    • /
    • pp.176-183
    • /
    • 2021
  • Digital medical technology is very effective and at the same time faces the challenge of protecting privacy. However, for contact tracking and exposure notification apps in COVID-19 environment, there is always a trade-off between privacy measures and the effectiveness of the app's use. Today, many countries have developed and used contact tracking and exposure notification apps in various forms to prevent the spread of COVID-19, but the suspicion of digital surveillance (digital panopticon) is unavoidable. Therefore, this study aims to identify the factors of personal information infringement and dissatisfaction through text mining analysis by extracting user reviews of "Self-Quarantine Safety Protection" in Korea. As a result of the text mining analysis, we derived four groups, 'Address recognition error', 'Exit warning error', 'Access error', and 'App. program error'. Since 'Address recognition error' and 'Exit warning error' can give the app users a strong perception that they are keeping under surveillanc by the app, transparent management of personal information protection and consent procedures related to personal information collection are required. In addition, if the other two groups are not corrected immediately due to an error in an app function or a program bug, the complaints of users can be maximized and a protest against the monitor can be raised.

Vietnamese Syncretism and the Characteristics of Caodaism's Chief Deity: Problematising Đức Cao Đài as a 'Monotheistic' God Within an East Asian Heavenly Milieu

  • HARTNEY, Christopher
    • Journal of Daesoon Thought and the Religions of East Asia
    • /
    • v.1 no.2
    • /
    • pp.41-59
    • /
    • 2022
  • Caodaism is a new religion from Vietnam which began in late 1925 and spread rapidly across the French colony of Indochina. With a broad syncretic aim, the new faith sought to revivify Vietnamese religious traditions whilst also incorporating religious, literary, and spiritist influences from France. Like Catholicism, Caodaism kept a strong focus on its monotheistic nature and today Caodaists are eager to label their religion a monotheism. It will be argued here, however, that the syncretic nature of this new faith complicates this claim to a significant degree. To make this argument, we will consider here the nature of God in Caodaism through two central texts from two important stages in the life of the religion. The first is the canonized Compilation of Divine Messages which collects a range of spirit messages from God and some other divine voices. These were received in the early years of the faith. The second is a collection of sermons from 1948/9 that takes Caodaist believers on a tour of heaven, and which is entitled The Divine Path to Eternal Life. It will be shown that in the first text, God speaks in the mode of a fully omnipotent and omniscient supreme being. In the second text, however, we are given a view of paradise that is much more akin to the court of a Jade Emperor within an East Asian milieu. In these realms, the personalities of other beings and redemptive mechanisms claim much of our attention, and seem to be a competing center of power to that of God. Furthermore, God's consort, the Divine Mother, takes on a range of sacred creative prerogatives that do something similar. Additionally, cadres of celestial administrators; buddhas, immortals, and saints help with the operation of a cosmos which spins on with guidance from its own laws. These laws form sacred mechanisms, such as cycles of reincarnation and judgement. These operate not in the purview of God, but as part of the very nature of the cosmos itself. In this context, the dualistic, polytheistic, and even automatic nature of Caodaism's cosmos will be considered in terms of the way in which they complicate this religion's monotheistic claims. To conclude, this article seeks to demonstrate the precise relevance of the term 'monotheism' for this religion.

Quantification of Schedule Delay Risk of Rain via Text Mining of a Construction Log (공사일지의 텍스트 마이닝을 통한 우천 공기지연 리스크 정량화)

  • Park, Jongho;Cho, Mingeon;Eom, Sae Ho;Park, Sun-Kyu
    • KSCE Journal of Civil and Environmental Engineering Research
    • /
    • v.43 no.1
    • /
    • pp.109-117
    • /
    • 2023
  • Schedule delays present a major risk factor, as they can adversely affect construction projects, such as through increasing construction costs, claims from a client, and/or a decrease in construction quality due to trims to stages to catch up on lost time. Risk management has been conducted according to the importance and priority of schedule delay risk, but quantification of risk on the depth of schedule delay tends to be inadequate due to limitations in data collection. Therefore, this research used the BERT (Bidirectional Encoder Representations from Transformers) language model to convert the contents of aconstruction log, which comprised unstructured data, into WBS (Work Breakdown Structure)-based structured data, and to form a model of classification and quantification of risk. A process was applied to eight highway construction sites, and 75 cases of rain schedule delay risk were obtained from 8 out of 39 detailed work kinds. Through a K-S test, a significant probability distribution was derived for fourkinds of work, and the risk impact was compared. The process presented in this study can be used to derive various schedule delay risks in construction projects and to quantify their depth.

A Model for Evaluating Technology Importance of Patents under Incomplete Citation (불완전 인용정보 하에서의 특허의 기술적 중요도 평가 모형)

  • Kim, Heon;Baek, Dong-Hyun;Shin, Min-Ju;Han, Dong-Seok
    • Journal of Intelligence and Information Systems
    • /
    • v.14 no.2
    • /
    • pp.121-136
    • /
    • 2008
  • Although domestic research funding organizations require patented technologies as an outcome of financial aids, they have much difficulty in evaluating qualitative value of the patented technology due to lack of systematic methods. Especially, because citation data is not essential to patent application in Korea, it is very difficult to evaluate a patent using the incomplete citation data. This study proposes a method for evaluating technology importance of a patent when there is no or insufficient citation data in patents. The technology importance of a patent can be evaluated objectively and quantitatively by the proposed method which consists of 5 steps such as selection of a target patent, collection of related patents, preparation of key word vector, clustering patents, and technological importance assessment. The method was applied to a patent on 'user identification method for payment using mobile terminal' in order to evaluate technology importance and demonstrate how the method works.

  • PDF

Germination Characteristics by Temperature and Production Time to Poaceae Plant Seed (녹화용 벼과식물 종자의 채종시기 및 온도별 발아특성)

  • Kang, Hee-Kyoung;Yi, Ja-Yeon;Cho, Yong-Hyeon;Song, Hong-Seon
    • Journal of the Korean Society of Environmental Restoration Technology
    • /
    • v.19 no.2
    • /
    • pp.71-81
    • /
    • 2016
  • This text was experimented and investigated the optimum production time and germination characteristics of seed that collect in Korean Chungnam, in order to offer the basic informations for slope restoration and revegetation using Poaceae plant. Optimum time of seed production was mid-late October of Miscanthus sinensis, Pennisetum alopecuroides, early November to late October of Themeda triandra var. japonica and M. sacchariflorus, and mid November of Phragmites communis. Epiphytic amount of seed full ripe was the most times in mid October of M. sinensis, P. alopecuroides and M. sacchariflorus, early November of T. triandra var. japonica, and mid November of P. communis. Seed rate of maturity was the highest times in early November of M. sinensis, P. alopecuroides and M. sacchariflorus, and mid November of T. triandra var. japonica and P. communis. Germination rate by seed collection time was the highest times in early October of M. sacchariflorus, mid October of T. triandra var. japonica, early November of M. sinensis and mid November of P. communis. Germination days by seed production time was the shortest times in early November of M. sinensis and M. sacchariflorus, mid November of T. triandra var. japonica and P. communis. Optimum temperature of germination was $20{\sim}25^{\circ}C$ in M. sinensis and P. alopecuroides, $25^{\circ}C$ in T. triandra var. japonica, and $25{\sim}30^{\circ}C$ in P. communis.

Informatics analysis of consumer reviews for 「Frozen 2」 fashion collaboration products - Semantic networks and sentiment analysis - (「겨울왕국2」의 콜라보레이션 패션제품에 대한 소비자 리뷰 - 의미 네트워크와 감성분석 -)

  • Choi, Yeong-Hyeon;Lee, Kyu-Hye
    • The Research Journal of the Costume Culture
    • /
    • v.28 no.2
    • /
    • pp.265-284
    • /
    • 2020
  • This study aimed to analyze the performance of Disney-collaborated fashion lines based on online consumer reviews. To do so, the researchers employed text mining and network analysis to identify key words in the reviews of these products. Blogs, internet cafes, and web documents provided by Naver, Daum, and YoutTube were selected as subjects for the analysis. The analysis period was limited to one year after for the 2019. Data collection and analysis were conducted using Python 3.7, Textom, and NodeXL. The research terms in question were as follows: 'Disney fashion collaboration' and 'Frozen fashion collaboration'. Preliminary survey results indicated that 'Elsa's dress' was the most frequently mentioned term and that the domestic fashion brand Eland Retail was the most active in selling Disney branded clothing through its own brand. The writers of reviews for Disney-collaborated fashion products were primarily mothers with daughters. Their decision to purchase these products was based upon the following factors; price, size, stability of decoration, shipping, laundry, and retailer. The motives for purchasing the product were the positive response of the consumer's child and the satisfaction of the parents due to the child's response. The problems to be solved included insufficient quantity of supply, delay in delivery, expensive price considering the number of times children's clothes are worn, poor glitter decoration, faded color, contamination from laundry, and undesirable smells immediately after the purchase.

Sentimental Analysis of SW Education News Data (SW 교육 뉴스데이터의 감성분석)

  • Park, SunJu
    • Journal of The Korean Association of Information Education
    • /
    • v.21 no.1
    • /
    • pp.89-96
    • /
    • 2017
  • Recently, a number of researches actively focus on the contents and sensitivity of information distributed through SNS as smartphones and SNS gained its popularity. In this paper, we collected online news data about SW education, extracted words after morphological analysis, and analyzed emotions of collected news data by calculating sentimental score of each news datum. Also, the accuracy of the calculated sentimental score was examined. As a result, the number of news related to 'SW education' in the collection period was about 189 per month, and the average of sentimental score was 0.7, which signifies the news related to 'SW education' was emotionally positive. We were positive about the importance of SW education and the policy implementation, but there were negative views on the specific method for the realization. That is, a lack of SW education environment and its education method, a problem related to improvement of SW developers and improvement of their labor conditions, and increase of private education in coding were the factors for the negative viewers.

Provenance and Concept of 10 Symptoms of Para-cold Damage(類傷寒) from Yixuerumen(醫學入門) (『의학입문(醫學入門)·상한편(傷寒篇)』 유상한(類傷寒) 병증(病症) 10종의 유래와 개념에 대한 연구)

  • Jo, Hak-jun
    • Journal of Korean Medical classics
    • /
    • v.29 no.4
    • /
    • pp.17-36
    • /
    • 2016
  • Objectives : This paper seeks to study the provenances of 10 Symptoms of Para-cold damage from Yixuerumen, and discovers the rationality and originality of para-cold damages recognized by Li Chan. Methods : The paper looks through the books referenced by Yixuerumen to investigate the source text for Yixuerumen's Para-cold Damage chapter, and the contents of the referenced books will be compared to that of Yixuerumen. Results : The sources of the 10 symptoms of para-cold damage from Yixuerumen Chapter of Cold Damage can be traced to Leizhenghuorenshu (expectoration, indigestion, rising temperature due to lack of energy, and beriberi), Wang Zhen's Shanghanzhengzhimingtiao Xinzengxuleishanghansizheng(the infection in a boil, blood sattis, caused overexertion, and pox), and Liu Chun's Yujiweiyi's chapter on Common Cold. Here, Li Chan made his own addition of internal damage and damage of overworking. Li Chan seems to have considered para-cold damage not as a type of cold damage but as a "concept relative to the broader range of cold damage." In reflection of this understanding, Li Chan limited the number of categories to ten in accordance with the developmental trace leading from Leizhenghuorenshu to Shanghanzhengzhimingtiao. When we understand para-cold damage as a "concept relative to the narrower range of cold damage," it could be used as a mean as for "Wenbin diagnosis," but Yixuerumen's para-cold damage is only utilized for the "diagnosis of a broader concept of cold damage." Conclusions : Li Chan deserves credits for his academic rationality and originality displayed by the method he used to form para-cold damages where he used real cold damage and para-cold damage as principles, by the division of para-cold damage symptoms into ten provenances, by the creation of the concept of para-cold damage, and by his broad collection of prescription and treatment corresponding to each type.

A Study of Perception of Golfwear Using Big Data Analysis (빅데이터를 활용한 골프웨어에 관한 인식 연구)

  • Lee, Areum;Lee, Jin Hwa
    • Fashion & Textile Research Journal
    • /
    • v.20 no.5
    • /
    • pp.533-547
    • /
    • 2018
  • The objective of this study is to examine the perception of golfwear and related trends based on major keywords and associated words related to golfwear utilizing big data. For this study, the data was collected from blogs, Jisikin and Tips, news articles, and web $caf{\acute{e}}$ from two of the most commonly used search engines (Naver & Daum) containing the keywords, 'Golfwear' and 'Golf clothes'. For data collection, frequency and matrix data were extracted through Textom, from January 1, 2016 to December 31, 2017. From the matrix created by Textom, Degree centrality, Closeness centrality, Betweenness centrality, and Eigenvector centrality were calculated and analyzed by utilizing Netminer 4.0. As a result of analysis, it was found that the keyword 'brand' showed the highest rank in web visibility followed by 'woman', 'size', 'man', 'fashion', 'sports', 'price', 'store', 'discount', 'equipment' in the top 10 frequency rankings. For centrality calculations, only the top 30 keywords were included because the density was extremely high due to high frequency of the co-occurring keywords. The results of centrality calculations showed that the keywords on top of the rankings were similar to the frequency of the raw data. When the frequency was adjusted by subtracting 100 and 500 words, it showed different results as the low-ranking keywords such as J. Lindberg in the frequency analysis ranked high along with changes in the rankings of all centrality calculations. Such findings of this study will provide basis for marketing strategies and ways to increase awareness and web visibility for Golfwear brands.