• Title/Summary/Keyword: Text data

Search Result 2,953, Processing Time 0.028 seconds

Analysis of Impact Between Data Analysis Performance and Database

  • Kyoungju Min;Jeongyun Cho;Manho Jung;Hyangbae Lee
    • Journal of information and communication convergence engineering
    • /
    • v.21 no.3
    • /
    • pp.244-251
    • /
    • 2023
  • Engineering or humanities data are stored in databases and are often used for search services. While the latest deep-learning technologies, such like BART and BERT, are utilized for data analysis, humanities data still rely on traditional databases. Representative analysis methods include n-gram and lexical statistical extraction. However, when using a database, performance limitation is often imposed on the result calculations. This study presents an experimental process using MariaDB on a PC, which is easily accessible in a laboratory, to analyze the impact of the database on data analysis performance. The findings highlight the fact that the database becomes a bottleneck when analyzing large-scale text data, particularly over hundreds of thousands of records. To address this issue, a method was proposed to provide real-time humanities data analysis web services by leveraging the open source database, with a focus on the Seungjeongwon-Ilgy, one of the largest datasets in the humanities fields.

Analysis of Rice Blast Outbreaks in Korea through Text Mining (텍스트 마이닝을 통한 우리나라의 벼 도열병 발생 개황 분석)

  • Song, Sungmin;Chung, Hyunjung;Kim, Kwang-Hyung;Kim, Ki-Tae
    • Research in Plant Disease
    • /
    • v.28 no.3
    • /
    • pp.113-121
    • /
    • 2022
  • Rice blast is a major plant disease that occurs worldwide and significantly reduces rice yields. Rice blast disease occurs periodically in Korea, causing significant socio-economic damage due to the unique status of rice as a major staple crop. A disease outbreak prediction system is required for preventing rice blast disease. Epidemiological investigations of disease outbreaks can aid in decision-making for plant disease management. Currently, plant disease prediction and epidemiological investigations are mainly based on quantitatively measurable, structured data such as crop growth and damage, weather, and other environmental factors. On the other hand, text data related to the occurrence of plant diseases are accumulated along with the structured data. However, epidemiological investigations using these unstructured data have not been conducted. The useful information extracted using unstructured data can be used for more effective plant disease management. This study analyzed news articles related to the rice blast disease through text mining to investigate the years and provinces where rice blast disease occurred most in Korea. Moreover, the average temperature, total precipitation, sunshine hours, and supplied rice varieties in the regions were also analyzed. Through these data, it was estimated that the primary causes of the nationwide outbreak in 2020 and the major outbreak in Jeonbuk region in 2021 were meteorological factors. These results obtained through text mining can be combined with deep learning technology to be used as a tool to investigate the epidemiology of rice blast disease in the future.

Inferring Undiscovered Public Knowledge by Using Text Mining-driven Graph Model (텍스트 마이닝 기반의 그래프 모델을 이용한 미발견 공공 지식 추론)

  • Heo, Go Eun;Song, Min
    • Journal of the Korean Society for information Management
    • /
    • v.31 no.1
    • /
    • pp.231-250
    • /
    • 2014
  • Due to the recent development of Information and Communication Technologies (ICT), the amount of research publications has increased exponentially. In response to this rapid growth, the demand of automated text processing methods has risen to deal with massive amount of text data. Biomedical text mining discovering hidden biological meanings and treatments from biomedical literatures becomes a pivotal methodology and it helps medical disciplines reduce the time and cost. Many researchers have conducted literature-based discovery studies to generate new hypotheses. However, existing approaches either require intensive manual process of during the procedures or a semi-automatic procedure to find and select biomedical entities. In addition, they had limitations of showing one dimension that is, the cause-and-effect relationship between two concepts. Thus;this study proposed a novel approach to discover various relationships among source and target concepts and their intermediate concepts by expanding intermediate concepts to multi-levels. This study provided distinct perspectives for literature-based discovery by not only discovering the meaningful relationship among concepts in biomedical literature through graph-based path interference but also being able to generate feasible new hypotheses.

Analysis on Research Trend of Productivity Using Text Mining - Focusing on KSCE Journal - (텍스트 마이닝을 통한 건설 생산성 분야의 연구동향 분석 - KSCE 저널을 중심으로 -)

  • Gu, Bongil;Huh, Youngki
    • Korean Journal of Construction Engineering and Management
    • /
    • v.21 no.2
    • /
    • pp.15-21
    • /
    • 2020
  • The relationship between keywords, found in all productivity related papers published in the KSCE journal for last 15 years, were analyzed in order to reveal a research trend in the area using text mining and A-Priori algorithm. As the results, it is found that the word of 'productivity' is most closely related to the words of 'work' and 'labor'. Futhermore, the word is somewhat related to those of 'factor', 'model', simulation', and 'work time'. It is also revealed that, on the other hand, the words of 'machine' and 'equipment' have little relationships with the keyword. This research will be a great help for academia to understand a research trend in the area of construction productivity.

Research on Korea Text Recognition in Images Using Deep Learning (딥 러닝 기법을 활용한 이미지 내 한글 텍스트 인식에 관한 연구)

  • Sung, Sang-Ha;Lee, Kang-Bae;Park, Sung-Ho
    • Journal of the Korea Convergence Society
    • /
    • v.11 no.6
    • /
    • pp.1-6
    • /
    • 2020
  • In this study, research on character recognition, which is one of the fields of computer vision, was conducted. Optical character recognition, which is one of the most widely used character recognition techniques, suffers from decreasing recognition rate if the recognition target deviates from a certain standard and format. Hence, this study aimed to address this limitation by applying deep learning techniques to character recognition. In addition, as most character recognition studies have been limited to English or number recognition, the recognition range has been expanded through additional data training on Korean text. As a result, this study derived a deep learning-based character recognition algorithm for Korean text recognition. The algorithm obtained a score of 0.841 on the 1-NED evaluation method, which is a similar result to that of English recognition. Further, based on the analysis of the results, major issues with Korean text recognition and possible future study tasks are introduced.

The Course of Schema Activation in Processing of Humor Text (유머텍스트 처리에서 스키마의 활성화 과정)

  • Choi, Young-Geon;Shin, Hyun-Jung
    • The Journal of the Korea Contents Association
    • /
    • v.15 no.9
    • /
    • pp.425-435
    • /
    • 2015
  • Though most researchers studying humor in recent years agree that 'incongruity' is essential factor of humor elicitation, they have different views in the course of schema activation in processing humor text. One of the different views on schema activation in processing humor text is wether schemata are activated concurrently or selectively. While concurrent activation view suggests that different schemata are concurrently activated because we perceive them at the same time, selective activation view suggests that different schemata are selectively activated because we use selective attention for their perception. This study was conducted to verify these two different views. We clarified that different schemata were activated in processing of humor text, and we examined whether Vaid's experiment failed or not. Experiment was designed mixed 2 (schema1, schema2) ${\times}$ 3 (setup, incongruity, resolution) ${\times}$ (humor, control) factorial design and latin square counterbalancing. As a result of experiment, we got data that different schemata were activated in the course of 'Incongruity' at the same time. Most of all, the activated schemata were kept activation in the course of 'Resolution'. This result suggest that different schemata were activated concurrently.

Ontology and Text Mining-based Advanced Historical People Finding Service (온톨로지와 텍스트 마이닝 기반 지능형 역사인물 검색 서비스)

  • Jeong, Do-Heon;Hwang, Myunggwon;Cho, Minhee;Jung, Hanmin;Yoon, Soyoung;Kim, Kyungsun;Kim, Pyung
    • Journal of Internet Computing and Services
    • /
    • v.13 no.5
    • /
    • pp.33-43
    • /
    • 2012
  • Semantic web is utilized to construct advanced information service by using semantic relationships between entities. Text mining can be applied to generate semantic relationships from unstructured data resources. In this study, ontology schema guideline, ontology instance generation, disambiguation of same name by text mining and advanced historical people finding service by reasoning have been proposed. Various relationships between historical event, organization, people, which are created by domain experts, are linked to literatures of National Institute of Korean History (NIKH). It improves the effectiveness of user access and proposes advanced people finding service based on relationships. In order to distinguish between people with the same name, we compares the structure and edge, nodes of personal social network. To provide additional information, external resources including thesaurus and web are linked to all of internal related resources as well.

Development of On-line Judge System based on Block Programming Environment (블록 프로그래밍 환경 기반 온라인 평가 시스템 개발)

  • Shim, Jaekwoun;Chae, Jeong Min
    • The Journal of Korean Association of Computer Education
    • /
    • v.21 no.4
    • /
    • pp.1-10
    • /
    • 2018
  • Block programming environment, which is represented by Scratch in elementary and middle school programming education, is suitable for learner's characteristics and cognitive level, and is recommended not only for beginners. Transference to the text programming environment after the block programming is essential for understanding the data processing process, understanding the accuracy and efficiency aspects of algorithms, and creating SW activity. In addition, it is presented step by step in the programming curriculum. In this study, developed WithBlock the online evaluation system for the purpose of transference from a block programming to a text programming environment. The developed system can solve the same algorithm problem in both block and text programming environment, and it can be used for elementary and secondary programming education by automatically scoring the written code and providing immediate feedback. In order to applicable to programming education in elementary and secondary surveyed the usability, learning possibility, interest and satisfaction of WithBlock. The results of the survey showed that it can be used for programming education.

Parametric and Non Parametric Measures for Text Similarity (텍스트 유사성을 위한 파라미터 및 비 파라미터 측정)

  • Mlyahilu, John;Kim, Jong-Nam
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.20 no.4
    • /
    • pp.193-198
    • /
    • 2019
  • The wide spread of genuine and fake information on internet has lead to various studies on text analysis. Copying and pasting others' work without acknowledgement, research results manipulation without proof has been trending for a while in the era of data science. Various tools have been developed to reduce, combat and possibly eradicate plagiarism in various research fields. Text similarity measurements can be manually done by using both parametric and non parametric methods of which this study implements cosine similarity and Pearson correlation as parametric while Spearman correlation as non parametric. Cosine similarity and Pearson correlation metrics have achieved highest coefficients of similarity while Spearman shown low similarity coefficients. We recommend the use of non parametric methods in measuring text similarity due to their non normality assumption as opposed to the parametric methods which relies on normality assumptions and biasness.

The Effects of User Involvement on Internet Ad Preference Based on Presentation Type and Content

  • Joo Hoo Kim
    • The Journal of Society for e-Business Studies
    • /
    • v.8 no.4
    • /
    • pp.33-51
    • /
    • 2003
  • The primary objectives of this study were, using data from Internet users in Korea, to determine users' preference of banner ad through two ad properties; ad presentation type (text vs. image) and ad content (product information vs. prize information) by incorporating the level of involvement into research design. Using within-group experimental design by means of subjects' web-based participation in the study, the study result showed that image-based banner ad was significantly preferred to text-based banner ad. It was found that the level of ad involvement had a significant impact on the preference of banner ads. Also it was found that image-based banner ad had a greater effect on ad preference than text-based banner ad in low involvement situation only, Finally, image-based banner ad was consistently preferred to text-based banner ad regardless of involvement level when the banner ad was product oriented. The study findings suggest that adoption decisions regarding banner ad presentation type and banner ad content should be based on the knowledge of both the level of consumer's ad involvement and the interactive effects between ad presentation and ad content.

  • PDF