• Title/Summary/Keyword: Text data

Search Result 2,953, Processing Time 0.031 seconds

Generating and Controlling an Interlinking Network of Technical Terms to Enhance Data Utilization (데이터 활용률 제고를 위한 기술 용어의 상호 네트워크 생성과 통제)

  • Jeong, Do-Heon
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.1
    • /
    • pp.157-182
    • /
    • 2018
  • As data management and processing techniques have been developed rapidly in the era of big data, nowadays a lot of business companies and researchers have been interested in long tail data which were ignored in the past. This study proposes methods for generating and controlling a network of technical terms based on text mining technique to enhance data utilization in the distribution of long tail theory. Especially, an edit distance technique of text mining has given us efficient methods to automatically create an interlinking network of technical terms in the scholarly field. We have also used linked open data system to gather experimental data to improve data utilization and proposed effective methods to use data of LOD systems and algorithm to recognize patterns of terms. Finally, the performance evaluation test of the network of technical terms has shown that the proposed methods were useful to enhance the rate of data utilization.

A Big Data Learning for Patent Analysis (특허분석을 위한 빅 데이터학습)

  • Jun, Sunghae
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.23 no.5
    • /
    • pp.406-411
    • /
    • 2013
  • Big data issue has been considered in diverse fields. Also, big data learning has been required in all areas such as engineering and social science. Statistics and machine learning algorithms are representative tools for big data learning. In this paper, we study learning tools for big data and propose an efficient methodology for big data learning via legacy data to practical application. We apply our big data learning to patent analysis, because patent is one of big data. Also, we use patent analysis result for technology forecasting. To illustrate how the proposed methodology could be applied in real domain, we will retrieve patents related to big data from patent databases in the world. Using searched patent data, we perform a case study by text mining preprocessing and multiple linear regression of statistics.

A Deep Learning Application for Automated Feature Extraction in Transaction-based Machine Learning (트랜잭션 기반 머신러닝에서 특성 추출 자동화를 위한 딥러닝 응용)

  • Woo, Deock-Chae;Moon, Hyun Sil;Kwon, Suhnbeom;Cho, Yoonho
    • Journal of Information Technology Services
    • /
    • v.18 no.2
    • /
    • pp.143-159
    • /
    • 2019
  • Machine learning (ML) is a method of fitting given data to a mathematical model to derive insights or to predict. In the age of big data, where the amount of available data increases exponentially due to the development of information technology and smart devices, ML shows high prediction performance due to pattern detection without bias. The feature engineering that generates the features that can explain the problem to be solved in the ML process has a great influence on the performance and its importance is continuously emphasized. Despite this importance, however, it is still considered a difficult task as it requires a thorough understanding of the domain characteristics as well as an understanding of source data and the iterative procedure. Therefore, we propose methods to apply deep learning for solving the complexity and difficulty of feature extraction and improving the performance of ML model. Unlike other techniques, the most common reason for the superior performance of deep learning techniques in complex unstructured data processing is that it is possible to extract features from the source data itself. In order to apply these advantages to the business problems, we propose deep learning based methods that can automatically extract features from transaction data or directly predict and classify target variables. In particular, we applied techniques that show high performance in existing text processing based on the structural similarity between transaction data and text data. And we also verified the suitability of each method according to the characteristics of transaction data. Through our study, it is possible not only to search for the possibility of automated feature extraction but also to obtain a benchmark model that shows a certain level of performance before performing the feature extraction task by a human. In addition, it is expected that it will be able to provide guidelines for choosing a suitable deep learning model based on the business problem and the data characteristics.

A Study on the Meaning of The First Slam Dunk Based on Text Mining and Semantic Network Analysis

  • Kyung-Won Byun
    • International journal of advanced smart convergence
    • /
    • v.12 no.1
    • /
    • pp.164-172
    • /
    • 2023
  • In this study, we identify the recognition of 'The First Slam Dunk', which is gaining popularity as a sports-based cartoon through big data analysis of social media channels, and provide basic data for the development and development of various contents in the sports industry. Social media channels collected detailed social big data from news provided on Naver and Google sites. Data were collected from January 1, 2023 to February 15, 2023, referring to the release date of 'The First Slam Dunk' in Korea. The collected data were 2,106 Naver news data, and 1,019 Google news data were collected. TF and TF-IDF were analyzed through text mining for these data. Through this, semantic network analysis was conducted for 60 keywords. Big data analysis programs such as Textom and UCINET were used for social big data analysis, and NetDraw was used for visualization. As a result of the study, the keyword with the high frequency in relation to the subject in consideration of TF and TF-IDF appeared 4,079 times as 'The First Slam Dunk' was the keyword with the high frequency among the frequent keywords. Next are 'Slam Dunk', 'Movie', 'Premiere', 'Animation', 'Audience', and 'Box-Office'. Based on these results, 60 high-frequency appearing keywords were extracted. After that, semantic metrics and centrality analysis were conducted. Finally, a total of 6 clusters(competing movie, cartoon, passion, premiere, attention, Box-Office) were formed through CONCOR analysis. Based on this analysis of the semantic network of 'The First Slam Dunk', basic data on the development plan of sports content were provided.

Fast Construction of Suffix Arrays for DNA Strings (DNA 스트링에 대하여 써픽스 배열을 구축하는 빠른 알고리즘)

  • Jo, Jun-Ha;Kim, Nam-Hee;Kwon, Ki-Ryong;Kim, Dong-Kyue
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.34 no.8
    • /
    • pp.319-326
    • /
    • 2007
  • To perform fast searching in massive data such as DNA strings, the most efficient method is to construct full-text index data structures of given strings. The widely used full-text index structures are suffix trees and suffix arrays. Since the suffix may uses less space than the suffix tree, the suffix array is proper for DNA strings. Previously developed construction algorithms of suffix arrays are not suitable for DNA strings since those are designed for integer alphabets. We propose a fast algorithm to construct suffix arrays on DNA strings whose alphabet sizes are fixed by 4. We reduce the construction time by improving encoding and merging steps on Kim et al.[1]'s algorithm. Experimental results show that our algorithm constructs suffix arrays on DNA strings 1.3-1.6 times faster than Kim et al.'s algorithm, and also for other algorithms in most cases.

Pedagogical effectiveness of algorithm visualizations in teaching the data structures and algorithms in elementary schools (초등학교의 자료구조와 알고리즘 수업에서 알고리즘 시각화의 교육적 효과)

  • Chun, Seok-Ju
    • Journal of The Korean Association of Information Education
    • /
    • v.16 no.2
    • /
    • pp.255-263
    • /
    • 2012
  • Early algorithm education is very important in order to nurture excellent S/W developers in an information society. However a algorithm learning is a great challenge to elementary school students since understanding what a computer algorithm written in a static text format meant to do is difficult. It is expected that a student can easily visualize a algorithm through animations. In this study, we evaluate the pedagogical effectiveness of algorithm visualizations in teaching the fundamental data structures and algorithms in elementary schools. Thus we defined a new measure called 'Algorithm Visualization Factor(AVF)' and developed both text-oriented and animation-oriented PPTs of algorithm education elements, that is, Stack, Queue, Bubble Sort, Heap Sort, BDF, and DFS. We have conducted experiments and evaluations on diverse students groups. Extensive experiment results show that the average score of the student groups using animation-orirented PPT is greater(22%) than the one of the student groups using text-orirented PPT.

  • PDF

A Topic Modeling Analysis for Online News Article Comments on Nurses' Workplace Bullying (간호사의 직장 내 괴롭힘 관련 온라인 뉴스기사 댓글에 대한 토픽 모델링 분석)

  • Kang, Jiyeon;Kim, Soogyeong;Roh, Seungkook
    • Journal of Korean Academy of Nursing
    • /
    • v.49 no.6
    • /
    • pp.736-747
    • /
    • 2019
  • Purpose: This study aimed to explore public opinion on workplace bullying in the nursing field, by analyzing the keywords and topics of online news comments. Methods: This was a text-mining study that collected, processed, and analyzed text data. A total of 89,951 comments on 650 online news articles, reported between January 1, 2013 and July 31, 2018, were collected via web crawling. The collected unstructured text data were preprocessed and keyword analysis and topic modeling were performed using R programming. Results: The 10 most important keywords were "work" (37121.7), "hospital" (25286.0), "patients" (24600.8), "woman" (24015.6), "physician" (20840.6), "trouble" (18539.4), "time" (17896.3), "money" (16379.9), "new nurses" (14056.8), and "salary" (13084.1). The 22,572 preprocessed key words were categorized into four topics: "poor working environment", "culture among women", "unfair oppression", and "society-level solutions". Conclusion: Public interest in workplace bullying among nurses has continued to increase. The public agreed that negative work environment and nursing shortage could cause workplace bullying. They also considered nurse bullying as a problem that should be resolved at a societal level. It is necessary to conduct further research through gender discrimination perspectives on nurse workplace bullying and the social value of nursing work.

Topic Model Analysis of Research Trend on Renewable Energy (신재생에너지 동향 파악을 위한 토픽 모형 분석)

  • Shin, KyuSik;Choi, HoeRyeon;Lee, HongChul
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.16 no.9
    • /
    • pp.6411-6418
    • /
    • 2015
  • To respond the climate change and environmental pollution, the studies on renewable energy policies are increasing. The renewable energy is a new growth engine technology represented by the green industry and green technology. At present, the investments for the renewable energy supply and technology development projects of three main strategy sectors such as sunlight, wind power and hydrogen fuel cell are implemented in our country, while they are still in the early stage, accordingly reducing those uncertainty for the research direction and investment fields is the most urgent issue among others. Thus, this study applied text mining method and multinominal topic model among the big data analysis methods on our country's newspaper articles concerning the renewable energy over the last 10 years, and then analyzed the core issues and global research trend, forecasting the renewable energy fields with the growth potential. It is predicted that these results of the study based on information and communication technology will be actively applied on the renewable energy fields.

Analysis of Unstructured Data on Detecting of New Drug Indication of Atorvastatin (아토바스타틴의 새로운 약물 적응증 탐색을 위한 비정형 데이터 분석)

  • Jeong, Hwee-Soo;Kang, Gil-Won;Choi, Woong;Park, Jong-Hyock;Shin, Kwang-Soo;Suh, Young-Sung
    • Journal of health informatics and statistics
    • /
    • v.43 no.4
    • /
    • pp.329-335
    • /
    • 2018
  • Objectives: In recent years, there has been an increased need for a way to extract desired information from multiple medical literatures at once. This study was conducted to confirm the usefulness of unstructured data analysis using previously published medical literatures to search for new indications. Methods: The new indications were searched through text mining, network analysis, and topic modeling analysis using 5,057 articles of atorvastatin, a treatment for hyperlipidemia, from 1990 to 2017. Results: The extracted keywords was 273. In the frequency of text mining and network analysis, the existing indications of atorvastatin were extracted in top level. The novel indications by Term Frequency-Inverse Document Frequency (TF-IDF) were atrial fibrillation, heart failure, breast cancer, rheumatoid arthritis, combined hyperlipidemia, arrhythmias, multiple sclerosis, non-alcoholic fatty liver disease, contrast-induced acute kidney injury and prostate cancer. Conclusions: Unstructured data analysis for discovering new indications from massive medical literature is expected to be used in drug repositioning industries.

The Study on the patient safety culture convergence research topics through text mining and CONCOR analysis (텍스트마이닝 및 CONCOR 분석을 활용한 환자안전문화 융복합 연구주제 분석)

  • Baek, Su Mi;Moon, Inn Oh
    • Journal of Digital Convergence
    • /
    • v.19 no.12
    • /
    • pp.359-367
    • /
    • 2021
  • The purpose of this study is to analyze domestic patient safety culture research topics using text mining and CONCOR analysis. The research method was conducted in the stages of data collection, data preprocessing, text mining and social network analysis, and CONCOR analysis. A total of 136 articles were analyzed excluding papers that were not published. Data analysis was performed using Textom and UCINET programs. As a result of this study, TF (frequency) of patient safety culture-related studies showed that patient safety was the highest, and TF-IDF (importance in documents) was highest in nursing. As a result of the CONCOR analysis, a total of seven clusters were derived: knowledge and attitude, communication, medical service, team, work environment, structure, organization and management that constitute the patient safety culture. In the future, it is necessary to conduct research on the relationship between the establishment of a patient safety culture and patient outcomes.