Search | Korea Science

A Study on Unstructured text data Post-processing Methodology using Stopword Thesaurus (불용어 시소러스를 이용한 비정형 텍스트 데이터 후처리 방법론에 관한 연구)

Won-Jo Lee
- The Journal of the Convergence on Culture Technology
- /
- v.9 no.6
- /
- pp.935-940
- /
- 2023
Most text data collected through web scraping for artificial intelligence and big data analysis is generally large and unstructured, so a purification process is required for big data analysis. The process becomes structured data that can be analyzed through a heuristic pre-processing refining step and a post-processing machine refining step. Therefore, in this study, in the post-processing machine refining process, the Korean dictionary and the stopword dictionary are used to extract vocabularies for frequency analysis for word cloud analysis. In this process, "user-defined stopwords" are used to efficiently remove stopwords that were not removed. We propose a methodology for applying the "thesaurus" and examine the pros and cons of the proposed refining method through a case analysis using the "user-defined stop word thesaurus" technique proposed to complement the problems of the existing "stop word dictionary" method with R's word cloud technique. We present comparative verification and suggest the effectiveness of practical application of the proposed methodology.
https://doi.org/10.17703/JCCT.2023.9.6.935 인용 PDF

A Study on the User Experience at Unmanned Cafe Using Big Data Analsis: Focus on text mining and semantic network analysis (빅데이터를 활용한 무인카페 소비자 인식에 관한 연구: 텍스트 마이닝과 의미연결망 분석을 중심으로)

Seung-Yeop Lee;Byeong-Hyeon Park;Jang-Hyeon Nam
- Asia-Pacific Journal of Business
- /
- v.14 no.3
- /
- pp.241-250
- /
- 2023
Purpose - The purpose of this study was to investigate the perception of 'unmanned cafes' on the network through big data analysis, and to identify the latest trends in rapidly changing consumer perception. Based on this, I would like to suggest that it can be used as basic data for the revitalization of unmanned cafes and differentiated marketing strategies. Design/methodology/approach - This study collected documents containing unmanned cafe keywords for about three years, and the data collected using text mining techniques were analyzed using methods such as keyword frequency analysis, centrality analysis, and keyword network analysis. Findings - First, the top 10 words with a high frequency of appearance were identified in the order of unmanned cafes, unmanned cafes, start-up, operation, coffee, time, coffee machine, franchise, and robot cafes. Second, visualization of the semantic network confirmed that the key keyword "unmanned cafe" was at the center of the keyword cluster. Research implications or Originality - Using big data to collect and analyze keywords with high web visibility, we tried to identify new issues or trends in unmanned cafe recognition, which consists of keywords related to start-ups, mainly deals with topics related to start-ups when unmanned cafes are mentioned on the network.
https://doi.org/10.32599/apjb.14.3.202309.241 인용 PDF

INTRA-AND INTERGOVERNMENTAL INFORMATION SYSTEM TO MANAGE INFORMATION IN URBAN RENEWAL PROJECT

Dong-bum Kim;Jin-Won Kim;Ju-Hyung Kim;Jae-Jun Kim
- International conference on construction engineering and project management
- /
- 2011.02a
- /
- pp.561-566
- /
- 2011
In general, the early stage of urban renewal such as preparing a master plan and processing administrative works including planning permission are conducted by local governments in Korea. The local governments need to review the status of projects that are undergone in other local governments' territories. However, no integrated information system to manage information to this end at the level of nation exists in Korea. If the system would be developed, it may support central government to obtain information on required resources at the national level. In addition, local governments can gain guidance on the process and recognize potential problematic situations from others experience. The system should include functions to collect data on project summary, cost and schedule of projects according to local governments. The expected effects from using the information system are as following. First, information generated from project practice become more credible on account of management at the national level. Because the authorized party such as system administrative agents of governments are responsible for collecting and managing data. Second, the unified information system with no regard to the place where projects progresses reduces the efforts for accumulating reference data for aiding local governments decision making by providing appropriate information timely. Also, enhanced information accessibility for stakeholders make the project process clear. Finally, oversight management is enforced with visualization technology adopted in the system, presenting master plan and mass model including information on usage by floors and progressing information graphically. Ultimately, potential challenges can be anticipated by considering records accumulated from other local governments' projects. This paper presents concept, functionalities, and architecture of information system enabling to manage data from individual projects and aggregate those for oversight management for local and central governments. As a part of systems analysis, general requirements of briefing system for governments and necessary data fields to this end are identified.
PDF

Application of Terrestrial LiDAR for Reconstructing 3D Images of Fault Trench Sites and Web-based Visualization Platform for Large Point Clouds (지상 라이다를 활용한 트렌치 단층 단면 3차원 영상 생성과 웹 기반 대용량 점군 자료 가시화 플랫폼 활용 사례)

Lee, Byung Woo;Kim, Seung-Sep
- Economic and Environmental Geology
- /
- v.54 no.2
- /
- pp.177-186
- /
- 2021
For disaster management and mitigation of earthquakes in Korea Peninsula, active fault investigation has been conducted for the past 5 years. In particular, investigation of sediment-covered active faults integrates geomorphological analysis on airborne LiDAR data, surface geological survey, and geophysical exploration, and unearths subsurface active faults by trench survey. However, the fault traces revealed by trench surveys are only available for investigation during a limited time and restored to the previous condition. Thus, the geological data describing the fault trench sites remain as the qualitative data in terms of research articles and reports. To extend the limitations due to temporal nature of geological studies, we utilized a terrestrial LiDAR to produce 3D point clouds for the fault trench sites and restored them in a digital space. The terrestrial LiDAR scanning was conducted at two trench sites located near the Yangsan Fault and acquired amplitude and reflectance from the surveyed area as well as color information by combining photogrammetry with the LiDAR system. The scanned data were merged to form the 3D point clouds having the average geometric error of 0.003 m, which exhibited the sufficient accuracy to restore the details of the surveyed trench sites. However, we found more post-processing on the scanned data would be necessary because the amplitudes and reflectances of the point clouds varied depending on the scan positions and the colors of the trench surfaces were captured differently depending on the light exposures available at the time. Such point clouds are pretty large in size and visualized through a limited set of softwares, which limits data sharing among researchers. As an alternative, we suggested Potree, an open-source web-based platform, to visualize the point clouds of the trench sites. In this study, as a result, we identified that terrestrial LiDAR data can be practical to increase reproducibility of geological field studies and easily accessible by researchers and students in Earth Sciences.
https://doi.org/10.9719/EEG.2021.54.2.177 인용 PDF KSCI

Twitter Issue Tracking System by Topic Modeling Techniques (토픽 모델링을 이용한 트위터 이슈 트래킹 시스템)

Bae, Jung-Hwan;Han, Nam-Gi;Song, Min
- Journal of Intelligence and Information Systems
- /
- v.20 no.2
- /
- pp.109-122
- /
- 2014
People are nowadays creating a tremendous amount of data on Social Network Service (SNS). In particular, the incorporation of SNS into mobile devices has resulted in massive amounts of data generation, thereby greatly influencing society. This is an unmatched phenomenon in history, and now we live in the Age of Big Data. SNS Data is defined as a condition of Big Data where the amount of data (volume), data input and output speeds (velocity), and the variety of data types (variety) are satisfied. If someone intends to discover the trend of an issue in SNS Big Data, this information can be used as a new important source for the creation of new values because this information covers the whole of society. In this study, a Twitter Issue Tracking System (TITS) is designed and established to meet the needs of analyzing SNS Big Data. TITS extracts issues from Twitter texts and visualizes them on the web. The proposed system provides the following four functions: (1) Provide the topic keyword set that corresponds to daily ranking; (2) Visualize the daily time series graph of a topic for the duration of a month; (3) Provide the importance of a topic through a treemap based on the score system and frequency; (4) Visualize the daily time-series graph of keywords by searching the keyword; The present study analyzes the Big Data generated by SNS in real time. SNS Big Data analysis requires various natural language processing techniques, including the removal of stop words, and noun extraction for processing various unrefined forms of unstructured data. In addition, such analysis requires the latest big data technology to process rapidly a large amount of real-time data, such as the Hadoop distributed system or NoSQL, which is an alternative to relational database. We built TITS based on Hadoop to optimize the processing of big data because Hadoop is designed to scale up from single node computing to thousands of machines. Furthermore, we use MongoDB, which is classified as a NoSQL database. In addition, MongoDB is an open source platform, document-oriented database that provides high performance, high availability, and automatic scaling. Unlike existing relational database, there are no schema or tables with MongoDB, and its most important goal is that of data accessibility and data processing performance. In the Age of Big Data, the visualization of Big Data is more attractive to the Big Data community because it helps analysts to examine such data easily and clearly. Therefore, TITS uses the d3.js library as a visualization tool. This library is designed for the purpose of creating Data Driven Documents that bind document object model (DOM) and any data; the interaction between data is easy and useful for managing real-time data stream with smooth animation. In addition, TITS uses a bootstrap made of pre-configured plug-in style sheets and JavaScript libraries to build a web system. The TITS Graphical User Interface (GUI) is designed using these libraries, and it is capable of detecting issues on Twitter in an easy and intuitive manner. The proposed work demonstrates the superiority of our issue detection techniques by matching detected issues with corresponding online news articles. The contributions of the present study are threefold. First, we suggest an alternative approach to real-time big data analysis, which has become an extremely important issue. Second, we apply a topic modeling technique that is used in various research areas, including Library and Information Science (LIS). Based on this, we can confirm the utility of storytelling and time series analysis. Third, we develop a web-based system, and make the system available for the real-time discovery of topics. The present study conducted experiments with nearly 150 million tweets in Korea during March 2013.
https://doi.org/10.13088/jiis.2014.20.2.109 인용 PDF KSCI

Learning Material Bookmarking Service based on Collective Intelligence (집단지성 기반 학습자료 북마킹 서비스 시스템)

Jang, Jincheul;Jung, Sukhwan;Lee, Seulki;Jung, Chihoon;Yoon, Wan Chul;Yi, Mun Yong
- Journal of Intelligence and Information Systems
- /
- v.20 no.2
- /
- pp.179-192
- /
- 2014
Keeping in line with the recent changes in the information technology environment, the online learning environment that supports multiple users' participation such as MOOC (Massive Open Online Courses) has become important. One of the largest professional associations in Information Technology, IEEE Computer Society, announced that "Supporting New Learning Styles" is a crucial trend in 2014. Popular MOOC services, CourseRa and edX, have continued to build active learning environment with a large number of lectures accessible anywhere using smart devices, and have been used by an increasing number of users. In addition, collaborative web services (e.g., blogs and Wikipedia) also support the creation of various user-uploaded learning materials, resulting in a vast amount of new lectures and learning materials being created every day in the online space. However, it is difficult for an online educational system to keep a learner' motivation as learning occurs remotely, with limited capability to share knowledge among the learners. Thus, it is essential to understand which materials are needed for each learner and how to motivate learners to actively participate in online learning system. To overcome these issues, leveraging the constructivism theory and collective intelligence, we have developed a social bookmarking system called WeStudy, which supports learning material sharing among the users and provides personalized learning material recommendations. Constructivism theory argues that knowledge is being constructed while learners interact with the world. Collective intelligence can be separated into two types: (1) collaborative collective intelligence, which can be built on the basis of direct collaboration among the participants (e.g., Wikipedia), and (2) integrative collective intelligence, which produces new forms of knowledge by combining independent and distributed information through highly advanced technologies and algorithms (e.g., Google PageRank, Recommender systems). Recommender system, one of the examples of integrative collective intelligence, is to utilize online activities of the users and recommend what users may be interested in. Our system included both collaborative collective intelligence functions and integrative collective intelligence functions. We analyzed well-known Web services based on collective intelligence such as Wikipedia, Slideshare, and Videolectures to identify main design factors that support collective intelligence. Based on this analysis, in addition to sharing online resources through social bookmarking, we selected three essential functions for our system: 1) multimodal visualization of learning materials through two forms (e.g., list and graph), 2) personalized recommendation of learning materials, and 3) explicit designation of learners of their interest. After developing web-based WeStudy system, we conducted usability testing through the heuristic evaluation method that included seven heuristic indices: features and functionality, cognitive page, navigation, search and filtering, control and feedback, forms, context and text. We recruited 10 experts who majored in Human Computer Interaction and worked in the same field, and requested both quantitative and qualitative evaluation of the system. The evaluation results show that, relative to the other functions evaluated, the list/graph page produced higher scores on all indices except for contexts & text. In case of contexts & text, learning material page produced the best score, compared with the other functions. In general, the explicit designation of learners of their interests, one of the distinctive functions, received lower scores on all usability indices because of its unfamiliar functionality to the users. In summary, the evaluation results show that our system has achieved high usability with good performance with some minor issues, which need to be fully addressed before the public release of the system to large-scale users. The study findings provide practical guidelines for the design and development of various systems that utilize collective intelligence.
https://doi.org/10.13088/jiis.2014.20.2.179 인용 PDF KSCI

User-Perspective Issue Clustering Using Multi-Layered Two-Mode Network Analysis (다계층 이원 네트워크를 활용한 사용자 관점의 이슈 클러스터링)

Kim, Jieun;Kim, Namgyu;Cho, Yoonho
- Journal of Intelligence and Information Systems
- /
- v.20 no.2
- /
- pp.93-107
- /
- 2014
In this paper, we report what we have observed with regard to user-perspective issue clustering based on multi-layered two-mode network analysis. This work is significant in the context of data collection by companies about customer needs. Most companies have failed to uncover such needs for products or services properly in terms of demographic data such as age, income levels, and purchase history. Because of excessive reliance on limited internal data, most recommendation systems do not provide decision makers with appropriate business information for current business circumstances. However, part of the problem is the increasing regulation of personal data gathering and privacy. This makes demographic or transaction data collection more difficult, and is a significant hurdle for traditional recommendation approaches because these systems demand a great deal of personal data or transaction logs. Our motivation for presenting this paper to academia is our strong belief, and evidence, that most customers' requirements for products can be effectively and efficiently analyzed from unstructured textual data such as Internet news text. In order to derive users' requirements from textual data obtained online, the proposed approach in this paper attempts to construct double two-mode networks, such as a user-news network and news-issue network, and to integrate these into one quasi-network as the input for issue clustering. One of the contributions of this research is the development of a methodology utilizing enormous amounts of unstructured textual data for user-oriented issue clustering by leveraging existing text mining and social network analysis. In order to build multi-layered two-mode networks of news logs, we need some tools such as text mining and topic analysis. We used not only SAS Enterprise Miner 12.1, which provides a text miner module and cluster module for textual data analysis, but also NetMiner 4 for network visualization and analysis. Our approach for user-perspective issue clustering is composed of six main phases: crawling, topic analysis, access pattern analysis, network merging, network conversion, and clustering. In the first phase, we collect visit logs for news sites by crawler. After gathering unstructured news article data, the topic analysis phase extracts issues from each news article in order to build an article-news network. For simplicity, 100 topics are extracted from 13,652 articles. In the third phase, a user-article network is constructed with access patterns derived from web transaction logs. The double two-mode networks are then merged into a quasi-network of user-issue. Finally, in the user-oriented issue-clustering phase, we classify issues through structural equivalence, and compare these with the clustering results from statistical tools and network analysis. An experiment with a large dataset was performed to build a multi-layer two-mode network. After that, we compared the results of issue clustering from SAS with that of network analysis. The experimental dataset was from a web site ranking site, and the biggest portal site in Korea. The sample dataset contains 150 million transaction logs and 13,652 news articles of 5,000 panels over one year. User-article and article-issue networks are constructed and merged into a user-issue quasi-network using Netminer. Our issue-clustering results applied the Partitioning Around Medoids (PAM) algorithm and Multidimensional Scaling (MDS), and are consistent with the results from SAS clustering. In spite of extensive efforts to provide user information with recommendation systems, most projects are successful only when companies have sufficient data about users and transactions. Our proposed methodology, user-perspective issue clustering, can provide practical support to decision-making in companies because it enhances user-related data from unstructured textual data. To overcome the problem of insufficient data from traditional approaches, our methodology infers customers' real interests by utilizing web transaction logs. In addition, we suggest topic analysis and issue clustering as a practical means of issue identification.
https://doi.org/10.13088/jiis.2014.20.2.093 인용 PDF KSCI

Design of Cloud-Based Data Analysis System for Culture Medium Management in Smart Greenhouses (스마트온실 배양액 관리를 위한 클라우드 기반 데이터 분석시스템 설계)

Heo, Jeong-Wook;Park, Kyeong-Hun;Lee, Jae-Su;Hong, Seung-Gil;Lee, Gong-In;Baek, Jeong-Hyun
- Korean Journal of Environmental Agriculture
- /
- v.37 no.4
- /
- pp.251-259
- /
- 2018
BACKGROUND: Various culture media have been used for hydroponic cultures of horticultural plants under the smart greenhouses with natural and artificial light types. Management of the culture medium for the control of medium amounts and/or necessary components absorbed by plants during the cultivation period is performed with ICT (Information and Communication Technology) and/or IoT (Internet of Things) in a smart farm system. This study was conducted to develop the cloud-based data analysis system for effective management of culture medium applying to hydroponic culture and plant growth in smart greenhouses. METHODS AND RESULTS: Conventional inorganic Yamazaki and organic media derived from agricultural byproducts such as a immature fruit, leaf, or stem were used for hydroponic culture media. Component changes of the solutions according to the growth stage were monitored and plant growth was observed. Red and green lettuce seedlings (Lactuca sativa L.) which developed 2~3 true leaves were considered as plant materials. The seedlings were hydroponically grown in the smart greenhouse with fluorescent and light-emitting diodes (LEDs) lights of $150{\mu}mol/m^2/s$ light intensity for 35 days. Growth data of the seedlings were classified and stored to develop the relational database in the virtual machine which was generated from an open stack cloud system on the base of growth parameter. Relation of the plant growth and nutrient absorption pattern of 9 inorganic components inside the media during the cultivation period was investigated. The stored data associated with component changes and growth parameters were visualized on the web through the web framework and Node JS. CONCLUSION: Time-series changes of inorganic components in the culture media were observed. The increases of the unfolded leaves or fresh weight of the seedlings were mainly dependent on the macroelements such as a $NO_3-N$, and affected by the different inorganic and organic media. Though the data analysis system was developed, actual measurement data were offered by using the user smart device, and analysis and comparison of the data were visualized graphically in time series based on the cloud database. Agricultural management in data visualization and/or plant growth can be implemented by the data analysis system under whole agricultural sites regardless of various culture environmental changes.
https://doi.org/10.5338/KJEA.2018.37.4.38 인용 PDF KSCI

An Analysis of Eye Movement in Observation According to University Students' Cognitive Style (대학생들의 인지양식에 따른 관찰에서의 안구 운동 분석)

Lim, Sung-Man;Choi, Hyun-Dong;Yang, Il-Ho;Jeong, Mi-Yeon
- Journal of The Korean Association For Science Education
- /
- v.33 no.4
- /
- pp.778-793
- /
- 2013
The purpose of this study is to analyze observation characteristics through eye movement according to cognitive styles. To do this, we developed observation tasks that show the differences between wholistic cognitive style group and analytic cognitive style group, measured eye movement of university students with different cognitive styles after being given an observation task. The difference between two cognitive style groups is confirmed by analysing gathered statistics and visualization data. The findings of this study are as follows: First, to compare fixation time and frequency, we compared the average value of total time used in the observation task by the wholistic cognitive style group and analytic cognitive style group. The numbers of Fixation (total) and number of Fixations (30s), is based on the fact that the wholistic cognitive style group has more numbers of fixation (Total) and number of fixations (30s) means the wholistic cognitive style group can observe more points or overall features than the analytic cognitive style group, in contrast, the analytic cognitive style group tend to focus on a particular detail, and observe less numbers of points. Second, to compare observation object and area by cognitive style, the outcome of analysing visualization data shows that wholistic cognitive style group observes the surrounding environment of spider and web on a wider area, on the other hand, the analytic cognitive style group observes by focusing on the spider itself. Through the result of this study, there are differences in observation time, frequency, object, area, and ratio from the two cognitive styles. It also shows the reason why each student has varied outcome, from the difference of information following their cognitive styles, and the result of this study helps to figure out and give direction as to what observation fulfillment is more suitable for each student.
https://doi.org/10.14697/jkase.2013.33.4.778 인용 PDF KSCI

Prediction of Key Variables Affecting NBA Playoffs Advancement: Focusing on 3 Points and Turnover Features (미국 프로농구(NBA)의 플레이오프 진출에 영향을 미치는 주요 변수 예측: 3점과 턴오버 속성을 중심으로)

An, Sehwan;Kim, Youngmin
- Journal of Intelligence and Information Systems
- /
- v.28 no.1
- /
- pp.263-286
- /
- 2022
This study acquires NBA statistical information for a total of 32 years from 1990 to 2022 using web crawling, observes variables of interest through exploratory data analysis, and generates related derived variables. Unused variables were removed through a purification process on the input data, and correlation analysis, t-test, and ANOVA were performed on the remaining variables. For the variable of interest, the difference in the mean between the groups that advanced to the playoffs and did not advance to the playoffs was tested, and then to compensate for this, the average difference between the three groups (higher/middle/lower) based on ranking was reconfirmed. Of the input data, only this year's season data was used as a test set, and 5-fold cross-validation was performed by dividing the training set and the validation set for model training. The overfitting problem was solved by comparing the cross-validation result and the final analysis result using the test set to confirm that there was no difference in the performance matrix. Because the quality level of the raw data is high and the statistical assumptions are satisfied, most of the models showed good results despite the small data set. This study not only predicts NBA game results or classifies whether or not to advance to the playoffs using machine learning, but also examines whether the variables of interest are included in the major variables with high importance by understanding the importance of input attribute. Through the visualization of SHAP value, it was possible to overcome the limitation that could not be interpreted only with the result of feature importance, and to compensate for the lack of consistency in the importance calculation in the process of entering/removing variables. It was found that a number of variables related to three points and errors classified as subjects of interest in this study were included in the major variables affecting advancing to the playoffs in the NBA. Although this study is similar in that it includes topics such as match results, playoffs, and championship predictions, which have been dealt with in the existing sports data analysis field, and comparatively analyzed several machine learning models for analysis, there is a difference in that the interest features are set in advance and statistically verified, so that it is compared with the machine learning analysis result. Also, it was differentiated from existing studies by presenting explanatory visualization results using SHAP, one of the XAI models.
https://doi.org/10.13088/jiis.2022.28.1.263 인용 PDF KSCI

Search Result 330, Processing Time 0.025 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)