• Title/Summary/Keyword: Data Mining Technique

Search Result 638, Processing Time 0.029 seconds

Understanding the Categories and Characteristics of Depressive Moods in Chatbot Data (챗봇 데이터에 나타난 우울 담론의 범주와 특성의 이해)

  • Chin, HyoJin;Jung, Chani;Baek, Gumhee;Cha, Chiyoung;Choi, Jeonghoi;Cha, Meeyoung
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.9
    • /
    • pp.381-390
    • /
    • 2022
  • Influenced by a culture that prefers non-face-to-face activity during the COVID-19 pandemic, chatbot usage is accelerating. Chatbots have been used for various purposes, not only for customer service in businesses and social conversations for fun but also for mental health. Chatbots are a platform where users can easily talk about their depressed moods because anonymity is guaranteed. However, most relevant research has been on social media data, especially Twitter data, and few studies have analyzed the commercially used chatbots data. In this study, we identified the characteristics of depressive discourse in user-chatbot interaction data by analyzing the chats, including the word 'depress,' using the topic modeling algorithm and the text-mining technique. Moreover, we compared its characteristics with those of the depressive moods in the Twitter data. Finally, we draw several design guidelines and suggest avenues for future research based on the study findings.

Efficient Topic Modeling by Mapping Global and Local Topics (전역 토픽의 지역 매핑을 통한 효율적 토픽 모델링 방안)

  • Choi, Hochang;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.3
    • /
    • pp.69-94
    • /
    • 2017
  • Recently, increase of demand for big data analysis has been driving the vigorous development of related technologies and tools. In addition, development of IT and increased penetration rate of smart devices are producing a large amount of data. According to this phenomenon, data analysis technology is rapidly becoming popular. Also, attempts to acquire insights through data analysis have been continuously increasing. It means that the big data analysis will be more important in various industries for the foreseeable future. Big data analysis is generally performed by a small number of experts and delivered to each demander of analysis. However, increase of interest about big data analysis arouses activation of computer programming education and development of many programs for data analysis. Accordingly, the entry barriers of big data analysis are gradually lowering and data analysis technology being spread out. As the result, big data analysis is expected to be performed by demanders of analysis themselves. Along with this, interest about various unstructured data is continually increasing. Especially, a lot of attention is focused on using text data. Emergence of new platforms and techniques using the web bring about mass production of text data and active attempt to analyze text data. Furthermore, result of text analysis has been utilized in various fields. Text mining is a concept that embraces various theories and techniques for text analysis. Many text mining techniques are utilized in this field for various research purposes, topic modeling is one of the most widely used and studied. Topic modeling is a technique that extracts the major issues from a lot of documents, identifies the documents that correspond to each issue and provides identified documents as a cluster. It is evaluated as a very useful technique in that reflect the semantic elements of the document. Traditional topic modeling is based on the distribution of key terms across the entire document. Thus, it is essential to analyze the entire document at once to identify topic of each document. This condition causes a long time in analysis process when topic modeling is applied to a lot of documents. In addition, it has a scalability problem that is an exponential increase in the processing time with the increase of analysis objects. This problem is particularly noticeable when the documents are distributed across multiple systems or regions. To overcome these problems, divide and conquer approach can be applied to topic modeling. It means dividing a large number of documents into sub-units and deriving topics through repetition of topic modeling to each unit. This method can be used for topic modeling on a large number of documents with limited system resources, and can improve processing speed of topic modeling. It also can significantly reduce analysis time and cost through ability to analyze documents in each location or place without combining analysis object documents. However, despite many advantages, this method has two major problems. First, the relationship between local topics derived from each unit and global topics derived from entire document is unclear. It means that in each document, local topics can be identified, but global topics cannot be identified. Second, a method for measuring the accuracy of the proposed methodology should be established. That is to say, assuming that global topic is ideal answer, the difference in a local topic on a global topic needs to be measured. By those difficulties, the study in this method is not performed sufficiently, compare with other studies dealing with topic modeling. In this paper, we propose a topic modeling approach to solve the above two problems. First of all, we divide the entire document cluster(Global set) into sub-clusters(Local set), and generate the reduced entire document cluster(RGS, Reduced global set) that consist of delegated documents extracted from each local set. We try to solve the first problem by mapping RGS topics and local topics. Along with this, we verify the accuracy of the proposed methodology by detecting documents, whether to be discerned as the same topic at result of global and local set. Using 24,000 news articles, we conduct experiments to evaluate practical applicability of the proposed methodology. In addition, through additional experiment, we confirmed that the proposed methodology can provide similar results to the entire topic modeling. We also proposed a reasonable method for comparing the result of both methods.

Estimation of Drought Index Using CART Algorithm and Satellite Data (CART기법과 위성자료를 이용한 향상된 공간가뭄지수 산정)

  • Kim, Gwang-Seob;Park, Han-Gyun
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.13 no.1
    • /
    • pp.128-141
    • /
    • 2010
  • Drought indices such as SPI(Standard Precipitation Index) and PDSI(Palmer Drought Severity Index) estimated using ground observations are not enough to describe detail spatial distribution of drought condition. In this study, the drought index with improved spatial resolution was estimated by using the CART algorithm and ancillary data such as MODIS NDVI, MODIS LST, land cover, rainfall, average air temperature, SPI, and PDSI data. Estimated drought index using the proposed approach for the year 2008 demonstrates better spatial information than that of traditional approaches. Results show that the availability of satellite imageries and various associated data allows us to get improved spatial drought information using a data mining technique and ancillary data and get better understanding of drought condition and prediction.

OLAP and Decision Tree Analysis of Productivity Affected by Construction Duration Impact Factors (공사기간 영향요인에 따른 생산성의 OLAP 분석과 의사결정트리 분석)

  • Ryu, Han-Guk
    • Journal of the Korea Institute of Building Construction
    • /
    • v.11 no.2
    • /
    • pp.100-107
    • /
    • 2011
  • As construction duration significantly influences the performance and the success of construction projects, it is necessary to appropriately manage the impact factors affecting construction duration. Recently, interest in the construction industry has been rising due to the recent change in the construction legal system, and the competition among the construction companies on construction time. However, the impact factors are extremely diverse. The existing productivity data on impact factors is not sufficient to properly identify the impact factor and measure the productivity from various perspectives, such as subcontractor, time, crew, work and so on. In this respect, a multidimensional analysis by a data warehouse is very helpful in order to view the manner in which productivity is affected by impact factors from various perspectives. Therefore, this research proposes a method that effectively takes the diverse productivity data of impact factors, and generates a multidimensional analysis. Decision tree analysis, a data mining technique, is also applied in this research in order to supply construction managers with appropriate productivity data on impact factors during the construction management process.

Analysis of a Structure of the Kunsan Basin in Yellow Sea Using Gravity and Magnetic Data (중자력 자료를 이용한 황해 군산분지의 지질 구조 해석)

  • Park, Gye-Soon;Choi, Jong-Keun;Koo, June-Mo;Kwon, Byung-Doo
    • Journal of the Korean earth science society
    • /
    • v.30 no.1
    • /
    • pp.49-57
    • /
    • 2009
  • We studied a structure of the Kunsan basin in the Yellow Sea using ship-borne magnetic data and altimetry satellite-derived gravity data provided from the Scripps institution of oceanography in 2006. The gravity data was analyzed via power spectrum analysis and gravity inversion, and the magnetic data via analytic signal technique, pseudo-gravity transformation, and its inversion. The results showed that the depth of bedrock tended to increase as we approached the center of the South Central Sag in Kunsan basin and that the maximum and minimum of its depth were estimated to be about 6-8 km and 2 km, respectively. Inaddition, the observed high anomaly of gravity and magnetism was attributed to the intrusion of igneous rock of higher density than the surrounding basement rock in the center of South Central Sag, which was consistent with the interpretation of seismic data obtained in the same region.

Emotion Prediction of Paragraph using Big Data Analysis (빅데이터 분석을 이용한 문단 내의 감정 예측)

  • Kim, Jin-su
    • Journal of Digital Convergence
    • /
    • v.14 no.11
    • /
    • pp.267-273
    • /
    • 2016
  • Creation and Sharing of information which is structured data as well as various unstructured data. makes progress actively through the spread of mobile. Recently, Big Data extracts the semantic information from SNS and data mining is one of the big data technique. Especially, the general emotion analysis that expresses the collective intelligence of the masses is utilized using large and a variety of materials. In this paper, we propose the emotion prediction system architecture which extracts the significant keywords from social network paragraphs using n-gram and Korean morphological analyzer, and predicts the emotion using SVM and these extracted emotion features. The proposed system showed 82.25% more improved recall rate in average than previous systems and it will help extract the semantic keyword using morphological analysis.

A Study on analysis of severity-adjustment length of stay in hospital for community-acquired pneumonia (지역사회획득 폐렴 환자의 중증도 보정 재원일수 분석)

  • Kim, Yoo-Mi;Choi, Yun-Kyoung;Kang, Sung-Hong;Kim, Won-Joong
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.12 no.3
    • /
    • pp.1234-1243
    • /
    • 2011
  • Our study was carried out to develop the severity-adjustment model for length of stay in hospital for community-acquired pneumonia so that we analysed the factors on the variation in length of stay(LOS). The subjects were 5,353 community-acquired pneumonia inpatients of the Korean National Hospital Discharge In-depth Injury Survey data from 2004 through 2006. The data were analyzed using t-test and ANOVA and the severity-adjustment model was developed using data mining technique. There are differences according to gender, age, type of insurance, type of admission, but there is no difference of whether patients died in hospital. After yielding the standardized value of the difference between crude and expected length of stay, we analysed the variation of length of stay for community-acquired pneumonia. There was variation of LOS in regional differences and insurance type, though there was no variation according whether patients receive their care in their residences. The variation of length of stay controlling the case mix or severity of illness can be explained the factors of provider. This supply factors in LOS variations should be more studied for individual practice style or patient management practices and healthcare resources or environment. We expect that the severity-adjustment model using administrative databases should be more adapted in other diseases in practical.

Classification of Very High Concerns HRCT Images using Extended Bayesian Networks (확장 베이지안망을 적용한 고위험성 HRCT 영상 분류)

  • Lim, Chae-Gyun;Jung, Yong-Gyu
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.49 no.2
    • /
    • pp.7-12
    • /
    • 2012
  • Recently the medical field to efficiently process the vast amounts of information to decision trees, neural networks, Bayesian Networks, including the application method of various data mining techniques are investigated. In addition, the basic personal information or patient history, family history, in addition to information such as MRI, HRCT images and additional information to collect and leverage in the diagnosis of disease, improved diagnostic accuracy is to promote a common status. But in real world situations that affect the results much because of the variable exists for a particular data mining techniques to obtain information through the enemy can be seen fairly limited. Medical images were taken as well as a minor can not give a positive impact on the diagnosis, but the proportion increased subjective judgments by the automated system is to deal with difficult issues. As a result of a complex reality, the situation is more advantageous to deal with the relative probability of the multivariate model based on Bayesian network, or TAN in the K2 search algorithm improves due to expansion model has been proposed. At this point, depending on the type of search algorithm applied significantly influenced the performance characteristics of the extended Bayesian network, the performance and suitability of each technique for evaluation of the facts is required. In this paper, we extend the Bayesian network for diagnosis of diseases using the same data were carried out, K2, TAN and changes in search algorithms such as classification accuracy was measured. In the 10-fold cross-validation experiment was performed to compare the performance evaluation based on the analysis and the onset of high-risk classification for patients with HRCT images could be possible to identify high-risk data.

The development of symmetrically and attributably pure confidence in association rule mining (연관성 규칙에서 활용 가능한 대칭적 기여 순수 신뢰도의 개발)

  • Park, Hee Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.25 no.3
    • /
    • pp.601-609
    • /
    • 2014
  • The most widely used data mining technique for big data analysis is to generate meaningful association rules. This method has been used to find the relationship between set of items based on the association criteria such as support, confidence, lift, etc. Among them, confidence is the most frequently used, but it has the drawback that we can not know the direction of association by it. The attributably pure confidence was developed to compensate for this drawback, but the value was changed by the position of two item sets. In this paper, we propose four symmetrically and attributably pure confidence measures to compensate the shortcomings of confidence and the attributably pure confidence. And then we prove three conditions of interestingness measure by Piatetsky-Shapiro, and comparative studies with confidence, attributably pure confidence, and four symmetrically and attributably pure confidence measures are shown by numerical examples. The results show that the symmetrically and attributably pure confidence measures are better than confidence and the attributably pure confidence. Also the measure NSAPis found to be the best among these four symmetrically and attributably pure confidence measures.

A Recognition and Application Plan of Placenta Chamber of King Sejong's Princes by Big Data Analytical Technique (빅데이터 분석기법을 통한 성주(星州) 세종대왕자태실(世宗大王子胎室)의 인식 및 활용방안)

  • Lim, Jin-Kang;Park, Ji-Hwan
    • Journal of the Korean Institute of Traditional Landscape Architecture
    • /
    • v.36 no.1
    • /
    • pp.78-88
    • /
    • 2018
  • The purpose of this study is to establish a utilization plan according to the cultural value of Placenta Chamber of King Sejong's Princes. We used SNS to analyze various public perceptions and opinions, collected data and analyzed it. The collection period is from June 01, 2007 to June 30, 2017 (for about 10 years), We gathered data from blogs, cafes, and Knowledge IN that contain keywords related to 'Placenta Chamber', 'Placenta Chamber of Seongju', 'Placenta Chamber of King Sejong's Princes'. and Analyzed using the text mining method of the big date program. Based on the main results of the big data analysis, Placenta Chamber's method of utilization was derived. As a result, major keywords such as King Sejong Great, Prince, Sungju, Feng Shui, culture, preservation, blessing etc were derived. The association of 'world', 'heritage', 'cultural heritage' is high, and the connection of 'Placenta Chamber', 'Gyeongsangbuk-do', 'cultural property' is high, and it was able to confirm the value of Placenta Chamber as a world cultural heritage. and It is necessary to induce visitors to feel stimulation or change of surroundings through facility refurbishment and environmental improvement around Placenta Chamber.