• Title/Summary/Keyword: Business Intelligence system

Search Result 757, Processing Time 0.031 seconds

A Study on Developing a VKOSPI Forecasting Model via GARCH Class Models for Intelligent Volatility Trading Systems (지능형 변동성트레이딩시스템개발을 위한 GARCH 모형을 통한 VKOSPI 예측모형 개발에 관한 연구)

  • Kim, Sun-Woong
    • Journal of Intelligence and Information Systems
    • /
    • v.16 no.2
    • /
    • pp.19-32
    • /
    • 2010
  • Volatility plays a central role in both academic and practical applications, especially in pricing financial derivative products and trading volatility strategies. This study presents a novel mechanism based on generalized autoregressive conditional heteroskedasticity (GARCH) models that is able to enhance the performance of intelligent volatility trading systems by predicting Korean stock market volatility more accurately. In particular, we embedded the concept of the volatility asymmetry documented widely in the literature into our model. The newly developed Korean stock market volatility index of KOSPI 200, VKOSPI, is used as a volatility proxy. It is the price of a linear portfolio of the KOSPI 200 index options and measures the effect of the expectations of dealers and option traders on stock market volatility for 30 calendar days. The KOSPI 200 index options market started in 1997 and has become the most actively traded market in the world. Its trading volume is more than 10 million contracts a day and records the highest of all the stock index option markets. Therefore, analyzing the VKOSPI has great importance in understanding volatility inherent in option prices and can afford some trading ideas for futures and option dealers. Use of the VKOSPI as volatility proxy avoids statistical estimation problems associated with other measures of volatility since the VKOSPI is model-free expected volatility of market participants calculated directly from the transacted option prices. This study estimates the symmetric and asymmetric GARCH models for the KOSPI 200 index from January 2003 to December 2006 by the maximum likelihood procedure. Asymmetric GARCH models include GJR-GARCH model of Glosten, Jagannathan and Runke, exponential GARCH model of Nelson and power autoregressive conditional heteroskedasticity (ARCH) of Ding, Granger and Engle. Symmetric GARCH model indicates basic GARCH (1, 1). Tomorrow's forecasted value and change direction of stock market volatility are obtained by recursive GARCH specifications from January 2007 to December 2009 and are compared with the VKOSPI. Empirical results indicate that negative unanticipated returns increase volatility more than positive return shocks of equal magnitude decrease volatility, indicating the existence of volatility asymmetry in the Korean stock market. The point value and change direction of tomorrow VKOSPI are estimated and forecasted by GARCH models. Volatility trading system is developed using the forecasted change direction of the VKOSPI, that is, if tomorrow VKOSPI is expected to rise, a long straddle or strangle position is established. A short straddle or strangle position is taken if VKOSPI is expected to fall tomorrow. Total profit is calculated as the cumulative sum of the VKOSPI percentage change. If forecasted direction is correct, the absolute value of the VKOSPI percentage changes is added to trading profit. It is subtracted from the trading profit if forecasted direction is not correct. For the in-sample period, the power ARCH model best fits in a statistical metric, Mean Squared Prediction Error (MSPE), and the exponential GARCH model shows the highest Mean Correct Prediction (MCP). The power ARCH model best fits also for the out-of-sample period and provides the highest probability for the VKOSPI change direction tomorrow. Generally, the power ARCH model shows the best fit for the VKOSPI. All the GARCH models provide trading profits for volatility trading system and the exponential GARCH model shows the best performance, annual profit of 197.56%, during the in-sample period. The GARCH models present trading profits during the out-of-sample period except for the exponential GARCH model. During the out-of-sample period, the power ARCH model shows the largest annual trading profit of 38%. The volatility clustering and asymmetry found in this research are the reflection of volatility non-linearity. This further suggests that combining the asymmetric GARCH models and artificial neural networks can significantly enhance the performance of the suggested volatility trading system, since artificial neural networks have been shown to effectively model nonlinear relationships.

Intelligent Brand Positioning Visualization System Based on Web Search Traffic Information : Focusing on Tablet PC (웹검색 트래픽 정보를 활용한 지능형 브랜드 포지셔닝 시스템 : 태블릿 PC 사례를 중심으로)

  • Jun, Seung-Pyo;Park, Do-Hyung
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.3
    • /
    • pp.93-111
    • /
    • 2013
  • As Internet and information technology (IT) continues to develop and evolve, the issue of big data has emerged at the foreground of scholarly and industrial attention. Big data is generally defined as data that exceed the range that can be collected, stored, managed and analyzed by existing conventional information systems and it also refers to the new technologies designed to effectively extract values from such data. With the widespread dissemination of IT systems, continual efforts have been made in various fields of industry such as R&D, manufacturing, and finance to collect and analyze immense quantities of data in order to extract meaningful information and to use this information to solve various problems. Since IT has converged with various industries in many aspects, digital data are now being generated at a remarkably accelerating rate while developments in state-of-the-art technology have led to continual enhancements in system performance. The types of big data that are currently receiving the most attention include information available within companies, such as information on consumer characteristics, information on purchase records, logistics information and log information indicating the usage of products and services by consumers, as well as information accumulated outside companies, such as information on the web search traffic of online users, social network information, and patent information. Among these various types of big data, web searches performed by online users constitute one of the most effective and important sources of information for marketing purposes because consumers search for information on the internet in order to make efficient and rational choices. Recently, Google has provided public access to its information on the web search traffic of online users through a service named Google Trends. Research that uses this web search traffic information to analyze the information search behavior of online users is now receiving much attention in academia and in fields of industry. Studies using web search traffic information can be broadly classified into two fields. The first field consists of empirical demonstrations that show how web search information can be used to forecast social phenomena, the purchasing power of consumers, the outcomes of political elections, etc. The other field focuses on using web search traffic information to observe consumer behavior, identifying the attributes of a product that consumers regard as important or tracking changes on consumers' expectations, for example, but relatively less research has been completed in this field. In particular, to the extent of our knowledge, hardly any studies related to brands have yet attempted to use web search traffic information to analyze the factors that influence consumers' purchasing activities. This study aims to demonstrate that consumers' web search traffic information can be used to derive the relations among brands and the relations between an individual brand and product attributes. When consumers input their search words on the web, they may use a single keyword for the search, but they also often input multiple keywords to seek related information (this is referred to as simultaneous searching). A consumer performs a simultaneous search either to simultaneously compare two product brands to obtain information on their similarities and differences, or to acquire more in-depth information about a specific attribute in a specific brand. Web search traffic information shows that the quantity of simultaneous searches using certain keywords increases when the relation is closer in the consumer's mind and it will be possible to derive the relations between each of the keywords by collecting this relational data and subjecting it to network analysis. Accordingly, this study proposes a method of analyzing how brands are positioned by consumers and what relationships exist between product attributes and an individual brand, using simultaneous search traffic information. It also presents case studies demonstrating the actual application of this method, with a focus on tablets, belonging to innovative product groups.

A Study of the Reactive Movement Synchronization for Analysis of Group Flow (그룹 몰입도 판단을 위한 움직임 동기화 연구)

  • Ryu, Joon Mo;Park, Seung-Bo;Kim, Jae Kyeong
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.1
    • /
    • pp.79-94
    • /
    • 2013
  • Recently, the high value added business is steadily growing in the culture and art area. To generated high value from a performance, the satisfaction of audience is necessary. The flow in a critical factor for satisfaction, and it should be induced from audience and measures. To evaluate interest and emotion of audience on contents, producers or investors need a kind of index for the measurement of the flow. But it is neither easy to define the flow quantitatively, nor to collect audience's reaction immediately. The previous studies of the group flow were evaluated by the sum of the average value of each person's reaction. The flow or "good feeling" from each audience was extracted from his face, especially, the change of his (or her) expression and body movement. But it was not easy to handle the large amount of real-time data from each sensor signals. And also it was difficult to set experimental devices, in terms of economic and environmental problems. Because, all participants should have their own personal sensor to check their physical signal. Also each camera should be located in front of their head to catch their looks. Therefore we need more simple system to analyze group flow. This study provides the method for measurement of audiences flow with group synchronization at same time and place. To measure the synchronization, we made real-time processing system using the Differential Image and Group Emotion Analysis (GEA) system. Differential Image was obtained from camera and by the previous frame was subtracted from present frame. So the movement variation on audience's reaction was obtained. And then we developed a program, GEX(Group Emotion Analysis), for flow judgment model. After the measurement of the audience's reaction, the synchronization is divided as Dynamic State Synchronization and Static State Synchronization. The Dynamic State Synchronization accompanies audience's active reaction, while the Static State Synchronization means to movement of audience. The Dynamic State Synchronization can be caused by the audience's surprise action such as scary, creepy or reversal scene. And the Static State Synchronization was triggered by impressed or sad scene. Therefore we showed them several short movies containing various scenes mentioned previously. And these kind of scenes made them sad, clap, and creepy, etc. To check the movement of audience, we defined the critical point, ${\alpha}$and ${\beta}$. Dynamic State Synchronization was meaningful when the movement value was over critical point ${\beta}$, while Static State Synchronization was effective under critical point ${\alpha}$. ${\beta}$ is made by audience' clapping movement of 10 teams in stead of using average number of movement. After checking the reactive movement of audience, the percentage(%) ratio was calculated from the division of "people having reaction" by "total people". Total 37 teams were made in "2012 Seoul DMC Culture Open" and they involved the experiments. First, they followed induction to clap by staff. Second, basic scene for neutralize emotion of audience. Third, flow scene was displayed to audience. Forth, the reversal scene was introduced. And then 24 teams of them were provided with amuse and creepy scenes. And the other 10 teams were exposed with the sad scene. There were clapping and laughing action of audience on the amuse scene with shaking their head or hid with closing eyes. And also the sad or touching scene made them silent. If the results were over about 80%, the group could be judged as the synchronization and the flow were achieved. As a result, the audience showed similar reactions about similar stimulation at same time and place. Once we get an additional normalization and experiment, we can obtain find the flow factor through the synchronization on a much bigger group and this should be useful for planning contents.

Attitude Confidence and User Resistance for Purchasing Wearable Devices on Virtual Reality: Based on Virtual Reality Headgears (가상현실 웨어러블 기기의 구매 촉진을 위한 태도 자신감과 사용자 저항 태도: 가상현실 헤드기어를 중심으로)

  • Sohn, Bong-Jin;Park, Da-Sul;Choi, Jaewon
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.3
    • /
    • pp.165-183
    • /
    • 2016
  • Over the past decade, there has been a rapid diffusion of technological devices and a rising number of various devices, resulting in an escalation of virtual reality technology. Technological market has rapidly been changed from smartphone to wearable devices based on virtual reality. Virtual reality can make users feel real situation through sensing interaction, voice, motion capture and so on. Facebook.com, Google, Samsung, LG, Sony and so on have investigated developing platform of virtual reality. the pricing of virtual reality devices also had decreased into 30% from their launched period. Thus market infrastructure in virtual reality have rapidly been developed to crease marketplace. However, most consumers recognize that virtual reality is not ease to purchase or use. That could not lead consumers to positive attitude for devices and purchase the related devices in the early market. Through previous studies related to virtual reality, there are few studies focusing on why the devices for virtual reality stayed in early stage in adoption & diffusion context in the market. Almost previous studies considered the reasons of hard adoption for innovative products in the viewpoints of Typology of Innovation Resistance, MIR(Management of Innovation Resistant), UTAUT & UTAUT2. However, product-based antecedents also important to increase user intention to purchase and use products in the technological market. In this study, we focus on user acceptance and resistance for increasing purchase and usage promotions of wearable devices related to virtual reality based on headgear products like Galaxy Gear. Especially, we added a variables like attitude confidence as a dimension for user resistance. The research questions of this study are follows. First, how attitude confidence and innovativeness resistance affect user intention to use? Second, What factors related to content and brand contexts can affect user intention to use? This research collected data from the participants who have experiences using virtual rality headgears aged between 20s to 50s located in South Korea. In order to collect data, this study used a pilot test and through making face-to-face interviews on three specialists, face validity and content validity were evaluated for the questionnaire validity. Cleansing the data, we dropped some outliers and data of irrelevant papers. Totally, 156 responses were used for testing the suggested hypotheses. Through collecting data, demographics and the relationships among variables were analyzed through conducting structural equation modeling by PLS. The data showed that the sex of respondents who have experience using social commerce sites (male=86(55.1%), female=70(44.9%). The ages of respondents are mostly from 20s (74.4%) to 30s (16.7%). 126 respondents (80.8%) have used virtual reality devices. The results of our model estimation are as follows. With the exception of Hypothesis 1 and 7, which deals with the two relationships between brand awareness to attitude confidence, and quality of content to perceived enjoyment, all of our hypotheses were supported. In compliance with our hypotheses, perceived ease of use (H2) and use innovativeness (H3) were supported with its positively influence for the attitude confidence. This finding indicates that the more ease of use and innovativeness for devices increased, the more users' attitude confidence increased. Perceived price (H4), enjoyment (H5), Quantity of contents (H6) significantly increase user resistance. However, perceived price positively affect user innovativeness resistance meanwhile perceived enjoyment and quantity of contents negatively affect user innovativeness resistance. In addition, aesthetic exterior (H6) was also positively associated with perceived price (p<0.01). Also projection quality (H8) can increase perceived enjoyment (p<0.05). Finally, attitude confidence (H10) increased user intention to use virtual reality devices. however user resistance (H11) negatively affect user intention to use virtual reality devices. The findings of this study show that attitude confidence and user innovativeness resistance differently influence customer intention for using virtual reality devices. There are two distinct characteristic of attitude confidence: perceived ease of use and user innovativeness. This study identified the antecedents of different roles of perceived price (aesthetic exterior) and perceived enjoyment (quality of contents & projection quality). The findings indicated that brand awareness and quality of contents for virtual reality is not formed within virtual reality market yet. Therefore, firms should developed brand awareness for their product in the virtual market to increase market share.

Methodology for Identifying Issues of User Reviews from the Perspective of Evaluation Criteria: Focus on a Hotel Information Site (사용자 리뷰의 평가기준 별 이슈 식별 방법론: 호텔 리뷰 사이트를 중심으로)

  • Byun, Sungho;Lee, Donghoon;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.3
    • /
    • pp.23-43
    • /
    • 2016
  • As a result of the growth of Internet data and the rapid development of Internet technology, "big data" analysis has gained prominence as a major approach for evaluating and mining enormous data for various purposes. Especially, in recent years, people tend to share their experiences related to their leisure activities while also reviewing others' inputs concerning their activities. Therefore, by referring to others' leisure activity-related experiences, they are able to gather information that might guarantee them better leisure activities in the future. This phenomenon has appeared throughout many aspects of leisure activities such as movies, traveling, accommodation, and dining. Apart from blogs and social networking sites, many other websites provide a wealth of information related to leisure activities. Most of these websites provide information of each product in various formats depending on different purposes and perspectives. Generally, most of the websites provide the average ratings and detailed reviews of users who actually used products/services, and these ratings and reviews can actually support the decision of potential customers in purchasing the same products/services. However, the existing websites offering information on leisure activities only provide the rating and review based on one stage of a set of evaluation criteria. Therefore, to identify the main issue for each evaluation criterion as well as the characteristics of specific elements comprising each criterion, users have to read a large number of reviews. In particular, as most of the users search for the characteristics of the detailed elements for one or more specific evaluation criteria based on their priorities, they must spend a great deal of time and effort to obtain the desired information by reading more reviews and understanding the contents of such reviews. Although some websites break down the evaluation criteria and direct the user to input their reviews according to different levels of criteria, there exist excessive amounts of input sections that make the whole process inconvenient for the users. Further, problems may arise if a user does not follow the instructions for the input sections or fill in the wrong input sections. Finally, treating the evaluation criteria breakdown as a realistic alternative is difficult, because identifying all the detailed criteria for each evaluation criterion is a challenging task. For example, if a review about a certain hotel has been written, people tend to only write one-stage reviews for various components such as accessibility, rooms, services, or food. These might be the reviews for most frequently asked questions, such as distance between the nearest subway station or condition of the bathroom, but they still lack detailed information for these questions. In addition, in case a breakdown of the evaluation criteria was provided along with various input sections, the user might only fill in the evaluation criterion for accessibility or fill in the wrong information such as information regarding rooms in the evaluation criteria for accessibility. Thus, the reliability of the segmented review will be greatly reduced. In this study, we propose an approach to overcome the limitations of the existing leisure activity information websites, namely, (1) the reliability of reviews for each evaluation criteria and (2) the difficulty of identifying the detailed contents that make up the evaluation criteria. In our proposed methodology, we first identify the review content and construct the lexicon for each evaluation criterion by using the terms that are frequently used for each criterion. Next, the sentences in the review documents containing the terms in the constructed lexicon are decomposed into review units, which are then reconstructed by using the evaluation criteria. Finally, the issues of the constructed review units by evaluation criteria are derived and the summary results are provided. Apart from the derived issues, the review units are also provided. Therefore, this approach aims to help users save on time and effort, because they will only be reading the relevant information they need for each evaluation criterion rather than go through the entire text of review. Our proposed methodology is based on the topic modeling, which is being actively used in text analysis. The review is decomposed into sentence units rather than considering the whole review as a document unit. After being decomposed into individual review units, the review units are reorganized according to each evaluation criterion and then used in the subsequent analysis. This work largely differs from the existing topic modeling-based studies. In this paper, we collected 423 reviews from hotel information websites and decomposed these reviews into 4,860 review units. We then reorganized the review units according to six different evaluation criteria. By applying these review units in our methodology, the analysis results can be introduced, and the utility of proposed methodology can be demonstrated.

The Pattern Analysis of Financial Distress for Non-audited Firms using Data Mining (데이터마이닝 기법을 활용한 비외감기업의 부실화 유형 분석)

  • Lee, Su Hyun;Park, Jung Min;Lee, Hyoung Yong
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.4
    • /
    • pp.111-131
    • /
    • 2015
  • There are only a handful number of research conducted on pattern analysis of corporate distress as compared with research for bankruptcy prediction. The few that exists mainly focus on audited firms because financial data collection is easier for these firms. But in reality, corporate financial distress is a far more common and critical phenomenon for non-audited firms which are mainly comprised of small and medium sized firms. The purpose of this paper is to classify non-audited firms under distress according to their financial ratio using data mining; Self-Organizing Map (SOM). SOM is a type of artificial neural network that is trained using unsupervised learning to produce a lower dimensional discretized representation of the input space of the training samples, called a map. SOM is different from other artificial neural networks as it applies competitive learning as opposed to error-correction learning such as backpropagation with gradient descent, and in the sense that it uses a neighborhood function to preserve the topological properties of the input space. It is one of the popular and successful clustering algorithm. In this study, we classify types of financial distress firms, specially, non-audited firms. In the empirical test, we collect 10 financial ratios of 100 non-audited firms under distress in 2004 for the previous two years (2002 and 2003). Using these financial ratios and the SOM algorithm, five distinct patterns were distinguished. In pattern 1, financial distress was very serious in almost all financial ratios. 12% of the firms are included in these patterns. In pattern 2, financial distress was weak in almost financial ratios. 14% of the firms are included in pattern 2. In pattern 3, growth ratio was the worst among all patterns. It is speculated that the firms of this pattern may be under distress due to severe competition in their industries. Approximately 30% of the firms fell into this group. In pattern 4, the growth ratio was higher than any other pattern but the cash ratio and profitability ratio were not at the level of the growth ratio. It is concluded that the firms of this pattern were under distress in pursuit of expanding their business. About 25% of the firms were in this pattern. Last, pattern 5 encompassed very solvent firms. Perhaps firms of this pattern were distressed due to a bad short-term strategic decision or due to problems with the enterpriser of the firms. Approximately 18% of the firms were under this pattern. This study has the academic and empirical contribution. In the perspectives of the academic contribution, non-audited companies that tend to be easily bankrupt and have the unstructured or easily manipulated financial data are classified by the data mining technology (Self-Organizing Map) rather than big sized audited firms that have the well prepared and reliable financial data. In the perspectives of the empirical one, even though the financial data of the non-audited firms are conducted to analyze, it is useful for find out the first order symptom of financial distress, which makes us to forecast the prediction of bankruptcy of the firms and to manage the early warning and alert signal. These are the academic and empirical contribution of this study. The limitation of this research is to analyze only 100 corporates due to the difficulty of collecting the financial data of the non-audited firms, which make us to be hard to proceed to the analysis by the category or size difference. Also, non-financial qualitative data is crucial for the analysis of bankruptcy. Thus, the non-financial qualitative factor is taken into account for the next study. This study sheds some light on the non-audited small and medium sized firms' distress prediction in the future.

Strategy for Store Management Using SOM Based on RFM (RFM 기반 SOM을 이용한 매장관리 전략 도출)

  • Jeong, Yoon Jeong;Choi, Il Young;Kim, Jae Kyeong;Choi, Ju Choel
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.2
    • /
    • pp.93-112
    • /
    • 2015
  • Depending on the change in consumer's consumption pattern, existing retail shop has evolved in hypermarket or convenience store offering grocery and daily products mostly. Therefore, it is important to maintain the inventory levels and proper product configuration for effectively utilize the limited space in the retail store and increasing sales. Accordingly, this study proposed proper product configuration and inventory level strategy based on RFM(Recency, Frequency, Monetary) model and SOM(self-organizing map) for manage the retail shop effectively. RFM model is analytic model to analyze customer behaviors based on the past customer's buying activities. And it can differentiates important customers from large data by three variables. R represents recency, which refers to the last purchase of commodities. The latest consuming customer has bigger R. F represents frequency, which refers to the number of transactions in a particular period and M represents monetary, which refers to consumption money amount in a particular period. Thus, RFM method has been known to be a very effective model for customer segmentation. In this study, using a normalized value of the RFM variables, SOM cluster analysis was performed. SOM is regarded as one of the most distinguished artificial neural network models in the unsupervised learning tool space. It is a popular tool for clustering and visualization of high dimensional data in such a way that similar items are grouped spatially close to one another. In particular, it has been successfully applied in various technical fields for finding patterns. In our research, the procedure tries to find sales patterns by analyzing product sales records with Recency, Frequency and Monetary values. And to suggest a business strategy, we conduct the decision tree based on SOM results. To validate the proposed procedure in this study, we adopted the M-mart data collected between 2014.01.01~2014.12.31. Each product get the value of R, F, M, and they are clustered by 9 using SOM. And we also performed three tests using the weekday data, weekend data, whole data in order to analyze the sales pattern change. In order to propose the strategy of each cluster, we examine the criteria of product clustering. The clusters through the SOM can be explained by the characteristics of these clusters of decision trees. As a result, we can suggest the inventory management strategy of each 9 clusters through the suggested procedures of the study. The highest of all three value(R, F, M) cluster's products need to have high level of the inventory as well as to be disposed in a place where it can be increasing customer's path. In contrast, the lowest of all three value(R, F, M) cluster's products need to have low level of inventory as well as to be disposed in a place where visibility is low. The highest R value cluster's products is usually new releases products, and need to be placed on the front of the store. And, manager should decrease inventory levels gradually in the highest F value cluster's products purchased in the past. Because, we assume that cluster has lower R value and the M value than the average value of good. And it can be deduced that product are sold poorly in recent days and total sales also will be lower than the frequency. The procedure presented in this study is expected to contribute to raising the profitability of the retail store. The paper is organized as follows. The second chapter briefly reviews the literature related to this study. The third chapter suggests procedures for research proposals, and the fourth chapter applied suggested procedure using the actual product sales data. Finally, the fifth chapter described the conclusion of the study and further research.

Development and application of prediction model of hyperlipidemia using SVM and meta-learning algorithm (SVM과 meta-learning algorithm을 이용한 고지혈증 유병 예측모형 개발과 활용)

  • Lee, Seulki;Shin, Taeksoo
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.111-124
    • /
    • 2018
  • This study aims to develop a classification model for predicting the occurrence of hyperlipidemia, one of the chronic diseases. Prior studies applying data mining techniques for predicting disease can be classified into a model design study for predicting cardiovascular disease and a study comparing disease prediction research results. In the case of foreign literatures, studies predicting cardiovascular disease were predominant in predicting disease using data mining techniques. Although domestic studies were not much different from those of foreign countries, studies focusing on hypertension and diabetes were mainly conducted. Since hypertension and diabetes as well as chronic diseases, hyperlipidemia, are also of high importance, this study selected hyperlipidemia as the disease to be analyzed. We also developed a model for predicting hyperlipidemia using SVM and meta learning algorithms, which are already known to have excellent predictive power. In order to achieve the purpose of this study, we used data set from Korea Health Panel 2012. The Korean Health Panel produces basic data on the level of health expenditure, health level and health behavior, and has conducted an annual survey since 2008. In this study, 1,088 patients with hyperlipidemia were randomly selected from the hospitalized, outpatient, emergency, and chronic disease data of the Korean Health Panel in 2012, and 1,088 nonpatients were also randomly extracted. A total of 2,176 people were selected for the study. Three methods were used to select input variables for predicting hyperlipidemia. First, stepwise method was performed using logistic regression. Among the 17 variables, the categorical variables(except for length of smoking) are expressed as dummy variables, which are assumed to be separate variables on the basis of the reference group, and these variables were analyzed. Six variables (age, BMI, education level, marital status, smoking status, gender) excluding income level and smoking period were selected based on significance level 0.1. Second, C4.5 as a decision tree algorithm is used. The significant input variables were age, smoking status, and education level. Finally, C4.5 as a decision tree algorithm is used. In SVM, the input variables selected by genetic algorithms consisted of 6 variables such as age, marital status, education level, economic activity, smoking period, and physical activity status, and the input variables selected by genetic algorithms in artificial neural network consist of 3 variables such as age, marital status, and education level. Based on the selected parameters, we compared SVM, meta learning algorithm and other prediction models for hyperlipidemia patients, and compared the classification performances using TP rate and precision. The main results of the analysis are as follows. First, the accuracy of the SVM was 88.4% and the accuracy of the artificial neural network was 86.7%. Second, the accuracy of classification models using the selected input variables through stepwise method was slightly higher than that of classification models using the whole variables. Third, the precision of artificial neural network was higher than that of SVM when only three variables as input variables were selected by decision trees. As a result of classification models based on the input variables selected through the genetic algorithm, classification accuracy of SVM was 88.5% and that of artificial neural network was 87.9%. Finally, this study indicated that stacking as the meta learning algorithm proposed in this study, has the best performance when it uses the predicted outputs of SVM and MLP as input variables of SVM, which is a meta classifier. The purpose of this study was to predict hyperlipidemia, one of the representative chronic diseases. To do this, we used SVM and meta-learning algorithms, which is known to have high accuracy. As a result, the accuracy of classification of hyperlipidemia in the stacking as a meta learner was higher than other meta-learning algorithms. However, the predictive performance of the meta-learning algorithm proposed in this study is the same as that of SVM with the best performance (88.6%) among the single models. The limitations of this study are as follows. First, various variable selection methods were tried, but most variables used in the study were categorical dummy variables. In the case with a large number of categorical variables, the results may be different if continuous variables are used because the model can be better suited to categorical variables such as decision trees than general models such as neural networks. Despite these limitations, this study has significance in predicting hyperlipidemia with hybrid models such as met learning algorithms which have not been studied previously. It can be said that the result of improving the model accuracy by applying various variable selection techniques is meaningful. In addition, it is expected that our proposed model will be effective for the prevention and management of hyperlipidemia.

Automatic Quality Evaluation with Completeness and Succinctness for Text Summarization (완전성과 간결성을 고려한 텍스트 요약 품질의 자동 평가 기법)

  • Ko, Eunjung;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.125-148
    • /
    • 2018
  • Recently, as the demand for big data analysis increases, cases of analyzing unstructured data and using the results are also increasing. Among the various types of unstructured data, text is used as a means of communicating information in almost all fields. In addition, many analysts are interested in the amount of data is very large and relatively easy to collect compared to other unstructured and structured data. Among the various text analysis applications, document classification which classifies documents into predetermined categories, topic modeling which extracts major topics from a large number of documents, sentimental analysis or opinion mining that identifies emotions or opinions contained in texts, and Text Summarization which summarize the main contents from one document or several documents have been actively studied. Especially, the text summarization technique is actively applied in the business through the news summary service, the privacy policy summary service, ect. In addition, much research has been done in academia in accordance with the extraction approach which provides the main elements of the document selectively and the abstraction approach which extracts the elements of the document and composes new sentences by combining them. However, the technique of evaluating the quality of automatically summarized documents has not made much progress compared to the technique of automatic text summarization. Most of existing studies dealing with the quality evaluation of summarization were carried out manual summarization of document, using them as reference documents, and measuring the similarity between the automatic summary and reference document. Specifically, automatic summarization is performed through various techniques from full text, and comparison with reference document, which is an ideal summary document, is performed for measuring the quality of automatic summarization. Reference documents are provided in two major ways, the most common way is manual summarization, in which a person creates an ideal summary by hand. Since this method requires human intervention in the process of preparing the summary, it takes a lot of time and cost to write the summary, and there is a limitation that the evaluation result may be different depending on the subject of the summarizer. Therefore, in order to overcome these limitations, attempts have been made to measure the quality of summary documents without human intervention. On the other hand, as a representative attempt to overcome these limitations, a method has been recently devised to reduce the size of the full text and to measure the similarity of the reduced full text and the automatic summary. In this method, the more frequent term in the full text appears in the summary, the better the quality of the summary. However, since summarization essentially means minimizing a lot of content while minimizing content omissions, it is unreasonable to say that a "good summary" based on only frequency always means a "good summary" in its essential meaning. In order to overcome the limitations of this previous study of summarization evaluation, this study proposes an automatic quality evaluation for text summarization method based on the essential meaning of summarization. Specifically, the concept of succinctness is defined as an element indicating how few duplicated contents among the sentences of the summary, and completeness is defined as an element that indicating how few of the contents are not included in the summary. In this paper, we propose a method for automatic quality evaluation of text summarization based on the concepts of succinctness and completeness. In order to evaluate the practical applicability of the proposed methodology, 29,671 sentences were extracted from TripAdvisor 's hotel reviews, summarized the reviews by each hotel and presented the results of the experiments conducted on evaluation of the quality of summaries in accordance to the proposed methodology. It also provides a way to integrate the completeness and succinctness in the trade-off relationship into the F-Score, and propose a method to perform the optimal summarization by changing the threshold of the sentence similarity.

A Study on the Clustering Method of Row and Multiplex Housing in Seoul Using K-Means Clustering Algorithm and Hedonic Model (K-Means Clustering 알고리즘과 헤도닉 모형을 활용한 서울시 연립·다세대 군집분류 방법에 관한 연구)

  • Kwon, Soonjae;Kim, Seonghyeon;Tak, Onsik;Jeong, Hyeonhee
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.3
    • /
    • pp.95-118
    • /
    • 2017
  • Recent centrally the downtown area, the transaction between the row housing and multiplex housing is activated and platform services such as Zigbang and Dabang are growing. The row housing and multiplex housing is a blind spot for real estate information. Because there is a social problem, due to the change in market size and information asymmetry due to changes in demand. Also, the 5 or 25 districts used by the Seoul Metropolitan Government or the Korean Appraisal Board(hereafter, KAB) were established within the administrative boundaries and used in existing real estate studies. This is not a district classification for real estate researches because it is zoned urban planning. Based on the existing study, this study found that the city needs to reset the Seoul Metropolitan Government's spatial structure in estimating future housing prices. So, This study attempted to classify the area without spatial heterogeneity by the reflected the property price characteristics of row housing and Multiplex housing. In other words, There has been a problem that an inefficient side has arisen due to the simple division by the existing administrative district. Therefore, this study aims to cluster Seoul as a new area for more efficient real estate analysis. This study was applied to the hedonic model based on the real transactions price data of row housing and multiplex housing. And the K-Means Clustering algorithm was used to cluster the spatial structure of Seoul. In this study, data onto real transactions price of the Seoul Row housing and Multiplex Housing from January 2014 to December 2016, and the official land value of 2016 was used and it provided by Ministry of Land, Infrastructure and Transport(hereafter, MOLIT). Data preprocessing was followed by the following processing procedures: Removal of underground transaction, Price standardization per area, Removal of Real transaction case(above 5 and below -5). In this study, we analyzed data from 132,707 cases to 126,759 data through data preprocessing. The data analysis tool used the R program. After data preprocessing, data model was constructed. Priority, the K-means Clustering was performed. In addition, a regression analysis was conducted using Hedonic model and it was conducted a cosine similarity analysis. Based on the constructed data model, we clustered on the basis of the longitude and latitude of Seoul and conducted comparative analysis of existing area. The results of this study indicated that the goodness of fit of the model was above 75 % and the variables used for the Hedonic model were significant. In other words, 5 or 25 districts that is the area of the existing administrative area are divided into 16 districts. So, this study derived a clustering method of row housing and multiplex housing in Seoul using K-Means Clustering algorithm and hedonic model by the reflected the property price characteristics. Moreover, they presented academic and practical implications and presented the limitations of this study and the direction of future research. Academic implication has clustered by reflecting the property price characteristics in order to improve the problems of the areas used in the Seoul Metropolitan Government, KAB, and Existing Real Estate Research. Another academic implications are that apartments were the main study of existing real estate research, and has proposed a method of classifying area in Seoul using public information(i.e., real-data of MOLIT) of government 3.0. Practical implication is that it can be used as a basic data for real estate related research on row housing and multiplex housing. Another practical implications are that is expected the activation of row housing and multiplex housing research and, that is expected to increase the accuracy of the model of the actual transaction. The future research direction of this study involves conducting various analyses to overcome the limitations of the threshold and indicates the need for deeper research.