• Title/Summary/Keyword: Correlation clustering

Search Result 272, Processing Time 0.026 seconds

Discovering Association Rules using Item Clustering on Frequent Pattern Network (빈발 패턴 네트워크에서 아이템 클러스터링을 통한 연관규칙 발견)

  • Oh, Kyeong-Jin;Jung, Jin-Guk;Ha, In-Ay;Jo, Geun-Sik
    • Journal of Intelligence and Information Systems
    • /
    • v.14 no.1
    • /
    • pp.1-17
    • /
    • 2008
  • Data mining is defined as the process of discovering meaningful and useful pattern in large volumes of data. In particular, finding associations rules between items in a database of customer transactions has become an important thing. Some data structures and algorithms had been proposed for storing meaningful information compressed from an original database to find frequent itemsets since Apriori algorithm. Though existing method find all association rules, we must have a lot of process to analyze association rules because there are too many rules. In this paper, we propose a new data structure, called a Frequent Pattern Network (FPN), which represents items as vertices and 2-itemsets as edges of the network. In order to utilize FPN, We constitute FPN using item's frequency. And then we use a clustering method to group the vertices on the network into clusters so that the intracluster similarity is maximized and the intercluster similarity is minimized. We generate association rules based on clusters. Our experiments showed accuracy of clustering items on the network using confidence, correlation and edge weight similarity methods. And We generated association rules using clusters and compare traditional and our method. From the results, the confidence similarity had a strong influence than others on the frequent pattern network. And FPN had a flexibility to minimum support value.

  • PDF

Predicting stock movements based on financial news with systematic group identification (시스템적인 군집 확인과 뉴스를 이용한 주가 예측)

  • Seong, NohYoon;Nam, Kihwan
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.1-17
    • /
    • 2019
  • Because stock price forecasting is an important issue both academically and practically, research in stock price prediction has been actively conducted. The stock price forecasting research is classified into using structured data and using unstructured data. With structured data such as historical stock price and financial statements, past studies usually used technical analysis approach and fundamental analysis. In the big data era, the amount of information has rapidly increased, and the artificial intelligence methodology that can find meaning by quantifying string information, which is an unstructured data that takes up a large amount of information, has developed rapidly. With these developments, many attempts with unstructured data are being made to predict stock prices through online news by applying text mining to stock price forecasts. The stock price prediction methodology adopted in many papers is to forecast stock prices with the news of the target companies to be forecasted. However, according to previous research, not only news of a target company affects its stock price, but news of companies that are related to the company can also affect the stock price. However, finding a highly relevant company is not easy because of the market-wide impact and random signs. Thus, existing studies have found highly relevant companies based primarily on pre-determined international industry classification standards. However, according to recent research, global industry classification standard has different homogeneity within the sectors, and it leads to a limitation that forecasting stock prices by taking them all together without considering only relevant companies can adversely affect predictive performance. To overcome the limitation, we first used random matrix theory with text mining for stock prediction. Wherever the dimension of data is large, the classical limit theorems are no longer suitable, because the statistical efficiency will be reduced. Therefore, a simple correlation analysis in the financial market does not mean the true correlation. To solve the issue, we adopt random matrix theory, which is mainly used in econophysics, to remove market-wide effects and random signals and find a true correlation between companies. With the true correlation, we perform cluster analysis to find relevant companies. Also, based on the clustering analysis, we used multiple kernel learning algorithm, which is an ensemble of support vector machine to incorporate the effects of the target firm and its relevant firms simultaneously. Each kernel was assigned to predict stock prices with features of financial news of the target firm and its relevant firms. The results of this study are as follows. The results of this paper are as follows. (1) Following the existing research flow, we confirmed that it is an effective way to forecast stock prices using news from relevant companies. (2) When looking for a relevant company, looking for it in the wrong way can lower AI prediction performance. (3) The proposed approach with random matrix theory shows better performance than previous studies if cluster analysis is performed based on the true correlation by removing market-wide effects and random signals. The contribution of this study is as follows. First, this study shows that random matrix theory, which is used mainly in economic physics, can be combined with artificial intelligence to produce good methodologies. This suggests that it is important not only to develop AI algorithms but also to adopt physics theory. This extends the existing research that presented the methodology by integrating artificial intelligence with complex system theory through transfer entropy. Second, this study stressed that finding the right companies in the stock market is an important issue. This suggests that it is not only important to study artificial intelligence algorithms, but how to theoretically adjust the input values. Third, we confirmed that firms classified as Global Industrial Classification Standard (GICS) might have low relevance and suggested it is necessary to theoretically define the relevance rather than simply finding it in the GICS.

Correlation Between Sasang Constitution and Heart Rate Variability in Won-ju Rural Population (원주 지역 주민들의 사상체질과 심박수변이도와의 상관성)

  • Kim, Soo-Yeon;Sun, Seung-Ho;Yoo, Jun-Sang;Koh, Sang-Baek;Park, Jong-Ku
    • The Journal of Internal Korean Medicine
    • /
    • v.30 no.3
    • /
    • pp.510-524
    • /
    • 2009
  • Objective : This study was designed to find the correlation between Sasang Constitution and heart rate variability(HRV). Method : There were 665 subjects (280 men and 385 women), between 39 and 72 years old. in a rural community. Sasang Constitution was diagnosed by a Sasang constitutional specialist using PSSC (Phonetic System for Sasang Constitution), face and tongue photo and checkup-list. A structured-questionnaire was used to assess the general characteristics. HRV was recorded using SA-2000 (medi-core). HRV was assessed by time domain and by frequency domain analysis. Metabolic syndrome was defined on the basis of clustering of risk factors, when three or more of the following cardiovascular risk factors were included : blood pressure, fasting blood sugar, triglyceride HDL-cholesterol, and abdominal obesity (waist). Because of the skewness of the data, logarithmic transformation was performed on the absolute units of the spectral components of HRV, and the resulting logarithmic values and normalized units were compared between the groups by a logistic regression. The 95% confidence interval (CI) of the odds ratio was used and calculated from the data laid out for a cross sectional study. Results : 1. Odds ratios of Taeeumin and Soeumin in female adults below 60 years old were significantly lower than that of Soyangin in LF norm and LF/HF ratio. Odds ratios of Taeeumin and Soeumin in female adults below 60 years old were significantly higher than that of Soyangin in HF norm. 2. There was no significant correlation between HRV and Sasang Constitution in female adults from 60 years old and over. 3. There was no significant correlation between HRV and Sasang Constitution in male adults. Conclusion : There is a statistically significant correlation between the HRV and Sasang Constitution. There is a tendency of increase in the sympathetic activity in Soyangin. There is a tendency of decrease in the parasympathetic activity in Taeeumin and Soeumin.

  • PDF

Analysis of Reading Domian of Men and Women Elderly Using Book Lending Data (도서 대출데이터를 활용한 남녀 노령자의 독서 주제 분석)

  • Cho, Jane
    • Journal of Korean Library and Information Science Society
    • /
    • v.50 no.1
    • /
    • pp.23-41
    • /
    • 2019
  • This study understand the subject domain of book which has been read by men and woman elderly by analizying the PFNET using library big data and confirm the difference between adult at age 30-40. This study extract co-occurrence matrix of book lending on the popular book list from library big data, for 4 group, men/woman elderly, men/woman adult. With these matrix, this study performs FP network analysis. And Pearson Correlation Analysis based on the Triangle Betweenness Centrality calculated on the loan book was performed to understand the correlation among the 4 clusters which has been created by PNNC algorithm. As a result, reading trend which has been focused on modern korean novel has been revealed in elderly regardless gender, among them, men elderly show extreme tendency concentrated on modern korean long series novel. In the correlation analysis, the male elderly showed a weak negative correlation with the adult male of r = -0.222, and the negative direction of all the other groups showed that the tendency of male elderly's loan book was opposite.

Bioclimatic Classification and Characterization in South Korea (남한의 생물기후권역 구분과 특성 규명)

  • Choi, Yu-Young;Lim, Chul-Hee;Ryu, Ji-Eun;Piao, Dongfan;Kang, Jin-Young;Zhu, Weihong;Cui, Guishan;Lee, Woo-Kyun;Jeon, Seong-Woo
    • Journal of the Korean Society of Environmental Restoration Technology
    • /
    • v.20 no.3
    • /
    • pp.1-18
    • /
    • 2017
  • This study constructed a high-resolution bioclimatic classification map of South Korea which classifies land into homogeneous zones by similar environment properties using advanced statistical techniques compared to existing ecological area classification studies. The climate data provided by WorldClim(1960-1990) were used to generate 27 bioclimatic variables affecting biological habitats, and key environmental variables were derived from Correlation Analysis and Principal Component Analysis. Clustering Analysis was performed using the ISODATA method to construct a 30'(~1km) resolution bioclimatic classification map. South Korea was divided into 21 regions and the results of classification were verified by correlation analysis with the Gross Primary Production(GPP), Actual Vegetation map made by the Ministry of Environment. Each zones' were described and named by its environmental characteristics and major vegetation distribution. This study could provide useful spatial frameworks to support ecosystem research, monitoring and policy decisions.

Optimal Associative Neighborhood Mining using Representative Attribute (대표 속성을 이용한 최적 연관 이웃 마이닝)

  • Jung Kyung-Yong
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.43 no.4 s.310
    • /
    • pp.50-57
    • /
    • 2006
  • In Electronic Commerce, the latest most of the personalized recommender systems have applied to the collaborative filtering technique. This method calculates the weight of similarity among users who have a similar preference degree in order to predict and recommend the item which hits to propensity of users. In this case, we commonly use Pearson Correlation Coefficient. However, this method is feasible to calculate a correlation if only there are the items that two users evaluated a preference degree in common. Accordingly, the accuracy of prediction falls. The weight of similarity can affect not only the case which predicts the item which hits to propensity of users, but also the performance of the personalized recommender system. In this study, we verify the improvement of the prediction accuracy through an experiment after observing the rule of the weight of similarity applying Vector similarity, Entropy, Inverse user frequency, and Default voting of Information Retrieval field. The result shows that the method combining the weight of similarity using the Entropy with Default voting got the most efficient performance.

Evaluating Cross-correlation of GOSAT CO2 Concentration with MODIS NDVI Patterns in North-East Asia (동북아시아에서 GOSAT CO2와 MODIS 식생지수 분포의 상관성 분석)

  • Choi, Jin Ho;Joo, Seung Min;Um, Jung Sup
    • Spatial Information Research
    • /
    • v.21 no.5
    • /
    • pp.15-22
    • /
    • 2013
  • The purpose of this work is to investigate correlation between $CO_2$ concentration and NDVI (Normalized Difference Vegetation Index) in North East Asia. Geographically weighted regression techniques were used to evaluate the spatial relationships between GOSAT (Greenhouse Observing SATellite) $CO_2$ measurement and MODIS (Moderate Resolution Imaging Spectroradiometer) vegetation index. The results reveals that $CO_2$ concentration to be negatively associated with NDVI. The analysis of Global Morans' I index and Anselin Local Morasn's I showed spatial autocorrelation between the overall spatial pattern of $CO_2$ and NDVI. Ultimately, there were clustered patterns in both data sets. The results show that carbon dioxide concentration shows non-random distribution patterns in relation to NDVI clusters, which proves that intense development activities such as deforestation are influencing carbon dioxide emission across the area of analysis. However, as the concentration of carbon dioxide varies depending on a variety of factors such as artificial sources, plant respiration, and the absorption and discharge of the ocean, follow-up studies are required to evaluate the correlations among more related variables.

Relationships of Colorectal Cancer with Dietary Factors and Public Health Indicators: an Ecological Study

  • Abbastabar, Hedayat;Roustazadeh, Abazar;Alizadeh, Ali;Hamidifard, Parvin;Valipour, Mehrdad;Valipour, Ali Asghar
    • Asian Pacific Journal of Cancer Prevention
    • /
    • v.16 no.9
    • /
    • pp.3991-3995
    • /
    • 2015
  • Background: Colorectal cancer (CRC) is the third most common cancer in Iranian women and fifth in men. The aims of this study were to investigate the relation of dietary factors and public health indicators to its development. Materials and Methods: The required information (2001-2006) about risk factors was obtained from the Non-Communicable Disease Surveillance Centre (NCDSC) of Iran. Risk factor data (RFD) from 89,404 individuals (15-64 years old) were gathered by questionnaire and laboratory examinations through a cross sectional study in all provinces by systematic clustering sampling method. CRC incidence segregated by age and gender was obtained from Cancer Registry Ministry of Health (CRMH) of Iran. First, correlation coefficients were used for data analysis and then multiple regression analysis was performed to control for confounding factors. Results: Colorectal cancer incidence showed a positive relationship with diabetes mellitus, hypertension, lacking or low physical activity, high education, high intake of dairy products, and non-consumption of vegetables and fruits. Conclusions: We concluded that many dietary factors and public health indicators have positive relationships with CRC and might therefore be targets of preliminary prevention. However, since this is an ecological study limited by potential ecological fallacy the results must be interpreted with caution.

Innovation of technology and social changes - quantitative analysis based on patent big data (기술의 진보와 혁신, 그리고 사회변화: 특허빅데이터를 이용한 정량적 분석)

  • Kim, Yongdai;Jong, Sang Jo;Jang, Woncheol;Lee, Jongsu
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.6
    • /
    • pp.1025-1039
    • /
    • 2016
  • We introduce various methods to investigate the relations between innovation of technology and social changes by analyzing more than 4 millions of patents registered at United States Patent and Trademark Office(USPTO) from year 1985 to 2015. First, we review the history of patent law and its relation with the quantitative changes of registered patents. Second, we investigate the differences of technical innovations of several countries by use of cluster analysis based on the numbers of registered patents at several technical sectors. Third, we introduce the PageRank algorithm to define important nodes in network type data and apply the PageRank algorithm to find important technical sectors based on citation information between registered patents. Finally, we explain how to use the canonical correlation analysis to study relationship between technical innovation and social changes.

Traffic Attributes Correlation Mechanism based on Self-Organizing Maps for Real-Time Intrusion Detection (실시간 침입탐지를 위한 자기 조직화 지도(SOM)기반 트래픽 속성 상관관계 메커니즘)

  • Hwang, Kyoung-Ae;Oh, Ha-Young;Lim, Ji-Young;Chae, Ki-Joon;Nah, Jung-Chan
    • The KIPS Transactions:PartC
    • /
    • v.12C no.5 s.101
    • /
    • pp.649-658
    • /
    • 2005
  • Since the Network based attack Is extensive in the real state of damage, It is very important to detect intrusion quickly at the beginning. But the intrusion detection using supervised learning needs either the preprocessing enormous data or the manager's analysis. Also it has two difficulties to detect abnormal traffic that the manager's analysis might be incorrect and would miss the real time detection. In this paper, we propose a traffic attributes correlation analysis mechanism based on self-organizing maps(SOM) for the real-time intrusion detection. The proposed mechanism has three steps. First, with unsupervised learning build a map cluster composed of similar traffic. Second, label each map cluster to divide the map into normal traffic and abnormal traffic. In this step there is a rule which is created through the correlation analysis with SOM. At last, the mechanism would the process real-time detecting and updating gradually. During a lot of experiments the proposed mechanism has good performance in real-time intrusion to combine of unsupervised learning and supervised learning than that of supervised learning.