• Title/Summary/Keyword: cluster method

Search Result 2,497, Processing Time 0.034 seconds

Online news-based stock price forecasting considering homogeneity in the industrial sector (산업군 내 동질성을 고려한 온라인 뉴스 기반 주가예측)

  • Seong, Nohyoon;Nam, Kihwan
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.1-19
    • /
    • 2018
  • Since stock movements forecasting is an important issue both academically and practically, studies related to stock price prediction have been actively conducted. The stock price forecasting research is classified into structured data and unstructured data, and it is divided into technical analysis, fundamental analysis and media effect analysis in detail. In the big data era, research on stock price prediction combining big data is actively underway. Based on a large number of data, stock prediction research mainly focuses on machine learning techniques. Especially, research methods that combine the effects of media are attracting attention recently, among which researches that analyze online news and utilize online news to forecast stock prices are becoming main. Previous studies predicting stock prices through online news are mostly sentiment analysis of news, making different corpus for each company, and making a dictionary that predicts stock prices by recording responses according to the past stock price. Therefore, existing studies have examined the impact of online news on individual companies. For example, stock movements of Samsung Electronics are predicted with only online news of Samsung Electronics. In addition, a method of considering influences among highly relevant companies has also been studied recently. For example, stock movements of Samsung Electronics are predicted with news of Samsung Electronics and a highly related company like LG Electronics.These previous studies examine the effects of news of industrial sector with homogeneity on the individual company. In the previous studies, homogeneous industries are classified according to the Global Industrial Classification Standard. In other words, the existing studies were analyzed under the assumption that industries divided into Global Industrial Classification Standard have homogeneity. However, existing studies have limitations in that they do not take into account influential companies with high relevance or reflect the existence of heterogeneity within the same Global Industrial Classification Standard sectors. As a result of our examining the various sectors, it can be seen that there are sectors that show the industrial sectors are not a homogeneous group. To overcome these limitations of existing studies that do not reflect heterogeneity, our study suggests a methodology that reflects the heterogeneous effects of the industrial sector that affect the stock price by applying k-means clustering. Multiple Kernel Learning is mainly used to integrate data with various characteristics. Multiple Kernel Learning has several kernels, each of which receives and predicts different data. To incorporate effects of target firm and its relevant firms simultaneously, we used Multiple Kernel Learning. Each kernel was assigned to predict stock prices with variables of financial news of the industrial group divided by the target firm, K-means cluster analysis. In order to prove that the suggested methodology is appropriate, experiments were conducted through three years of online news and stock prices. The results of this study are as follows. (1) We confirmed that the information of the industrial sectors related to target company also contains meaningful information to predict stock movements of target company and confirmed that machine learning algorithm has better predictive power when considering the news of the relevant companies and target company's news together. (2) It is important to predict stock movements with varying number of clusters according to the level of homogeneity in the industrial sector. In other words, when stock prices are homogeneous in industrial sectors, it is important to use relational effect at the level of industry group without analyzing clusters or to use it in small number of clusters. When the stock price is heterogeneous in industry group, it is important to cluster them into groups. This study has a contribution that we testified firms classified as Global Industrial Classification Standard have heterogeneity and suggested it is necessary to define the relevance through machine learning and statistical analysis methodology rather than simply defining it in the Global Industrial Classification Standard. It has also contribution that we proved the efficiency of the prediction model reflecting heterogeneity.

Evaluation of Ecological Niche for Major Tree Species in the Natural Deciduous Forest of Mt. Chumbong (점봉산(點鳳山) 일대(一帶) 천연활엽수림(天然闊葉樹林)의 주요(主要) 구성(構成) 수종(樹種)에 대한 생태지위(生態地位) 평가(評價))

  • Kim, Guang Ze;Kim, Ji Hong
    • Journal of Korean Society of Forest Science
    • /
    • v.90 no.3
    • /
    • pp.380-387
    • /
    • 2001
  • The characteristics of ecological niche, breadth and overlap, for seventeen major tree species were evaluated in the natural deciduous forest in Mt. Chumbong area. Employed by the plot sampling method, the environmental gradient for vertical niche was based on the intensity of light within the forest, and that for horizontal niche was based on multi-dimensional resources in distribution pattern. The result showed that Fraxinus rhynchophylla had the highest value of vertical niche breadth and Maackia amurensis had the lowest, and Acer pseudo-sieboldianum had the highest value of horizontal niche breadth and Betula costata had the lowest. There was no significant correlation between both measures of niche breadth. However, the tolerance index for each species was positively correlated to the values of niche breadth. Spearman's rank correlation coefficients were applied to test the correlationship between the species ranks of tolerance index and those of two ecological niche breadths. The coefficient of $r_s=0.432$ ($P{\leq}0.1$) was not enough to support significant correlationship between the tolerance index and vertical niche breadth at the 95% probability. If Carpinus cordata, rarely reach canopy of the forest due to its own growth form, are excluded from the analysis, coefficient was calculated as $r_s=0.650$ ($P{\leq}0.01$), resulting in highly significant correlationship. The Spearman's rank correlation coefficient was $r_s=0.797$ ($P{\leq}0.01$) for tolerance indices and the values of horizontal niche breadth, indicating highly significant. Four distinctive species groups, produced by cluster analysis on the basis of ecological niche overlap for each pair of species, were in considerable accord with the positively associated species constellation pattern created by the inter-species association analysis.

  • PDF

A Study on Spatial Pattern of Impact Area of Intersection Using Digital Tachograph Data and Traffic Assignment Model (차량 운행기록정보와 통행배정 모형을 이용한 교차로 영향권의 공간적 패턴에 관한 연구)

  • PARK, Seungjun;HONG, Kiman;KIM, Taegyun;SEO, Hyeon;CHO, Joong Rae;HONG, Young Suk
    • Journal of Korean Society of Transportation
    • /
    • v.36 no.2
    • /
    • pp.155-168
    • /
    • 2018
  • In this study, we studied the directional pattern of entering the intersection from the intersection upstream link prior to predicting short future (such as 5 or 10 minutes) intersection direction traffic volume on the interrupted flow, and examined the possibility of traffic volume prediction using traffic assignment model. The analysis method of this study is to investigate the similarity of patterns by performing cluster analysis with the ratio of traffic volume by intersection direction divided by 2 hours using taxi DTG (Digital Tachograph) data (1 week). Also, for linking with the result of the traffic assignment model, this study compares the impact area of 5 minutes or 10 minutes from the center of the intersection with the analysis result of taxi DTG data. To do this, we have developed an algorithm to set the impact area of intersection, using the taxi DTG data and traffic assignment model. As a result of the analysis, the intersection entry pattern of the taxi is grouped into 12, and the Cubic Clustering Criterion indicating the confidence level of clustering is 6.92. As a result of correlation analysis with the impact area of the traffic assignment model, the correlation coefficient for the impact area of 5 minutes was analyzed as 0.86, and significant results were obtained. However, it was analyzed that the correlation coefficient is slightly lowered to 0.69 in the impact area of 10 minutes from the center of the intersection, but this was due to insufficient accuracy of O/D (Origin/Destination) travel and network data. In future, if accuracy of traffic network and accuracy of O/D traffic by time are improved, it is expected that it will be able to utilize traffic volume data calculated from traffic assignment model when controlling traffic signals at intersections.

Evaluation Criteria and Preferred Image of Jeans Products based on Benefit Segmentation (진 제품 구매자의 추구혜택에 따른 평가기준 및 선호 이미지)

  • Park, Na-Ri;Park, Jae-Ok
    • Journal of the Korean Society of Clothing and Textiles
    • /
    • v.31 no.6 s.165
    • /
    • pp.974-984
    • /
    • 2007
  • The purpose of this study was to find differences in evaluation criteria and to find differences in preferred images based on benefits segmented groups of jeans products consumers. Male and female Korean university students participated in the study. Quota sampling method was used to collect the data based on gender and a residential area of the respondents. Data from 492 questionnaires were used in the analysis. Factor analysis, Cronbach's alpha coefficient, cluster analysis, one-way ANOVA, and post-hoc test were conducted. As a result, respondents who seek multi-benefits considered aesthetic criteria(e.g., color, style, design, fit) and quality performance criteria(e.g., durability, ease of care, contractibility, flexibility) more importantly when evaluating and purchasing jeans products. Respondents who seek brand name considered extrinsic criteria(e.g., brand reputation, status symbol, country of origin, fashionability) more importantly than respondents who seek economic efciency. Respondents who seek multi-benefits such as attractiveness, fashion, individuality, and utility tend to prefer all the images: individual image, active image, sexual image, sophisticated image, and simple image when wearing jeans products. Respondents who seek fashion are likely to prefer individual image, and respondents who seek brand name more prefer both individual image and polished image. Mean while, respondents who seek economical efficiency less prefer sexual image and polished image.

Analysis of Utilization Characteristics, Health Behaviors and Health Management Level of Participants in Private Health Examination in a General Hospital (일개 종합병원의 민간 건강검진 수검자의 검진이용 특성, 건강행태 및 건강관리 수준 분석)

  • Kim, Yoo-Mi;Park, Jong-Ho;Kim, Won-Joong
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.14 no.1
    • /
    • pp.301-311
    • /
    • 2013
  • This study aims to analyze characteristics, health behaviors and health management level related to private health examination recipients in one general hospital. To achieve this, we analyzed 150,501 cases of private health examination data for 11 years from 2001 to 2011 for 20,696 participants in 2011 in a Dae-Jeon general hospital health examination center. The cluster analysis for classify private health examination group is used z-score standardization of K-means clustering method. The logistic regression analysis, decision tree and neural network analysis are used to periodic/non-periodic private health examination classification model. 1,000 people were selected as a customer management business group that has high probability to be non-periodic private health examination patients in new private health examination. According to results of this study, private health examination group was categorized by new, periodic and non-periodic group. New participants in private health examination were more 30~39 years old person than other age groups and more patients suspected of having renal disease. Periodic participants in private health examination were more male participants and more patients suspected of having hyperlipidemia. Non-periodic participants in private health examination were more smoking and sitting person and more patients suspected of having anemia and diabetes mellitus. As a result of decision tree, variables related to non-periodic participants in private health examination were sex, age, residence, exercise, anemia, hyperlipidemia, diabetes mellitus, obesity and liver disease. In particular, 71.4% of non-periodic participants were female, non-anemic, non-exercise, and suspicious obesity person. To operation of customized customer management business for private health examination will contribute to efficiency in health examination center.

Development and Analysis of COMS AMV Target Tracking Algorithm using Gaussian Cluster Analysis (가우시안 군집분석을 이용한 천리안 위성의 대기운동벡터 표적추적 알고리듬 개발 및 분석)

  • Oh, Yurim;Kim, Jae Hwan;Park, Hyungmin;Baek, Kanghyun
    • Korean Journal of Remote Sensing
    • /
    • v.31 no.6
    • /
    • pp.531-548
    • /
    • 2015
  • Atmospheric Motion Vector (AMV) from satellite images have shown Slow Speed Bias (SSB) in comparison with rawinsonde. The causes of SSB are originated from tracking, selection, and height assignment error, which is known to be the leading error. However, recent works have shown that height assignment error cannot be fully explained the cause of SSB. This paper attempts a new approach to examine the possibility of SSB reduction of COMS AMV by using a new target tracking algorithm. Tracking error can be caused by averaging of various wind patterns within a target and changing of cloud shape in searching process over time. To overcome this problem, Gaussian Mixture Model (GMM) has been adopted to extract the coldest cluster as target since the shape of such target is less subject to transformation. Then, an image filtering scheme is applied to weigh more on the selected coldest pixels than the other, which makes it easy to track the target. When AMV derived from our algorithm with sum of squared distance method and current COMS are compared with rawindsonde, our products show noticeable improvement over COMS products in mean wind speed by an increase of $2.7ms^{-1}$ and SSB reduction by 29%. However, the statistics regarding the bias show negative impact for mid/low level with our algorithm, and the number of vectors are reduced by 40% relative to COMS. Therefore, further study is required to improve accuracy for mid/low level winds and increase the number of AMV vectors.

Genetic Relationship and Characteristics Using Microsatellite DNA Loci in Horse Breeds. (Microsatellite DNA를 이용한 말 집단의 유전적 특성 및 유연 관계)

  • Cho, Gil-Jae
    • Journal of Life Science
    • /
    • v.17 no.5 s.85
    • /
    • pp.699-705
    • /
    • 2007
  • The present study was conducted to investigate the genetic characteristic and to establish the parentage verification system of the Korean native horse(KNH). A total number of 192 horses from six horse breeds including the KNH were genotyped using 17 microsatellite loci. This method consisted of multiplexing PCR procedure. The number of alleles per locus varied from 5 to 10 with a mean value of 7.35 in KNH. The expected heterozygosity and observed heterozygosity were ranged from 0.387 to 0.841(mean 0.702) and from 0.429 to 0.905(mean 0.703), respectively. The total exclusion probability of 17 microsatellite loci was 0.9999. Of the 17 markers, AHT4, AHT5, CA425, HMS2, HMS3, HTG10, LEX3 and VHL20 marker have relatively high PIC value(>0.7). This study found that there were specific alleles, P allele at AHT5, Q allele and R allele at ASB23, H allele at CA425, S allele at HMS3, J allele at HTG10 and J allele at LEX3 marker in KNH when compared with other horse populations. Also, the results showed two distinct clusters: the Korean native horse cluster(Korean native horse, Mongolian horse), and the European cluster(Jeju racing horse, Thoroughbred horse). These results present basic information for detecting the genetic markers of the KNH, and has high potential for parentage verification and individual identification of the KNH.

A Study on the Asia Container Ports Clustering Using Hierarchical Clustering(Single, Complete, Average, Centroid Linkages) Methods with Empirical Verification of Clustering Using the Silhouette Method and the Second Stage(Type II) Cross-Efficiency Matrix Clustering Model (계층적 군집분석(최단, 최장, 평균, 중앙연결)방법에 의한 아시아 컨테이너 항만의 클러스터링 측정 및 실루엣방법과 2단계(Type II) 교차효율성 메트릭스 군집모형을 이용한 실증적 검증에 관한 연구)

  • Park, Ro-Kyung
    • Journal of Korea Port Economic Association
    • /
    • v.37 no.1
    • /
    • pp.31-70
    • /
    • 2021
  • The purpose of this paper is to measure the clustering change and analyze empirical results, and choose the clustering ports for Busan, Incheon, and Gwangyang ports by using Hierarchical clustering(single, complete, average, and centroid), Silhouette, and 2SCE[the Second Stage(Type II) cross-efficiency] matrix clustering models on Asian container ports over the period 2009-2018. The models have chosen number of cranes, depth, birth length, and total area as inputs and container TEU as output. The main empirical results are as follows. First, ranking order according to the efficiency increasing ratio during the 10 years analysis shows Silhouette(0.4052 up), Hierarchical clustering(0.3097 up), and 2SCE(0.1057 up). Second, according to empirical verification of the Silhouette and 2SCE models, 3 Korean ports should be clustered with ports like Busan Port[ Dubai, Hong Kong, and Tanjung Priok], and Incheon Port and Gwangyang Port are required to cluster with most ports. Third, in terms of the ASEAN, it would be good to cluster like Busan (Singapore), Incheon Port (Tanjung Priok, Tanjung Perak, Manila, Tanjung Pelpas, Leam Chanbang, and Bangkok), and Gwangyang Port(Tanjung Priok, Tanjung Perak, Port Kang, Tanjung Pelpas, Leam Chanbang, and Bangkok). Third, Wilcoxon's signed-ranks test of models shows that all P values are significant at an average level of 0.852. It means that the average efficiency figures and ranking orders of the models are matched each other. The policy implication is that port policy makers and port operation managers should select benchmarking ports by introducing the models used in this study into the clustering of ports, compare and analyze the port development and operation plans of their ports, and introduce and implement the parts which required benchmarking quickly.

A study on solar radiation prediction using medium-range weather forecasts (중기예보를 이용한 태양광 일사량 예측 연구)

  • Sujin Park;Hyojeoung Kim;Sahm Kim
    • The Korean Journal of Applied Statistics
    • /
    • v.36 no.1
    • /
    • pp.49-62
    • /
    • 2023
  • Solar energy, which is rapidly increasing in proportion, is being continuously developed and invested. As the installation of new and renewable energy policy green new deal and home solar panels increases, the supply of solar energy in Korea is gradually expanding, and research on accurate demand prediction of power generation is actively underway. In addition, the importance of solar radiation prediction was identified in that solar radiation prediction is acting as a factor that most influences power generation demand prediction. In addition, this study can confirm the biggest difference in that it attempted to predict solar radiation using medium-term forecast weather data not used in previous studies. In this paper, we combined the multi-linear regression model, KNN, random fores, and SVR model and the clustering technique, K-means, to predict solar radiation by hour, by calculating the probability density function for each cluster. Before using medium-term forecast data, mean absolute error (MAE) and root mean squared error (RMSE) were used as indicators to compare model prediction results. The data were converted into daily data according to the medium-term forecast data format from March 1, 2017 to February 28, 2022. As a result of comparing the predictive performance of the model, the method showed the best performance by predicting daily solar radiation with random forest, classifying dates with similar climate factors, and calculating the probability density function of solar radiation by cluster. In addition, when the prediction results were checked after fitting the model to the medium-term forecast data using this methodology, it was confirmed that the prediction error increased by date. This seems to be due to a prediction error in the mid-term forecast weather data. In future studies, among the weather factors that can be used in the mid-term forecast data, studies that add exogenous variables such as precipitation or apply time series clustering techniques should be conducted.

Personalized Recommendation System for IPTV using Ontology and K-medoids (IPTV환경에서 온톨로지와 k-medoids기법을 이용한 개인화 시스템)

  • Yun, Byeong-Dae;Kim, Jong-Woo;Cho, Yong-Seok;Kang, Sang-Gil
    • Journal of Intelligence and Information Systems
    • /
    • v.16 no.3
    • /
    • pp.147-161
    • /
    • 2010
  • As broadcasting and communication are converged recently, communication is jointed to TV. TV viewing has brought about many changes. The IPTV (Internet Protocol Television) provides information service, movie contents, broadcast, etc. through internet with live programs + VOD (Video on demand) jointed. Using communication network, it becomes an issue of new business. In addition, new technical issues have been created by imaging technology for the service, networking technology without video cuts, security technologies to protect copyright, etc. Through this IPTV network, users can watch their desired programs when they want. However, IPTV has difficulties in search approach, menu approach, or finding programs. Menu approach spends a lot of time in approaching programs desired. Search approach can't be found when title, genre, name of actors, etc. are not known. In addition, inserting letters through remote control have problems. However, the bigger problem is that many times users are not usually ware of the services they use. Thus, to resolve difficulties when selecting VOD service in IPTV, a personalized service is recommended, which enhance users' satisfaction and use your time, efficiently. This paper provides appropriate programs which are fit to individuals not to save time in order to solve IPTV's shortcomings through filtering and recommendation-related system. The proposed recommendation system collects TV program information, the user's preferred program genres and detailed genre, channel, watching program, and information on viewing time based on individual records of watching IPTV. To look for these kinds of similarities, similarities can be compared by using ontology for TV programs. The reason to use these is because the distance of program can be measured by the similarity comparison. TV program ontology we are using is one extracted from TV-Anytime metadata which represents semantic nature. Also, ontology expresses the contents and features in figures. Through world net, vocabulary similarity is determined. All the words described on the programs are expanded into upper and lower classes for word similarity decision. The average of described key words was measured. The criterion of distance calculated ties similar programs through K-medoids dividing method. K-medoids dividing method is a dividing way to divide classified groups into ones with similar characteristics. This K-medoids method sets K-unit representative objects. Here, distance from representative object sets temporary distance and colonize it. Through algorithm, when the initial n-unit objects are tried to be divided into K-units. The optimal object must be found through repeated trials after selecting representative object temporarily. Through this course, similar programs must be colonized. Selecting programs through group analysis, weight should be given to the recommendation. The way to provide weight with recommendation is as the follows. When each group recommends programs, similar programs near representative objects will be recommended to users. The formula to calculate the distance is same as measure similar distance. It will be a basic figure which determines the rankings of recommended programs. Weight is used to calculate the number of watching lists. As the more programs are, the higher weight will be loaded. This is defined as cluster weight. Through this, sub-TV programs which are representative of the groups must be selected. The final TV programs ranks must be determined. However, the group-representative TV programs include errors. Therefore, weights must be added to TV program viewing preference. They must determine the finalranks.Based on this, our customers prefer proposed to recommend contents. So, based on the proposed method this paper suggested, experiment was carried out in controlled environment. Through experiment, the superiority of the proposed method is shown, compared to existing ways.