• Title/Summary/Keyword: Multiple clustering

Search Result 358, Processing Time 0.023 seconds

SKU recommender system for retail stores that carry identical brands using collaborative filtering and hybrid filtering (협업 필터링 및 하이브리드 필터링을 이용한 동종 브랜드 판매 매장간(間) 취급 SKU 추천 시스템)

  • Joe, Denis Yongmin;Nam, Kihwan
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.4
    • /
    • pp.77-110
    • /
    • 2017
  • Recently, the diversification and individualization of consumption patterns through the web and mobile devices based on the Internet have been rapid. As this happens, the efficient operation of the offline store, which is a traditional distribution channel, has become more important. In order to raise both the sales and profits of stores, stores need to supply and sell the most attractive products to consumers in a timely manner. However, there is a lack of research on which SKUs, out of many products, can increase sales probability and reduce inventory costs. In particular, if a company sells products through multiple in-store stores across multiple locations, it would be helpful to increase sales and profitability of stores if SKUs appealing to customers are recommended. In this study, the recommender system (recommender system such as collaborative filtering and hybrid filtering), which has been used for personalization recommendation, is suggested by SKU recommendation method of a store unit of a distribution company that handles a homogeneous brand through a plurality of sales stores by country and region. We calculated the similarity of each store by using the purchase data of each store's handling items, filtering the collaboration according to the sales history of each store by each SKU, and finally recommending the individual SKU to the store. In addition, the store is classified into four clusters through PCA (Principal Component Analysis) and cluster analysis (Clustering) using the store profile data. The recommendation system is implemented by the hybrid filtering method that applies the collaborative filtering in each cluster and measured the performance of both methods based on actual sales data. Most of the existing recommendation systems have been studied by recommending items such as movies and music to the users. In practice, industrial applications have also become popular. In the meantime, there has been little research on recommending SKUs for each store by applying these recommendation systems, which have been mainly dealt with in the field of personalization services, to the store units of distributors handling similar brands. If the recommendation method of the existing recommendation methodology was 'the individual field', this study expanded the scope of the store beyond the individual domain through a plurality of sales stores by country and region and dealt with the store unit of the distribution company handling the same brand SKU while suggesting a recommendation method. In addition, if the existing recommendation system is limited to online, it is recommended to apply the data mining technique to develop an algorithm suitable for expanding to the store area rather than expanding the utilization range offline and analyzing based on the existing individual. The significance of the results of this study is that the personalization recommendation algorithm is applied to a plurality of sales outlets handling the same brand. A meaningful result is derived and a concrete methodology that can be constructed and used as a system for actual companies is proposed. It is also meaningful that this is the first attempt to expand the research area of the academic field related to the existing recommendation system, which was focused on the personalization domain, to a sales store of a company handling the same brand. From 05 to 03 in 2014, the number of stores' sales volume of the top 100 SKUs are limited to 52 SKUs by collaborative filtering and the hybrid filtering method SKU recommended. We compared the performance of the two recommendation methods by totaling the sales results. The reason for comparing the two recommendation methods is that the recommendation method of this study is defined as the reference model in which offline collaborative filtering is applied to demonstrate higher performance than the existing recommendation method. The results of this model are compared with the Hybrid filtering method, which is a model that reflects the characteristics of the offline store view. The proposed method showed a higher performance than the existing recommendation method. The proposed method was proved by using actual sales data of large Korean apparel companies. In this study, we propose a method to extend the recommendation system of the individual level to the group level and to efficiently approach it. In addition to the theoretical framework, which is of great value.

Comparison of Association Rule Learning and Subgroup Discovery for Mining Traffic Accident Data (교통사고 데이터의 마이닝을 위한 연관규칙 학습기법과 서브그룹 발견기법의 비교)

  • Kim, Jeongmin;Ryu, Kwang Ryel
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.4
    • /
    • pp.1-16
    • /
    • 2015
  • Traffic accident is one of the major cause of death worldwide for the last several decades. According to the statistics of world health organization, approximately 1.24 million deaths occurred on the world's roads in 2010. In order to reduce future traffic accident, multipronged approaches have been adopted including traffic regulations, injury-reducing technologies, driving training program and so on. Records on traffic accidents are generated and maintained for this purpose. To make these records meaningful and effective, it is necessary to analyze relationship between traffic accident and related factors including vehicle design, road design, weather, driver behavior etc. Insight derived from these analysis can be used for accident prevention approaches. Traffic accident data mining is an activity to find useful knowledges about such relationship that is not well-known and user may interested in it. Many studies about mining accident data have been reported over the past two decades. Most of studies mainly focused on predict risk of accident using accident related factors. Supervised learning methods like decision tree, logistic regression, k-nearest neighbor, neural network are used for these prediction. However, derived prediction model from these algorithms are too complex to understand for human itself because the main purpose of these algorithms are prediction, not explanation of the data. Some of studies use unsupervised clustering algorithm to dividing the data into several groups, but derived group itself is still not easy to understand for human, so it is necessary to do some additional analytic works. Rule based learning methods are adequate when we want to derive comprehensive form of knowledge about the target domain. It derives a set of if-then rules that represent relationship between the target feature with other features. Rules are fairly easy for human to understand its meaning therefore it can help provide insight and comprehensible results for human. Association rule learning methods and subgroup discovery methods are representing rule based learning methods for descriptive task. These two algorithms have been used in a wide range of area from transaction analysis, accident data analysis, detection of statistically significant patient risk groups, discovering key person in social communities and so on. We use both the association rule learning method and the subgroup discovery method to discover useful patterns from a traffic accident dataset consisting of many features including profile of driver, location of accident, types of accident, information of vehicle, violation of regulation and so on. The association rule learning method, which is one of the unsupervised learning methods, searches for frequent item sets from the data and translates them into rules. In contrast, the subgroup discovery method is a kind of supervised learning method that discovers rules of user specified concepts satisfying certain degree of generality and unusualness. Depending on what aspect of the data we are focusing our attention to, we may combine different multiple relevant features of interest to make a synthetic target feature, and give it to the rule learning algorithms. After a set of rules is derived, some postprocessing steps are taken to make the ruleset more compact and easier to understand by removing some uninteresting or redundant rules. We conducted a set of experiments of mining our traffic accident data in both unsupervised mode and supervised mode for comparison of these rule based learning algorithms. Experiments with the traffic accident data reveals that the association rule learning, in its pure unsupervised mode, can discover some hidden relationship among the features. Under supervised learning setting with combinatorial target feature, however, the subgroup discovery method finds good rules much more easily than the association rule learning method that requires a lot of efforts to tune the parameters.

Epidemiologic Investigation for the Etiology of an Epidemic Ocurred among Animals and Humans in an Isolated Island, Korea(I) (신안군(新安郡) 낙도(落島)에서 발생(發生)한 괴질(怪疾)의 원인(原因)에 관한 역학적(疫學的) 조사(調査)(I))

  • Kim, J.S.;Heo, Y.;Yoon, H.Y.;Lee, W.Y.
    • Journal of Preventive Medicine and Public Health
    • /
    • v.22 no.2 s.26
    • /
    • pp.290-301
    • /
    • 1989
  • This is preliminary report on anthrax epidemic occurred in an island with about 100 residents. Since 1982 there had been sudden deaths among all kinds of domestic animals including cattle, dogs, ducks, chicken and goat but only a few among cats in an isolated island about three hours distance away by ferry boat from Mokpo city. From 1986 through 1988 nine human deaths and four patients occurred, which made the government intervene for investigation on June 25 1988. The epidemiological investigation consisted of interview survey and medical examination, medical record analysis, laboratory work to isolate the pathogens under the direction of hypothesis derived from the study and further confirmation of the pathogens by international institute. The summarized results are as followings: 1. According to the interview survey there were many deaths among domestic animals usually in cold and dry season such as January through March and September through November; 36 heads of cattle leaving one head, more than 40 hogs(all), hundreds of chicken leaving few alive, goats that had taken home from mountain and two or three cats out of around 40 had sudden deaths from 1982 till 1985, when the residents stopped to purchase and take them into the island anymore. Also there were eleven persons who had experienced the similar syndrome complex to those of admitted and expired patients and four of them revealed typical chest X-ray findings; from one of these four patients(Rho) B. anthracis is isolated. 2. Medical record on patients who had been admitted, showed common characteristics of the disease course. On admission they had either gastrointestinal or upper respiratory infection symptoms which invariably progressed to septicemic nature with pulmonary interstitial infiltration and mediastinal widening/bulging, and then to deadly acute respiratory distress syndrome. At the end stage chest X-ray revealed multiple bullous emphysema. One of another characteristics was oral ulceration with bleeding occurred in about 50% of the patients. Laboratory test results in common were leukocytosis with left shift and abnormal liver and kidney functions, particularly at the later stage of the illness. 3. Epidemiological characteristics was striking in that both mortality and incidence rates were high: the mortality rate was 8.7% average, male being three times higher than females but there was no distictive clustering by age group. The incidence rate for both sexes was 28.2% and there was no sex difference although a tendency of higher incidence among older ages was noticed. The highest mortality and incidence were observed in Won village where the first death of animal occurred and with the highest frequency among three villages of the island. 4. Among twelve bacilli species isolated from various specimens, two strains, one from patient and the other from soil where the recently died cow is hurried, were confirmed as B. anthracis by Pasteur Institute and CDC of USA(strain from soil). CDC reported that the strain did not produce capsule in bicarbonate media but reacted with the bacteriophage and one of five sera taken from the patients. Mode of transmission as well as incubation period of the agent has not been established yet, which needs further investigation in relation to the antigenic structure of the variant when it is confirmed.

  • PDF

Estimation of Long-term Water Demand by Principal Component and Cluster Analysis and Practical Application (주성분분석과 군집분석을 이용한 장기 물수요예측과 활용)

  • Koo, Ja-Yong;Yu, Myung-Jin;Kim, Shin-Geol;Shim, Mi-Hee;Akira, Koizumi
    • Journal of Korean Society of Environmental Engineers
    • /
    • v.27 no.8
    • /
    • pp.870-876
    • /
    • 2005
  • The multiple regression models which have two factors(population and commercial area) have been used to forecast the water demand in the future. But, the coefficient of population had a negative value because proper regional classification wasn't performed, and it is not reasonable because the population must be a positive factor. So, the regional classification was performed by principal component and cluster analysis to solve the problem. 6 regional characters were transformed into 4 principal components, and the areas were divided into two groups according to cluster analysis which had 4 principal components. The new regression models were made by each group, and the problem was solved. And, the future water demands were estimated by three scenarios(Active, moderate, and passive one). The increase of water demand ore $89.034\;m^3/day$ in active plat $49,077\;m^3/day$ in moderate plan, and $19,996\;m^3/day$ in passive plan. The water supply ability as scenarios is enough in water treatment plant, however, 2 reservoirs among 4 reservoirs don't have enough retention time in all scenarios.

Analysis of Grain Quality Properties in Korea-bred Japonica Rice Cultivars (우리나라 자포니카 벼 품종의 식미관련 미질특성 분석)

  • Choi, Yong-Hwan;Kim, Kwang-Ho;Choi, Hae-Chun;Hwang, Hung-Goo;Kim, Yeon-Gyu;Kim, Kee-Jong;Lee, Young-Tae
    • KOREAN JOURNAL OF CROP SCIENCE
    • /
    • v.51 no.7
    • /
    • pp.624-631
    • /
    • 2006
  • This study was conducted to make clustering analysis based on major physicochemical characteristics related to palatability of cooked rice. 89 Korea-bred japonica rice cultivars could be largely classified into two groups, that is, Dongjinbyeo and Ilpumbyeo groups. The Ilpumbyeo group was divided into two subgroups; Ilpumbyeo and Chucheongbyeo groups. The two major rice groups showed significant difference in viscogram properties of rice flour. Ilpumbyeo group revealed slightly higher estimates of viscogram traits as compared with Dongiinbyeo group in average. Early-maturing rice group showed slighly lower estimates of taste meter and higher protein content compared with medium or medium late maturing ones. Also, early and medium-maturing groups exhibited slightly higher estimates of peak, hot and breakdown viscosities but lower estimates of consistenency and setback viscosities compared with medium-late-maturing one. The rice cultivars developed in 2000's revealed slightly higher estimates of peak, hot, cool and consistency viscosities compared with those in $1980's{\sim}1990's$. The grain quality properties significantly associated with the esimates of Toyo taste meter were protein and amylose contents and hot viscosity. The lower protein content and hot viscosity and the higher amylose content, the higher estimates of the taster meter. The protein content was highly negatively correlated with amylose content of milled rice. The important quality components contributed to multiple regression formula for estimating the Toyo taster meter values were protein content, alkali digestion value, and hot viscosity. The fittness of this formula was about 49% along with the coefficients of determination.

Internal Structure and Movement History of the Keumwang Fault (금왕단층의 내부구조 및 단층발달사)

  • Kim, Man-Jae;Lee, Hee-Kwon
    • The Journal of the Petrological Society of Korea
    • /
    • v.25 no.3
    • /
    • pp.211-230
    • /
    • 2016
  • Detailed mapping along the Keumwang fault reveals a complex history of multiple brittle reactivations following late Jurassic and early Cretaceous ductile shearing. The fault core consists of a 10~50 m thick fault gouge layer bounded by a 30~100 m thick damaged zone. The Pre-cambrian gneiss and Jurassic granite underwent at least six distinct stages of fault movements based on deformation environment, time and mechanism. Each stage characterized by fault kinematics and dynamics at different deformation environment. Stage 1 generated mylonite series along the Keumwang shear zone by sinistral ductile shearing during late Jurassic and early Cretaceous. Stage 2 was a mostly brittle event generating cataclasite series superimposed on the mylonite series of the Keumwang shear zone. The roundness of pophyroclastes and the amount of matrix increase from host rocks to ultracataclasite indicating stronger cataclastic flow toward the fault core. At stage 3, fault gouge layer superimposed on the cataclasite generated during stage 2 and the sedimentary basins (Umsung and Pungam) formed along the fault by sinistral strike-slip movement. Fragments of older cataclasite suspended in the fault gouge suggest extensive reworking of fault rocks at brittle deformation environments. At stage 4, systematic en-echelon folds, joints and faults were formed in the sedimentary basins by sinistral strike-slip reactivation of the Keumwang fault. Most of the shearing is accommodated by slip along foliations and on discrete shear surfaces, while shear deformation tends to be relatively uniformly distributed within the fault damage zone developed in the mudrocks in the sedimentary basins. Fine-grained andesitic rocks intruded during stage 4. Stage 5 dextral strike-slip activity produced shear planes and bands in the andesitic rocks. ESR(Electron Spin Resonance) dates of fault gouge show temporal clustering within active period and migrating along the strike of the Keumwang fault during the stage 6 at the Quaternary period.

Performance of Investment Strategy using Investor-specific Transaction Information and Machine Learning (투자자별 거래정보와 머신러닝을 활용한 투자전략의 성과)

  • Kim, Kyung Mock;Kim, Sun Woong;Choi, Heung Sik
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.1
    • /
    • pp.65-82
    • /
    • 2021
  • Stock market investors are generally split into foreign investors, institutional investors, and individual investors. Compared to individual investor groups, professional investor groups such as foreign investors have an advantage in information and financial power and, as a result, foreign investors are known to show good investment performance among market participants. The purpose of this study is to propose an investment strategy that combines investor-specific transaction information and machine learning, and to analyze the portfolio investment performance of the proposed model using actual stock price and investor-specific transaction data. The Korea Exchange offers daily information on the volume of purchase and sale of each investor to securities firms. We developed a data collection program in C# programming language using an API provided by Daishin Securities Cybosplus, and collected 151 out of 200 KOSPI stocks with daily opening price, closing price and investor-specific net purchase data from January 2, 2007 to July 31, 2017. The self-organizing map model is an artificial neural network that performs clustering by unsupervised learning and has been introduced by Teuvo Kohonen since 1984. We implement competition among intra-surface artificial neurons, and all connections are non-recursive artificial neural networks that go from bottom to top. It can also be expanded to multiple layers, although many fault layers are commonly used. Linear functions are used by active functions of artificial nerve cells, and learning rules use Instar rules as well as general competitive learning. The core of the backpropagation model is the model that performs classification by supervised learning as an artificial neural network. We grouped and transformed investor-specific transaction volume data to learn backpropagation models through the self-organizing map model of artificial neural networks. As a result of the estimation of verification data through training, the portfolios were rebalanced monthly. For performance analysis, a passive portfolio was designated and the KOSPI 200 and KOSPI index returns for proxies on market returns were also obtained. Performance analysis was conducted using the equally-weighted portfolio return, compound interest rate, annual return, Maximum Draw Down, standard deviation, and Sharpe Ratio. Buy and hold returns of the top 10 market capitalization stocks are designated as a benchmark. Buy and hold strategy is the best strategy under the efficient market hypothesis. The prediction rate of learning data using backpropagation model was significantly high at 96.61%, while the prediction rate of verification data was also relatively high in the results of the 57.1% verification data. The performance evaluation of self-organizing map grouping can be determined as a result of a backpropagation model. This is because if the grouping results of the self-organizing map model had been poor, the learning results of the backpropagation model would have been poor. In this way, the performance assessment of machine learning is judged to be better learned than previous studies. Our portfolio doubled the return on the benchmark and performed better than the market returns on the KOSPI and KOSPI 200 indexes. In contrast to the benchmark, the MDD and standard deviation for portfolio risk indicators also showed better results. The Sharpe Ratio performed higher than benchmarks and stock market indexes. Through this, we presented the direction of portfolio composition program using machine learning and investor-specific transaction information and showed that it can be used to develop programs for real stock investment. The return is the result of monthly portfolio composition and asset rebalancing to the same proportion. Better outcomes are predicted when forming a monthly portfolio if the system is enforced by rebalancing the suggested stocks continuously without selling and re-buying it. Therefore, real transactions appear to be relevant.

A Study on Market Size Estimation Method by Product Group Using Word2Vec Algorithm (Word2Vec을 활용한 제품군별 시장규모 추정 방법에 관한 연구)

  • Jung, Ye Lim;Kim, Ji Hui;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.1-21
    • /
    • 2020
  • With the rapid development of artificial intelligence technology, various techniques have been developed to extract meaningful information from unstructured text data which constitutes a large portion of big data. Over the past decades, text mining technologies have been utilized in various industries for practical applications. In the field of business intelligence, it has been employed to discover new market and/or technology opportunities and support rational decision making of business participants. The market information such as market size, market growth rate, and market share is essential for setting companies' business strategies. There has been a continuous demand in various fields for specific product level-market information. However, the information has been generally provided at industry level or broad categories based on classification standards, making it difficult to obtain specific and proper information. In this regard, we propose a new methodology that can estimate the market sizes of product groups at more detailed levels than that of previously offered. We applied Word2Vec algorithm, a neural network based semantic word embedding model, to enable automatic market size estimation from individual companies' product information in a bottom-up manner. The overall process is as follows: First, the data related to product information is collected, refined, and restructured into suitable form for applying Word2Vec model. Next, the preprocessed data is embedded into vector space by Word2Vec and then the product groups are derived by extracting similar products names based on cosine similarity calculation. Finally, the sales data on the extracted products is summated to estimate the market size of the product groups. As an experimental data, text data of product names from Statistics Korea's microdata (345,103 cases) were mapped in multidimensional vector space by Word2Vec training. We performed parameters optimization for training and then applied vector dimension of 300 and window size of 15 as optimized parameters for further experiments. We employed index words of Korean Standard Industry Classification (KSIC) as a product name dataset to more efficiently cluster product groups. The product names which are similar to KSIC indexes were extracted based on cosine similarity. The market size of extracted products as one product category was calculated from individual companies' sales data. The market sizes of 11,654 specific product lines were automatically estimated by the proposed model. For the performance verification, the results were compared with actual market size of some items. The Pearson's correlation coefficient was 0.513. Our approach has several advantages differing from the previous studies. First, text mining and machine learning techniques were applied for the first time on market size estimation, overcoming the limitations of traditional sampling based- or multiple assumption required-methods. In addition, the level of market category can be easily and efficiently adjusted according to the purpose of information use by changing cosine similarity threshold. Furthermore, it has a high potential of practical applications since it can resolve unmet needs for detailed market size information in public and private sectors. Specifically, it can be utilized in technology evaluation and technology commercialization support program conducted by governmental institutions, as well as business strategies consulting and market analysis report publishing by private firms. The limitation of our study is that the presented model needs to be improved in terms of accuracy and reliability. The semantic-based word embedding module can be advanced by giving a proper order in the preprocessed dataset or by combining another algorithm such as Jaccard similarity with Word2Vec. Also, the methods of product group clustering can be changed to other types of unsupervised machine learning algorithm. Our group is currently working on subsequent studies and we expect that it can further improve the performance of the conceptually proposed basic model in this study.