• Title/Summary/Keyword: K-means cluster analysis

Search Result 366, Processing Time 0.222 seconds

Variable Selection and Outlier Detection for Automated K-means Clustering

  • Kim, Sung-Soo
    • Communications for Statistical Applications and Methods
    • /
    • v.22 no.1
    • /
    • pp.55-67
    • /
    • 2015
  • An important problem in cluster analysis is the selection of variables that define cluster structure that also eliminate noisy variables that mask cluster structure; in addition, outlier detection is a fundamental task for cluster analysis. Here we provide an automated K-means clustering process combined with variable selection and outlier identification. The Automated K-means clustering procedure consists of three processes: (i) automatically calculating the cluster number and initial cluster center whenever a new variable is added, (ii) identifying outliers for each cluster depending on used variables, (iii) selecting variables defining cluster structure in a forward manner. To select variables, we applied VS-KM (variable-selection heuristic for K-means clustering) procedure (Brusco and Cradit, 2001). To identify outliers, we used a hybrid approach combining a clustering based approach and distance based approach. Simulation results indicate that the proposed automated K-means clustering procedure is effective to select variables and identify outliers. The implemented R program can be obtained at http://www.knou.ac.kr/~sskim/SVOKmeans.r.

Probabilistic reduced K-means cluster analysis (확률적 reduced K-means 군집분석)

  • Lee, Seunghoon;Song, Juwon
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.6
    • /
    • pp.905-922
    • /
    • 2021
  • Cluster analysis is one of unsupervised learning techniques used for discovering clusters when there is no prior knowledge of group membership. K-means, one of the commonly used cluster analysis techniques, may fail when the number of variables becomes large. In such high-dimensional cases, it is common to perform tandem analysis, K-means cluster analysis after reducing the number of variables using dimension reduction methods. However, there is no guarantee that the reduced dimension reveals the cluster structure properly. Principal component analysis may mask the structure of clusters, especially when there are large variances for variables that are not related to cluster structure. To overcome this, techniques that perform dimension reduction and cluster analysis simultaneously have been suggested. This study proposes probabilistic reduced K-means, the transition of reduced K-means (De Soete and Caroll, 1994) into a probabilistic framework. Simulation shows that the proposed method performs better than tandem clustering or clustering without any dimension reduction. When the number of the variables is larger than the number of samples in each cluster, probabilistic reduced K-means show better formation of clusters than non-probabilistic reduced K-means. In the application to a real data set, it revealed similar or better cluster structure compared to other methods.

A Study on Efficient Cluster Analysis of Bio-Data Using MapReduce Framework

  • Yoo, Sowol;Lee, Kwangok;Bae, Sanghyun
    • Journal of Integrative Natural Science
    • /
    • v.7 no.1
    • /
    • pp.57-61
    • /
    • 2014
  • This study measured the stream data from the several sensors, and stores the database in MapReduce framework environment, and it aims to design system with the small performance and cluster analysis error rate through the KMSVM algorithm. Through the KM-SVM algorithm, the cluster analysis effective data was used for U-health system. In the results of experiment by using 2003 data sets obtained from 52 test subjects, the k-NN algorithm showed 79.29% cluster analysis accuracy, K-means algorithm showed 87.15 cluster analysis accuracy, and SVM algorithm showed 83.72%, KM-SVM showed 90.72%. As a result, the process speed and cluster analysis effective ratio of KM-SVM algorithm was better.

Cluster analysis by month for meteorological stations using a gridded data of numerical model with temperatures and precipitation (기온과 강수량의 수치모델 격자자료를 이용한 기상관측지점의 월별 군집화)

  • Kim, Hee-Kyung;Kim, Kwang-Sub;Lee, Jae-Won;Lee, Yung-Seop
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.5
    • /
    • pp.1133-1144
    • /
    • 2017
  • Cluster analysis with meteorological data allows to segment meteorological region based on meteorological characteristics. By the way, meteorological observed data are not adequate for cluster analysis because meteorological stations which observe the data are located not uniformly. Therefore the clustering of meteorological observed data cannot reflect the climate characteristic of South Korea properly. The clustering of $5km{\times}5km$ gridded data derived from a numerical model, on the other hand, reflect it evenly. In this study, we analyzed long-term grid data for temperatures and precipitation using cluster analysis. Due to the monthly difference of climate characteristics, clustering was performed by month. As the result of K-Means cluster analysis is so sensitive to initial values, we used initial values with Ward method which is hierarchical cluster analysis method. Based on clustering of gridded data, cluster of meteorological stations were determined. As a result, clustering of meteorological stations in South Korea has been made spatio-temporal segmentation.

A Variable Selection Procedure for K-Means Clustering

  • Kim, Sung-Soo
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.3
    • /
    • pp.471-483
    • /
    • 2012
  • One of the most important problems in cluster analysis is the selection of variables that truly define cluster structure, while eliminating noisy variables that mask such structure. Brusco and Cradit (2001) present VS-KM(variable-selection heuristic for K-means clustering) procedure for selecting true variables for K-means clustering based on adjusted Rand index. This procedure starts with the fixed number of clusters in K-means and adds variables sequentially based on an adjusted Rand index. This paper presents an updated procedure combining the VS-KM with the automated K-means procedure provided by Kim (2009). This automated variable selection procedure for K-means clustering calculates the cluster number and initial cluster center whenever new variable is added and adds a variable based on adjusted Rand index. Simulation result indicates that the proposed procedure is very effective at selecting true variables and at eliminating noisy variables. Implemented program using R can be obtained on the website "http://faculty.knou.ac.kr/sskim/nvarkm.r and vnvarkm.r".

A Performance Comparison of Cluster Validity Indices based on K-means Algorithm (K-means 알고리즘 기반 클러스터링 인덱스 비교 연구)

  • Shim, Yo-Sung;Chung, Ji-Won;Choi, In-Chan
    • Asia pacific journal of information systems
    • /
    • v.16 no.1
    • /
    • pp.127-144
    • /
    • 2006
  • The K-means algorithm is widely used at the initial stage of data analysis in data mining process, partly because of its low time complexity and the simplicity of practical implementation. Cluster validity indices are used along with the algorithm in order to determine the number of clusters as well as the clustering results of datasets. In this paper, we present a performance comparison of sixteen indices, which are selected from forty indices in literature, while considering their applicability to nonhierarchical clustering algorithms. Data sets used in the experiment are generated based on multivariate normal distribution. In particular, four error types including standardization, outlier generation, error perturbation, and noise dimension addition are considered in the comparison. Through the experiment the effects of varying number of points, attributes, and clusters on the performance are analyzed. The result of the simulation experiment shows that Calinski and Harabasz index performs the best through the all datasets and that Davis and Bouldin index becomes a strong competitor as the number of points increases in dataset.

Anthropometry for clothing construction and cluster analysis ( I ) (피복구성학적 인체계측과 집낙구조분석 ( I ))

  • Kim Ku Ja
    • Journal of the Korean Society of Clothing and Textiles
    • /
    • v.10 no.3
    • /
    • pp.37-48
    • /
    • 1986
  • The purpose of this study was to analyze 'the natural groupings' of subjects in order to classify highly similar somatotype for clothing construction. The sample for the study was drawn randomly out of senior high school boys in Seoul urban area. The sample size was 425 boys between age 16 and 18. Cluster analysis was more concerned with finding the hierarchical structure of subjects by three dimensional distance of stature. bust girth and sleeve length. The groups forming a partition can be subdivided into 5 and 6 sets by the hierarchical tree of the given subjects. Ward's Minimum Variance Method was applied after extraction of distance matrix by the Standardized Euclidean Distance. All of the above data was analyzed by the computer installed at Korea Advanced Institute of Science and Technology. The major findings, take for instance, of 16 age group can be summarized as follows. The results of cluster analysis of this study: 1. Cluster 1 (32 persons means $18.29\%$ of the total) is characterized with smaller bust girth than that of cluster 5, but stature and sleeve length of the cluster 1 are the largest group. 2. Cluster 2 (18 Persons means $10.29\%$ of the total) is characterized with the group of the smallest stature and sleeve length, but bust girth larger than that of cluster 3. 3. Cluster 3(35persons means $20\%$ of the total) is classified with the smallest group of all the stature, bust girth and sleeve length. 4. Cluster 4(60 persons means $34.29\%$ of the total) is grouped with the same value of sleeve length with the mean value of 16 age group, but the stature and bust girth is smaller than the mean value of this age group. 5. Cluster 5(30 persons means $17.14\%$ of the total) is characterized with smaller stature than that of cluster 1, and with larger bust girth than that of cluster 1, but with the same value of the sleeve length with the mean value of the 16 age group.

  • PDF

Partial Discharge Distribution Analysis on Interlace Defects of Cable Joint using K-means Clustering (K-means 클러스터링을 이용한 케이블 접속재 계면결함의 부분방전 분포 해석)

  • Cho, Kyung-Soon;Hong, Jin-Woong
    • Journal of the Korean Institute of Electrical and Electronic Material Engineers
    • /
    • v.20 no.11
    • /
    • pp.959-964
    • /
    • 2007
  • To investigate the influence of partial discharge(PD) distribution characteristics due to various defects on the power cable joints interface, we used the K-means clustering method. As the result of PD number(n) distribution analyzing on $\Phi-n$ graph, the phase angle($\Phi$) of cluster centroid shifted to $0^{\circ}\;and\;180^{\circ}$ increasing with applying voltage. It was confirmed that the PD quantify(q) and euclidean distance of centroid were increased with applying voltage from the centroid distribution analyzing of $\Phi-q$ plane. The dispersion degree was increased with calculated standard deviation of the $\Phi-q$ cluster centroid. The PD number and mean value on $\Phi-q$ graph were some different by electric field concentration with defect types.

Analysis of Partial Discharge Pattern in XLPE/EDPM Interface Defect using the Cluster (군집화에 의한 XLPE/EPDM 계면결함 부분방전 패턴 분석)

  • Cho, Kyung-Soon;Lee, Kang-Won;Shin, Jong-Yeol;Hong, Jin-Woong
    • Proceedings of the Korean Institute of Electrical and Electronic Material Engineers Conference
    • /
    • 2007.11a
    • /
    • pp.203-204
    • /
    • 2007
  • This paper investigated the influence on partial discharge distribution of various defects at the model power cable joints interface using K-means clustering. As the result of analyzing discharge number distribution of ${\Phi}-n$ cluster, clusters shifted to $0^{\circ}\;and\;180^{\circ}$ with increasing applying voltage. It was confirmed that discharge quantity and euclidean distance between centroids were increased with applying voltage from the analyzing centroid distribution of ${\Phi}-q$ cluster. The degree of dispersion was increased with calculating standard deviation of ${\Phi}-q$ cluster centroid. The tendency both number of discharge and mean value of ${\Phi}-q$ cluster centroid were some different with defect types.

  • PDF

Analysis of Brokerage Commission Policy based on the Potential Customer Value (고객의 잠재가치에 기반한 증권사 수수료 정책 연구)

  • Shin, Hyung-Won;Sohn, So-Young
    • IE interfaces
    • /
    • v.16 no.spc
    • /
    • pp.123-126
    • /
    • 2003
  • In this paper, we use three cluster algorithms (K-means, Self-Organizing Map, and Fuzzy K-means) to find proper graded stock market brokerage commission rates based on the cumulative transactions on both stock exchange market and HTS (Home Trading System). Stock trading investors for both modes are classified in terms of the total transaction as well as the corresponding mode of investment, respectively. Empirical analysis results indicated that fuzzy K-means cluster analysis is the best fit for the segmentation of customers of both transaction modes in terms of robustness. We then propose the rules for three grouping of customers based on decision tree and apply different brokerage commission to be 0.4%, 0.45%, and 0.5% for exchange market while 0.06%, 0.1%, 0.18% for HTS.