• 제목, 요약, 키워드: Clustering Analysis

Search Result 1,737, Processing Time 0.038 seconds

Performance evaluation of principal component analysis for clustering problems

  • Kim, Jae-Hwan;Yang, Tae-Min;Kim, Jung-Tae
    • Journal of the Korean Society of Marine Engineering
    • /
    • v.40 no.8
    • /
    • pp.726-732
    • /
    • 2016
  • Clustering analysis is widely used in data mining to classify data into categories on the basis of their similarity. Through the decades, many clustering techniques have been developed, including hierarchical and non-hierarchical algorithms. In gene profiling problems, because of the large number of genes and the complexity of biological networks, dimensionality reduction techniques are critical exploratory tools for clustering analysis of gene expression data. Recently, clustering analysis of applying dimensionality reduction techniques was also proposed. PCA (principal component analysis) is a popular methd of dimensionality reduction techniques for clustering problems. However, previous studies analyzed the performance of PCA for only full data sets. In this paper, to specifically and robustly evaluate the performance of PCA for clustering analysis, we exploit an improved FCBF (fast correlation-based filter) of feature selection methods for supervised clustering data sets, and employ two well-known clustering algorithms: k-means and k-medoids. Computational results from supervised data sets show that the performance of PCA is very poor for large-scale features.

Veri cation of Improving a Clustering Algorith for Microarray Data with Missing Values

  • Kim, Su-Young
    • The Korean Journal of Applied Statistics
    • /
    • v.24 no.2
    • /
    • pp.315-321
    • /
    • 2011
  • Gene expression microarray data often include multiple missing values. Most gene expression analysis (including gene clustering analysis); however, require a complete data matric as an input. In ordinary clustering methods, just a single missing value makes one abandon the whole data of a gene even if the rest of data for that gene was intact. The quality of analysis may decrease seriously as the missing rate is increased. In the opposite aspect, the imputation of missing value may result in an artifact that reduces the reliability of the analysis. To clarify this contradiction in microarray clustering analysis, this paper compared the accuracy of clustering with and without imputation over several microarray data having different missing rates. This paper also tested the clustering efficiency of several imputation methods including our propose algorithm. The results showed it is worthwhile to check the clustering result in this alternative way without any imputed data for the imperfect microarray data.

Hot Spot Analysis of Tourist Attractions Based on Stay Point Spatial Clustering

  • Liao, Yifan
    • Journal of Information Processing Systems
    • /
    • v.16 no.4
    • /
    • pp.750-759
    • /
    • 2020
  • The wide application of various integrated location-based services (LBS social) and tourism application (app) has generated a large amount of trajectory space data. The trajectory data are used to identify popular tourist attractions with high density of tourists, and they are of great significance to smart service and emergency management of scenic spots. A hot spot analysis method is proposed, based on spatial clustering of trajectory stop points. The DBSCAN algorithm is studied with fast clustering speed, noise processing and clustering of arbitrary shapes in space. The shortage of parameters is manually selected, and an improved method is proposed to adaptively determine parameters based on statistical distribution characteristics of data. DBSCAN clustering analysis and contrast experiments are carried out for three different datasets of artificial synthetic two-dimensional dataset, four-dimensional Iris real dataset and scenic track retention point. The experiment results show that the method can automatically generate reasonable clustering division, and it is superior to traditional algorithms such as DBSCAN and k-means. Finally, based on the spatial clustering results of the trajectory stay points, the Getis-Ord Gi* hotspot analysis and mapping are conducted in ArcGIS software. The hot spots of different tourist attractions are classified according to the analysis results, and the distribution of popular scenic spots is determined with the actual heat of the scenic spots.

Application of Clustering Methods for Interpretation of Petroleum Spectra from Negative-Mode ESI FT-ICR MS

  • Yeo, In-Joon;Lee, Jae-Won;Kim, Sung-Hwan
    • Bulletin of the Korean Chemical Society
    • /
    • v.31 no.11
    • /
    • pp.3151-3155
    • /
    • 2010
  • This study was performed to develop analytical methods to better understand the properties and reactivity of petroleum, which is a highly complex organic mixture, using high-resolution mass spectrometry and statistical analysis. Ten crude oil samples were analyzed using negative-mode electrospray ionization Fourier transform ion cyclotron resonance mass spectrometry (ESI FT-ICR MS). Clustering methods, including principle component analysis (PCA), hierarchical clustering analysis (HCA), and k-means clustering, were used to comparatively interpret the spectra. All the methods were consistent and showed that oxygen and sulfur-containing heteroatom species played important roles in clustering samples or peaks. The oxygen-containing samples had higher acidity than the other samples, and the clustering results were linked to properties of the crude oils. This study demonstrated that clustering methods provide a simple and effective way to interpret complex petroleomic data.

Agglomerative Hierarchical Clustering Analysis with Deep Convolutional Autoencoders (합성곱 오토인코더 기반의 응집형 계층적 군집 분석)

  • Park, Nojin;Ko, Hanseok
    • Journal of Korea Multimedia Society
    • /
    • v.23 no.1
    • /
    • pp.1-7
    • /
    • 2020
  • Clustering methods essentially take a two-step approach; extracting feature vectors for dimensionality reduction and then employing clustering algorithm on the extracted feature vectors. However, for clustering images, the traditional clustering methods such as stacked auto-encoder based k-means are not effective since they tend to ignore the local information. In this paper, we propose a method first to effectively reduce data dimensionality using convolutional auto-encoder to capture and reflect the local information and then to accurately cluster similar data samples by using a hierarchical clustering approach. The experimental results confirm that the clustering results are improved by using the proposed model in terms of clustering accuracy and normalized mutual information.

Descriptive and Systematic Comparison of Clustering Methods in Microarray Data Analysis

  • Kim, Seo-Young
    • The Korean Journal of Applied Statistics
    • /
    • v.22 no.1
    • /
    • pp.89-106
    • /
    • 2009
  • There have been many new advances in the development of improved clustering methods for microarray data analysis, but traditional clustering methods are still often used in genomic data analysis, which maY be more due to their conceptual simplicity and their broad usability in commercial software packages than to their intrinsic merits. Thus, it is crucial to assess the performance of each existing method through a comprehensive comparative analysis so as to provide informed guidelines on choosing clustering methods. In this study, we investigated existing clustering methods applied to microarray data in various real scenarios. To this end, we focused on how the various methods differ, and why a particular method does not perform well. We applied both internal and external validation methods to the following eight clustering methods using various simulated data sets and real microarray data sets.

A Multi-Dimensional Issue Clustering from the Perspective Consumers' Interests and R&D (소비자 선호 이슈 및 R&D 관점에서의 다차원 이슈 클러스터링)

  • Hyun, Yoonjin;Kim, Namgyu;Cho, Yoonho
    • Journal of Information Technology Services
    • /
    • v.14 no.1
    • /
    • pp.237-249
    • /
    • 2015
  • The volume of unstructured text data generated by various social media has been increasing rapidly; therefore, use of text mining to support decision making has also been increasing. Especially, issue Clustering-determining a new relation with various issues through clustering-has gained attention from many researchers. However, traditional issue clustering methods can only be performed based on the co-occurrence frequency of issue keywords in many documents. Therefore, an association between issues that have a low co-occurrence frequency cannot be discovered using traditional issue clustering methods, even if those issues are strongly related in other perspectives. Therefore, issue clustering that fits each of criteria needs to be performed by the perspective of analysis and the purpose of use. In this study, a multi-dimensional issue clustering is proposed to overcome the limitation of traditional issue clustering. We assert, specifically in this study, that issue clustering should be performed for a particular purpose. We analyze the results of applying our methodology to two specific perspectives on issue clustering, (i) consumers' interests, and (ii) related R&D terms.

K-means Clustering for Environmental Indicator Survey Data

  • Park, Hee-Chang;Cho, Kwang-Hyun
    • 한국데이터정보과학회:학술대회논문집
    • /
    • /
    • pp.185-192
    • /
    • 2005
  • There are many data mining techniques such as association rule, decision tree, neural network analysis, clustering, genetic algorithm, bayesian network, memory-based reasoning, etc. We analyze 2003 Gyeongnam social indicator survey data using k-means clustering technique for environmental information. Clustering is the process of grouping the data into clusters so that objects within a cluster have high similarity in comparison to one another. In this paper, we used k-means clustering of several clustering techniques. The k-means clustering is classified as a partitional clustering method. We can apply k-means clustering outputs to environmental preservation and environmental improvement.

  • PDF

Clustering Algorithm using a Center Of Gravity for Grid-based Sample

  • Park, Hee-Chang;Ryu, Jee-Hyun
    • 한국데이터정보과학회:학술대회논문집
    • /
    • /
    • pp.77-88
    • /
    • 2003
  • Cluster analysis has been widely used in many applications, such that data analysis, pattern recognition, image processing, etc. But clustering requires many hours to get clusters that we want, because it is more primitive, explorative and we make many data an object of cluster analysis. In this paper we propose a new clustering method, 'Clustering algorithm using a center of gravity for grid-based sample'. It is more fast than any traditional clustering method and maintains accuracy. It reduces running time by using grid-based sample and keeps accuracy by using representative point, a center of gravity.

  • PDF

Application of Principal Component Analysis Prior to Cluster Analysis in the Concept of Informative Variables

  • Chae, Seong-San
    • Communications for Statistical Applications and Methods
    • /
    • v.10 no.3
    • /
    • pp.1057-1068
    • /
    • 2003
  • Results of using principal component analysis prior to cluster analysis are compared with results from applying agglomerative clustering algorithm alone. The retrieval ability of the agglomerative clustering algorithm is improved by using principal components prior to cluster analysis in some situations. On the other hand, the loss in retrieval ability for the agglomerative clustering algorithms decreases, as the number of informative variables increases, where the informative variables are the variables that have distinct information(or, necessary information) compared to other variables.