• 제목/요약/키워드: similarity based clustering

검색결과 322건 처리시간 0.028초

Spectral clustering based on the local similarity measure of shared neighbors

  • Cao, Zongqi;Chen, Hongjia;Wang, Xiang
    • ETRI Journal
    • /
    • 제44권5호
    • /
    • pp.769-779
    • /
    • 2022
  • Spectral clustering has become a typical and efficient clustering method used in a variety of applications. The critical step of spectral clustering is the similarity measurement, which largely determines the performance of the spectral clustering method. In this paper, we propose a novel spectral clustering algorithm based on the local similarity measure of shared neighbors. This similarity measurement exploits the local density information between data points based on the weight of the shared neighbors in a directed k-nearest neighbor graph with only one parameter k, that is, the number of nearest neighbors. Numerical experiments on synthetic and real-world datasets demonstrate that our proposed algorithm outperforms other existing spectral clustering algorithms in terms of the clustering performance measured via the normalized mutual information, clustering accuracy, and F-measure. As an example, the proposed method can provide an improvement of 15.82% in the clustering performance for the Soybean dataset.

아이템의 유사도를 고려한 트랜잭션 클러스터링 (Transactions Clustering based on Item Similarity)

  • 이상욱;김재련
    • 한국지능정보시스템학회:학술대회논문집
    • /
    • 한국지능정보시스템학회 2002년도 추계정기학술대회
    • /
    • pp.250-257
    • /
    • 2002
  • Clustering is a data mining method, which consists in discovering interesting data distributions in very large databases. In traditional data clustering, similarity of a cluster of object is measured by pairwise similarity of objects in that paper. In view of the nature of clustering transactions, we devise in this paper a novel measurement called item similarity and utilize this to perform clustering. With this item similarity measurement, we develop an efficient clustering algorithm for target marketing in each group.

  • PDF

Robust Similarity Measure for Spectral Clustering Based on Shared Neighbors

  • Ye, Xiucai;Sakurai, Tetsuya
    • ETRI Journal
    • /
    • 제38권3호
    • /
    • pp.540-550
    • /
    • 2016
  • Spectral clustering is a powerful tool for exploratory data analysis. Many existing spectral clustering algorithms typically measure the similarity by using a Gaussian kernel function or an undirected k-nearest neighbor (kNN) graph, which cannot reveal the real clusters when the data are not well separated. In this paper, to improve the spectral clustering, we consider a robust similarity measure based on the shared nearest neighbors in a directed kNN graph. We propose two novel algorithms for spectral clustering: one based on the number of shared nearest neighbors, and one based on their closeness. The proposed algorithms are able to explore the underlying similarity relationships between data points, and are robust to datasets that are not well separated. Moreover, the proposed algorithms have only one parameter, k. We evaluated the proposed algorithms using synthetic and real-world datasets. The experimental results demonstrate that the proposed algorithms not only achieve a good level of performance, they also outperform the traditional spectral clustering algorithms.

A Max-Flow-Based Similarity Measure for Spectral Clustering

  • Cao, Jiangzhong;Chen, Pei;Zheng, Yun;Dai, Qingyun
    • ETRI Journal
    • /
    • 제35권2호
    • /
    • pp.311-320
    • /
    • 2013
  • In most spectral clustering approaches, the Gaussian kernel-based similarity measure is used to construct the affinity matrix. However, such a similarity measure does not work well on a dataset with a nonlinear and elongated structure. In this paper, we present a new similarity measure to deal with the nonlinearity issue. The maximum flow between data points is computed as the new similarity, which can satisfy the requirement for similarity in the clustering method. Additionally, the new similarity carries the global and local relations between data. We apply it to spectral clustering and compare the proposed similarity measure with other state-of-the-art methods on both synthetic and real-world data. The experiment results show the superiority of the new similarity: 1) The max-flow-based similarity measure can significantly improve the performance of spectral clustering; 2) It is robust and not sensitive to the parameters.

시퀀스 요소 기반의 유사도를 이용한 시퀀스 데이터 클러스터링 (Mining Clusters of Sequence Data using Sequence Element-based Similarity Measure)

  • 오승준;김재련
    • 한국지능정보시스템학회:학술대회논문집
    • /
    • 한국지능정보시스템학회 2004년도 추계학술대회
    • /
    • pp.221-229
    • /
    • 2004
  • Recently, there has been enormous growth in the amount of commercial and scientific data, such as protein sequences, retail transactions, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, only a few of the existing clustering algorithms consider sequentiality. This study presents a method for clustering such sequence datasets. The similarity between sequences must be decided before clustering the sequences. This study proposes a new similarity measure to compute the similarity between two sequences using a sequence element. Two clustering algorithms using the proposed similarity measure are proposed: a hierarchical clustering algorithm and a scalable clustering algorithm that uses sampling and a k-nearest neighbor method. Using a splice dataset and synthetic datasets, we show that the quality of clusters generated by our proposed clustering algorithms is better than that of clusters produced by traditional clustering algorithms.

  • PDF

유사성 계수에 의한 문서 클러스터링 시스템 개발 (Development of Similarity-Based Document Clustering System)

  • 우훈식;임동순
    • 한국산업경영시스템학회:학술대회논문집
    • /
    • 한국산업경영시스템학회 2002년도 춘계학술대회
    • /
    • pp.119-124
    • /
    • 2002
  • Clustering of data is of a great interest in many data mining applications. In the field of document clustering, a document is represented as a data in a high dimensional space. Therefore, the document clustering can be accomplished with a general data clustering techniques. In this paper, we introduce a document clustering system based on similarity among documents. The developed system consists of three functions: 1) gatherings documents utilizing a search agent; 2) determining similarity coefficients between any two documents from term frequencies; 3) clustering documents with similarity coefficients. Especially, the document clustering is accomplished by a hybrid algorithm utilizing genetic and K-Means methods.

  • PDF

Shot Group and Representative Shot Frame Detection using Similarity-based Clustering

  • Lee, Gye-Sung
    • 한국컴퓨터정보학회논문지
    • /
    • 제21권9호
    • /
    • pp.37-43
    • /
    • 2016
  • This paper introduces a method for video shot group detection needed for efficient management and summary of video. The proposed method detects shots based on low-level visual properties and performs temporal and spatial clustering based on visual similarity of neighboring shots. Shot groups created from temporal clustering are further clustered into small groups with respect to visual similarity. A set of representative shot frames are selected from each cluster of the smaller groups representing a scene. Shots excluded from temporal clustering are also clustered into groups from which representative shot frames are selected. A number of video clips are collected and applied to the method for accuracy of shot group detection. We achieved 91% of accuracy of the method for shot group detection. The number of representative shot frames is reduced to 1/3 of the total shot frames. The experiment also shows the inverse relationship between accuracy and compression rate.

항목 유사도를 고려한 트랜잭션 클러스터링 (Transactions Clustering based on Item Similarity)

  • 이상욱;김재련
    • 지능정보연구
    • /
    • 제9권1호
    • /
    • pp.179-193
    • /
    • 2003
  • 군집화(clustering)는 주어진 객체들 중에서 유사한 것들을 몇몇의 집단으로 그룹화 하여 각 집단의 성격을 파악하는데, 실제적으로 각 객체가 유사한지 그렇지 않은지를 측정할 수 있는 도구가 필요하다. 기존의 군집화에서 객체간에 유사하다는 의미는 각 군집(cluster)안에 있는 객체들이 같은 속성 값이 많으면 많을수록 객체간에 유사성이 높아 유사도가 높은 객체끼리 군집을 이루게 된다는 것을 의미했다. 그 중에서도 범주형 속성을 갖는 군집화는 같은 속성 값이면 1, 서로 다르면 0으로 표현하여 유사성을 측정하는 방법이다. 제안된 알고리듬은 속성 값을 0과1로만 표현하는 것에 대한 문제점을 제시하고 서로 다른 속성이라도 속성간에 친밀한 관계가 있다는 개념을 도입하여 어느 정도 유사한 지를 보여준다. 같은 객체간에 같은 값을 갖는 속성이 하나로 없더라도 구해진 유사도에 의해 유사한 개체끼리는 하나의 군집이 될 수 있는 알고리듬을 만든 후 그 군집에 속해 있는 고객들의 니즈와 구매 선호도에 따라 적절한 타겟 마케팅(Target Marketing)을 할 수 있다.

  • PDF

A Density Peak Clustering Algorithm Based on Information Bottleneck

  • Yongli Liu;Congcong Zhao;Hao Chao
    • Journal of Information Processing Systems
    • /
    • 제19권6호
    • /
    • pp.778-790
    • /
    • 2023
  • Although density peak clustering can often easily yield excellent results, there is still room for improvement when dealing with complex, high-dimensional datasets. One of the main limitations of this algorithm is its reliance on geometric distance as the sole similarity measurement. To address this limitation, we draw inspiration from the information bottleneck theory, and propose a novel density peak clustering algorithm that incorporates this theory as a similarity measure. Specifically, our algorithm utilizes the joint probability distribution between data objects and feature information, and employs the loss of mutual information as the measurement standard. This approach not only eliminates the potential for subjective error in selecting similarity method, but also enhances performance on datasets with multiple centers and high dimensionality. To evaluate the effectiveness of our algorithm, we conducted experiments using ten carefully selected datasets and compared the results with three other algorithms. The experimental results demonstrate that our information bottleneck-based density peaks clustering (IBDPC) algorithm consistently achieves high levels of accuracy, highlighting its potential as a valuable tool for data clustering tasks.

순차적 클러스터링을 이용한 지역별 그룹핑 (Regional Grouping of the interconnected network system through Sequential Clustering)

  • 김현홍;송형용;김진호;박종배;신중린
    • 대한전기학회:학술대회논문집
    • /
    • 대한전기학회 2007년도 추계학술대회 논문집 전력기술부문
    • /
    • pp.252-254
    • /
    • 2007
  • This paper introduces the method of sequential clustering as a tool for the effective clustering of mass unit electrical systems. The interconnected network system retains information about the location of each line. With this information, this paper aims to carry out initial clustering through the transmission usage rate, compare the results of similarity measures for regional information with similarity measures for regional price, and introduce the technicalities of the clustering method. This transmission usage rate used power flow based on congestion costs and modified similarity measurements using the FCM algorithm. This paper also aims to prove the propriety of the proposed clustering method by comparing it with existing clustering methods that use the similarity measurement system. The proposed algorithm is demonstrated through the IEEE 39-bus RTS.

  • PDF