• Title/Summary/Keyword: 데이터 유사도

Search Result 3,346, Processing Time 0.033 seconds

SVM based Clustering Technique for Processing High Dimensional Data (고차원 데이터 처리를 위한 SVM기반의 클러스터링 기법)

  • Kim, Man-Sun;Lee, Sang-Yong
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.14 no.7
    • /
    • pp.816-820
    • /
    • 2004
  • Clustering is a process of dividing similar data objects in data set into clusters and acquiring meaningful information in the data. The main issues related to clustering are the effective clustering of high dimensional data and optimization. This study proposed a method of measuring similarity based on SVM and a new method of calculating the number of clusters in an efficient way. The high dimensional data are mapped to Feature Space ones using kernel functions and then similarity between neighboring clusters is measured. As for created clusters, the desired number of clusters can be got using the value of similarity measured and the value of Δd. In order to verify the proposed methods, the author used data of six UCI Machine Learning Repositories and obtained the presented number of clusters as well as improved cohesiveness compared to the results of previous researches.

A Clustering Scheme Considering the Structural Similarity of Metadata in Smartphone Sensing System (스마트폰 센싱에서 메타데이터의 구조적 유사도를 고려한 클러스터링 기법)

  • Min, Hong;Heo, Junyoung
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.14 no.6
    • /
    • pp.229-234
    • /
    • 2014
  • As association between sensor networks that collect environmental information by using numberous sensor nodes and smartphones that are equipped with various sensors, many applications understanding users' context have been developed to interact users and their environments. Collected data should be stored with XML formatted metadata containing semantic information to share the collected data. In case of distance based clustering schemes, the efficiency of data collection decreases because metadata files are extended and changed as the purpose of each system developer. In this paper, we proposed a clustering scheme considering the structural similarity of metadata to reduce clustering construction time and improve the similarity of metadata among member nodes in a cluster.

Korean Symptom-Based Disease Prediction Model according to Input Data Format and Positive/Negative (입력 데이터 형식 및 Positive/Negative에 따른 한국어 증상 기반 질병 예측 모델)

  • Min-Jung Kim;In-Whee Joe
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.11a
    • /
    • pp.418-421
    • /
    • 2023
  • 본 논문은 Word2Vec를 이용하여 한국어 증상 기반 질병 예측 모델을 제시한다. 아산병원 질환 백과의 크롤링 데이터를 세 가지 형식으로 나누어, 모델에 알맞은 데이터 형식을 찾고 모델에 적용한다. 가장 모델에 맞는 데이터 형식은 증상별 질병과 질병별 증상을 합친 경우이다. 데이터의 양을 늘려 임베딩 스페이스를 넓혔고, 가장 중요한 증상과 질병의 유사도도 정확하게 출력되었다. 이는 유사도가 높은 질병과 증상들이 제대로 학습이 되었다는 것을 알 수 있다. 이렇게 만들어진 예측 모델에 positive 증상을 입력하면 유사도가 향상되고, negative에 입력하면 하락하는 결과를 확인했다. 따라서 환자의 증상을 positive에 넣으면, 그 증상을 가진 질병이 가까워지는 반면, 환자의 증상이 아닌 증상을 negative에 넣으면, 환자에게 맞지 않는 질병이 멀어진다. 그러므로 환자의 상태에 맞는 질병을 유추해, 의사나 환자가 증상에 대한 질병을 알고 싶을 때 또는 검색에 유용하게 사용할 수 있다. 더불어, 질병의 진료과 데이터를 추가하여, 환자에게 맞는 진료과를 찾는 데도 도움을 줄 수 있다.

A Region Based Similar Image Retrieval using Histogram Comparison (히스토그램 비교법을 이용한 영역기반 유사 이미지 검색)

  • 임동혁;김창룡;정진완
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2000.10a
    • /
    • pp.130-132
    • /
    • 2000
  • 주요 멀티미디어 자료인 이미지는 데이터 특성을 표현하기가 어렵고, 특성추출에서 얻은 데이터가 너무 고차원적이라 이를 저차원의 처리가능한 데이터로 변환하는 과정에서 많은 손실이 있다. 이미지의 특성값을 전체 이미지의 평균값으로 변경하여 저차원 데이터를 얻는 기존의 이미지 전체 특성추출기법이나 고정된 블록의 평균값으로 변경하여 저차원 데이터를 얻는 이미지 블록 특성추출기법은 유사 이미지의 검색이 부정확하다는 단점이 있다. 본 논문에서는 이미지를 가변적인 영역으로 나누어 특성값을 얻고, 히스토그램을 이용하여 효율적으로 유사 이미지를 찾는 영역기반 유사 이미지 검색기법을 제안하고 이를 구현하였다.

  • PDF

The segmentation of Korean word for the lip-synch application (Lip-synch application을 위한 한국어 단어의 음소분할)

  • 강용성;고한석
    • Proceedings of the IEEK Conference
    • /
    • 2001.09a
    • /
    • pp.509-512
    • /
    • 2001
  • 본 논문은 한국어 음성에 대한 한국어 단어의 음소단위 분할을 목적으로 하였다. 대상 단어는 원광대학교 phonetic balanced 452단어 데이터 베이스를 사용하였고 분할 단위는 음성 전문가에 의해 구성된 44개의 음소셋을 사용하였다. 음소를 분할하기 위해 음성을 각각 프레임으로 나눈 후 각 프레임간의 스펙트럼 성분의 유사도를 측정한 후 측정한 유사도를 기준으로 음소의 분할점을 찾았다. 두 프레임 간의 유사도를 결정하기 위해 두 벡터 상호간의 유사성을 결정하는 방법중의 하나인 Lukasiewicz implication을 사용하였다. 본 실험에서는 기존의 프레임간 스펙트럼 성분의 유사도 측정을 이용한 하나의 어절의 유/무성음 분할 방법을 본 실험의 목적인 한국어 단어의 음소 분할 실험에 맞도록 수정하였다. 성능평가를 위해 음성 전문가에 의해 손으로 분할된 데이터와 본 실험을 통해 얻은 데이터와의 비교를 하여 평가를 하였다. 실험결과 전문가가 직접 손으로 분할한 데이터와 비교하여 32ms이내로 분할된 비율이 최고 84.76%를 나타내었다.

  • PDF

A Method of Reducing the Processing Cost of Similarity Queries in Databases (데이터베이스에서 유사도 질의 처리 비용 감소 방법)

  • Kim, Sunkyung;Park, Ji Su;Shon, Jin Gon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.4
    • /
    • pp.157-162
    • /
    • 2022
  • Today, most data is stored in a database (DB). In the DB environment, the users requests the DB to find the data they wants. Similarity Query has predicate that explained by a similarity. However, in the process of processing the similarity query, it is difficult to use an index that can reduce the range of processed records, so the cost of calculating the similarity for all records in the table is high each time. To solve this problem, this paper defines a lightweight similarity function. The lightweight similarity function has lower data filtering accuracy than the similarity function, but consumes less cost than the similarity function. We present a method for reducing similarity query processing cost by using the lightweight similarity function features. Then, Chebyshev distance is presented as a lightweight similarity function to the Euclidean distance function, and the processing cost of a query using the existing similarity function and a query using the lightweight similarity function is compared. And through experiments, it is confirmed that the similarity query processing cost is reduced when Chebyshev distance is applied as a lightweight similarity function for Euclidean similarity.

Practical Datasets for Similarity Measures and Their Threshold Values (유사도 측정 데이터 셋과 쓰레숄드)

  • Yang, Byoungju;Shim, Junho
    • The Journal of Society for e-Business Studies
    • /
    • v.18 no.1
    • /
    • pp.97-105
    • /
    • 2013
  • In the e-business domain where data objects are quantitatively large, measuring similarity to find the same or similar objects is important. It basically requires comparing and computing the features of objects in pairs, and therefore takes longer time as the amount of data becomes bigger. Recent studies have shown various algorithms to efficiently perform it. Most of them show their performance superiority by empirical tests over some sets of data. In this paper, we introduce those data sets, present their characteristics and the meaningful threshold values that each of data sets contain in nature. The analysis on practical data sets with respect to their threshold values may serve as a referential baseline to the future experiments of newly developed algorithms.

A study on the ordering of similarity measures with negative matches (음의 일치 빈도를 고려한 유사성 측도의 대소 관계 규명에 관한 연구)

  • Park, Hee Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.1
    • /
    • pp.89-99
    • /
    • 2015
  • The World Economic Forum and the Korean Ministry of Knowledge Economy have selected big data as one of the top 10 in core information technology. The key of big data is to analyze effectively the properties that do have data. Clustering analysis method of big data techniques is a method of assigning a set of objects into the clusters so that the objects in the same cluster are more similar to each other clusters. Similarity measures being used in the cluster analysis may be classified into various types depending on the nature of the data. In this paper, we studied upper and lower bounds for binary similarity measures with negative matches such as Russel and Rao measure, simple matching measure by Sokal and Michener, Rogers and Tanimoto measure, Sokal and Sneath measure, Hamann measure, and Baroni-Urbani and Buser mesures I, II. And the comparative studies with these measures were shown by real data and simulated experiment.

Video Classification System Based on Similarity Representation Among Sequential Data (순차 데이터간의 유사도 표현에 의한 동영상 분류)

  • Lee, Hosuk;Yang, Jihoon
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.7 no.1
    • /
    • pp.1-8
    • /
    • 2018
  • It is not easy to learn simple expressions of moving picture data since it contains noise and a lot of information in addition to time-based information. In this study, we propose a similarity representation method and a deep learning method between sequential data which can express such video data abstractly and simpler. This is to learn and obtain a function that allow them to have maximum information when interpreting the degree of similarity between image data vectors constituting a moving picture. Through the actual data, it is confirmed that the proposed method shows better classification performance than the existing moving image classification methods.

Hierarchical Clustering of Symbolic Objects based on Asymmetric Proximity (비대칭적 유사도 기반의 심볼릭 객체의 계층적 클러스터링)

  • Oh, Seung-Joon;Park, Chan-Woong
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.22 no.6
    • /
    • pp.729-734
    • /
    • 2012
  • Clustering analysis has been widely used in numerous applications like pattern recognition, data analysis, intrusion detection, image processing, bioinformatics and so on. Much of previous work has been based on the numeric data only. However, symbolic data analysis has emerged to deal with variables that can have intervals, histograms, and even functions as values. In this paper, we propose a non symmetric proximity based clustering approach for symbolic objects. A method for clustering symbolic patterns based on the average similarity value(ASV) is explored. The results of the proposed clustering method differ from those of the existing methods and the results are very encouraging.