• Title/Summary/Keyword: High Dimensionality Data

검색결과 122건 처리시간 0.033초

Comprehensive review on Clustering Techniques and its application on High Dimensional Data

  • Alam, Afroj;Muqeem, Mohd;Ahmad, Sultan
    • International Journal of Computer Science & Network Security
    • /
    • 제21권6호
    • /
    • pp.237-244
    • /
    • 2021
  • Clustering is a most powerful un-supervised machine learning techniques for division of instances into homogenous group, which is called cluster. This Clustering is mainly used for generating a good quality of cluster through which we can discover hidden patterns and knowledge from the large datasets. It has huge application in different field like in medicine field, healthcare, gene-expression, image processing, agriculture, fraud detection, profitability analysis etc. The goal of this paper is to explore both hierarchical as well as partitioning clustering and understanding their problem with various approaches for their solution. Among different clustering K-means is better than other clustering due to its linear time complexity. Further this paper also focused on data mining that dealing with high-dimensional datasets with their problems and their existing approaches for their relevancy

구조적 차원성 탐색을 통한 '노인 생활 만족도 척도'의 재발견: 최성재의 '노인 생활 만족도 척도'를 중심으로 (Life Satisfaction Scale for Elderly : Revisited)

  • 최혜지;이영분
    • 한국사회복지학
    • /
    • 제58권3호
    • /
    • pp.27-49
    • /
    • 2006
  • 본 연구는 구조적 차원성 탐색을 통한 최성재의 '노인 생활 만족도 척도'의 재검증을 목적으로 한다. 이를 위해 충주지역에 거주하는 65세 이상 노인 275명의 자료가 분석되었다. 연구 결과 '노인 생활 만족도 척도'는 세 개의 이론적 구인으로 구성된 다차원 구조를 갖는 것으로 분석되었다. 규명된 구인은 '긍정적 정서와 주관적 만족감', '부정적 자아상과 부정적 정서', 그리고 '자기 가치'로 명명되었다. 세 구인 모두 높은 신뢰도를 보였으며 '내적 구조에 근거한 타당도' 또한 모두 높은 것으로 분석되었다. '긍정적 정서와 주관적 만족감' 그리고 '부정적 자아상과 부정적 정서'는 수렴 타당도와 판별 타당도가 모두 높은 것으로 나타났다. '자기 가치'는 높은 수렴 타당도를 보인 반면 판별 타당도는 상대적으로 낮은 것으로 나타났다. 본 연구의 결과는 '노인 생활 만족도 척도'를 단일 차원 구조로 제시한 개발자의 견해와 달리 '노인생활 만족도 척도'의 다차원 구조를 검증함으로써 다수의 선행 연구 결과를 지지한다. 끝으로 '노인 생활 만족도 척도'의 구조적 차원성이 개발자의 연구와 본 연구에서 상이하게 나타난 원인이 논의되었다.

  • PDF

국부 퍼지 클러스터링 PCA를 갖는 GMM을 이용한 화자 식별 (Speaker Identification Using GMM Based on Local Fuzzy PCA)

  • 이기용
    • 음성과학
    • /
    • 제10권4호
    • /
    • pp.159-166
    • /
    • 2003
  • To reduce the high dimensionality required for training of feature vectors in speaker identification, we propose an efficient GMM based on local PCA with Fuzzy clustering. The proposed method firstly partitions the data space into several disjoint clusters by fuzzy clustering, and then performs PCA using the fuzzy covariance matrix in each cluster. Finally, the GMM for speaker is obtained from the transformed feature vectors with reduced dimension in each cluster. Compared to the conventional GMM with diagonal covariance matrix, the proposed method needs less storage and shows faster result, under the same performance.

  • PDF

생산 및 제조 단계의 검사 데이터를 이용한 유도탄 탐색기의 고장 분류 연구 (Study on Failure Classification of Missile Seekers Using Inspection Data from Production and Manufacturing Phases)

  • 정예은;김기현;김성목;이연호;김지원;용화영;정재우;박정원;김용수
    • 산업경영시스템학회지
    • /
    • 제47권2호
    • /
    • pp.30-39
    • /
    • 2024
  • This study introduces a novel approach for identifying potential failure risks in missile manufacturing by leveraging Quality Inspection Management (QIM) data to address the challenges presented by a dataset comprising 666 variables and data imbalances. The utilization of the SMOTE for data augmentation and Lasso Regression for dimensionality reduction, followed by the application of a Random Forest model, results in a 99.40% accuracy rate in classifying missiles with a high likelihood of failure. Such measures enable the preemptive identification of missiles at a heightened risk of failure, thereby mitigating the risk of field failures and enhancing missile life. The integration of Lasso Regression and Random Forest is employed to pinpoint critical variables and test items that significantly impact failure, with a particular emphasis on variables related to performance and connection resistance. Moreover, the research highlights the potential for broadening the scope of data-driven decision-making within quality control systems, including the refinement of maintenance strategies and the adjustment of control limits for essential test items.

A Density Peak Clustering Algorithm Based on Information Bottleneck

  • Yongli Liu;Congcong Zhao;Hao Chao
    • Journal of Information Processing Systems
    • /
    • 제19권6호
    • /
    • pp.778-790
    • /
    • 2023
  • Although density peak clustering can often easily yield excellent results, there is still room for improvement when dealing with complex, high-dimensional datasets. One of the main limitations of this algorithm is its reliance on geometric distance as the sole similarity measurement. To address this limitation, we draw inspiration from the information bottleneck theory, and propose a novel density peak clustering algorithm that incorporates this theory as a similarity measure. Specifically, our algorithm utilizes the joint probability distribution between data objects and feature information, and employs the loss of mutual information as the measurement standard. This approach not only eliminates the potential for subjective error in selecting similarity method, but also enhances performance on datasets with multiple centers and high dimensionality. To evaluate the effectiveness of our algorithm, we conducted experiments using ten carefully selected datasets and compared the results with three other algorithms. The experimental results demonstrate that our information bottleneck-based density peaks clustering (IBDPC) algorithm consistently achieves high levels of accuracy, highlighting its potential as a valuable tool for data clustering tasks.

An Efficient Content-Based High-Dimensional Index Structure for Image Data

  • Lee, Jang-Sun;Yoo, Jae-Soo;Lee, Seok-Hee;Kim, Myung-Joon
    • ETRI Journal
    • /
    • 제22권2호
    • /
    • pp.32-42
    • /
    • 2000
  • The existing multi-dimensional index structures are not adequate for indexing higher-dimensional data sets. Although conceptually they can be extended to higher dimensionalities, they usually require time and space that grow exponentially with the dimensionality. In this paper, we analyze the existing index structures and derive some requirements of an index structure for content-based image retrieval. We also propose a new structure, for indexing large amount of point data in a high-dimensional space that satisfies the requirements. in order to justify the performance of the proposed structure, we compare the proposed structure with the existing index structures in various environments. We show, through experiments, that our proposed structure outperforms the existing structures in terms of retrieval time and storage overhead.

  • PDF

Impact of Instance Selection on kNN-Based Text Categorization

  • Barigou, Fatiha
    • Journal of Information Processing Systems
    • /
    • 제14권2호
    • /
    • pp.418-434
    • /
    • 2018
  • With the increasing use of the Internet and electronic documents, automatic text categorization becomes imperative. Several machine learning algorithms have been proposed for text categorization. The k-nearest neighbor algorithm (kNN) is known to be one of the best state of the art classifiers when used for text categorization. However, kNN suffers from limitations such as high computation when classifying new instances. Instance selection techniques have emerged as highly competitive methods to improve kNN through data reduction. However previous works have evaluated those approaches only on structured datasets. In addition, their performance has not been examined over the text categorization domain where the dimensionality and size of the dataset is very high. Motivated by these observations, this paper investigates and analyzes the impact of instance selection on kNN-based text categorization in terms of various aspects such as classification accuracy, classification efficiency, and data reduction.

Efficient estimation and variable selection for partially linear single-index-coefficient regression models

  • Kim, Young-Ju
    • Communications for Statistical Applications and Methods
    • /
    • 제26권1호
    • /
    • pp.69-78
    • /
    • 2019
  • A structured model with both single-index and varying coefficients is a powerful tool in modeling high dimensional data. It has been widely used because the single-index can overcome the curse of dimensionality and varying coefficients can allow nonlinear interaction effects in the model. For high dimensional index vectors, variable selection becomes an important question in the model building process. In this paper, we propose an efficient estimation and a variable selection method based on a smoothing spline approach in a partially linear single-index-coefficient regression model. We also propose an efficient algorithm for simultaneously estimating the coefficient functions in a data-adaptive lower-dimensional approximation space and selecting significant variables in the index with the adaptive LASSO penalty. The empirical performance of the proposed method is illustrated with simulated and real data examples.

Multifactor Dimensionality Reduction (MDR) Analysis to Detect Single Nucleotide Polymorphisms Associated with a Carcass Trait in a Hanwoo Population

  • Lee, Jea-Young;Kwon, Jae-Chul;Kim, Jong-Joo
    • Asian-Australasian Journal of Animal Sciences
    • /
    • 제21권6호
    • /
    • pp.784-788
    • /
    • 2008
  • Studies to detect genes responsible for economic traits in farm animals have been performed using parametric linear models. A non-parametric, model-free approach using the 'expanded multifactor-dimensionality reduction (MDR) method' considering high dimensionalities of interaction effects between multiple single nucleotide polymorphisms (SNPs), was applied to identify interaction effects of SNPs responsible for carcass traits in a Hanwoo beef cattle population. Data were obtained from the Hanwoo Improvement Center, National Agricultural Cooperation Federation, Korea, and comprised 299 steers from 16 paternal half-sib proven sires that were delivered in Namwon or Daegwanryong livestock testing stations between spring of 2002 and fall of 2003. For each steer at approximately 722 days of age, the Longssimus dorsi muscle area (LMA) was measured after slaughter. Three functional SNPs (19_1, 18_4, 28_2) near the microsatellite marker ILSTS035 on BTA6, around which the QTL for meat quality were previously detected, were assessed. Application of the expanded MDR method revealed the best model with an interaction effect between the SNPs 19_1 and 28_2, while only one main effect of SNP19_1 was statistically significant for LMA (p<0.01) under a general linear mixed model. Our results suggest that the expanded MDR method better identifies interaction effects between multiple genes that are related to polygenic traits, and that the method is an alternative to the current model choices to find associations of multiple functional SNPs and/or their interaction effects with economic traits in livestock populations.

하이퍼스펙트럴 데이터 분류에서의 평탄도 LDA 규칙화 기법의 실험적 분석 (An Experimental Study on Smoothness Regularized LDA in Hyperspectral Data Classification)

  • 박래정
    • 한국지능시스템학회논문지
    • /
    • 제20권4호
    • /
    • pp.534-540
    • /
    • 2010
  • 고차원 특성과 높은 상관성은 하이퍼스펙트럴 데이터의 주요 특징이다. LDA와 그 변형 선형 투사 방법들이 고차원 스펙트럴 정보로부터 저차원의 특징을 추출하는데 사용되었다. LDA는 학습 데이터가 적은 경우 흔히 발생하는 과대적합으로 인해 일반화 성능이 낮아지는 문제가 발생하는데, 이를 완화하기 위하여 LDA 규칙화(regularization) 방법들이 제시되었다. 그 중, 평탄도(smoothness) 제약에 기반한 LDA 규칙화 기법은 높은 상관성을 갖는 하이퍼스펙트럴 데이터의 특성에 적합한 특징 추출 기법이다. 본 논문에서는 하이퍼스펙트럴 데이터 분류에서 평탄도 제약을 갖는 LDA 규칙화 방법을 소개하고 학습 데이터 조건에 따른 성능을 실험적으로 분석한다. 또한, 분류 성능의 향상을 위한 스펙트럴 정보와 공간적 정보의 상관성을 함께 활용하는 이중 평탄도 LDA 규칙화 기법을 제시한다.