• Title/Summary/Keyword: High Dimensional Data

Search Result 1,553, Processing Time 0.027 seconds

A Cyclic Sliced Partitioning Method for Packing High-dimensional Data (고차원 데이타 패킹을 위한 주기적 편중 분할 방법)

  • 김태완;이기준
    • Journal of KIISE:Databases
    • /
    • v.31 no.2
    • /
    • pp.122-131
    • /
    • 2004
  • Traditional works on indexing have been suggested for low dimensional data under dynamic environments. But recent database applications require efficient processing of huge sire of high dimensional data under static environments. Thus many indexing strategies suggested especially in partitioning ones do not adapt to these new environments. In our study, we point out these facts and propose a new partitioning strategy, which complies with new applications' requirements and is derived from analysis. As a preliminary step to propose our method, we apply a packing technique on the one hand and exploit observations on the Minkowski-sum cost model on the other, under uniform data distribution. Observations predict that unbalanced partitioning strategy may be more query-efficient than balanced partitioning strategy for high dimensional data. Thus we propose our method, called CSP (Cyclic Spliced Partitioning method). Analysis on this method explicitly suggests metrics on how to partition high dimensional data. By the cost model, simulations, and experiments, we show excellent performance of our method over balanced strategy. By experimental studies on other indices and packing methods, we also show the superiority of our method.

An SVD-Based Approach for Generating High-Dimensional Data and Query Sets (SVD를 기반으로 한 고차원 데이터 및 질의 집합의 생성)

  • 김상욱
    • The Journal of Information Technology and Database
    • /
    • v.8 no.2
    • /
    • pp.91-101
    • /
    • 2001
  • Previous research efforts on performance evaluation of multidimensional indexes typically have used synthetic data sets distributed uniformly or normally over multidimensional space. However, recent research research result has shown that these hinds of data sets hardly reflect the characteristics of multimedia database applications. In this paper, we discuss issues on generating high dimensional data and query sets for resolving the problem. We first identify the features of the data and query sets that are appropriate for fairly evaluating performances of multidimensional indexes, and then propose HDDQ_Gen(High-Dimensional Data and Query Generator) that satisfies such features. HDDQ_Gen supports the following features : (1) clustered distributions, (2) various object distributions in each cluster, (3) various cluster distributions, (4) various correlations among different dimensions, (5) query distributions depending on data distributions. Using these features, users are able to control tile distribution characteristics of data and query sets. Our contribution is fairly important in that HDDQ_Gen provides the benchmark environment evaluating multidimensional indexes correctly.

  • PDF

A Movie Recommendation System processing High-Dimensional Data with Fuzzy-AHP and Fuzzy Association Rules (퍼지 AHP와 퍼지 연관규칙을 이용하여 고차원 데이터를 처리하는 영화 추천 시스템)

  • Oh, Jae-Taek;Lee, Sang-Yong
    • Journal of Digital Convergence
    • /
    • v.17 no.2
    • /
    • pp.347-353
    • /
    • 2019
  • Recent recommendation systems are developing toward the utilization of high-dimensional data. However, high-dimensional data can increase algorithm complexity by expanding dimensions and be lower the accuracy of recommended items. In addition, it can cause the problem of data sparsity and make it difficult to provide users with proper recommended items. This study proposed an algorithm that classify users' subjective data with objective criteria with fuzzy-AHP and make use of rules with repetitive patterns through fuzzy association rules. Trying to check how problems with high-dimensional data would be mitigated by the algorithm, we performed 5-fold cross validation according to the changing number of users. The results show that the algorithm-applied system recorded accuracy that was 12.5% higher than that of the fuzzy-AHP-applied system and mitigated the problem of data sparsity.

Comparison of tree-based ensemble models for regression

  • Park, Sangho;Kim, Chanmin
    • Communications for Statistical Applications and Methods
    • /
    • v.29 no.5
    • /
    • pp.561-589
    • /
    • 2022
  • When multiple classifications and regression trees are combined, tree-based ensemble models, such as random forest (RF) and Bayesian additive regression trees (BART), are produced. We compare the model structures and performances of various ensemble models for regression settings in this study. RF learns bootstrapped samples and selects a splitting variable from predictors gathered at each node. The BART model is specified as the sum of trees and is calculated using the Bayesian backfitting algorithm. Throughout the extensive simulation studies, the strengths and drawbacks of the two methods in the presence of missing data, high-dimensional data, or highly correlated data are investigated. In the presence of missing data, BART performs well in general, whereas RF provides adequate coverage. The BART outperforms in high dimensional, highly correlated data. However, in all of the scenarios considered, the RF has a shorter computation time. The performance of the two methods is also compared using two real data sets that represent the aforementioned situations, and the same conclusion is reached.

Comparative Analysis of Linear and Nonlinear Projection Techniques for the Best Visualization of Facial Expression Data (얼굴 표정 데이터의 최적의 가시화를 위한 선형 및 비선형 투영 기법의 비교 분석)

  • Kim, Sung-Ho
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.9
    • /
    • pp.97-104
    • /
    • 2009
  • This paper describes comparison and analysis of methodology which enables us in order to search the projection technique of optimum for projection in the plane. For this methodology, we applies the high-dimensional facial motion capture data respectively in linear and nonlinear projection techniques. The one core element of the methodology is to applies the high-dimensional facial expression data of frame unit in PCA where is a linear projection technique and Isomap, MDS, CCA, Sammon's Mapping and LLE where are a nonlinear projection techniques. And another is to find out the methodology which distributes in this low-dimensional space, and analyze the result last. For this goal, we calculate the distance between the high-dimensional facial expression frame data of existing. And we distribute it in two-dimensional plane space to maintain the distance relationship between the high-dimensional facial expression frame data of existing like that from the condition which applies linear and nonlinear projection techniques. When comparing the facial expression data which distribute in two-dimensional space and the data of existing, we find out the projection technique to maintain the relationship of distance between the frame data like that in condition of optimum. Finally, this paper compare linear and nonlinear projection techniques to projection high-dimensional facial expression data in low-dimensional space and analyze it. And we find out the projection technique of optimum from it.

A Cell-based Clustering Method for Large High-dimensional Data in Data Mining (데이타마이닝에서 고차원 대용량 데이타를 위한 셀-기반 클러스터 링 방법)

  • Jin, Du-Seok;Chang, Jae-Woo
    • Journal of KIISE:Databases
    • /
    • v.28 no.4
    • /
    • pp.558-567
    • /
    • 2001
  • Recently, data mining applications require a large amount of high-dimensional data Most algorithms for data mining applications however, do not work efficiently of high-dimensional large data because of the so-called curse of dimensionality[1] and the limitation of available memory. To overcome these problems, this paper proposes a new cell-based clustering which is more efficient than the existing algorithms for high-dimensional large data, Our clustering method provides a cell construction algorithm for dealing with high-dimensional large data and a index structure based of filtering .We do performance comparison of our cell-based clustering method with the CLIQUE method in terms of clustering time, precision, and retrieval time. Finally, the results from our experiment show that our cell-based clustering method outperform the CLIQUE method.

  • PDF

Analyzing nuclear reactor simulation data and uncertainty with the group method of data handling

  • Radaideh, Majdi I.;Kozlowski, Tomasz
    • Nuclear Engineering and Technology
    • /
    • v.52 no.2
    • /
    • pp.287-295
    • /
    • 2020
  • Group method of data handling (GMDH) is considered one of the earliest deep learning methods. Deep learning gained additional interest in today's applications due to its capability to handle complex and high dimensional problems. In this study, multi-layer GMDH networks are used to perform uncertainty quantification (UQ) and sensitivity analysis (SA) of nuclear reactor simulations. GMDH is utilized as a surrogate/metamodel to replace high fidelity computer models with cheap-to-evaluate surrogate models, which facilitate UQ and SA tasks (e.g. variance decomposition, uncertainty propagation, etc.). GMDH performance is validated through two UQ applications in reactor simulations: (1) low dimensional input space (two-phase flow in a reactor channel), and (2) high dimensional space (8-group homogenized cross-sections). In both applications, GMDH networks show very good performance with small mean absolute and squared errors as well as high accuracy in capturing the target variance. GMDH is utilized afterward to perform UQ tasks such as variance decomposition through Sobol indices, and GMDH-based uncertainty propagation with large number of samples. GMDH performance is also compared to other surrogates including Gaussian processes and polynomial chaos expansions. The comparison shows that GMDH has competitive performance with the other methods for the low dimensional problem, and reliable performance for the high dimensional problem.

High-Dimensional Image Indexing based on Adaptive Partitioning ana Vector Approximation (적응 분할과 벡터 근사에 기반한 고차원 이미지 색인 기법)

  • Cha, Gwang-Ho;Jeong, Jin-Wan
    • Journal of KIISE:Databases
    • /
    • v.29 no.2
    • /
    • pp.128-137
    • /
    • 2002
  • In this paper, we propose the LPC+-file for efficient indexing of high-dimensional image data. With the proliferation of multimedia data, there Is an increasing need to support the indexing and retrieval of high-dimensional image data. Recently, the LPC-file (5) that based on vector approximation has been developed for indexing high-dimensional data. The LPC-file gives good performance especially when the dataset is uniformly distributed. However, compared with for the uniformly distributed dataset, its performance degrades when the dataset is clustered. We improve the performance of the LPC-file for the strongly clustered image dataset. The basic idea is to adaptively partition the data space to find subspaces with high-density clusters and to assign more bits to them than others to increase the discriminatory power of the approximation of vectors. The total number of bits used to represent vector approximations is rather less than that of the LPC-file since the partitioned cells in the LPC+-file share the bits. An empirical evaluation shows that the LPC+-file results in significant performance improvements for real image data sets which are strongly clustered.

Graphical method for evaluating the impact of influential observations in high-dimensional data (고차원 자료에서 영향점의 영향을 평가하기 위한 그래픽 방법)

  • Ahn, Sojin;Lee, Jae Eun;Jang, Dae-Heung
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.6
    • /
    • pp.1291-1300
    • /
    • 2017
  • In the high-dimensional data, the number of variables is very larger than the number of observations. In this case, the impact of influential observations on regression coefficient estimates can be very large. Jang and Anderson-Cook (2017) suggested the LASSO influence plot. In this paper, we propose the LASSO influence plot, LASSO variable selection ranking plot, and three-dimensional LASSO influence plot as graphical methods for evaluating the impact of influential observations in high-dimensional data. With real two high-dimensional data examples, we apply these graphical methods as the regression diagnostics tools for finding influential observations. It has been found that we can obtain influential observations with by these graphical methods.

Partial Dimensional Clustering based on Projection Filtering in High Dimensional Data Space (대용량의 고차원 데이터 공간에서 프로젝션 필터링 기반의 부분차원 클러스터링 기법)

  • 이혜명;정종진
    • The Journal of Society for e-Business Studies
    • /
    • v.8 no.4
    • /
    • pp.69-88
    • /
    • 2003
  • In high dimensional data, most of clustering algorithms tend to degrade the performance rapidly because of nature of sparsity and amount of noise. Recently, partial dimensional clustering algorithms have been studied, which have good performance in clustering. These algorithms select the dimensional data closely related to clustering but discard the dimensional data which are not directly related to clustering in entire dimensional data. However, the traditional algorithms have some problems. At first, the algorithms employ grid based techniques but the large amount of grids make worse the performance of algorithm in terms of computational time and memory space. Secondly, the algorithms explore dimensions related to clustering using k-medoid but it is very difficult to determine the best quality of k-medoids in large amount of high dimensional data. In this paper, we propose an efficient partial dimensional clustering algorithm which is called CLIP. CLIP explores dense regions for cluster on a certain dimension. Then, the algorithm probes dense regions on a next dimension. dependent on the dense regions of the explored dimension using incremental projection. CLIP repeats these probing work in all dimensions. Clustering by Incremental projection can prune the search space largely and reduce the computational time considerably. We evaluate the performance(efficiency, effectiveness and accuracy, etc.) of the proposed algorithm compared with other algorithms using common synthetic data.

  • PDF