• Title/Summary/Keyword: categorical values

Search Result 77, Processing Time 0.024 seconds

A New Similarity Measure for Categorical Attribute-Based Clustering (범주형 속성 기반 군집화를 위한 새로운 유사 측도)

  • Kim, Min;Jeon, Joo-Hyuk;Woo, Kyung-Gu;Kim, Myoung-Ho
    • Journal of KIISE:Databases
    • /
    • v.37 no.2
    • /
    • pp.71-81
    • /
    • 2010
  • The problem of finding clusters is widely used in numerous applications, such as pattern recognition, image analysis, market analysis. The important factors that decide cluster quality are the similarity measure and the number of attributes. Similarity measures should be defined with respect to the data types. Existing similarity measures are well applicable to numerical attribute values. However, those measures do not work well when the data is described by categorical attributes, that is, when no inherent similarity measure between values. In high dimensional spaces, conventional clustering algorithms tend to break down because of sparsity of data points. To overcome this difficulty, a subspace clustering approach has been proposed. It is based on the observation that different clusters may exist in different subspaces. In this paper, we propose a new similarity measure for clustering of high dimensional categorical data. The measure is defined based on the fact that a good clustering is one where each cluster should have certain information that can distinguish it with other clusters. We also try to capture on the attribute dependencies. This study is meaningful because there has been no method to use both of them. Experimental results on real datasets show clusters obtained by our proposed similarity measure are good enough with respect to clustering accuracy.

Fluctuation of estimates in an EM procedure

  • Kim, Seong-Ho;Kim, Sung-Ho
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2003.05a
    • /
    • pp.157-162
    • /
    • 2003
  • Estimates from an EM algorithm are somewhat sensitive to the initial values for the estimates, and it is more likely when the model becomes larger and more complicated. In this article, we examined how the estimates fluctuate during an EM procedure for a recursive model of categorical variables. It is found that the fluctuation takes place mostly during the first half of the procedure and that it can be subdued by applying the Bayesian method of estimation. Both simulation data and real data are used for illustration.

  • PDF

An Association Discovery Algorithm Containing Quantitative Attributes with Item Constraints (수량적 속성을 포함하는 항목 제약을 고려한 연관규칙 마이닝 앨고리듬)

  • 한경록;김재련
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.22 no.50
    • /
    • pp.183-193
    • /
    • 1999
  • The problem of discovering association rules has received considerable research attention and several fast algorithms for mining association rules have been developed. In this paper, we propose an efficient algorithm for mining quantitative association rules with item constraints. For categorical attributes, we map the values of the attribute to a set of consecutive integers. For quantitative attributes, we can partition the attribute into values or ranges. While such constraints can be applied as a post-processing step, integrating them into the mining algorithm can reduce the execution time. We consider the problem of integrating constraints that are boolean expressions over the presence or absence of items containing quantitative attributes into the association discovery algorithm using Apriori concept.

  • PDF

The Transform of Multidimensional Categorical Data and its Applications (다차원 범주형 자료의 변환과 그의 응용)

  • Ahn, Ju-Sun
    • The Korean Journal of Applied Statistics
    • /
    • v.20 no.3
    • /
    • pp.585-595
    • /
    • 2007
  • The squared Euclid distance of the values which is transformed by P-matrix of Ahn et al. (2003) is in proportion to the squared Euclid distance of cell's relative frequencies in two Contingency Tables. We propose the method of using the PP-values for the analysis of modern poems and questionnaire data.

Long-term Forecast of Seasonal Precipitation in Korea using the Large-scale Predictors (광역규모 예측인자를 이용한 한반도 계절 강수량의 장기 예측)

  • Kim, Hwa-Su;Kwak, Chong-Heum;So, Seon-Sup;Suh, Myoung-Seok;Park, Chung-Kyu;Kim, Maeng-Ki
    • Journal of the Korean earth science society
    • /
    • v.23 no.7
    • /
    • pp.587-596
    • /
    • 2002
  • A super ensemble model was developed for the seasonal prediction of regional precipitation in Korea using the lag correlated large scale predictors, based on the empirical orthogonal function (EOF) analysis and multiple linear regression model. The predictability of this model was also evaluated by cross-validation. Correlation between the predicted and the observed value obtained from the super ensemble model showed 0.73 in spring, 0.61 in summer, 0.69 in autumn and 0.75 in winter. The predictability of categorical forecasting was also evaluated based on the three classes such as above normal, near normal and below normal that are clearly defined in terms of a priori specified by threshold values. Categorical forecasting by the super ensemble model has a hit rate with a range from 0.42 to 0.74 in seasonal precipitation.

Discretization of Continuous-Valued Attributes considering Data Distribution (데이터 분포를 고려한 연속 값 속성의 이산화)

  • Lee, Sang-Hoon;Park, Jung-Eun;Oh, Kyung-Whan
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.13 no.4
    • /
    • pp.391-396
    • /
    • 2003
  • This paper proposes a new approach that converts continuous-valued attributes to categorical-valued ones considering the distribution of target attributes(classes). In this approach, It can be possible to get optimal interval boundaries by considering the distribution of data itself without any requirements of parameters. For each attributes, the distribution of target attributes is projected to one-dimensional space. And this space is clustered according to the criteria like as the density value of each target attributes and the amount of overlapped areas among each density values of target attributes. Clusters which are made in this ways are based on the probabilities that can predict a target attribute of instances. Therefore it has an interval boundaries that minimize a loss of information of original data. An improved performance of proposed discretization method can be validated using C4.5 algorithm and UCI Machine Learning Data Repository data sets.

Variable selection for latent class analysis using clustering efficiency (잠재변수 모형에서의 군집효율을 이용한 변수선택)

  • Kim, Seongkyung;Seo, Byungtae
    • The Korean Journal of Applied Statistics
    • /
    • v.31 no.6
    • /
    • pp.721-732
    • /
    • 2018
  • Latent class analysis (LCA) is an important tool to explore unseen latent groups in multivariate categorical data. In practice, it is important to select a suitable set of variables because the inclusion of too many variables in the model makes the model complicated and reduces the accuracy of the parameter estimates. Dean and Raftery (Annals of the Institute of Statistical Mathematics, 62, 11-35, 2010) proposed a headlong search algorithm based on Bayesian information criteria values to choose meaningful variables for LCA. In this paper, we propose a new variable selection procedure for LCA by utilizing posterior probabilities obtained from each fitted model. We propose a new statistic to measure the adequacy of LCA and develop a variable selection procedure. The effectiveness of the proposed method is also presented through some numerical studies.

A Geostatistical Block Simulation Approach for Generating Fine-scale Categorical Thematic Maps from Coarse-scale Fraction Data (저해상도 비율 자료로부터 고해상도 범주형 주제도 생성을 위한 지구통계학적 블록 시뮬레이션)

  • Park, No-Wook;Lee, Ki-Won
    • Journal of the Korean earth science society
    • /
    • v.32 no.6
    • /
    • pp.525-536
    • /
    • 2011
  • In any applications using various types of spatial data, it is very important to account for the scale differences among available data sets and to change the scale to the target one as well. In this paper, we propose to use a geostatistical downscaling approach based on vaiorgram deconvloution and block simulation to generate fine-scale categorical thematic maps from coarse-scale fraction data. First, an iterative variogram deconvolution method is applied to estimate a point-support variogram model from a block-support variogram model. Then, both a direct sequential simulation based on area-to-point kriging and the estimated point-support variogram are applied to produce alternative fine-scale fraction realizations. Finally, a maximum a posteriori decision rule is applied to generate the fine-scale categorical thematic maps. These analytical steps are illustrated through a case study of land-cover mapping only using the block fraction data of thematic classes without point data. Alternative fine-scale fraction maps by the downscaling method presented in this study reproduce the coarse-scale block fraction values. The final fine-scale land-cover realizations can reflect overall spatial patterns of the reference land-cover map, thus providing reasonable inputs for the impact assessment in change of support problems.

On Combining MOS and Histogram in a Subjective Evaluation Method

  • Sehyug Kwon
    • Communications for Statistical Applications and Methods
    • /
    • v.2 no.2
    • /
    • pp.176-183
    • /
    • 1995
  • Mean opinion score (MOS) method has been used in many areas to quantify opinions of respondents not only in survey research but in evaluating the parameters of population that are not measurable of are technically hard to be measured. Histogram is an important graphical technique because of the role it plays in describing categorical data as well as quantitative. In MOS method, subjective opinions of respondents are quantified by opinion scores and the arithmetic means of opinion scores have been used to describe the interesting population. Since opinion scores are polytomous, the values of arithmetic means have little meanings. In this paper, cumulative percentage curves as a function of the means of opinion scores are derived by combining means of opinion scores and histograms. It is proposed for better interpretation to opinion scores in MOS method, one of subjective evaluation methods.

  • PDF

Input Variable Importance in Supervised Learning Models

  • Huh, Myung-Hoe;Lee, Yong Goo
    • Communications for Statistical Applications and Methods
    • /
    • v.10 no.1
    • /
    • pp.239-246
    • /
    • 2003
  • Statisticians, or data miners, are often requested to assess the importances of input variables in the given supervised learning model. For the purpose, one may rely on separate ad hoc measures depending on modeling types, such as linear regressions, the neural networks or trees. Consequently, the conceptual consistency in input variable importance measures is lacking, so that the measures cannot be directly used in comparing different types of models, which is often done in data mining processes, In this short communication, we propose a unified approach to the importance measurement of input variables. Our method uses sensitivity analysis which begins by perturbing the values of input variables and monitors the output change. Research scope is limited to the models for continuous output, although it is not difficult to extend the method to supervised learning models for categorical outcomes.