• Title/Summary/Keyword: Categorical

Search Result 634, Processing Time 0.023 seconds

Categorical Data Clustering Analysis Using Association-based Dissimilarity (연관성 기반 비유사성을 활용한 범주형 자료 군집분석)

  • Lee, Changki;Jung, Uk
    • Journal of Korean Society for Quality Management
    • /
    • v.47 no.2
    • /
    • pp.271-281
    • /
    • 2019
  • Purpose: The purpose of this study is to suggest a more efficient distance measure taking into account the relationship between categorical variables for categorical data cluster analysis. Methods: In this study, the association-based dissimilarity was employed to calculate the distance between two categorical data observations and the distance obtained from the association-based dissimilarity was applied to the PAM cluster algorithms to verify its effectiveness. The strength of association between two different categorical variables can be calculated using a mixture of dissimilarities between the conditional probability distributions of other categorical variables, given these two categorical values. In particular, this method is suitable for datasets whose categorical variables are highly correlated. Results: The simulation results using several real life data showed that the proposed distance which considered relationships among the categorical variables generally yielded better clustering performance than the Hamming distance. In addition, as the number of correlated variables was increasing, the difference in the performance of the two clustering methods based on different distance measures became statistically more significant. Conclusion: This study revealed that the adoption of the relationship between categorical variables using our proposed method positively affected the results of cluster analysis.

Complex Segregation Analysis of Categorical Traits in Farm Animals: Comparison of Linear and Threshold Models

  • Kadarmideen, Haja N.;Ilahi, H.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.18 no.8
    • /
    • pp.1088-1097
    • /
    • 2005
  • Main objectives of this study were to investigate accuracy, bias and power of linear and threshold model segregation analysis methods for detection of major genes in categorical traits in farm animals. Maximum Likelihood Linear Model (MLLM), Bayesian Linear Model (BALM) and Bayesian Threshold Model (BATM) were applied to simulated data on normal, categorical and binary scales as well as to disease data in pigs. Simulated data on the underlying normally distributed liability (NDL) were used to create categorical and binary data. MLLM method was applied to data on all scales (Normal, categorical and binary) and BATM method was developed and applied only to binary data. The MLLM analyses underestimated parameters for binary as well as categorical traits compared to normal traits; with the bias being very severe for binary traits. The accuracy of major gene and polygene parameter estimates was also very low for binary data compared with those for categorical data; the later gave results similar to normal data. When disease incidence (on binary scale) is close to 50%, segregation analysis has more accuracy and lesser bias, compared to diseases with rare incidences. NDL data were always better than categorical data. Under the MLLM method, the test statistics for categorical and binary data were consistently unusually very high (while the opposite is expected due to loss of information in categorical data), indicating high false discovery rates of major genes if linear models are applied to categorical traits. With Bayesian segregation analysis, 95% highest probability density regions of major gene variances were checked if they included the value of zero (boundary parameter); by nature of this difference between likelihood and Bayesian approaches, the Bayesian methods are likely to be more reliable for categorical data. The BATM segregation analysis of binary data also showed a significant advantage over MLLM in terms of higher accuracy. Based on the results, threshold models are recommended when the trait distributions are discontinuous. Further, segregation analysis could be used in an initial scan of the data for evidence of major genes before embarking on molecular genome mapping.

Categorization and production in lexical pitch accent contrasts of North Kyungsang Korean

  • Kim, Jungsun
    • Phonetics and Speech Sciences
    • /
    • v.10 no.1
    • /
    • pp.1-7
    • /
    • 2018
  • Categorical production in language processing helps speakers to produce phonemic contrasts. This categorization and production is utilized for the production-based and imitation-based approach in the present study. Contrastive signals in speakers' speech reflect the shapes of boundaries with categorical characteristics. Signals that provide information about lexical pitch accent contrasts can introduce categorical distinctions for productive and cognitive selection. This experiment was conducted with nine North Kyungsang speakers for a production task and nine North Kyungsang speakers for an imitation task. The first finding of the present study is the rigidity of categorical production, which controls the boundaries of lexical pitch accent contrasts. The categorization of North Kyungsang speakers' production allows them to classify minimal pitch accent contrasts. The categorical production in imitation appeared in two clusters, representing two meaningful contrasts. The second finding of the present study is that there are individual differences in speakers' production and imitation responses. The distinctive performances of individual speakers showed a variety of curves. For the HL-LH patterns, the categorical production tended to be highly distinctive as compared to the other pitch accent patterns (HH-HL and HH-LH), showing that there are more continuous curves than categorical curves. Finally, the present study shows that, for North Kyungsang speakers, imitative production is the core type of categorical production for determining the existence of the lexical pitch accent system. However, several questions remain for defining that categorical production, which leads to ideas for future research.

Association-based Unsupervised Feature Selection for High-dimensional Categorical Data (고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법)

  • Lee, Changki;Jung, Uk
    • Journal of Korean Society for Quality Management
    • /
    • v.47 no.3
    • /
    • pp.537-552
    • /
    • 2019
  • Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness-to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset. Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases. Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation.

Integrated Partial Sufficient Dimension Reduction with Heavily Unbalanced Categorical Predictors

  • Yoo, Jae-Keun
    • The Korean Journal of Applied Statistics
    • /
    • v.23 no.5
    • /
    • pp.977-985
    • /
    • 2010
  • In this paper, we propose an approach to conduct partial sufficient dimension reduction with heavily unbalanced categorical predictors. For this, we consider integrated categorical predictors and investigate certain conditions that the integrated categorical predictor is fully informative to partial sufficient dimension reduction. For illustration, the proposed approach is implemented on optimal partial sliced inverse regression in simulation and data analysis.

Clustering Algorithm for Sequences of Categorical Values (범주형 값들이 순서를 가지고 있는 데이터들의 클러스터링 기법)

  • 오승준;김재련
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.26 no.1
    • /
    • pp.17-21
    • /
    • 2003
  • We study clustering algorithm for sequences of categorical values. Clustering is a data mining problem that has received significant attention by the database community. Traditional clustering algorithms deal with numerical or categorical data points. However, there exist many important databases that store categorical data sequences. In this paper, we introduce new similarity measure and develop a hierarchical clustering algorithm. An experimental section shows performance of the proposed approach.

On the clustering of huge categorical data

  • Kim, Dae-Hak
    • Journal of the Korean Data and Information Science Society
    • /
    • v.21 no.6
    • /
    • pp.1353-1359
    • /
    • 2010
  • Basic objective in cluster analysis is to discover natural groupings of items. In general, clustering is conducted based on some similarity (or dissimilarity) matrix or the original input data. Various measures of similarities between objects are developed. In this paper, we consider a clustering of huge categorical real data set which shows the aspects of time-location-activity of Korean people. Some useful similarity measure for the data set, are developed and adopted for the categorical variables. Hierarchical and nonhierarchical clustering method are applied for the considered data set which is huge and consists of many categorical variables.

Perceptual development in the categorization of pitch accent contrasts in children and adults

  • Kim, Jung-Sun
    • Phonetics and Speech Sciences
    • /
    • v.3 no.3
    • /
    • pp.11-18
    • /
    • 2011
  • This paper examines the categorical labeling of lexical pitch accent contrasts in North Kyungsang and South Cholla Korean listeners. It focuses specifically on investigating whether the pitch accent perception of adults and children has a dialect-specific effect. To evaluate the development of perceptual identification, slopes, intercepts, and positions at categorical boundaries were computed using a logistic regression function. The results showed that differences in slopes and intercepts were significant between North Kyungsang child and adult listeners, but the same was not the case for the positions at boundaries. As far as South Cholla child and adult listeners were concerned, there was a significant difference in slopes, but not intercepts and positions at boundaries. In the present study, the comparison of intercepts and slopes at the boundaries indicated developmental differences between North Kyungsang adult and child listeners. This improvement in categorical proportion seems to be a result of developmental changes in categorical perception. For South Cholla adult and child listeners, however, perception of the non-native contrast becomes less categorical.

  • PDF

Two-stage imputation method to handle missing data for categorical response variable

  • Jong-Min Kim;Kee-Jae Lee;Seung-Joo Lee
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.6
    • /
    • pp.577-587
    • /
    • 2023
  • Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the first stage, we utilize the Boruta variable selection method on the complete dataset to identify significant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.

Probabilistic Forecasting of Seasonal Inflow to Reservoir (계절별 저수지 유입량의 확률예측)

  • Kang, Jaewon
    • Journal of Environmental Science International
    • /
    • v.22 no.8
    • /
    • pp.965-977
    • /
    • 2013
  • Reliable long-term streamflow forecasting is invaluable for water resource planning and management which allocates water supply according to the demand of water users. It is necessary to get probabilistic forecasts to establish risk-based reservoir operation policies. Probabilistic forecasts may be useful for the users who assess and manage risks according to decision-making responding forecasting results. Probabilistic forecasting of seasonal inflow to Andong dam is performed and assessed using selected predictors from sea surface temperature and 500 hPa geopotential height data. Categorical probability forecast by Piechota's method and logistic regression analysis, and probability forecast by conditional probability density function are used to forecast seasonal inflow. Kernel density function is used in categorical probability forecast by Piechota's method and probability forecast by conditional probability density function. The results of categorical probability forecasts are assessed by Brier skill score. The assessment reveals that the categorical probability forecasts are better than the reference forecasts. The results of forecasts using conditional probability density function are assessed by qualitative approach and transformed categorical probability forecasts. The assessment of the forecasts which are transformed to categorical probability forecasts shows that the results of the forecasts by conditional probability density function are much better than those of the forecasts by Piechota's method and logistic regression analysis except for winter season data.