• Title/Summary/Keyword: Categorical Variables

Search Result 215, Processing Time 0.021 seconds

Computing Algorithm for Genetic Evaluations on Several Linear and Categorical Traits in A Multivariate Threshold Animal Model (범주형 자료를 포함한 다형질 임계개체모형에서 유전능력 추정 알고리즘)

  • Lee, D.H.
    • Journal of Animal Science and Technology
    • /
    • v.46 no.2
    • /
    • pp.137-144
    • /
    • 2004
  • Algorithms for estimating breeding values on several categorical data by using latent variables with threshold conception were developed and showed. Thresholds on each categorical trait were estimated by Newton’s method via gradients and Hessian matrix. This algorithm was developed by way of expansion of bivariate analysis provided by Quaas(2001). Breeding values on latent variables of categorical traits and observations on linear traits were estimated by preconditioned conjugate gradient(PCG) method, which was known having a property of fast convergence. Example was shown by simulated data with two linear traits and a categorical trait with four categories(CE=calving ease) and a dichotomous trait(SB=Still Birth) in threshold animal mixed model(TAMM). Breeding value estimates in TAMM were compared to those in linear animal mixed model (LAMM). As results, correlation estimates of breeding values to parameters were 0.91${\sim}$0.92 on CE and 0.87${\sim}$0.89 on SB in TAMM and 0.72~0.84 on CE and 0.59~0.70 on SB in LAMM. As conclusion, PCG method for estimating breeding values on several categorical traits with linear traits were feasible in TAMM.

Kernel Poisson regression for mixed input variables

  • Shim, Jooyong
    • Journal of the Korean Data and Information Science Society
    • /
    • v.23 no.6
    • /
    • pp.1231-1239
    • /
    • 2012
  • An estimating procedure is introduced for kernel Poisson regression when the input variables consist of numerical and categorical variables, which is based on the penalized negative log-likelihood and the component-wise product of two different types of kernel functions. The proposed procedure provides the estimates of the mean function of the response variables, where the canonical parameter is linearly and/or nonlinearly related to the input variables. Experimental results are then presented which indicate the performance of the proposed kernel Poisson regression.

Ordering Variables and Categories on the Mosaic Plot (모자이크 플롯에서 변수와 범주의 순서화)

  • Lee, Moon-Joo;Huh, Myung-Hoe
    • The Korean Journal of Applied Statistics
    • /
    • v.21 no.5
    • /
    • pp.875-888
    • /
    • 2008
  • Mosaic plots, proposed by Hartigan and Kleiner (1981, 1984), are very useful in visualizing categorical data. In mosaic plot, multi-way classified cell frequencies are represented by rectangles with proportional area. The plot is easy to understand while preserving the information contained in the data. Plot's appearance, however, does change substantially depending on the order of variables and the orders of categories with variable put into the plot. In this study, we propose the algorithms for ordering variables and categories of the categorical data to be explored via mosaic plots. We demonstrate our methods to three well-known datasets: Titanic, Housing and PreSex.

Multidimensional scaling of categorical data using the partition method (분할법을 활용한 범주형자료의 다차원척도법)

  • Shin, Sang Min;Chun, Sun-Kyung;Choi, Yong-Seok
    • The Korean Journal of Applied Statistics
    • /
    • v.31 no.1
    • /
    • pp.67-75
    • /
    • 2018
  • Multidimensional scaling (MDS) is an exploratory analysis of multivariate data to represent the dissimilarity among objects in the geometric low-dimensional space. However, a general MDS map only shows the information of objects without any information about variables. In this study, we used MDS based on the algorithm of Torgerson (Theory and Methods of Scaling, Wiley, 1958) to visualize some clusters of objects in categorical data. For this, we convert given data into a multiple indicator matrix. Additionally, we added the information of levels for each categorical variable on the MDS map by applying the partition method of Shin et al. (Korean Journal of Applied Statistics, 28, 1171-1180, 2015). Therefore, we can find information on the similarity among objects as well as find associations among categorical variables using the proposed MDS map.

Prediction of the price for stock index futures using integrated artificial intelligence techniques with categorical preprocessing

  • Kim, Kyoung-jae;Han, Ingoo
    • Proceedings of the Korean Operations and Management Science Society Conference
    • /
    • 1997.10a
    • /
    • pp.105-108
    • /
    • 1997
  • Previous studies in stock market predictions using artificial intelligence techniques such as artificial neural networks and case-based reasoning, have focused mainly on spot market prediction. Korea launched trading in index futures market (KOSPI 200) on May 3, 1996, then more people became attracted to this market. Thus, this research intends to predict the daily up/down fluctuant direction of the price for KOSPI 200 index futures to meet this recent surge of interest. The forecasting methodologies employed in this research are the integration of genetic algorithm and artificial neural network (GAANN) and the integration of genetic algorithm and case-based reasoning (GACBR). Genetic algorithm was mainly used to select relevant input variables. This study adopts the categorical data preprocessing based on expert's knowledge as well as traditional data preprocessing. The experimental results of each forecasting method with each data preprocessing method are compared and statistically tested. Artificial neural network and case-based reasoning methods with best performance are integrated. Out-of-the Model Integration and In-Model Integration are presented as the integration methodology. The research outcomes are as follows; First, genetic algorithms are useful and effective method to select input variables for Al techniques. Second, the results of the experiment with categorical data preprocessing significantly outperform that with traditional data preprocessing in forecasting up/down fluctuant direction of index futures price. Third, the integration of genetic algorithm and case-based reasoning (GACBR) outperforms the integration of genetic algorithm and artificial neural network (GAANN). Forth, the integration of genetic algorithm, case-based reasoning and artificial neural network (GAANN-GACBR, GACBRNN and GANNCBR) provide worse results than GACBR.

  • PDF

Hypothesis Testing: Means and Proportions (평균과 비율 비교)

  • Pak, Son-Il;Lee, Young-Won
    • Journal of Veterinary Clinics
    • /
    • v.26 no.5
    • /
    • pp.401-407
    • /
    • 2009
  • In the previous article in this series we introduced the basic concepts for statistical analysis. The present review introduces hypothesis testing for continuous and categorical data for readers of the veterinary science literature. For the analysis of continuous data, we explained t-test to compare a single mean with a hypothesized value and the difference between two means from two independent samples or between two means arising from paired samples. When the data are categorical variables, the $x^2$ test for association and homogeneity, Fisher's exact test and Yates' continuity correction for small samples, and test for trend, in which at least one of the variables is ordinal is described, together with the worked examples. McNemar test for correlated proportions is also discussed. The topics covered may provide a basic understanding of different approaches for analyzing clinical data.

Initial Value Selection in Applying an EM Algorithm for Recursive Models of Categorical Variables

  • Jeong, Mi-Sook;Kim, Sung-Ho;Jeong, Kwang-Mo
    • Journal of the Korean Statistical Society
    • /
    • v.27 no.1
    • /
    • pp.25-55
    • /
    • 1998
  • Maximum likelihood estimates (MLEs) for recursive models of categorical variables are discussed under an EM framework. Since MLEs by EM often depend on the choice of the initial values for MLEs, we explore reasonable rules for selecting the initial values for EM. Simulation results strongly support the proposed rules.

  • PDF

Statistical micro matching using a multinomial logistic regression model for categorical data

  • Kim, Kangmin;Park, Mingue
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.5
    • /
    • pp.507-517
    • /
    • 2019
  • Statistical matching is a method of combining multiple sources of data that are extracted or surveyed from the same population. It can be used in situation when variables of interest are not jointly observed. It is a low-cost way to expect high-effects in terms of being able to create synthetic data using existing sources. In this paper, we propose the several statistical micro matching methods using a multinomial logistic regression model when all variables of interest are categorical or categorized ones, which is common in sample survey. Under conditional independence assumption (CIA), a mixed statistical matching method, which is useful when auxiliary information is not available, is proposed. We also propose a statistical matching method with auxiliary information that reduces the bias of the conventional matching methods suggested under CIA. Through a simulation study, proposed micro matching methods and conventional ones are compared. Simulation study shows that suggested matching methods outperform the existing ones especially when CIA does not hold.

Latent class model for mixed variables with applications to text data (혼합모드 잠재범주모형을 통한 텍스트 자료의 분석)

  • Shin, Hyun Soo;Seo, Byungtae
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.6
    • /
    • pp.837-849
    • /
    • 2019
  • Latent class models (LCM) are useful tools to draw hidden information from categorical data. This model can also be interpreted as a mixture model with multinomial component distributions. In some cases, however, an available dataset may contain both categorical and count or continuous data. For such cases, we can extend the LCM to a mixture model with both multinomial and other component distributions such as normal and Poisson distributions. In this paper, we consider a LCM for the data containing categorical and count data to analyze the Drug Review dataset which contains categorical responses and text review. From this data analysis, we show that we can obtain more specific hidden inforamtion than those from the LCM only with categorical responses.

TeGCN:Transformer-embedded Graph Neural Network for Thin-filer default prediction (TeGCN:씬파일러 신용평가를 위한 트랜스포머 임베딩 기반 그래프 신경망 구조 개발)

  • Seongsu Kim;Junho Bae;Juhyeon Lee;Heejoo Jung;Hee-Woong Kim
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.3
    • /
    • pp.419-437
    • /
    • 2023
  • As the number of thin filers in Korea surpasses 12 million, there is a growing interest in enhancing the accuracy of assessing their credit default risk to generate additional revenue. Specifically, researchers are actively pursuing the development of default prediction models using machine learning and deep learning algorithms, in contrast to traditional statistical default prediction methods, which struggle to capture nonlinearity. Among these efforts, Graph Neural Network (GNN) architecture is noteworthy for predicting default in situations with limited data on thin filers. This is due to their ability to incorporate network information between borrowers alongside conventional credit-related data. However, prior research employing graph neural networks has faced limitations in effectively handling diverse categorical variables present in credit information. In this study, we introduce the Transformer embedded Graph Convolutional Network (TeGCN), which aims to address these limitations and enable effective default prediction for thin filers. TeGCN combines the TabTransformer, capable of extracting contextual information from categorical variables, with the Graph Convolutional Network, which captures network information between borrowers. Our TeGCN model surpasses the baseline model's performance across both the general borrower dataset and the thin filer dataset. Specially, our model performs outstanding results in thin filer default prediction. This study achieves high default prediction accuracy by a model structure tailored to characteristics of credit information containing numerous categorical variables, especially in the context of thin filers with limited data. Our study can contribute to resolving the financial exclusion issues faced by thin filers and facilitate additional revenue within the financial industry.