• Title/Summary/Keyword: Variable Statistics

Search Result 1,333, Processing Time 0.02 seconds

On Simultaneous Considerations of Variable Selection and Detection of Influential Cases

  • Ahn, Byoung-Jin;Park, Sung-Hyun
    • Journal of the Korean Statistical Society
    • /
    • v.16 no.1
    • /
    • pp.10-20
    • /
    • 1987
  • The value of statistics used for variable selection criteria can be reduced remarkably by excluding only a few influential cases. Furthermore, different subsets of regressors change leverage and influence patterns for the same response variable. Based on these motivations, this paper suggests a procedure which considers variable selection and detection of influential cases simultaneously.

  • PDF

Ensemble variable selection using genetic algorithm

  • Seogyoung, Lee;Martin Seunghwan, Yang;Jongkyeong, Kang;Seung Jun, Shin
    • Communications for Statistical Applications and Methods
    • /
    • v.29 no.6
    • /
    • pp.629-640
    • /
    • 2022
  • Variable selection is one of the most crucial tasks in supervised learning, such as regression and classification. The best subset selection is straightforward and optimal but not practically applicable unless the number of predictors is small. In this article, we propose directly solving the best subset selection via the genetic algorithm (GA), a popular stochastic optimization algorithm based on the principle of Darwinian evolution. To further improve the variable selection performance, we propose to run multiple GA to solve the best subset selection and then synthesize the results, which we call ensemble GA (EGA). The EGA significantly improves variable selection performance. In addition, the proposed method is essentially the best subset selection and hence applicable to a variety of models with different selection criteria. We compare the proposed EGA to existing variable selection methods under various models, including linear regression, Poisson regression, and Cox regression for survival data. Both simulation and real data analysis demonstrate the promising performance of the proposed method.

Robust Variable Selection in Classification Tree

  • Jang Jeong Yee;Jeong Kwang Mo
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2001.11a
    • /
    • pp.89-94
    • /
    • 2001
  • In this study we focus on variable selection in decision tree growing structure. Some of the splitting rules and variable selection algorithms are discussed. We propose a competitive variable selection method based on Kruskal-Wallis test, which is a nonparametric version of ANOVA F-test. Through a Monte Carlo study we note that CART has serious bias in variable selection towards categorical variables having many values, and also QUEST using F-test is not so powerful to select informative variables under heavy tailed distributions.

  • PDF

A Study for Antecedent Association Rules

  • Park, Hee-Chang;Cho, Kwang-Hyun
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.4
    • /
    • pp.1077-1083
    • /
    • 2006
  • Association rule mining searches for interesting relationships among items in a given database. Association rules are frequently used by retail stores to assist in marketing, advertising, floor placement, and inventory control. There are three primary quality measures for association rule, support and confidence and lift. In this paper we present association rule mining based antecedent variables. We call these rules to antecedent association rules. An antecedent variable is a variable that occurs before the independent variable and the dependent variable.

  • PDF

Functional Data Classification of Variable Stars

  • Park, Minjeong;Kim, Donghoh;Cho, Sinsup;Oh, Hee-Seok
    • Communications for Statistical Applications and Methods
    • /
    • v.20 no.4
    • /
    • pp.271-281
    • /
    • 2013
  • This paper considers a problem of classification of variable stars based on functional data analysis. For a better understanding of galaxy structure and stellar evolution, various approaches for classification of variable stars have been studied. Several features that explain the characteristics of variable stars (such as color index, amplitude, period, and Fourier coefficients) were usually used to classify variable stars. Excluding other factors but focusing only on the curve shapes of variable stars, Deb and Singh (2009) proposed a classification procedure using multivariate principal component analysis. However, this approach is limited to accommodate some features of the light curve data that are unequally spaced in the phase domain and have some functional properties. In this paper, we propose a light curve estimation method that is suitable for functional data analysis, and provide a classification procedure for variable stars that combined the features of a light curve with existing functional data analysis methods. To evaluate its practical applicability, we apply the proposed classification procedure to the data sets of variable stars from the project STellar Astrophysics and Research on Exoplanets (STARE).

Multivariate Control Charts for Means and Variances with Variable Sampling Intervals

  • Kim, Jae-Joo;Cho, Gyo-Young;Chang, Duk-Joon
    • Journal of Korean Society for Quality Management
    • /
    • v.22 no.1
    • /
    • pp.66-81
    • /
    • 1994
  • Several sample statistics to simultaneously monitor both means and variances for multivariate quality characteristics under multivariate normal process are proposed. Performances of multivariate Shewhart schemes and cumulative sum(CUSUM) schemes are evaluated for matched fixed sampling interval(FSI) and variable sampling interval(VSI) feature. Numerical results show that multivariate CUSUM charts are more efficient than Shewhart charts for small or moderate shifts and VSI feature is more efficient than FSI feature.

  • PDF

Pre-Adjustment of Incomplete Group Variable via K-Means Clustering

  • Hwang, S.Y.;Hahn, H.E.
    • Journal of the Korean Data and Information Science Society
    • /
    • v.15 no.3
    • /
    • pp.555-563
    • /
    • 2004
  • In classification and discrimination, we often face with incomplete group variable arising typically from many missing values and/or incredible cases. This paper suggests the use of K-means clustering for pre-adjusting incompleteness and in turn classification based on generalized statistical distance is performed. For illustrating the proposed procedure, simulation study is conducted comparatively with CART in data mining and traditional techniques which are ignoring incompleteness of group variable. Simulation study manifests that our methodology out-performs.

  • PDF

Other approaches to bivariate ranked set sampling

  • Al-Saleh, Mohammad Fraiwan;Alshboul, Hadeel Mohammad
    • Communications for Statistical Applications and Methods
    • /
    • v.25 no.3
    • /
    • pp.283-296
    • /
    • 2018
  • Ranked set sampling, as introduced by McIntyre (Australian Journal of Agriculture Research, 3, 385-390, 1952), dealt with the estimation of the mean of one population. To deal with two or more variables, different forms of bivariate and multivariate ranked set sampling were suggested. For a technique to be useful, it should be easy to implement in practice. Bivariate ranked set sampling, as introduced by Al-Saleh and Zheng (Australian & New Zealand Journal of Statistics, 44, 221-232, 2002), is not easy to implement in practice, because it requires the judgment ranking of each of the combination of the order statistics of the two characteristics. This paper investigates two modifications that make the method easier to use. The first modification is based on ranking one variable and noting the rank of the other variable for one cycle, and do the reverse for another cycle. The second approach is based on ranking of one variable and giving the second variable the same rank (Concomitant Order Statistic) for one cycle and do the reverse for the other cycle. The two procedures are investigated for an estimation of the means of some well-known distributions. It is show that the suggested approaches can be used in practice and can be more efficient than using SRS. A real data set is used to illustrate the procedure.

Two-stage imputation method to handle missing data for categorical response variable

  • Jong-Min Kim;Kee-Jae Lee;Seung-Joo Lee
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.6
    • /
    • pp.577-587
    • /
    • 2023
  • Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the first stage, we utilize the Boruta variable selection method on the complete dataset to identify significant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.

An importance sampling for a function of a multivariate random variable

  • Jae-Yeol Park;Hee-Geon Kang;Sunggon Kim
    • Communications for Statistical Applications and Methods
    • /
    • v.31 no.1
    • /
    • pp.65-85
    • /
    • 2024
  • The tail probability of a function of a multivariate random variable is not easy to estimate by the crude Monte Carlo simulation. When the occurrence of the function value over a threshold is rare, the accurate estimation of the corresponding probability requires a huge number of samples. When the explicit form of the cumulative distribution function of each component of the variable is known, the inverse transform likelihood ratio method is directly applicable scheme to estimate the tail probability efficiently. The method is a type of the importance sampling and its efficiency depends on the selection of the importance sampling distribution. When the cumulative distribution of the multivariate random variable is represented by a copula and its marginal distributions, we develop an iterative algorithm to find the optimal importance sampling distribution, and show the convergence of the algorithm. The performance of the proposed scheme is compared with the crude Monte Carlo simulation numerically.