• Title/Summary/Keyword: 확률과 통계, 군집분석

Search Result 17, Processing Time 0.027 seconds

Clustering Analysis of Science and Engineering College Students' understanding on Probability and Statistics (Robust PCA를 활용한 이공계 대학생의 확률 및 통계 개념 이해도 분석)

  • Yoo, Yongseok
    • Journal of Convergence for Information Technology
    • /
    • v.12 no.3
    • /
    • pp.252-258
    • /
    • 2022
  • In this study, we propose a method for analyzing students' understanding of probability and statistics in small lectures at universities. A computer-based test for probability and statistics was performed on 95 science and engineering college students. After dividing the students' responses into 7 clusters using the Robust PCA and the Gaussian mixture model, the achievement of each subject was analyzed for each cluster. High-ranking clusters generally showed high achievement on most topics except for statistical estimation, and low-achieving clusters showed strengths and weaknesses on different topics. Compared to the widely used PCA-based dimension reduction followed by clustering analysis, the proposed method showed each group's characteristics more clearly. The characteristics of each cluster can be used to develop an individualized learning strategy.

Nonparametric Bayesian Statistical Models in Biomedical Research (생물/보건/의학 연구를 위한 비모수 베이지안 통계모형)

  • Noh, Heesang;Park, Jinsu;Sim, Gyuseok;Yu, Jae-Eun;Chung, Yeonseung
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.6
    • /
    • pp.867-889
    • /
    • 2014
  • Nonparametric Bayesian (np Bayes) statistical models are popularly used in a variety of research areas because of their flexibility and computational convenience. This paper reviews the np Bayes models focusing on biomedical research applications. We review key probability models for np Bayes inference while illustrating how each of the models is used to answer different types of research questions using biomedical examples. The examples are chosen to highlight the problems that are challenging for standard parametric inference but can be solved using nonparametric inference. We discuss np Bayes inference in four topics: (1) density estimation, (2) clustering, (3) random effects distribution, and (4) regression.

Probabilistic reduced K-means cluster analysis (확률적 reduced K-means 군집분석)

  • Lee, Seunghoon;Song, Juwon
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.6
    • /
    • pp.905-922
    • /
    • 2021
  • Cluster analysis is one of unsupervised learning techniques used for discovering clusters when there is no prior knowledge of group membership. K-means, one of the commonly used cluster analysis techniques, may fail when the number of variables becomes large. In such high-dimensional cases, it is common to perform tandem analysis, K-means cluster analysis after reducing the number of variables using dimension reduction methods. However, there is no guarantee that the reduced dimension reveals the cluster structure properly. Principal component analysis may mask the structure of clusters, especially when there are large variances for variables that are not related to cluster structure. To overcome this, techniques that perform dimension reduction and cluster analysis simultaneously have been suggested. This study proposes probabilistic reduced K-means, the transition of reduced K-means (De Soete and Caroll, 1994) into a probabilistic framework. Simulation shows that the proposed method performs better than tandem clustering or clustering without any dimension reduction. When the number of the variables is larger than the number of samples in each cluster, probabilistic reduced K-means show better formation of clusters than non-probabilistic reduced K-means. In the application to a real data set, it revealed similar or better cluster structure compared to other methods.

Regionalization of Extreme Rainfall with Spatio-Temporal Pattern (극치강수량의 시공간적 특성을 이용한 지역빈도분석)

  • Lee, Jeong-Ju;Kwon, Hyun-Han;Kim, Byung-Sik;Yoon, Seok-Yeong
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2010.05a
    • /
    • pp.1429-1433
    • /
    • 2010
  • 수공구조물의 설계, 수자원 관리계획의 수립, 재해영향 검토 등을 수행할 때, 재현기간에 따른 확률개념의 강우량, 홍수량, 저수량 등을 산정하여 사용하게 되며, 보통 대상지역의 장기 수문관측 자료를 이용하여 수문사상의 확률분포를 산정한 후 재현기간을 연장하여 원하는 설계빈도에 해당하는 양을 추정하게 된다. 미계측지역 또는 관측자료의 보유기간이 짧은 지역의 경우는 지역빈도 분석 결과를 이용하게 된다. 지역빈도해석을 위해서는 강우자료들의 동질성을 파악하는 것이 가장 기본적인 과정이 되며 이를 위해 통계학적인 범주화분석이 선행되어야 한다. 지점 빈도분석의 수문학적 동질성 판별을 위해 L-moment 방법, K-means 방법에 의한 군집분석 등이 주로 사용되며 관측소 위치좌표를 이용한 공간보간법을 적용하여 시각화하고 있다. 강수량은 시공간적으로 변하는 수문변량으로서 강수량의 시간적인 특성 또한 강수량의 특성을 정의하는데 매우 중요한 요소이다. 이러한 점에서 본 연구를 통해 강수지점의 공간적인 좌표 및 강수량의 양적인 범주화에 초점을 맞춘 기존 지역빈도분석의 범주화 과정에 덧붙여 시간적인 영향을 고려할 수 있는 요소들을 결정하고 이를 활용할 수 있는 범주화 과정을 제시하고자 한다. 즉, 극치강수량의 발생 시기에 대한 정량적인 분석이 가능한 순환통계기법을 이용하여 관측 지점별 시간 통계량을 산정하고, 이를 극치강수량과 결합하여 시 공간적인 특성자료를 생성한 후 이를 이용한 군집화 해석 모형을 개발하는데 연구의 목적이 있다. 분석 과정에 있어서 시간속성의 정량화 및 일반화는 순환통계기법을 사용하였으며, 극치강수량과 발생시점의 속성자료는 각각의 평균과 표준편차를 이용하였다. K-means 알고리즘을 이용해 결합자료를 군집화 하고, L-moment 방법으로 지역화 결과에 대한 검증을 수행하였다. 속성 결합 자료의 군집화 효과는 모의데이터 실험을 통해 확인하였으며, 우리 나라의 58개 기상관측소 자료를 이용하여 분석을 수행하였다. 예비해석 단계에서 100회의 군집분석을 통해 평균적인 centroid를 산정하고, 해당 값을 본 해석의 초기 centroid로 지정하여, 변동적인 클러스터링 경향을 안정화시켜 해석이 반복됨에 따라 군집화 결과가 달라지는 오류를 방지하였다. 또한 K-means 방법으로 계산된 군집별 공간거리 합의 크기에 따라 군집번호를 부여함으로써 군집의 번호순서대로 물리적인 연관성이 인접하도록 설정하였으며, 군집간의 경계선을 추출할 때 발생할 수 있는 오류를 방지하였다. 지역빈도분석 결과는 3차원 Spline 기법으로 도시하였다.

  • PDF

Nonparametric clustering of functional time series electricity consumption data (전기 사용량 시계열 함수 데이터에 대한 비모수적 군집화)

  • Kim, Jaehee
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.1
    • /
    • pp.149-160
    • /
    • 2019
  • The electricity consumption time series data of 'A' University from July 2016 to June 2017 is analyzed via nonparametric functional data clustering since the time series data can be regarded as realization of continuous functions with dependency structure. We use a Bouveyron and Jacques (Advances in Data Analysis and Classification, 5, 4, 281-300, 2011) method based on model-based functional clustering with an FEM algorithm that assumes a Gaussian distribution on functional principal components. Clusterwise analysis is provided with cluster mean functions, densities and cluster profiles.

A Comparison of Cluster Analyses and Clustering of Sensory Data on Hanwoo Bulls (군집분석 비교 및 한우 관능평가데이터 군집화)

  • Kim, Jae-Hee;Ko, Yoon-Sil
    • The Korean Journal of Applied Statistics
    • /
    • v.22 no.4
    • /
    • pp.745-758
    • /
    • 2009
  • Cluster analysis is the automated search for groups of related observations in a data set. To group the observations into clusters many techniques has been proposed, and a variety measures aimed at validating the results of a cluster analysis have been suggested. In this paper, we compare complete linkage, Ward's method, K-means and model-based clustering and compute validity measures such as connectivity, Dunn Index and silhouette with simulated data from multivariate distributions. We also select a clustering algorithm and determine the number of clusters of Korean consumers based on Korean consumers' palatability scores for Hanwoo bull in BBQ cooking method.

Statistical methods for testing tumor heterogeneity (종양 이질성을 검정을 위한 통계적 방법론 연구)

  • Lee, Dong Neuck;Lim, Changwon
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.3
    • /
    • pp.331-348
    • /
    • 2019
  • Understanding the tumor heterogeneity due to differences in the growth pattern of metastatic tumors and rate of change is important for understanding the sensitivity of tumor cells to drugs and finding appropriate therapies. It is often possible to test for differences in population means using t-test or ANOVA when the group of N samples is distinct. However, these statistical methods can not be used unless the groups are distinguished as the data covered in this paper. Statistical methods have been studied to test heterogeneity between samples. The minimum combination t-test method is one of them. In this paper, we propose a maximum combinatorial t-test method that takes into account combinations that bisect data at different ratios. Also we propose a method based on the idea that examining the heterogeneity of a sample is equivalent to testing whether the number of optimal clusters is one in the cluster analysis. We verified that the proposed methods, maximum combination t-test method and gap statistic, have better type-I error and power than the previously proposed method based on simulation study and obtained the results through real data analysis.

Bayesian analysis of finite mixture model with cluster-specific random effects (군집 특정 변량효과를 포함한 유한 혼합 모형의 베이지안 분석)

  • Lee, Hyejin;Kyung, Minjung
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.1
    • /
    • pp.57-68
    • /
    • 2017
  • Clustering algorithms attempt to find a partition of a finite set of objects in to a potentially predetermined number of nonempty subsets. Gibbs sampling of a normal mixture of linear mixed regressions with a Dirichlet prior distribution calculates posterior probabilities when the number of clusters was known. Our approach provides simultaneous partitioning and parameter estimation with the computation of classification probabilities. A Monte Carlo study of curve estimation results showed that the model was useful for function estimation. Examples are given to show how these models perform on real data.

A Study on Spatial Statistical Perspective for Analyzing Spatial Phenomena in the Framework of GIS: an Empirical Example using Spatial Scan Statistic for Detecting Spatial Clusters of Breast Cancer Incidents (공간현상 분석을 위한 GIS 기반의 공간통계적 접근방법에 관한 고찰: 공간 군집지역 탐색을 위한 공간검색통계량의 실증적 사례분석)

  • Lee, Gyoung-Ju;Kweon, Ihl
    • Spatial Information Research
    • /
    • v.20 no.1
    • /
    • pp.81-90
    • /
    • 2012
  • When analyzing geographical phenomena, two properties need to be considered. One is the spatial dependence structure and the other is a variation or an uncertainty inhibited in a geographic space. Two problems are encountered due to the properties. Firstly, spatial dependence structure, which is conceptualized as spatial autocorrelation, generates heterogeneous geographic landscape in a spatial process. Secondly, generic statistics, although suitable for dealing with stochastic uncertainty, tacitly ignores location information im plicit in spatial data. GIS is a versatile tool for manipulating locational information, while spatial statistics are suitable for investigating spatial uncertainty. Therefore, integrating spatial statistics to GIS is considered as a plausible strategy for appropriately understanding geographic phenomena of interest. Geographic hot-spot analysis is a key tool for identifying abnormal locations in many domains (e.g., criminology, epidemiology, etc.) and is one of the most prominent applications by utilizing the integration strategy. The article aims at reviewing spatial statistical perspective for analyzing spatial processes in the framework of GIS by carrying out empirical analysis. Illustrated is the analysis procedure of using spatial scan statistic for detecting clusters in the framework of GIS. The empirical analysis targets for identifying spatial clusters of breast cancer incidents in Erie and Niagara counties, New York.

A study on the ordering of similarity measures with negative matches (음의 일치 빈도를 고려한 유사성 측도의 대소 관계 규명에 관한 연구)

  • Park, Hee Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.1
    • /
    • pp.89-99
    • /
    • 2015
  • The World Economic Forum and the Korean Ministry of Knowledge Economy have selected big data as one of the top 10 in core information technology. The key of big data is to analyze effectively the properties that do have data. Clustering analysis method of big data techniques is a method of assigning a set of objects into the clusters so that the objects in the same cluster are more similar to each other clusters. Similarity measures being used in the cluster analysis may be classified into various types depending on the nature of the data. In this paper, we studied upper and lower bounds for binary similarity measures with negative matches such as Russel and Rao measure, simple matching measure by Sokal and Michener, Rogers and Tanimoto measure, Sokal and Sneath measure, Hamann measure, and Baroni-Urbani and Buser mesures I, II. And the comparative studies with these measures were shown by real data and simulated experiment.