• Title/Summary/Keyword: Statistical data

Search Result 15,004, Processing Time 0.034 seconds

The Study on Application of Data Gathering for the site and Statistical analysis process (초기 데이터 분석 로드맵을 적용한 사례 연구)

  • Choi, Eun-Hyang;Ree, Sang-Bok
    • Proceedings of the Korean Society for Quality Management Conference
    • /
    • 2010.04a
    • /
    • pp.226-234
    • /
    • 2010
  • In this thesis, we present process that remove mistake of data before statistical analysis. If field data which is not simple examination about validity of data, we cannot believe analyzed statistics information. As statistical analysis information is produced based on data to be input in statistical analysis process, the data to be input should be free of error. In this paper, we study the application of statistical analysis road map that can enhance application on site by organizing basic theory and approaching on initial data exploratory phase, essential step before conducting statistical analysis. Therefore, access to statistical analysis can be enhanced and reliability on result of analysis can be secured by conducting correct statistical analysis.

  • PDF

A Study on a Statistical Matching Method Using Clustering for Data Enrichment

  • Kim Soon Y.;Lee Ki H.;Chung Sung S.
    • Communications for Statistical Applications and Methods
    • /
    • v.12 no.2
    • /
    • pp.509-520
    • /
    • 2005
  • Data fusion is defined as the process of combining data and information from different sources for the effectiveness of the usage of useful information contents. In this paper, we propose a data fusion algorithm using k-means clustering method for data enrichment to improve data quality in knowledge discovery in database(KDD) process. An empirical study was conducted to compare the proposed data fusion technique with the existing techniques and shows that the newly proposed clustering data fusion technique has low MSE in continuous fusion variables.

Release of Microdata and Statistical Disclosure Control Techniques (마이크로데이터 제공과 통계적 노출조절기법)

  • Kim, Kyu-Seong
    • Communications for Statistical Applications and Methods
    • /
    • v.16 no.1
    • /
    • pp.1-11
    • /
    • 2009
  • When micro data are released to users, record by record data are disclosed and the disclosure risk of respondent's information is inevitable. Statistical disclosure control techniques are statistical tools to reduce the risk of disclosure as well as to increase data utility in case of data release. In this paper, we reviewed the concept of disclosure and disclosure risk as well as statistical disclosure control techniques and then investigated selection strategies of a statistical disclosure control technique related with data utility. The risk-utility frontier map method was illustrated as an example. Finally, we listed some check points at each step when microdata are released.

A guideline for the statistical analysis of compositional data in immunology

  • Yoo, Jinkyung;Sun, Zequn;Greenacre, Michael;Ma, Qin;Chung, Dongjun;Kim, Young Min
    • Communications for Statistical Applications and Methods
    • /
    • v.29 no.4
    • /
    • pp.453-469
    • /
    • 2022
  • The study of immune cellular composition has been of great scientific interest in immunology because of the generation of multiple large-scale data. From the statistical point of view, such immune cellular data should be treated as compositional. In compositional data, each element is positive, and all the elements sum to a constant, which can be set to one in general. Standard statistical methods are not directly applicable for the analysis of compositional data because they do not appropriately handle correlations between the compositional elements. In this paper, we review statistical methods for compositional data analysis and illustrate them in the context of immunology. Specifically, we focus on regression analyses using log-ratio transformations and the alternative approach using Dirichlet regression analysis, discuss their theoretical foundations, and illustrate their applications with immune cellular fraction data generated from colorectal cancer patients.

The types and characteristics of statistical big-data graphics with emphasis on the cognitive discouragements (빅데이터 통계그래픽스의 유형 및 특정 - 인지적 방해요소를 중심으로 -)

  • Sim, Mihee;You, Sicheon
    • Smart Media Journal
    • /
    • v.3 no.3
    • /
    • pp.26-35
    • /
    • 2014
  • The statistical graphics is a design field focusing on the user perception aspects for the correct information delivery and the effective understanding, with the use of the quantitative data through the information analysis, extraction, visualization process. The statistical graphics with the big data composition factor is termed as the statistical big data graphics. In the statistical graphics the visual factors are used to reduce the errors in the perception part and to successfully deliver the information. However, in the statistical big data graphics the visual factors of the enormous data are causing the cognitive discouragements. The purpose of this study is to extract the cognitive discouragement factors from the big data statistical graphics, categorizing the types of the statistical big data graphics as 'network type', 'segment type', and 'mixed type', based on their compositional shapes, and explored the characteristics according to them. Especially, based on the visual main factors in the statistical big data graphics, We extracted the cognitive discouragement factors that appear in the high visualization as the four categories: 'multi-dimensional cases', 'various color', 'information overlap', and 'legibility of the writing'.

Training for Huge Data set with On Line Pruning Regression by LS-SVM

  • Kim, Dae-Hak;Shim, Joo-Yong;Oh, Kwang-Sik
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2003.10a
    • /
    • pp.137-141
    • /
    • 2003
  • LS-SVM(least squares support vector machine) is a widely applicable and useful machine learning technique for classification and regression analysis. LS-SVM can be a good substitute for statistical method but computational difficulties are still remained to operate the inversion of matrix of huge data set. In modern information society, we can easily get huge data sets by on line or batch mode. For these kind of huge data sets, we suggest an on line pruning regression method by LS-SVM. With relatively small number of pruned support vectors, we can have almost same performance as regression with full data set.

  • PDF

Comparing Data Access Methods in Statistical Packages (통계 패키지에서의 데이터 접근 방식 비교)

  • Kang, Gun-Seog
    • Communications for Statistical Applications and Methods
    • /
    • v.16 no.3
    • /
    • pp.437-447
    • /
    • 2009
  • Recently, in addition to analyzing data with appropriate statistical methods, statistical analysts in the industrial fields face difficulties that they have to compose proper datasets for analysis objectives via extracting or generating processes from diverse data storage devices. In this paper we survey and compare many state-of-the-art data access technologies adopted by several commonly used statistical packages. More understanding of these technologies will help to reduce the costs occurring when analyzing large size of datasets in especially data mining works, and so to allow more time in applying statistical analysis methods.

A Data Mining Approach for a Dynamic Development of an Ontology-Based Statistical Information System

  • Mohamed Hachem Kermani;Zizette Boufaida;Amel Lina Bensabbane;Besma Bourezg
    • Journal of Information Science Theory and Practice
    • /
    • v.11 no.2
    • /
    • pp.67-81
    • /
    • 2023
  • This paper presents a dynamic development of an ontology-based statistical information system supporting the collection, storage, processing, analysis, and the presentation of statistical knowledge at the national scale. To accomplish this, we propose a data mining technique to dynamically collect data relating to citizens from publicly available data sources; the collected data will then be structured, classified, categorized, and integrated into an ontology. Moreover, an intelligent platform is proposed in order to generate quantitative and qualitative statistical information based on the knowledge stored in the ontology. The main aims of our proposed system are to digitize administrative tasks and to provide reliable statistical information to governmental, economic, and social actors. The authorities will use the ontology-based statistical information system for strategic decision-making as it easily collects, produces, analyzes, and provides both quantitative and qualitative knowledge that will help to improve the administration and management of national political, social, and economic life.

A study on statistical data analysis by microcomputers (마이크로 컴퓨터에 의한 통계자료분석(統計資料分析)에 관한 연구(硏究))

  • Park, Seong-Hyeon
    • Journal of Korean Society for Quality Management
    • /
    • v.13 no.1
    • /
    • pp.12-19
    • /
    • 1985
  • First of all, the necessity of statistical packages, and the strengths and weaknesses of microcomputers for statistical data ana!ysis are examined in this paper. Secondly, some statistical packages available for microcomputers in the international market are introduced, and the contents of two statistical packages developed by the author are presented.

  • PDF

Methods and Sample Size Effect Evaluation for Wafer Level Statistical Bin Limits Determination with Poisson Distributions (포아송 분포를 가정한 Wafer 수준 Statistical Bin Limits 결정방법과 표본크기 효과에 대한 평가)

  • Park, Sung-Min;Kim, Young-Sig
    • IE interfaces
    • /
    • v.17 no.1
    • /
    • pp.1-12
    • /
    • 2004
  • In a modern semiconductor device manufacturing industry, statistical bin limits on wafer level test bin data are used for minimizing value added to defective product as well as protecting end customers from potential quality and reliability excursion. Most wafer level test bin data show skewed distributions. By Monte Carlo simulation, this paper evaluates methods and sample size effect regarding determination of statistical bin limits. In the simulation, it is assumed that wafer level test bin data follow the Poisson distribution. Hence, typical shapes of the data distribution can be specified in terms of the distribution's parameter. This study examines three different methods; 1) percentile based methodology; 2) data transformation; and 3) Poisson model fitting. The mean square error is adopted as a performance measure for each simulation scenario. Then, a case study is presented. Results show that the percentile and transformation based methods give more stable statistical bin limits associated with the real dataset. However, with highly skewed distributions, the transformation based method should be used with caution in determining statistical bin limits. When the data are well fitted to a certain probability distribution, the model fitting approach can be used in the determination. As for the sample size effect, the mean square error seems to reduce exponentially according to the sample size.