DOI QR코드

DOI QR Code

주성분 분석을 이용한 빅데이터 분석

Big Data Analysis Using Principal Component Analysis

  • 투고 : 2015.05.27
  • 심사 : 2015.06.19
  • 발행 : 2015.12.25

초록

빅 데이터 환경에서 빅데이터를 분석하기 위한 새로운 방법의 필요성이 대두되고 있다. 데이터의 크기, 다양성, 그리고 적재 속도 등의 빅데이터 특성으로 인해 모집단의 추론에서 전체 데이터의 분석이 가능해졌기 때문이다. 그러나 전통적인 통계분석 방법은 모집단으로부터 추출된 확률표본에 초점이 맞추어져 있다. 따라서 기존의 통계적 접근방법은 빅데이터 분석에 적합하지 않은 경우가 발생한다. 이와 같은 문제점을 해결하기 위하여 본 논문에서는 빅데이터분석을 위한 새로운 접근방법에 대하여 제안하였다. 특히 대표적인 다변량 통계분석 기법인 주성분 분석을 이용하여 효율적인 빅데이터분석을 위한 방법론을 연구하였다. 제안방법의 성능평가를 위하여 통계적 모의실험을 실시하였다.

In big data environment, we need new approach for big data analysis, because the characteristics of big data, such as volume, variety, and velocity, can analyze entire data for inferring population. But traditional methods of statistics were focused on small data called random sample extracted from population. So, the classical analyses based on statistics are not suitable to big data analysis. To solve this problem, we propose an approach to efficient big data analysis. In this paper, we consider a big data analysis using principal component analysis, which is popular method in multivariate statistics. To verify the performance of our research, we carry out diverse simulation studies.

키워드

참고문헌

  1. K. Pearson, "On lines and planes of closest fit to systems of points in space", Phil Mag, vol. 2, pp. 559-572, 1901. https://doi.org/10.1080/14786440109462720
  2. J. Gower, "Some distance properties of latent root and vector methods used in multivariate analysis", Biometrika, vol. 53, pp. 325-338, 1966. https://doi.org/10.1093/biomet/53.3-4.325
  3. G. Arnold and A. Collins, "Interpretation of transformed axes in multivariate analysis", Applied Statistics, vol. 42, pp. 381-400, 1993. https://doi.org/10.2307/2986240
  4. I. Jolliffe, Principal component analysis, Springer, 2002.
  5. M. Oleksiak, J. Roach, and D. Crawford, "Natural variation in cardiac metabolism and gene expression in fundulus heteroclitus", Nature Genetics, vol. 37, pp. 62-72, 2005.
  6. Johnson, R. A. and Wichern, D. W., Applied multivariate statistical analysis, Prentice-Hall, NJ, 1982.
  7. W. R. Zwick and W. F. Velicer, "Comparison of five rules for determining the number of components to retain", Psychological Bulletin, vol. 99, pp. 432-442, 1986. https://doi.org/10.1037/0033-2909.99.3.432
  8. M. S. Bartlett, "Tests of significance in factor analysis", British Journal of Psychology, vol. 3, pp. 77-85, 1950.
  9. M. S. Bartlett, "A further note on tests of significance in factor analysis", British Journal of Psychology, vol. 4, pp. 1-2, 1951.
  10. H. F. Kaiser, "The application of electronic computers to factor analysis", Educational and Psychological Measurement, vol. 20, pp. 141-151, 1960. https://doi.org/10.1177/001316446002000116
  11. R. B. Cattle, "The scree test for the number of factors", Multivariate Behavioral Research, vol. 1, pp. 245-276, 1966. https://doi.org/10.1207/s15327906mbr0102_10
  12. J. L. Horn, "A rationale and test for the number of factors in factor analysis", Psychometrika, vol. 30, pp. 179-185, 1965. https://doi.org/10.1007/BF02289447
  13. W. F. Velicer, "Determining the number of components from the matrix of partial correlations", Psychometrika, vol. 41, pp. 321-327, 1976. https://doi.org/10.1007/BF02293557
  14. J. Han and M. Kamber, Data mining: concepts & techniques, 2nd ed., Elsevier Inc., New York, 2006.
  15. S. Jun, "A Big Data Learning for Patent Analysis", Journal of Korean Institute of Intelligent Systems, Vol. 23, No. 5, pp. 406-411, 2013. https://doi.org/10.5391/JKIIS.2013.23.5.406
  16. B. Choi, J. Kong, and M. Han, "The Model of Network Packet Analysis based on Big Data", Journal of Korean Institute of Intelligent Systems, Vol. 23, No. 5, pp. 392-399, 2013. https://doi.org/10.5391/JKIIS.2013.23.5.392
  17. K. Kim, J. Jeong, and G. Park, "Assessment of External Force Acting on Ship Using Big Data in Maritime Traffic", Journal of Korean Institute of Intelligent Systems, Vol. 23, No. 5, pp. 379-384, 2013. https://doi.org/10.5391/JKIIS.2013.23.5.379
  18. S. Hong, and M. Han, "The Efficient Method of Parallel Genetic Algorithm using MapReduce of Big Data", Journal of Korean Institute of Intelligent Systems, Vol. 23, No. 5, pp. 385-391, 2013. https://doi.org/10.5391/JKIIS.2013.23.5.385
  19. H. C. Cho, and Y. J. Jung, "Probabilistic Modeling of Photovoltaic Power Systems with Big Learning Data Sets", Journal of Korean Institute of Intelligent Systems, Vol. 23, No. 5, pp. 412-417, 2013. https://doi.org/10.5391/JKIIS.2013.23.5.412
  20. J. H. Cho, D. J. Lee, J. I. Park and M. G. Chun, "Feature Extraction and Classification of High Dimensional Biomedical Spectral Data", Journal of Korean Institute of Intelligent Systems, Vol. 19, No. 3, pp. 297-303, 2009. https://doi.org/10.5391/JKIIS.2009.19.3.297
  21. W. G. Cochran, Sampling techniques, 3rd ed., New York, Wiley, 1977.
  22. W. R. Zwick and W. F. Velicer, "Factors influencing four rules for determining the number of components to retain", Multivariate Behavioral Research, vol. 17, pp. 253-269, 1982. https://doi.org/10.1207/s15327906mbr1702_5
  23. N. Cliff, "The eigen value greater than one rule and the reliability of components", Psychological Bulletin, vol. 103, pp. 276-279, 1988. https://doi.org/10.1037/0033-2909.103.2.276
  24. R. L. Gorsuch, Factor analysis, 2nd ed., Lawrence Erlbaum Associates, Inc., 1983.
  25. B. P. O'Connor, "SPSS and SAS programs for determining the number og components using parallel analysis and Velicer's MAP test", Behavioral Research Methods Instruments & Computers, vol. 32, pp. 396-402, 2000. https://doi.org/10.3758/BF03200807
  26. L. W. Glorfeld, "An improvement on Horn's parallel analysis methodology for selecting the correct number of factors to rertain", Educational and Psychological Measurement, vol. 55, pp. 377-393, 1995. https://doi.org/10.1177/0013164495055003002
  27. R Development Core Team, R: A language and environment for statistical computing, R Foundation for statistical computing, http://www.R-project.org, 2011.

피인용 문헌

  1. A statistical multivariable optimization method using improved orthogonal algorithm based on large data vol.87, pp.14, 2017, https://doi.org/10.1080/00949655.2017.1339241