Correlation Measure for Big Data

빅데이터에서의 상관성 측도

  • 정해성 (서원대학교 멀티미디어학과)
  • Received : 2018.08.06
  • Accepted : 2018.08.22
  • Published : 2018.09.25

Abstract

Purpose: The three Vs of volume, velocity and variety are commonly used to characterize different aspects of Big Data. Volume refers to the amount of data, variety refers to the number of types of data and velocity refers to the speed of data processing. According to these characteristics, the size of Big Data varies rapidly, some data buckets will contain outliers, and buckets might have different sizes. Correlation plays a big role in Big Data. We need something better than usual correlation measures. Methods: The correlation measures offered by traditional statistics are compared. And conditions to meet the characteristics of Big Data are suggested. Finally the correlation measure that satisfies the suggested conditions is recommended. Results: Mutual Information satisfies the suggested conditions. Conclusion: This article builds on traditional correlation measures to analyze the co-relation between two variables. The conditions for correlation measures to meet the characteristics of Big Data are suggested. The correlation measure that satisfies these conditions is recommended. It is Mutual Information.

Keywords

References

  1. Wright, S. (1921). "Correlation and causation". Journal of agricultural research, Vol. 20, No. 7, pp. 557-585.
  2. Munshi, J. (2016). "Spurious correlations in time series data: a note". SSRN Electronic Journal.August.
  3. Calude, C. S. and Longo, G. (2017). "The deluge of spurious correlations in big data". Foundations of Science, Vol. 22, No. 3, pp. 595-612. https://doi.org/10.1007/s10699-016-9489-4
  4. Alexander, K., Harald, S. and Peter, G. (2004). "Estimating mutual information". John-von-Neumann Institute for Computing, Germany, D-52425.
  5. Calude, C. S. and Longo, G. (2017). "The deluge of spurious correlations in big data". Foundations of Science, Vol. 22, No. 3, pp. 595-612. https://doi.org/10.1007/s10699-016-9489-4
  6. Cahill, N. D. (2010). "Normalized measures of mutual information with general definitions of entropy for multimodal image registration". B. Fischer, B. Dawant, and C. Lorenz (Eds.): WBIR 2010, LNCS 6204, pp. 258-268.
  7. Gel'fand, I. M. and Yaglom, A. M. (1957). "Computation of the amount of information about a stochastic function contained in another such function". Uspekhi Mat. Nauk, Vol. 12, No. 1(73), pp. 3-52.
  8. Kim, J. T., Oh, B. J. and Park, J. Y. (2013). "Standard trends for the bigdata technologies". Electronics and Telecommunications Trends, pp. 92-99.