• Title/Summary/Keyword: spurious correlations

Search Result 3, Processing Time 0.016 seconds

Current trends in high dimensional massive data analysis (고차원 대용량 자료분석의 현재 동향)

  • Jang, Woncheol;Kim, Gwangsu;Kim, Joungyoun
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.6
    • /
    • pp.999-1005
    • /
    • 2016
  • The advent of big data brings the opportunity to answer many open scientic questions but also presents some interesting challenges. Main features of contemporary datasets are the high dimensionality and massive sample size. In this paper, we give an overview of major challenges caused by these two features: (1) noise accumulation and spurious correlations in high dimensional data; (ii) computational scalability for massive data. We also provide applications of big data in various fields including forecast of disasters, digital humanities and sabermetrics.

Probing Effects of Contextual Bias on Number Magnitude Estimation

  • Xuehao Du;Ping Ji;Wei Qin;Lei Wang;Yunshi Lan
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.9
    • /
    • pp.2464-2482
    • /
    • 2024
  • The semantic understanding of numbers requires association with context. However, powerful neural networks overfit spurious correlations between context and numbers in training corpus can lead to the occurrence of contextual bias, which may affect the network's accurate estimation of number magnitude when making inferences in real-world data. To investigate the resilience of current methodologies against contextual bias, we introduce a novel out-of-distribution (OOD) numerical question-answering (QA) dataset that features specific correlations between context and numbers in the training data, which are not present in the OOD test data. We evaluate the robustness of different numerical encoding and decoding methods when confronted with contextual bias on this dataset. Our findings indicate that encoding methods incorporating more detailed digit information exhibit greater resilience against contextual bias. Inspired by this finding, we propose a digit-aware position embedding strategy, and the experimental results demonstrate that this strategy is highly effective in improving the robustness of neural networks against contextual bias.

Time Series Modelling of Air Quality in Korea: Long Range Dependence or Changes in Mean? (한국의 미세먼지 시계열 분석: 장기종속 시계열 혹은 비정상 평균변화모형?)

  • Baek, Changryong
    • The Korean Journal of Applied Statistics
    • /
    • v.26 no.6
    • /
    • pp.987-998
    • /
    • 2013
  • This paper considers the statistical characteristics on the air quality (PM10) of Korea collected hourly in 2011. PM10 in Korea exhibits very strong correlations even for higher lags, namely, long range dependence. It is power-law tailed in marginal distribution, and generalized Pareto distribution successfully captures the thicker tail than log-normal distribution. However, slowly decaying autocorrelations may confuse practitioners since a non-stationary model (such as changes in mean) can produce spurious long term correlations for finite samples. We conduct a statistical testing procedure to distinguish two models and argue that the high persistency can be explained by non-stationary changes in mean model rather than long range dependent time series models.