• 제목/요약/키워드: methods of data analysis

검색결과 19,201건 처리시간 0.058초

Social Media Data Analysis Trends and Methods

  • Rokaya, Mahmoud;Al Azwari, Sanaa
    • International Journal of Computer Science & Network Security
    • /
    • 제22권9호
    • /
    • pp.358-368
    • /
    • 2022
  • Social media is a window for everyone, individuals, communities, and companies to spread ideas and promote trends and products. With these opportunities, challenges and problems related to security, privacy and rights arose. Also, the data accumulated from social media has become a fertile source for many analytics, inference, and experimentation with new technologies in the field of data science. In this chapter, emphasis will be given to methods of trend analysis, especially ensemble learning methods. Ensemble learning methods embrace the concept of cooperation between different learning methods rather than competition between them. Therefore, in this chapter, we will discuss the most important trends in ensemble learning and their applications in analysing social media data and anticipating the most important future trends.

An Empirical Study on Dimension Reduction

  • Suh, Changhee;Lee, Hakbae
    • Journal of the Korean Data Analysis Society
    • /
    • 제20권6호
    • /
    • pp.2733-2746
    • /
    • 2018
  • The two inverse regression estimation methods, SIR and SAVE to estimate the central space are computationally easy and are widely used. However, SIR and SAVE may have poor performance in finite samples and need strong assumptions (linearity and/or constant covariance conditions) on predictors. The two non-parametric estimation methods, MAVE and dMAVE have much better performance for finite samples than SIR and SAVE. MAVE and dMAVE need no strong requirements on predictors or on the response variable. MAVE is focused on estimating the central mean subspace, but dMAVE is to estimate the central space. This paper explores and compares four methods to explain the dimension reduction. Each algorithm of these four methods is reviewed. Empirical study for simulated data shows that MAVE and dMAVE has relatively better performance than SIR and SAVE, regardless of not only different models but also different distributional assumptions of predictors. However, real data example with the binary response demonstrates that SAVE is better than other methods.

Data Visualization using Linear and Non-linear Dimensionality Reduction Methods

  • Kim, Junsuk;Youn, Joosang
    • 한국컴퓨터정보학회논문지
    • /
    • 제23권12호
    • /
    • pp.21-26
    • /
    • 2018
  • As the large amount of data can be efficiently stored, the methods extracting meaningful features from big data has become important. Especially, the techniques of converting high- to low-dimensional data are crucial for the 'Data visualization'. In this study, principal component analysis (PCA; linear dimensionality reduction technique) and Isomap (non-linear dimensionality reduction technique) are introduced and applied to neural big data obtained by the functional magnetic resonance imaging (fMRI). First, we investigate how much the physical properties of stimuli are maintained after the dimensionality reduction processes. We moreover compared the amount of residual variance to quantitatively compare the amount of information that was not explained. As result, the dimensionality reduction using Isomap contains more information than the principal component analysis. Our results demonstrate that it is necessary to consider not only linear but also nonlinear characteristics in the big data analysis.

Robustness of model averaging methods for the violation of standard linear regression assumptions

  • Lee, Yongsu;Song, Juwon
    • Communications for Statistical Applications and Methods
    • /
    • 제28권2호
    • /
    • pp.189-204
    • /
    • 2021
  • In a regression analysis, a single best model is usually selected among several candidate models. However, it is often useful to combine several candidate models to achieve better performance, especially, in the prediction viewpoint. Model combining methods such as stacking and Bayesian model averaging (BMA) have been suggested from the perspective of averaging candidate models. When the candidate models include a true model, it is expected that BMA generally gives better performance than stacking. On the other hand, when candidate models do not include the true model, it is known that stacking outperforms BMA. Since stacking and BMA approaches have different properties, it is difficult to determine which method is more appropriate under other situations. In particular, it is not easy to find research papers that compare stacking and BMA when regression model assumptions are violated. Therefore, in the paper, we compare the performance among model averaging methods as well as a single best model in the linear regression analysis when standard linear regression assumptions are violated. Simulations were conducted to compare model averaging methods with the linear regression when data include outliers and data do not include them. We also compared them when data include errors from a non-normal distribution. The model averaging methods were applied to the water pollution data, which have a strong multicollinearity among variables. Simulation studies showed that the stacking method tends to give better performance than BMA or standard linear regression analysis (including the stepwise selection method) in the sense of risks (see (3.1)) or prediction error (see (3.2)) when typical linear regression assumptions are violated.

정보공개 환경에서 개인정보 보호와 노출 위험의 측정에 대한 통계적 방법 (Review on statistical methods for protecting privacy and measuring risk of disclosure when releasing information for public use)

  • 이용희
    • Journal of the Korean Data and Information Science Society
    • /
    • 제24권5호
    • /
    • pp.1029-1041
    • /
    • 2013
  • 최근 빅데이터의 등장과 정보 공개에 대한 급격한 수요 증가에 따라 자료를 일반에게 공개할 때 개인 정보를 보호해야 하는 필요성이 어느 때보다 절실하다. 본 논문에서는 마이크로 자료와 통계분석 서버를 중심으로 현재까지 제시된 개인정보 노출제한를 위한 통계적 방법, 정보 노출의 개념, 노출 위험을 측정하는 기준들을 개괄적으로 소개한다.

베이지안 네트워크를 이용한 다차원 범주형 분석 (Multi-dimension Categorical Data with Bayesian Network)

  • 김용철
    • 한국정보전자통신기술학회논문지
    • /
    • 제11권2호
    • /
    • pp.169-174
    • /
    • 2018
  • 일반적으로 자료의 효과 연속형인 경우 분산분석과 이산형인 경우 분할표 카이제곱 검정을 통계적 분석방법으로 사용한다. 다차원의 자료에서는 계층적 구조의 분석이 요구되어지며 자료간의 인과관계를 나타내기 위해 통계적 선형모형을 채택하여 분석한다. 선형모형의 구조에서는 자료의 정규성이 요구되어지며 일부 자료에서는 비 선형모형을 채택할 수도 있다. 특히, 설문조사 자료 구조는 문항의 특성상 이산형 자료의 형태가 많아 모형의 조건에 만족하지 않는 경우가 종종 발생한다. 자료구조의 차원이 높아질수록 인과관계, 교호작용, 연관성분석 등에 다차원 범주형 자료 분석 방법을 사용한다. 본 논문에서는 확률분포의 계산을 이용한 베이지안 네트워크 모형이 범주형 자료 분석에서 분석절차를 줄이고 교호작용 및 인과관계를 분석할 수 있다는 것을 제시하였다.

Model-Ship Correlation Study on the Powering Performance for a Large Container Carrier

  • Hwangbo, S.M.;Go, S.C.
    • Journal of Ship and Ocean Technology
    • /
    • 제5권4호
    • /
    • pp.44-50
    • /
    • 2001
  • Large container carriers are suffering from lack of knowledge on reliable correlation allowances between model tests and full-scale trials, especially at fully loaded condition, Careful full-scale sea trial with a full loading of containers both in holds and on decks was carried out to clarify it. Model test results were analyzed by different methods but with the same measuring data to figure out appropriated correlations factors for each analysis methods, Even if it is no doubt that model test technique is one of the most reliable tool to predict full scale powering performance, its assumptions and simplifications which have been applied on the course of data manipulation and analysis need a feedback from sea trial data for a fine tuning, so called correlation factor. It can be stated that the best correlation allowances at fully loaded condition for both 2-dimensional and 3-dimensional analysis methods are fecund through the careful sea trial results and relevant study on the large size container carriers.

  • PDF

Forecasting Symbolic Candle Chart-Valued Time Series

  • Park, Heewon;Sakaori, Fumitake
    • Communications for Statistical Applications and Methods
    • /
    • 제21권6호
    • /
    • pp.471-486
    • /
    • 2014
  • This study introduces a new type of symbolic data, a candle chart-valued time series. We aggregate four stock indices (i.e., open, close, highest and lowest) as a one data point to summarize a huge amount of data. In other words, we consider a candle chart, which is constructed by open, close, highest and lowest stock indices, as a type of symbolic data for a long period. The proposed candle chart-valued time series effectively summarize and visualize a huge data set of stock indices to easily understand a change in stock indices. We also propose novel approaches for the candle chart-valued time series modeling based on a combination of two midpoints and two half ranges between the highest and the lowest indices, and between the open and the close indices. Furthermore, we propose three types of sum of square for estimation of the candle chart valued-time series model. The proposed methods take into account of information from not only ordinary data, but also from interval of object, and thus can effectively perform for time series modeling (e.g., forecasting future stock index). To evaluate the proposed methods, we describe real data analysis consisting of the stock market indices of five major Asian countries'. We can see thorough the results that the proposed approaches outperform for forecasting future stock indices compared with classical data analysis.

Comparing Accuracy of Imputation Methods for Incomplete Categorical Data

  • Shin, Hyung-Won;Sohn, So-Young
    • 한국통계학회:학술대회논문집
    • /
    • 한국통계학회 2003년도 춘계 학술발표회 논문집
    • /
    • pp.237-242
    • /
    • 2003
  • Various kinds of estimation methods have been developed for imputation of categorical missing data. They include modal category method, logistic regression, and association rule. In this study, we propose two imputation methods (neural network fusion and voting fusion) that combine the results of individual imputation methods. A Monte-Carlo simulation is used to compare the performance of these methods. Five factors used to simulate the missing data are (1) true model for the data, (2) data size, (3) noise size (4) percentage of missing data, and (5) missing pattern. Overall, neural network fusion performed the best while voting fusion is better than the individual imputation methods, although it was inferior to the neural network fusion. Result of an additional real data analysis confirms the simulation result.

  • PDF

식피 Data 에 대한 Information 과 Entropy 이론의 실용연구 (A Study on the Presentation of Idea in Information and Entropy Theory in Vegetation Data)

  • Park, Seung Tai
    • The Korean Journal of Ecology
    • /
    • 제10권2호
    • /
    • pp.91-107
    • /
    • 1987
  • This study is concerned with some methods and applications, used as a basis on information and entropy analysis of vegetation data. These methods are adopted for the evaluating the effect of sampling intensity on information, which repersnets the departure of observed variable from standard component. Classes on the data matrix are caluculated by using marginal dispersion array for rank and weighting information program. Finally the information and entropy are computed by applying seven options. On the application of vegetation studies, two models for cluster analysis and analysis of concentration are explained in detail. Cluster analysis is based on use of equivocation information and Rajski's metrics. The analysis of concentration utilizes coherence coefficience being transformed values, which has been adjusted from blocks and entropy values. The relationship btween three begetation clusters and four stands of Naejangsan data is highly significant in 79% of total variance. Cluster A relatively tends to prefer north side, and cluster C south side.

  • PDF