• Title/Summary/Keyword: methods of data analysis

Search Result 19,201, Processing Time 0.046 seconds

Fused inverse regression with multi-dimensional responses

  • Cho, Youyoung;Han, Hyoseon;Yoo, Jae Keun
    • Communications for Statistical Applications and Methods
    • /
    • v.28 no.3
    • /
    • pp.267-279
    • /
    • 2021
  • A regression with multi-dimensional responses is quite common nowadays in the so-called big data era. In such regression, to relieve the curse of dimension due to high-dimension of responses, the dimension reduction of predictors is essential in analysis. Sufficient dimension reduction provides effective tools for the reduction, but there are few sufficient dimension reduction methodologies for multivariate regression. To fill this gap, we newly propose two fused slice-based inverse regression methods. The proposed approaches are robust to the numbers of clusters or slices and improve the estimation results over existing methods by fusing many kernel matrices. Numerical studies are presented and are compared with existing methods. Real data analysis confirms practical usefulness of the proposed methods.

Investigating the underlying structure of particulate matter concentrations: a functional exploratory data analysis study using California monitoring data

  • Montoya, Eduardo L.
    • Communications for Statistical Applications and Methods
    • /
    • v.25 no.6
    • /
    • pp.619-631
    • /
    • 2018
  • Functional data analysis continues to attract interest because advances in technology across many fields have increasingly permitted measurements to be made from continuous processes on a discretized scale. Particulate matter is among the most harmful air pollutants affecting public health and the environment, and levels of PM10 (particles less than 10 micrometers in diameter) for regions of California remain among the highest in the United States. The relatively high frequency of particulate matter sampling enables us to regard the data as functional data. In this work, we investigate the dominant modes of variation of PM10 using functional data analysis methodologies. Our analysis provides insight into the underlying data structure of PM10, and it captures the size and temporal variation of this underlying data structure. In addition, our study shows that certain aspects of size and temporal variation of the underlying PM10 structure are associated with changes in large-scale climate indices that quantify variations of sea surface temperature and atmospheric circulation patterns.

Nonparametric Bayesian methods: a gentle introduction and overview

  • MacEachern, Steven N.
    • Communications for Statistical Applications and Methods
    • /
    • v.23 no.6
    • /
    • pp.445-466
    • /
    • 2016
  • Nonparametric Bayesian methods have seen rapid and sustained growth over the past 25 years. We present a gentle introduction to the methods, motivating the methods through the twin perspectives of consistency and false consistency. We then step through the various constructions of the Dirichlet process, outline a number of the basic properties of this process and move on to the mixture of Dirichlet processes model, including a quick discussion of the computational methods used to fit the model. We touch on the main philosophies for nonparametric Bayesian data analysis and then reanalyze a famous data set. The reanalysis illustrates the concept of admissibility through a novel perturbation of the problem and data, showing the benefit of shrinkage estimation and the much greater benefit of nonparametric Bayesian modelling. We conclude with a too-brief survey of fancier nonparametric Bayesian methods.

Predicting Personal Credit Rating with Incomplete Data Sets Using Frequency Matrix technique (Frequency Matrix 기법을 이용한 결측치 자료로부터의 개인신용예측)

  • Bae, Jae-Kwon;Kim, Jin-Hwa;Hwang, Kook-Jae
    • Journal of Information Technology Applications and Management
    • /
    • v.13 no.4
    • /
    • pp.273-290
    • /
    • 2006
  • This study suggests a frequency matrix technique to predict personal credit rate more efficiently using incomplete data sets. At first this study test on multiple discriminant analysis and logistic regression analysis for predicting personal credit rate with incomplete data sets. Missing values are predicted with mean imputation method and regression imputation method here. An artificial neural network and frequency matrix technique are also tested on their performance in predicting personal credit rating. A data set of 8,234 customers in 2004 on personal credit information of Bank A are collected for the test. The performance of frequency matrix technique is compared with that of other methods. The results from the experiments show that the performance of frequency matrix technique is superior to that of all other models such as MDA-mean, Logit-mean, MDA-regression, Logit-regression, and artificial neural networks.

  • PDF

Data Mining for High Dimensional Data in Drug Discovery and Development

  • Lee, Kwan R.;Park, Daniel C.;Lin, Xiwu;Eslava, Sergio
    • Genomics & Informatics
    • /
    • v.1 no.2
    • /
    • pp.65-74
    • /
    • 2003
  • Data mining differs primarily from traditional data analysis on an important dimension, namely the scale of the data. That is the reason why not only statistical but also computer science principles are needed to extract information from large data sets. In this paper we briefly review data mining, its characteristics, typical data mining algorithms, and potential and ongoing applications of data mining at biopharmaceutical industries. The distinguishing characteristics of data mining lie in its understandability, scalability, its problem driven nature, and its analysis of retrospective or observational data in contrast to experimentally designed data. At a high level one can identify three types of problems for which data mining is useful: description, prediction and search. Brief review of data mining algorithms include decision trees and rules, nonlinear classification methods, memory-based methods, model-based clustering, and graphical dependency models. Application areas covered are discovery compound libraries, clinical trial and disease management data, genomics and proteomics, structural databases for candidate drug compounds, and other applications of pharmaceutical relevance.

DR-LSTM: Dimension reduction based deep learning approach to predict stock price

  • Ah-ram Lee;Jae Youn Ahn;Ji Eun Choi;Kyongwon Kim
    • Communications for Statistical Applications and Methods
    • /
    • v.31 no.2
    • /
    • pp.213-234
    • /
    • 2024
  • In recent decades, increasing research attention has been directed toward predicting the price of stocks in financial markets using deep learning methods. For instance, recurrent neural network (RNN) is known to be competitive for datasets with time-series data. Long short term memory (LSTM) further improves RNN by providing an alternative approach to the gradient loss problem. LSTM has its own advantage in predictive accuracy by retaining memory for a longer time. In this paper, we combine both supervised and unsupervised dimension reduction methods with LSTM to enhance the forecasting performance and refer to this as a dimension reduction based LSTM (DR-LSTM) approach. For a supervised dimension reduction method, we use methods such as sliced inverse regression (SIR), sparse SIR, and kernel SIR. Furthermore, principal component analysis (PCA), sparse PCA, and kernel PCA are used as unsupervised dimension reduction methods. Using datasets of real stock market index (S&P 500, STOXX Europe 600, and KOSPI), we present a comparative study on predictive accuracy between six DR-LSTM methods and time series modeling.

A Study on Exploring the Main Factors and Methods to Improve Community Care : Focusing on the Case of GyeongSangNam-Do (커뮤니티케어 개선을 위한 주요 요인 탐색과 방안 연구 : 경상남도 사례 중심으로)

  • Jun Hoe Kim;Gun A Kim
    • Journal of agricultural medicine and community health
    • /
    • v.48 no.3
    • /
    • pp.189-204
    • /
    • 2023
  • Objectives: The goals of this study are to exploring critical factors and methods to improve Korean Community Care through the cases of GyeongsangNamdo. Methods: For this study, we performed in-depth interviews with 90 people involved in Community Care services of 6 regions, and the collected data were analyzed. The collected data were analyzed utilizing NVivo12. In the end, we reconfirmed the process through Topic Modeling analysis. Results: We conducted descriptive statistics and qualitative data analysis collected through surveys and in-depth interviews. In the case of qualitative analysis, we extracted principle codes (Need, Lack, Absence), and sorted the contents into sub-categories. The response rate of 'Need to strengthen capabilities' was the highest, 'Need to communicate and share information' was the second, and 'Need for integrated operation and a control tower' was the third. Conclusion: As a result, we find the critical factors to improve Community Care. Based on them, we should conduct follow-up researches to propose concrete methods to apply to diverse regions.

Trend Analysis of Research Using Evaluation Tools of Languages Abilities for Young Children: Based on Early Children Education Journals registered with the Korea Research Foundation (유아 언어능력 평가연구의 동향 분석 -한국학술진흥재단 등재 학회지를 중심으로)

  • Youn, Jin-Ju
    • Korean Journal of Human Ecology
    • /
    • v.16 no.4
    • /
    • pp.677-690
    • /
    • 2007
  • This study has a goal to read a trend of language research by analysing evaluation tools and methods that researchers have used for assessing young children's language abilities. Thus the study has chosen 237 language ability evaluation methods out of 121 young child's language ability evaluation researches. The treatises were selected from 4 types of early childhood education journals registered on the Korea Research Foundation. The data analysis was employed for processing the frequency and percentage of the collected data. The results were as follows: First, of single age groups the subject group most selected was five-year-olders and of mixed-age groups the subject group most selected was from three to five, and the number of subjects in researches were mostly below fifty children. The researches were sorted into an 'experimental/ investigational researching' type that has been frequently re-utilized by others, an 'interview type' using a data collection method, and a 'difference verification' type using a data analysis method which has been used in majority of studies. Second, the number of treaties that required data analysis has increased since 1996. Concludingly, the analysis of young child's language ability evaluation tools shows that the purposes of many researches were concentrated on studying children's knowledge about language, children's language functions such as speaking, reading, writing and listening, while evaluation contents were focused on speaking and writing.

Classification performance comparison of inductive learning methods (귀납적 학습방법들의 분류성능 비교)

  • 이상호;지원철
    • Proceedings of the Korean Operations and Management Science Society Conference
    • /
    • 1997.10a
    • /
    • pp.173-176
    • /
    • 1997
  • In this paper, the classification performances of inductive learning methods are investigated using the credit rating data. The adopted classifiers are Multiple Discriminant Analysis (MDA), C4.5 of Quilan, Multi-Layer Perceptron (MLP) and Cascade Correlation Network (CCN). The data used in this analysis is obtained using the publicly announced rating reports from the three korean rating agencies. The performances of 4 classifiers are analyzed in term of prediction accuracy. The results show that no classifier is dominated by the other classifiers.

  • PDF

Binary classification on compositional data

  • Joo, Jae Yun;Lee, Seokho
    • Communications for Statistical Applications and Methods
    • /
    • v.28 no.1
    • /
    • pp.89-97
    • /
    • 2021
  • Due to boundedness and sum constraint, compositional data are often transformed by logratio transformation and their transformed data are put into traditional binary classification or discriminant analysis. However, it may be problematic to directly apply traditional multivariate approaches to the transformed data because class distributions are not Gaussian and Bayes decision boundary are not polynomial on the transformed space. In this study, we propose to use flexible classification approaches to transformed data for compositional data classification. Empirical studies using synthetic and real examples demonstrate that flexible approaches outperform traditional multivariate classification or discriminant analysis.