• Title/Summary/Keyword: Multivariate Dataset

Search Result 66, Processing Time 0.018 seconds

Towards Texture-Based Visualization of Multivariate Dataset

  • Mehmood, Raja Majid;Lee, Hyo Jong
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2014.04a
    • /
    • pp.582-585
    • /
    • 2014
  • Visualization is a science which makes the invisible to visible through the techniques of experimental visualization and computer-aided visualization. This paper presents the practical aspects of visualization of multivariate dataset. In this paper, we will briefly discuss a previous research work and introduce a new visualization technique which will help us to design and develop a visualization tool for experimental visualization of multivariate dataset. Our newly developed visualization tool can be used in various domains. In this paper, we have chosen a software industry as an application domain and we used the multivariate dataset of software components computed by VizzMaintenance. VizzMaintenance is software analysis tool which give us multiple software metrics of open source Java based programs. Main objective of this research is to develop a new visualization tool for large multivariate dataset which will be more efficient and easy to perceive by viewer. Perception is very important for our research work and we have decided to test the perception level of our proposed visualization approach by researchers of our research lab.

Canonical Correlation Biplot

  • Park, Mi-Ra;Huh, Myung-Hoe
    • Communications for Statistical Applications and Methods
    • /
    • v.3 no.1
    • /
    • pp.11-19
    • /
    • 1996
  • Canonical correlation analysis is a multivariate technique for identifying and quantifying the statistical relationship between two sets of variables. Like most multivariate techniques, the main objective of canonical correlation analysis is to reduce the dimensionality of the dataset. It would be particularly useful if high dimensional data can be represented in a low dimensional space. In this study, we will construct statistical graphs for paired sets of multivariate data. Specifically, plots of the observations as well as the variables are proposed. We discuss the geometric interpretation and goodness-of-fit of the proposed plots. We also provide a numerical example.

  • PDF

Diagnosis of Observations after Fit of Multivariate Skew t-Distribution: Identification of Outliers and Edge Observations from Asymmetric Data

  • Kim, Seung-Gu
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.6
    • /
    • pp.1019-1026
    • /
    • 2012
  • This paper presents a method for the identification of "edge observations" located on a boundary area constructed by a truncation variable as well as for the identification of outliers and the after fit of multivariate skew $t$-distribution(MST) to asymmetric data. The detection of edge observation is important in data analysis because it provides information on a certain critical area in observation space. The proposed method is applied to an Australian Institute of Sport(AIS) dataset that is well known for asymmetry in data space.

Predicting depth value of the future depth-based multivariate record

  • Samaneh Tata;Mohammad Reza Faridrohani
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.5
    • /
    • pp.453-465
    • /
    • 2023
  • The prediction problem of univariate records, though not addressed in multivariate records, has been discussed by many authors based on records values. There are various definitions for multivariate records among which depth-based records have been selected for the aim of this paper. In this paper, by means of the maximum likelihood and conditional median methods, point and interval predictions of depth values which are related to the future depth-based multivariate records are considered on the basis of the observed ones. The observations derived from some elements of the elliptical distributions are the main reason of studying this problem. Finally, the satisfactory performance of the prediction methods is illustrated via some simulation studies and a real dataset about Kermanshah city drought.

Real-world multimodal lifelog dataset for human behavior study

  • Chung, Seungeun;Jeong, Chi Yoon;Lim, Jeong Mook;Lim, Jiyoun;Noh, Kyoung Ju;Kim, Gague;Jeong, Hyuntae
    • ETRI Journal
    • /
    • v.44 no.3
    • /
    • pp.426-437
    • /
    • 2022
  • To understand the multilateral characteristics of human behavior and physiological markers related to physical, emotional, and environmental states, extensive lifelog data collection in a real-world environment is essential. Here, we propose a data collection method using multimodal mobile sensing and present a long-term dataset from 22 subjects and 616 days of experimental sessions. The dataset contains over 10 000 hours of data, including physiological, data such as photoplethysmography, electrodermal activity, and skin temperature in addition to the multivariate behavioral data. Furthermore, it consists of 10 372 user labels with emotional states and 590 days of sleep quality data. To demonstrate feasibility, human activity recognition was applied on the sensor data using a convolutional neural network-based deep learning model with 92.78% recognition accuracy. From the activity recognition result, we extracted the daily behavior pattern and discovered five representative models by applying spectral clustering. This demonstrates that the dataset contributed toward understanding human behavior using multimodal data accumulated throughout daily lives under natural conditions.

A GEE approach for the semiparametric accelerated lifetime model with multivariate interval-censored data

  • Maru Kim;Sangbum Choi
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.4
    • /
    • pp.389-402
    • /
    • 2023
  • Multivariate or clustered failure time data often occur in many medical, epidemiological, and socio-economic studies when survival data are collected from several research centers. If the data are periodically observed as in a longitudinal study, survival times are often subject to various types of interval-censoring, creating multivariate interval-censored data. Then, the event times of interest may be correlated among individuals who come from the same cluster. In this article, we propose a unified linear regression method for analyzing multivariate interval-censored data. We consider a semiparametric multivariate accelerated failure time model as a statistical analysis tool and develop a generalized Buckley-James method to make inferences by imputing interval-censored observations with their conditional mean values. Since the study population consists of several heterogeneous clusters, where the subjects in the same cluster may be related, we propose a generalized estimating equations approach to accommodate potential dependence in clusters. Our simulation results confirm that the proposed estimator is robust to misspecification of working covariance matrix and statistical efficiency can increase when the working covariance structure is close to the truth. The proposed method is applied to the dataset from a diabetic retinopathy study.

Local Projective Display of Multivariate Numerical Data

  • Huh, Myung-Hoe;Lee, Yong-Goo
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.4
    • /
    • pp.661-668
    • /
    • 2012
  • For displaying multivariate numerical data on a 2D plane by the projection, principal components biplot and the GGobi are two main tools of data visualization. The biplot is very useful for capturing the global shape of the dataset, by representing $n$ observations and $p$ variables simultaneously on a single graph. The GGobi shows a dynamic movie of the images of $n$ observations projected onto a sequence of unit vectors floating on the $p$-dimensional sphere. Even though these two methods are certainly very valuable, there are drawbacks. The biplot is too condensed to describe the detailed parts of the data, and the GGobi is too burdensome for ordinary data analyses. In this paper, "the local projective display(LPD)" is proposed for visualizing multivariate numerical data. Main steps of the LDP are 1) $k$-means clustering of the data into $k$ subsets, 2) drawing $k$ principal components biplots of individual subsets, and 3) sequencing $k$ plots by Hurley's (2004) endlink algorithm for cognitive continuity.

Bayesian Multiple Change-Point Estimation of Multivariate Mean Vectors for Small Data

  • Cheon, Sooyoung;Yu, Wenxing
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.6
    • /
    • pp.999-1008
    • /
    • 2012
  • A Bayesian multiple change-point model for small data is proposed for multivariate means and is an extension of the univariate case of Cheon and Yu (2012). The proposed model requires data from a multivariate noncentral $t$-distribution and conjugate priors for the distributional parameters. We apply the Metropolis-Hastings-within-Gibbs Sampling algorithm to the proposed model to detecte multiple change-points. The performance of our proposed algorithm has been investigated on simulated and real dataset, Hanwoo fat content bivariate data.

An Outlier Detection Algorithm and Data Integration Technique for Prediction of Hypertension (고혈압 예측을 위한 이상치 탐지 알고리즘 및 데이터 통합 기법)

  • Khongorzul Dashdondov;Mi-Hye Kim;Mi-Hwa Song
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.417-419
    • /
    • 2023
  • Hypertension is one of the leading causes of mortality worldwide. In recent years, the incidence of hypertension has increased dramatically, not only among the elderly but also among young people. In this regard, the use of machine-learning methods to diagnose the causes of hypertension has increased in recent years. In this study, we improved the prediction of hypertension detection using Mahalanobis distance-based multivariate outlier removal using the KNHANES database from the Korean national health data and the COVID-19 dataset from Kaggle. This study was divided into two modules. Initially, the data preprocessing step used merged datasets and decision-tree classifier-based feature selection. The next module applies a predictive analysis step to remove multivariate outliers using the Mahalanobis distance from the experimental dataset and makes a prediction of hypertension. In this study, we compared the accuracy of each classification model. The best results showed that the proposed MAH_RF algorithm had an accuracy of 82.66%. The proposed method can be used not only for hypertension but also for the detection of various diseases such as stroke and cardiovascular disease.

Selection probability of multivariate regularization to identify pleiotropic variants in genetic association studies

  • Kim, Kipoong;Sun, Hokeun
    • Communications for Statistical Applications and Methods
    • /
    • v.27 no.5
    • /
    • pp.535-546
    • /
    • 2020
  • In genetic association studies, pleiotropy is a phenomenon where a variant or a genetic region affects multiple traits or diseases. There have been many studies identifying cross-phenotype genetic associations. But, most of statistical approaches for detection of pleiotropy are based on individual tests where a single variant association with multiple traits is tested one at a time. These approaches fail to account for relations among correlated variants. Recently, multivariate regularization methods have been proposed to detect pleiotropy in analysis of high-dimensional genomic data. However, they suffer a problem of tuning parameter selection, which often results in either too many false positives or too small true positives. In this article, we applied selection probability to multivariate regularization methods in order to identify pleiotropic variants associated with multiple phenotypes. Selection probability was applied to individual elastic-net, unified elastic-net and multi-response elastic-net regularization methods. In simulation studies, selection performance of three multivariate regularization methods was evaluated when the total number of phenotypes, the number of phenotypes associated with a variant, and correlations among phenotypes are different. We also applied the regularization methods to a wild bean dataset consisting of 169,028 variants and 17 phenotypes.