• Title/Summary/Keyword: mixed data set

Search Result 146, Processing Time 0.021 seconds

Detecting differentially expressed genes from a mixed data set

  • Lee, Sun-Ho;Kim, In-Young;Kim, Sang-Cheol;Rha, Sun-Young;Chung, Hyun-Chel;Kim, Byung-Soo
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2003.10a
    • /
    • pp.173-177
    • /
    • 2003
  • When we have both a paired data set and two independent data sets, neither a paired t-test nor a two-sample t-test can be used to detect differences between two samples. In order to identify differentially expressed genes in a mixed data set, a new test statistic is proposed.

  • PDF

Statistical Method of Ranking Candidate Genes for the Biomarker

  • Kim, Byung-Soo;Kim, In-Young;Lee, Sun-Ho;Rha, Sun-Young
    • Communications for Statistical Applications and Methods
    • /
    • v.14 no.1
    • /
    • pp.169-182
    • /
    • 2007
  • Receive operating characteristic (ROC) approach can be employed to rank candidate genes from a microarray experiment, in particular, for the biomarker development with the purpose of population screening of a cancer. In the cancer microarray experiment based on n patients the researcher often wants to compare the tumor tissue with the normal tissue within the same individual using a common reference RNA. Ideally, this experiment produces n pairs of microarray data. However, it is often the case that there are missing values either in the normal or tumor tissue data. Practically, we have $n_1$ pairs of complete observations, $n_2$ "normal only" and $n_3$ "tumor only" data for the microarray. We refer to this data set as a mixed data set. We develop a ROC approach on the mixed data set to rank candidate genes for the biomarker development for the colorectal cancer screening. It turns out that the correlation between two ranks in terms of ROC and t statistics based on the top 50 genes of ROC rank is less than 0.6. This result indicates that employing a right approach of ranking candidate genes for the biomarker development is important for the allocation of resources.

A General Mixed Linear Model with Left-Censored Data

  • Ha, Il-Do
    • Communications for Statistical Applications and Methods
    • /
    • v.15 no.6
    • /
    • pp.969-976
    • /
    • 2008
  • Mixed linear models have been widely used in various correlated data including multivariate survival data. In this paper we extend hierarchical-likelihood(h-likelihood) approach for mixed linear models with right censored data to that for left censored data. We also allow a general random-effect structure and propose the estimation procedure. The proposed method is illustrated using a numerical data set and is also compared with marginal likelihood method.

Ranking Candidate Genes for the Biomarker Development in a Cancer Diagnostics

  • Kim, In-Young;Lee, Sun-Ho;Rha, Sun-Young;Kim, Byung-Soo
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2004.11a
    • /
    • pp.272-278
    • /
    • 2004
  • Recently, Pepe et al. (2003) employed the receiver operating characteristic (ROC) approach to rank candidate genes from a microarray experiment that can be used for the biomarker development with the ultimate purpose of the population screening of a cancer, In the cancer microarray experiment based on n patients the researcher often wants to compare the tumor tissue with the normal tissue within the same individual using a common reference RNA. This design is referred to as a reference design or an indirect design. Ideally, this experiment produces n pairs of microarray data, where each pair consists of two sets of microarray data resulting from reference versus normal tissue and reference versus tumor tissue hybridizations. However, for certain individuals either normal tissue or tumor tissue is not large enough for the experimenter to extract enough RNA for conducting the microarray experiment, hence there are missing values either in the normal or tumor tissue data. Practically, we have $n_1$ pairs of complete observations, $n_2$ 'normal only' and $n_3$ 'tumor only' data for the microarray experiment with n patients, where n=$n_1$+$n_2$+$n_3$. We refer to this data set as a mixed data set, as it contains a mix of fully observed and partially observed pair data. This mixed data set was actually observed in the microarray experiment based on human tissues, where human tissues were obtained during the surgical operations of cancer patients. Pepe et al. (2003) provide the rationale of using ROC approach based on two independent samples for ranking candidate gene instead of using t or Mann -Whitney statistics. We first modify ROC approach of ranking genes to a paired data set and further extend it to a mixed data set by taking a weighted average of two ROC values obtained by the paired data set and two independent data sets.

  • PDF

Improving Classification Performance for Data with Numeric and Categorical Attributes Using Feature Wrapping (특징 래핑을 통한 숫자형 특징과 범주형 특징이 혼합된 데이터의 클래스 분류 성능 향상 기법)

  • Lee, Jae-Sung;Kim, Dae-Won
    • Journal of KIISE:Software and Applications
    • /
    • v.36 no.12
    • /
    • pp.1024-1027
    • /
    • 2009
  • In this letter, we evaluate the classification performance of mixed numeric and categorical data for comparing the efficiency of feature filtering and feature wrapping. Because the mixed data is composed of numeric and categorical features, the feature selection method was applied to data set after discretizing the numeric features in the given data set. In this study, we choose the feature subset for improving the classification performance of the data set after preprocessing. The experimental result of comparing the classification performance show that the feature wrapping method is more reliable than feature filtering method in the aspect of classification accuracy.

A Mixed Model for Oredered Response Categories

  • Choi, Jae-Sung
    • Journal of the Korean Data and Information Science Society
    • /
    • v.15 no.2
    • /
    • pp.339-345
    • /
    • 2004
  • This paper deals with a mixed logit model for ordered polytomous data. There are two types of factors affecting the response varable in this paper. One is a fixed factor with finite quantitative levels and the other is a random factor coming from an experimental structure such as a randomized complete block design. It is discussed how to set up the model for analyzing ordered polytomous data and illustrated how to estimate the paramers in the given model.

  • PDF

Cluster Analysis with Balancing Weight on Mixed-type Data

  • Chae, Seong-San;Kim, Jong-Min;Yang, Wan-Youn
    • Communications for Statistical Applications and Methods
    • /
    • v.13 no.3
    • /
    • pp.719-732
    • /
    • 2006
  • A set of clustering algorithms with proper weight on the formulation of distance which extend to mixed numeric and multiple binary values is presented. A simple matching and Jaccard coefficients are used to measure similarity between objects for multiple binary attributes. Similarities are converted to dissimilarities between i th and j th objects. The performance of clustering algorithms with balancing weight on different similarity measures is demonstrated. Our experiments show that clustering algorithms with application of proper weight give competitive recovery level when a set of data with mixed numeric and multiple binary attributes is clustered.

Dam Sensor Outlier Detection using Mixed Prediction Model and Supervised Learning

  • Park, Chang-Mok
    • International journal of advanced smart convergence
    • /
    • v.7 no.1
    • /
    • pp.24-32
    • /
    • 2018
  • An outlier detection method using mixed prediction model has been described in this paper. The mixed prediction model consists of time-series model and regression model. The parameter estimation of the prediction model was performed using supervised learning and a genetic algorithm is adopted for a learning method. The experiments were performed in artificial and real data set. The prediction performance is compared with the existing prediction methods using artificial data. Outlier detection is conducted using the real sensor measurements in a dam. The validity of the proposed method was shown in the experiments.

Cointegration Analysis with Mixed-Frequency Data of Quarterly GDP and Monthly Coincident Indicators

  • Seong, Byeongchan
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.6
    • /
    • pp.925-932
    • /
    • 2012
  • The article introduces a method to estimate a cointegrated vector autoregressive model, using mixed-frequency data, in terms of a state-space representation of the vector error correction(VECM) of the model. The method directly estimates the parameters of the model, in a state-space form of its VECM representation, using the available data in its mixed-frequency form. Then it allows one to compute in-sample smoothed estimates and out-of-sample forecasts at their high-frequency intervals using the estimated model. The method is applied to a mixed-frequency data set that consists of the quarterly real gross domestic product and three monthly coincident indicators. The result shows that the method produces accurate smoothed and forecasted estimates in comparison to a method based on single-frequency data.

AN APPROACH TO THE TRAINING OF A SUPPORT VECTOR MACHINE (SVM) CLASSIFIER USING SMALL MIXED PIXELS

  • Yu, Byeong-Hyeok;Chi, Kwang-Hoon
    • Proceedings of the KSRS Conference
    • /
    • 2008.10a
    • /
    • pp.386-389
    • /
    • 2008
  • It is important that the training stage of a supervised classification is designed to provide the spectral information. On the design of the training stage of a classification typically calls for the use of a large sample of randomly selected pure pixels in order to characterize the classes. Such guidance is generally made without regard to the specific nature of the application in-hand, including the classifier to be used. An approach to the training of a support vector machine (SVM) classifier that is the opposite of that generally promoted for training set design is suggested. This approach uses a small sample of mixed spectral responses drawn from purposefully selected locations (geographical boundaries) in training. A sample of such data should, however, be easier and cheaper to acquire than that suggested by traditional approaches. In this research, we evaluated them against traditional approaches with high-resolution satellite data. The results proved that it can be used small mixed pixels to derive a classification with similar accuracy using a large number of pure pixels. The approach can also reduce substantial costs in training data acquisition because the sampling locations used are commonly easy to observe.

  • PDF