• Title/Summary/Keyword: data sets

Search Result 3,763, Processing Time 0.028 seconds

Multivariate Procedure for Variable Selection and Classification of High Dimensional Heterogeneous Data

  • Mehmood, Tahir;Rasheed, Zahid
    • Communications for Statistical Applications and Methods
    • /
    • v.22 no.6
    • /
    • pp.575-587
    • /
    • 2015
  • The development in data collection techniques results in high dimensional data sets, where discrimination is an important and commonly encountered problem that are crucial to resolve when high dimensional data is heterogeneous (non-common variance covariance structure for classes). An example of this is to classify microbial habitat preferences based on codon/bi-codon usage. Habitat preference is important to study for evolutionary genetic relationships and may help industry produce specific enzymes. Most classification procedures assume homogeneity (common variance covariance structure for all classes), which is not guaranteed in most high dimensional data sets. We have introduced regularized elimination in partial least square coupled with QDA (rePLS-QDA) for the parsimonious variable selection and classification of high dimensional heterogeneous data sets based on recently introduced regularized elimination for variable selection in partial least square (rePLS) and heterogeneous classification procedure quadratic discriminant analysis (QDA). A comparison of proposed and existing methods is conducted over the simulated data set; in addition, the proposed procedure is implemented to classify microbial habitat preferences by their codon/bi-codon usage. Five bacterial habitats (Aquatic, Host Associated, Multiple, Specialized and Terrestrial) are modeled. The classification accuracy of each habitat is satisfactory and ranges from 89.1% to 100% on test data. Interesting codon/bi-codons usage, their mutual interactions influential for respective habitat preference are identified. The proposed method also produced results that concurred with known biological characteristics that will help researchers better understand divergence of species.

A Biclustering Method for Time Series Analysis

  • Lee, Jeong-Hwa;Lee, Young-Rok;Jun, Chi-Hyuck
    • Industrial Engineering and Management Systems
    • /
    • v.9 no.2
    • /
    • pp.131-140
    • /
    • 2010
  • Biclustering is a method of finding meaningful subsets of objects and attributes simultaneously, which may not be detected by traditional clustering methods. It is popularly used for the analysis of microarray data representing the expression levels of genes by conditions. Usually, biclustering algorithms do not consider a sequential relation between attributes. For time series data, however, bicluster solutions should keep the time sequence. This paper proposes a new biclustering algorithm for time series data by modifying the plaid model. The proposed algorithm introduces a parameter controlling an interval between two selected time points. Also, the pruning step preventing an over-fitting problem is modified so as to eliminate only starting or ending points. Results from artificial data sets show that the proposed method is more suitable for the extraction of biclusters from time series data sets. Moreover, by using the proposed method, we find some interesting observations from real-world time-course microarray data sets and apartment price data sets in metropolitan areas.

A Hybrid Clustering Technique for Processing Large Data (대용량 데이터 처리를 위한 하이브리드형 클러스터링 기법)

  • Kim, Man-Sun;Lee, Sang-Yong
    • The KIPS Transactions:PartB
    • /
    • v.10B no.1
    • /
    • pp.33-40
    • /
    • 2003
  • Data mining plays an important role in a knowledge discovery process and various algorithms of data mining can be selected for the specific purpose. Most of traditional hierachical clustering methode are suitable for processing small data sets, so they difficulties in handling large data sets because of limited resources and insufficient efficiency. In this study we propose a hybrid neural networks clustering technique, called PPC for Pre-Post Clustering that can be applied to large data sets and find unknown patterns. PPC combinds an artificial intelligence method, SOM and a statistical method, hierarchical clustering technique, and clusters data through two processes. In pre-clustering process, PPC digests large data sets using SOM. Then in post-clustering, PPC measures Similarity values according to cohesive distances which show inner features, and adjacent distances which show external distances between clusters. At last PPC clusters large data sets using the simularity values. Experiment with UCI repository data showed that PPC had better cohensive values than the other clustering techniques.

A Study on "Comparing Two Data Sets" as Effective Tasks for the Education of Pre-Service Elementary Teachers (예비초등교사교육을 위한 효과적인 과제로서 "두 자료집합 비교하기" 과제의 가능성 탐색)

  • Tak, Byungjoo;Ko, Eun-Sung;Jee, Young Myon
    • School Mathematics
    • /
    • v.19 no.4
    • /
    • pp.691-712
    • /
    • 2017
  • It is an important to develop teachers' statistical reasoning or thinking by teacher education. In this study, the "comparing two data sets" tasks is focused as a way to develop pre-service elementary teachers' reasoning about core ideas of statistics such as distribution, variability, center, and spread. 6 teams of each 4 pre-service elementary teachers participated on the tasks and their presentations are analyzed based on Pfannkuch's (2006) teachers' inference model in comparing two data sets. As a result, they paid attention to the distribution and variability in the statistical problem solving by the "comparing two data sets" tasks, and used their contextual knowledge to make a statistical decision. In addition, they used some statistics and graphs as the reference for statistical communication, which is expected to provide implications for improving statistical education. The finding implies that the "comparing two data sets" tasks can be used to develop statistical reasoning of pre-service elementary teachers. Some recommendations are suggested for teacher education by these tasks.

Reliability of microarray analysis for studying periodontitis: low consistency in 2 periodontitis cohort data sets from different platforms and an integrative meta-analysis

  • Jeon, Yoon-Seon;Shivakumar, Manu;Kim, Dokyoon;Kim, Chang-Sung;Lee, Jung-Seok
    • Journal of Periodontal and Implant Science
    • /
    • v.51 no.1
    • /
    • pp.18-29
    • /
    • 2021
  • Purpose: The aim of this study was to compare the characteristic expression patterns of advanced periodontitis in 2 cohort data sets analyzed using different microarray platforms, and to identify differentially expressed genes (DEGs) through a meta-analysis of both data sets. Methods: Twenty-two patients for cohort 1 and 40 patients for cohort 2 were recruited with the same inclusion criteria. The 2 cohort groups were analyzed using different platforms: Illumina and Agilent. A meta-analysis was performed to increase reliability by removing statistical differences between platforms. An integrative meta-analysis based on an empirical Bayesian methodology (ComBat) was conducted. DEGs for the integrated data sets were identified using the limma package to adjust for age, sex, and platform and compared with the results for cohorts 1 and 2. Clustering and pathway analyses were also performed. Results: This study detected 557 and 246 DEGs in cohorts 1 and 2, respectively, with 146 and 42 significantly enriched gene ontology (GO) terms. Overlapping between cohorts 1 and 2 was present in 59 DEGs and 18 GO terms. However, only 6 genes from the top 30 enriched DEGs overlapped, and there were no overlapping GO terms in the top 30 enriched pathways. The integrative meta-analysis detected 34 DEGs, of which 10 overlapped in all the integrated data sets of cohorts 1 and 2. Conclusions: The characteristic expression pattern differed between periodontitis and the healthy periodontium, but the consistency between the data sets from different cohorts and metadata was too low to suggest specific biomarkers for identifying periodontitis.

Initial Mode Decision Method for Clustering in Categorical Data

  • Yang, Soon-Cheol;Kang, Hyung-Chang;Kim, Chul-Soo
    • Journal of the Korean Data and Information Science Society
    • /
    • v.18 no.2
    • /
    • pp.481-488
    • /
    • 2007
  • The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. The k-modes algorithm is to extend the k-means paradigm to categorical domains. The algorithm requires a pre-setting or random selection of initial points (modes) of the clusters. This paper improved the problem of k-modes algorithm, using the Max-Min method that is a kind of methods to decide initial values in k-means algorithm. we introduce new similarity measures to deal with using the categorical data for clustering. We show that the mushroom data sets and soybean data sets tested with the proposed algorithm has shown a good performance for the two aspects(accuracy, run time).

  • PDF

Effect of Heterogeneous Variance by Sex and Genotypes by Sex Interaction on EBVs of Postweaning Daily Gain of Angus Calves

  • Oikawa, T.;Hammond, K.;Tier, B.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.12 no.6
    • /
    • pp.850-853
    • /
    • 1999
  • Angus postweaning daily gain (PWDG) was analyzed to investigate effects of the heterogeneous variance and the genotypes by sex interaction on prediction of EBVs with data sets of various environmental levels. A whole data (16,239 records) was divided into six data sets according to averages of the best linear unbiased estimator (BLUE) of herd environment. The results comparing prediction models showed that single-trait model is adequate for most of the data sets except for the data set of poor environment for both of the bulls and the heifers where the heterogeneity of variance and the genotypes by sex interaction exists. In the prediction with the data set of the low environment level, the bull's EBVs by single-trait models had high product moment correlations with male EBVs of the bulls by the multitrait model. Whereas the heifer's EBVs had moderate correlations with female EBVs by the multitrait model. This moderate correlation seems to be resulted by the heterogeneity of variance and low heritability of the heifer's PWDG. The prediction models with heterogeneity of variance had little effect on the prediction of EBVs for the data sets with moderate to high genetic correlations.

Microarray Data Retrieval Using Fuzzy Signature Sets (퍼지 시그너쳐 집합을 이용한 마이크로어레이 데이터 검색)

  • Lee, Sun-A;Lee, Keon-Myung;Ryu, Keun-Ho
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.19 no.4
    • /
    • pp.545-549
    • /
    • 2009
  • Microarray data sets could contain thousands of gene expression levels and have been considered as an important source from which meaningful patterns could be extracted for further analysis in biological studies. It is sometimes necessary to retrieve out specific genes or samples of analyst's interest in an effective way. This paper is concerned with a method to make use of fuzzy signature set in order to filter out genes or samples which satisfy complicated constraints as well as simple ones. Fuzzy signatures are an extension of vector valued fuzzy sets, in which elements of the vector are allowed to have a vector. Fuzzy signature sets are similar to fuzzy signatures except that their leaf elements are fuzzy sets defined on the interval [0,1]. This paper introduces an extension of fuzzy signature sets which specifies aggregation operators at each internal node and comparison operators for aggregation. It also shows how to use the extended fuzzy signature sets in microarray data retrieval and some examples of its usage.

Band Feature Extraction of Normal Distributive Multispectral Image Data using Rough Sets

  • Chung, Hwan-mook;Won, Sung-Hyun
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 1998.06a
    • /
    • pp.314-319
    • /
    • 1998
  • In this paper, for efficient data classification in multispectral bands environment, a band feature extraction method using the Rough sets theroy is proposed. First, we make a look up table from training data, and analyze the properties of experimental multispectral image data, then select the efficient band usin indiscernibility relation of Rough sets theory from analysis results. Proposed method is applied to LAMDSAT TM data on 2, June, 1992. Among them, normal distributive data were experimented, mainly. From this, we show clustering trends that similar to traditional band selection results by wavelength properties, from this, we verify that can use the proposed method that centered on data properties to select the efficient bands, though data sensing environment change to hyperspectral band environments.

  • PDF

A Study on Fusion and Visualization using Multibeam Sonar Data with Various Spatial Data Sets for Marine GIS

  • Kong, Seong-Kyu
    • Journal of Advanced Marine Engineering and Technology
    • /
    • v.34 no.3
    • /
    • pp.407-412
    • /
    • 2010
  • According to the remarkable advances in sonar technology, positioning capabilities and computer processing power we can accurately image and explore the seafloor in hydrography. Especially, Multibeam Echo Sounder can provide nearly perfect coverage of the seafloor with high resolution. Since the mid-1990's, Multibeam Echo Sounders have been used for hydrographic surveying in Korea. In this study, new marine data set as an effective decision-making tool in various fields was proposed by visualizing and combining with Multibeam sonar data and marine spatial data sets such as satellite image and digital nautical chart. The proposed method was tested around the port of PyeongTaek-DangJin in the west coast of Korea. The Visualization and fusion methods are described with various marine data sets with processing. We demonstrated that new data set in marine GIS is useful in safe navigation and port management as an efficient decision-making tool.