• Title/Summary/Keyword: Public dataset

Search Result 254, Processing Time 0.029 seconds

Standard-based Integration of Heterogeneous Large-scale DNA Microarray Data for Improving Reusability

  • Jung, Yong;Seo, Hwa-Jeong;Park, Yu-Rang;Kim, Ji-Hun;Bien, Sang Jay;Kim, Ju-Han
    • Genomics & Informatics
    • /
    • v.9 no.1
    • /
    • pp.19-27
    • /
    • 2011
  • Gene Expression Omnibus (GEO) has kept the largest amount of gene-expression microarray data that have grown exponentially. Microarray data in GEO have been generated in many different formats and often lack standardized annotation and documentation. It is hard to know if preprocessing has been applied to a dataset or not and in what way. Standard-based integration of heterogeneous data formats and metadata is necessary for comprehensive data query, analysis and mining. We attempted to integrate the heterogeneous microarray data in GEO based on Minimum Information About a Microarray Experiment (MIAME) standard. We unified the data fields of GEO Data table and mapped the attributes of GEO metadata into MIAME elements. We also discriminated non-preprocessed raw datasets from others and processed ones by using a two-step classification method. Most of the procedures were developed as semi-automated algorithms with some degree of text mining techniques. We localized 2,967 Platforms, 4,867 Series and 103,590 Samples with covering 279 organisms, integrated them into a standard-based relational schema and developed a comprehensive query interface to extract. Our tool, GEOQuest is available at http://www.snubi.org/software/GEOQuest/.

Imputation for Binary or Ordered Categorical Traits Based on the Bayesian Threshold Model (베이지안 분계점 모형에 의한 순서 범주형 변수의 대체)

  • Lee Seung-Chun
    • The Korean Journal of Applied Statistics
    • /
    • v.18 no.3
    • /
    • pp.597-606
    • /
    • 2005
  • The nonresponse in sample survey causes a problem when it comes time to analyze dataset in public-use files where the user has only complete-data methods available and has limited information about the reasons for nonresponse. Recently imputation for nonresponse is becoming a standard approach for handling nonresponse and various imputation methods have been devised . However, most imputation methods concern with continuous traits while many interesting features are measured by binary or ordered categorical scales in sample survey. In this note. an imputation method for ignorable nonresponse in binary or ordered categorical traits is considered.

An Adaptive Workflow Scheduling Scheme Based on an Estimated Data Processing Rate for Next Generation Sequencing in Cloud Computing

  • Kim, Byungsang;Youn, Chan-Hyun;Park, Yong-Sung;Lee, Yonggyu;Choi, Wan
    • Journal of Information Processing Systems
    • /
    • v.8 no.4
    • /
    • pp.555-566
    • /
    • 2012
  • The cloud environment makes it possible to analyze large data sets in a scalable computing infrastructure. In the bioinformatics field, the applications are composed of the complex workflow tasks, which require huge data storage as well as a computing-intensive parallel workload. Many approaches have been introduced in distributed solutions. However, they focus on static resource provisioning with a batch-processing scheme in a local computing farm and data storage. In the case of a large-scale workflow system, it is inevitable and valuable to outsource the entire or a part of their tasks to public clouds for reducing resource costs. The problems, however, occurred at the transfer time for huge dataset as well as there being an unbalanced completion time of different problem sizes. In this paper, we propose an adaptive resource-provisioning scheme that includes run-time data distribution and collection services for hiding the data transfer time. The proposed adaptive resource-provisioning scheme optimizes the allocation ratio of computing elements to the different datasets in order to minimize the total makespan under resource constraints. We conducted the experiments with a well-known sequence alignment algorithm and the results showed that the proposed scheme is efficient for the cloud environment.

Rice Crop Monitoring Using RADARSAT

  • Suchaichit, Waraporn
    • Proceedings of the KSRS Conference
    • /
    • 2003.11a
    • /
    • pp.37-37
    • /
    • 2003
  • Rice is one of the most important crop in the world and is a major export of Thailand. Optical sensors are not useful for rice monitoring, because most cultivated areas are often obscured by cloud during the growing period, especially in South East Asia. Spaceborne Synthetic Aperture Radar (SAR) such as RADARSAT, can see through regardless of weather condition which make it possible to monitor rice growth and to retrieve rice acreage, using the unique temporal signature of rice fields. This paper presents the result of a study of examining the backscatter behavior of rice using multi-temporal RADARSAT dataset. Ground measurements of paddy parameters and water and soil condition were collected. The ground truth information was also used to identify mature rice crops, orchard, road, residence, and aquaculture ponds. Land use class distributions from the RADARSAT image were analyzed. Comparison of the mean DB of each land use class indicated significant differences. Schematic representation of temporal backscatter of rice crop were plotted. Based on the study carried out in Pathum Thani Province test site, the results showed variation of sigma naught from first tillering vegatative phase until ripenning phase. It is suggested that at least, three radar data acquisitions taken at 3 stages of rice growth circle namely; those are at the beginning of rice growth when the field is still covered with water, in the ear differentiation period, and at the beginning of the harvest season, are required for rice monitoring. This pilot project was an experimental one aiming at future operational rice monitoring and potential yield predicttion.

  • PDF

Identification of Differentially Expressed Genes Using Tests Based on Multiple Imputations

  • Kim, Sang Cheol;Yu, Donghyeon
    • Quantitative Bio-Science
    • /
    • v.36 no.1
    • /
    • pp.23-31
    • /
    • 2017
  • Datasets from DNA microarray experiments, which are in the form of large matrices of expression levels of genes, often have missing values. However, the existing statistical methods including the principle components analysis (PCA) and Hotelling's t-test are not directly applicable for the datasets having missing values due to the fact that they assume the observed dataset is complete in general. Many methods have been proposed in previous literature to impute the missing in the observed data. Troyanskaya et al. [1] study the k-nearest neighbor (kNN) imputation, Kim et al. [2] propose the local least squares (LLS) method and Rubin [3] propose the multiple imputation (MI) for missing values. To identify differentially expressed genes, we propose a new testing procedure when the missing exists in the observed data. The proposed procedure uses the Stouffer's z-scores and combines the test results of individual imputed samples, which are dependent to each other. We numerically show that the proposed test procedure based on MI performs better than the existing test procedures based on single imputation (SI) by comparing their ROC curves. We apply the proposed method to analyzing a public microarray data.

Interactions of Behavioral Changes in Smoking, High-risk Drinking, and Weight Gain in a Population of 7.2 Million in Korea

  • Kim, Yeon-Yong;Kang, Hee-Jin;Ha, Seongjun;Park, Jong Heon
    • Journal of Preventive Medicine and Public Health
    • /
    • v.52 no.4
    • /
    • pp.234-241
    • /
    • 2019
  • Objectives: To identify simultaneous behavioral changes in alcohol consumption, smoking, and weight using a fixed-effect model and to characterize their associations with disease status. Methods: This study included 7 000 529 individuals who participated in the national biennial health-screening program every 2 years from 2009 to 2016 and were aged 40 or more. We reconstructed the data into an individual-level panel dataset with 4 waves. We used a fixed-effect model for smoking, heavy alcohol drinking, and overweight. The independent variables were sex, age, lifestyle factors, insurance contribution, employment status, and disease status. Results: Becoming a high-risk drinker and losing weight were associated with initiation or resumption of smoking. Initiation or resumption of smoking and weight gain were associated with non-high-risk drinkers becoming high-risk drinkers. Smoking cessation and becoming a high-risk drinker were associated with normal-weight participants becoming overweight. Participants with newly acquired diabetes mellitus, ischemic heart disease, stroke, and cancer tended to stop smoking, discontinue high-risk drinking, and return to a normal weight. Conclusions: These results obtained using a large-scale population-based database documented interactions among lifestyle factors over time.

Patch based Semi-supervised Linear Regression for Face Recognition

  • Ding, Yuhua;Liu, Fan;Rui, Ting;Tang, Zhenmin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.8
    • /
    • pp.3962-3980
    • /
    • 2019
  • To deal with single sample face recognition, this paper presents a patch based semi-supervised linear regression (PSLR) algorithm, which draws facial variation information from unlabeled samples. Each facial image is divided into overlapped patches, and a regression model with mapping matrix will be constructed on each patch. Then, we adjust these matrices by mapping unlabeled patches to $[1,1,{\cdots},1]^T$. The solutions of all the mapping matrices are integrated into an overall objective function, which uses ${\ell}_{2,1}$-norm minimization constraints to improve discrimination ability of mapping matrices and reduce the impact of noise. After mapping matrices are computed, we adopt majority-voting strategy to classify the probe samples. To further learn the discrimination information between probe samples and obtain more robust mapping matrices, we also propose a multistage PSLR (MPSLR) algorithm, which iteratively updates the training dataset by adding those reliably labeled probe samples into it. The effectiveness of our approaches is evaluated using three public facial databases. Experimental results prove that our approaches are robust to illumination, expression and occlusion.

Improving the Quality of Response Surface Analysis of an Experiment for Coffee-Supplemented Milk Beverage: I. Data Screening at the Center Point and Maximum Possible R-Square

  • Rheem, Sungsue;Oh, Sejong
    • Food Science of Animal Resources
    • /
    • v.39 no.1
    • /
    • pp.114-120
    • /
    • 2019
  • Response surface methodology (RSM) is a useful set of statistical techniques for modeling and optimizing responses in research studies of food science. As a design for a response surface experiment, a central composite design (CCD) with multiple runs at the center point is frequently used. However, sometimes there exist situations where some among the responses at the center point are outliers and these outliers are overlooked. Since the responses from center runs are those from the same experimental conditions, there should be no outliers at the center point. Outliers at the center point ruin statistical analysis. Thus, the responses at the center point need to be looked at, and if outliers are observed, they have to be examined. If the reasons for the outliers are not errors in measuring or typing, such outliers need to be deleted. If the outliers are due to such errors, they have to be corrected. Through a re-analysis of a dataset published in the Korean Journal for Food Science of Animal Resources, we have shown that outlier elimination resulted in the increase of the maximum possible R-square that the modeling of the data can obtain, which enables us to improve the quality of response surface analysis.

Data-Linking Infrastructure for the Health Technology Assessment (의료기술평가 기반으로서의 데이터 연계)

  • Park, Chong Yon
    • The Journal of Health Technology Assessment
    • /
    • v.6 no.2
    • /
    • pp.81-87
    • /
    • 2018
  • With the recent change of healthcare environment including rapid technological development, evidences are more and more important and necessary to support relevant policies in health technology assessment to provide safe and effective health services, utilizing medical resources efficiently. Despite of the emphasis on the importance of real world data and real world evidence in health care research, current infrastructure supporting clinical research is considerably weak due to absence of legal and institutional basis. However, in accordance with the Article 26 of the Health and Medical Technology Promotion Act, there is a limited legal apparatus that can be used only in public data with other dataset for the purpose of healthcare technology assessment at the National Evidence-based Collaborating Agency. Although the use of linked data from various sources was often required in the field of clinical research, it was not yet working well due to insufficient environmental conditions. In order to support the decision-making of medical practice and health care policies, data-linking platform for clinical research is needed. If the legal system that can link up to the data of the private institutions without violating the significant value such as the protection of private informations is established, it will be a decisive foundation reinforcing the researches and policy making processes for the improvement of the national health care system.

The Impact of Electricity Infrastructure Quality on Firm Productivity: Empirical Evidence from Southeast Asian Countries

  • BUI, Lan Thi Hoang;NGUYEN, Phi-Hung
    • The Journal of Asian Finance, Economics and Business
    • /
    • v.8 no.9
    • /
    • pp.261-272
    • /
    • 2021
  • Rapid economic growth in recent years has caused a surge in energy consumption among Southeast Asian countries and laid a considerable burden on the already inadequate power infrastructure. As a result, frequent blackouts and prolonged outages have become common and weakened firm productive performance in those years. The main objective of this study is to examine the impact of power infrastructure quality on the performance of Southeast Asian manufacturing firms. In this study, the World Bank Enterprise Surveys was employed as the training dataset of 4723 manufacturing firms in the period of 2015-2016. The results of this study reveal that industrial firms that suffered from power outages had consistently lower productivity. As measured by the length of such events, more severe outages tend to be more harmful to the firm. Furthermore, the findings also indicated that most firms relied on self-generated electricity to reduce the negative impact of power outages, but this does not bring many benefits when operating at a small scale in some countries. Consequently, this study contributes to a growing literature that examines the economic impact of public infrastructure and how detrimental the poor state of such services is to a firm's downstream operations, productivity, and growth.