• Title/Summary/Keyword: datasets

Search Result 2,091, Processing Time 0.03 seconds

Genes Frequently Coexpressed with Hoxc8 Provide Insight into the Discovery of Target Genes

  • Kalyani, Ruthala;Lee, Ji-Yeon;Min, Hyehyun;Yoon, Heejei;Kim, Myoung Hee
    • Molecules and Cells
    • /
    • v.39 no.5
    • /
    • pp.395-402
    • /
    • 2016
  • Identifying Hoxc8 target genes is at the crux of understanding the Hoxc8-mediated regulatory networks underlying its roles during development. However, identification of these genes remains difficult due to intrinsic factors of Hoxc8, such as low DNA binding specificity, context-dependent regulation, and unknown cofactors. Therefore, as an alternative, the present study attempted to test whether the roles of Hoxc8 could be inferred by simply analyzing genes frequently coexpressed with Hoxc8, and whether these genes include putative target genes. Using archived gene expression datasets in which Hoxc8 was differentially expressed, we identified a total of 567 genes that were positively coexpressed with Hoxc8 in at least four out of eight datasets. Among these, 23 genes were coexpressed in six datasets. Gene sets associated with extracellular matrix and cell adhesion were most significantly enriched, followed by gene sets for skeletal system development, morphogenesis, cell motility, and transcriptional regulation. In particular, transcriptional regulators, including paralogs of Hoxc8, known Hox co-factors, and transcriptional remodeling factors were enriched. We randomly selected Adam19, Ptpn13, Prkd1, Tgfbi, and Aldh1a3, and validated their coexpression in mouse embryonic tissues and cell lines following $TGF-{\beta}2$ treatment or ectopic Hoxc8 expression. Except for Aldh1a3, all genes showed concordant expression with that of Hoxc8, suggesting that the coexpressed genes might include direct or indirect target genes. Collectively, we suggest that the coexpressed genes provide a resource for constructing Hoxc8-mediated regulatory networks.

Statistical estimation of crop yields for the Midwestern United States using satellite images, climate datasets, and soil property maps

  • Kim, Nari;Cho, Jaeil;Hong, Sungwook;Ha, Kyung-Ja;Shibasaki, Ryosuke;Lee, Yang-Won
    • Korean Journal of Remote Sensing
    • /
    • v.32 no.4
    • /
    • pp.383-401
    • /
    • 2016
  • In this paper, we described the statistical modeling of crop yields using satellite images, climatic datasets, soil property maps, and fertilizer data for the Midwestern United States during 2001-2012. Satellite images were obtained from the Moderate Resolution Imaging Spectroradiometer (MODIS), and climatic datasets were provided by the Parameter-elevation Regressions on Independent Slopes Model (PRISM) Climate Group. Soil property maps were derived from the Harmonized World Soil Database (HWSD). Our multivariate regression models produced quite good prediction accuracies, with differences of approximately 8-15% from the governmental statistics of corn and soybean yields. The unfavorable conditions of climate and vegetation in 2012 could have resulted in a decrease in yields according to the regression models, but the actual yields were greater than predicted. It can be interpreted that factors other than climate, vegetation, soil, and fertilizer may be involved in the negative biases. Also, we found that soybean yield was more affected by minimum temperature conditions while corn yield was more associated with photosynthetic activities. These two crops can have different potential impacts regarding climate change, and it is important to quantify the degree of the crop sensitivities to climatic variations to help adaptation by humans. Considering the yield decreases during the drought event, we can assume that climatic effect may be stronger than human adaptive capacity. Thus, further studies are demanded particularly by enhancing the data regarding human activities such as tillage, fertilization, irrigation, and comprehensive agricultural technologies.

Performance Comparison of Two Gene Set Analysis Methods for Genome-wide Association Study Results: GSA-SNP vs i-GSEA4GWAS

  • Kwon, Ji-Sun;Kim, Ji-Hye;Nam, Doug-U;Kim, Sang-Soo
    • Genomics & Informatics
    • /
    • v.10 no.2
    • /
    • pp.123-127
    • /
    • 2012
  • Gene set analysis (GSA) is useful in interpreting a genome-wide association study (GWAS) result in terms of biological mechanism. We compared the performance of two different GSA implementations that accept GWAS p-values of single nucleotide polymorphisms (SNPs) or gene-by-gene summaries thereof, GSA-SNP and i-GSEA4GWAS, under the same settings of inputs and parameters. GSA runs were made with two sets of p-values from a Korean type 2 diabetes mellitus GWAS study: 259,188 and 1,152,947 SNPs of the original and imputed genotype datasets, respectively. When Gene Ontology terms were used as gene sets, i-GSEA4GWAS produced 283 and 1,070 hits for the unimputed and imputed datasets, respectively. On the other hand, GSA-SNP reported 94 and 38 hits, respectively, for both datasets. Similar, but to a lesser degree, trends were observed with Kyoto Encyclopedia of Genes and Genomes (KEGG) gene sets as well. The huge number of hits by i-GSEA4GWAS for the imputed dataset was probably an artifact due to the scaling step in the algorithm. The decrease in hits by GSA-SNP for the imputed dataset may be due to the fact that it relies on Z-statistics, which is sensitive to variations in the background level of associations. Judicious evaluation of the GSA outcomes, perhaps based on multiple programs, is recommended.

Data Cleaning and Integration of Multi-year Dietary Survey in the Korea National Health and Nutrition Examination Survey (KNHANES) using Database Normalization Theory (데이터베이스 정규화 이론을 이용한 국민건강영양조사 중 다년도 식이조사 자료 정제 및 통합)

  • Kwon, Namji;Suh, Jihye;Lee, Hunjoo
    • Journal of Environmental Health Sciences
    • /
    • v.43 no.4
    • /
    • pp.298-306
    • /
    • 2017
  • Objectives: Since 1998, the Korea National Health and Nutrition Examination Survey (KNHANES) has been conducted in order to investigate the health and nutritional status of Koreans. The food intake data of individuals in the KNHANES has also been utilized as source dataset for risk assessment of chemicals via food. To improve the reliability of intake estimation and prevent missing data for less-responded foods, the structure of integrated long-standing datasets is significant. However, it is difficult to merge multi-year survey datasets due to ineffective cleaning processes for handling extensive numbers of codes for each food item along with changes in dietary habits over time. Therefore, this study aims at 1) cleaning the process of abnormal data 2) generation of integrated long-standing raw data, and 3) contributing to the production of consistent dietary exposure factors. Methods: Codebooks, the guideline book, and raw intake data from KNHANES V and VI were used for analysis. The violation of the primary key constraint and the $1^{st}-3rd$ normal form in relational database theory were tested for the codebook and the structure of the raw data, respectively. Afterwards, the cleaning process was executed for the raw data by using these integrated codes. Results: Duplication of key records and abnormality in table structures were observed. However, after adjusting according to the suggested method above, the codes were corrected and integrated codes were newly created. Finally, we were able to clean the raw data provided by respondents to the KNHANES survey. Conclusion: The results of this study will contribute to the integration of the multi-year datasets and help improve the data production system by clarifying, testing, and verifying the primary key, integrity of the code, and primitive data structure according to the database normalization theory in the national health data.

The Impact of Satellite Observations on Large-Scale Atmospheric Circulation in the Reanalysis Data: A Comparison Between JRA-55 and JRA-55C (위성 자료가 재분석자료의 대규모 대기 순환장에 미치는 영향: JRA-55와 JRA-55C 비교 연구)

  • Park, Mingyu;Choi, Yooseong;Son, Seok-Woo
    • Atmosphere
    • /
    • v.26 no.4
    • /
    • pp.523-540
    • /
    • 2016
  • The effects of satellite observations on large-scale atmospheric circulations in the reanalysis data are investigated by comparing the latest Japanese Meteorological Association's reanalysis data (JRA-55) and its family data, JRA-55 Conventional (JRA-55C). The latter is identical to the former except that satellite observations are excluded during the data assimilation process. Only conventional datasets are assimilated in JRA-55C. A simple comparison revealed a considerable difference in temperature and zonal wind fields in both the stratosphere and troposphere. Such differences are particularly large in the Southern Hemisphere and whole stratosphere where conventional ground-based measurements are limited. The effects of satellite observations on the zonal-mean tropospheric circulations are further examined in terms of the Hadley cell, eddy-driven jet, and mid-latitude storm tracks. In both hemispheres, JRA-55C exhibits slightly weaker and narrower Hadley cell than JRA-55. This is consistent with a weaker diabatic heating in JRA-55C. The eddy-driven jet shows a small difference in its latitudinal location only in the Southern Hemisphere. Likewise, while the Northern-Hemisphere storm tracks are quantitatively similar in the two datasets, Southern-Hemisphere storm tracks are relatively weaker in JRA-55C than in JRA-55. Their difference is comparable to the uncertainty between reanalysis datasets, indicating that satellite data assimilation could yield significant corrections in the zonal-mean circulation in the Southern Hemisphere.

Mapping the water table at the Cheongju-Gadeok site of the Korea National Groundwater Monitoring Network using multiple geophysical methods

  • Ju, Hyeon-Tae;Sa, Jin-Hyeon;Kim, Ji-Soo
    • The Journal of Engineering Geology
    • /
    • v.27 no.3
    • /
    • pp.305-312
    • /
    • 2017
  • The most effective way to distinguish subsurface interfaces that produce various geophysical responses is through the integration of multiple geophysical methods, with each method detecting both a complementary and unique set of distinct physical properties relating to the subsurface. In this study, shallow seismic reflection (SSR) and ground penetrating radar (GPR) surveys were conducted at the Cheongju-Gadeok site of the Korea National Groundwater Monitoring Network to map the water table, which was measured at 12 m depth during the geophysical surveys. The water table proved to be a good target reflector in both datasets, as the abrupt transition from the overlying unsaturated weathered rock to the underlying saturated weathered rock yielded large acoustic impedance and dielectric constant contrasts. The two datasets were depth converted and integrated into a single section, with the SSR and GPR surveys conducted to ensure subsurface imaging at approximately the same wavelength. The GPR data provided detailed information on the upper ~15 m of the section, whereas the SSR data imaged structures at depths of 10-45 m. The integrated section thus captured the full depth coverage of the sandy clay, water table, weathered rock, soft rock, and hard rock structures, which correlated well with local drillcore and water table observations. Incorporation of these two geophysical datasets yielded a synthetic section that resembled a simplified aquifer model, with the best-fitting seismic velocity, dielectric constant, and porosity of the saturated weathered layer being $v_{seismic}=1000m/s$, ${\varepsilon}_r=16$, and ${\phi}=0.32$, respectively.

A Scalable Clustering Method for Categorical Sequences (범주형 시퀀스들에 대한 확장성 있는 클러스터링 방법)

  • Oh, Seung-Joon;Kim, Jae-Yearn
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.14 no.2
    • /
    • pp.136-141
    • /
    • 2004
  • There has been enormous growth in the amount of commercial and scientific data, such as retail transactions, protein sequences, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, few clustering algorithms consider sequentiality. In this paper, we study how to cluster sequence datasets. We propose a new similarity measure to compute the similarity between two sequences. We also present an efficient method for determining the similarity measure and develop a clustering algorithm. Due to the high computational complexity of hierarchical clustering algorithms for clustering large datasets, a new clustering method is required. Therefore, we propose a new scalable clustering method using sampling and a k-nearest-neighbor method. Using a real dataset and a synthetic dataset, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional algorithms.

Experimental Analysis of Recent Works on the Overlap Phase of De Novo Sequence Assembly (De novo 시퀀스 어셈블리의 overlap 단계의 최근 연구 실험 분석)

  • Lim, Jihyuk;Kim, Sun;Park, Kunsoo
    • Journal of KIISE
    • /
    • v.45 no.3
    • /
    • pp.200-210
    • /
    • 2018
  • Given a set of DNA read sequences, de novo sequence assembly reconstructs a target sequence without a reference sequence. For reconstruction, the assembly needs the overlap phase, which computes all overlaps between every pair of reads. Since the overlap phase is the most time-consuming part of the whole assembly, the performance of the assembly depends on that of the overlap phase. There have been extensive studies on the overlap phase in various fields. Among them, three state-of-the-art results for the overlap phase are Readjoiner, SOF, and Lim-Park algorithm. Recently, a rapid development of sequencing technology has made it possible to produce a large read dataset at a low cost, and many platforms for generating a DNA read dataset have been developed. Since the platforms produce datasets with different statistical characteristics, a performance evaluation for the overlap phase should consider datasets with these characteristics. In this paper, we compare and analyze the performances of the three algorithms with various large datasets.

Optimal number of dimensions in linear discriminant analysis for sparse data (희박한 데이터에 대한 선형판별분석에서 최적의 차원 수 결정)

  • Shin, Ga In;Kim, Jaejik
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.6
    • /
    • pp.867-876
    • /
    • 2017
  • Datasets with small n and large p are often found in various fields and the analysis of the datasets is still a challenge in statistics. Discriminant analysis models for such datasets were recently developed in classification problems. One approach of those models tries to detect dimensions that distinguish between groups well and the number of the detected dimensions is typically smaller than p. In such models, the number of dimensions is important because the prediction and visualization of data and can be usually determined by the K-fold cross-validation (CV). However, in sparse data scenarios, the CV is not reliable for determining the optimal number of dimensions since there can be only a few observations for each fold. Thus, we propose a method to determine the number of dimensions using a measure based on the standardized distance between the mean values of each group in the reduced dimensions. The proposed method is verified through simulations.

Photogrammetric Georeferencing Using LIDAR Linear and Areal Features

  • HABIB Ayman;GHANMA Mwafag;MITISHITA Edson
    • Korean Journal of Geomatics
    • /
    • v.5 no.1
    • /
    • pp.7-19
    • /
    • 2005
  • Photogrammetric mapping procedures have gone through major developments due to significant improvements in its underlying technologies. The availability of GPS/INS systems greatly assist in direct geo-referencing of the acquired imagery. Still, photogrammetric datasets taken without the aid of positioning and navigation systems need control information for the purpose of surface reconstruction. Point features were, and still are, the primary source of control for the photogrammetric triangulation although other higher-order features are available and can be used. LIDAR systems supply dense geometric surface information in the form of three dimensional coordinates with respect to certain reference system. Considering the accuracy improvement of LIDAR systems in the recent years, LIDAR data is considered a viable supply of photogrammetric control. To exploit LIDAR data, new challenges are poised concerning the representation and reference system by which both the photogrammetric and LIDAR datasets are described. In this paper, registration methodologies will be devised for the purpose of integrating the LIDAR data into the photogrammetric triangulation. Such registration methodologies have to deal with three issues: registration primitives, transformation parameters, and similarity measures. Two methodologies will be introduced that utilize straight-line and areal features derived from both datasets as the registration primitives. The first methodology directly incorporates the LIDAR lines as control information in the photogrammetric triangulation, while in the second methodology, LIDAR patches are used to produce and align the photogrammetric model. Also, camera self-calibration experiments were conducted on simulated and real data to test the feasibility of using LIDAR patches for this purpose.

  • PDF