• Title/Summary/Keyword: Bayes method

Search Result 365, Processing Time 0.026 seconds

Identifying Copy Number Variants under Selection in Geographically Structured Populations Based on F-statistics

  • Song, Hae-Hiang;Hu, Hae-Jin;Seok, In-Hae;Chung, Yeun-Jun
    • Genomics & Informatics
    • /
    • v.10 no.2
    • /
    • pp.81-87
    • /
    • 2012
  • Large-scale copy number variants (CNVs) in the human provide the raw material for delineating population differences, as natural selection may have affected at least some of the CNVs thus far discovered. Although the examination of relatively large numbers of specific ethnic groups has recently started in regard to inter-ethnic group differences in CNVs, identifying and understanding particular instances of natural selection have not been performed. The traditional $F_{ST}$ measure, obtained from differences in allele frequencies between populations, has been used to identify CNVs loci subject to geographically varying selection. Here, we review advances and the application of multinomial-Dirichlet likelihood methods of inference for identifying genome regions that have been subject to natural selection with the $F_{ST}$ estimates. The contents of presentation are not new; however, this review clarifies how the application of the methods to CNV data, which remains largely unexplored, is possible. A hierarchical Bayesian method, which is implemented via Markov Chain Monte Carlo, estimates locus-specific $F_{ST}$ and can identify outlying CNVs loci with large values of FST. By applying this Bayesian method to the publicly available CNV data, we identified the CNV loci that show signals of natural selection, which may elucidate the genetic basis of human disease and diversity.

Effect of missing values in detecting differentially expressed genes in a cDNA microarray experiment

  • Kim, Byung-Soo;Rha, Sun-Young
    • Bioinformatics and Biosystems
    • /
    • v.1 no.1
    • /
    • pp.67-72
    • /
    • 2006
  • The aim of this paper is to discuss the effect of missing values in detecting differentially expressed genes in a cDNA microarray experiment in the context of a one sample problem. We conducted a cDNA micro array experiment to detect differentially expressed genes for the metastasis of colorectal cancer based on twenty patients who underwent liver resection due to liver metastasis from colorectal cancer. Total RNAs from metastatic liver tumor and adjacent normal liver tissue from a single patient were labeled with cy5 and cy3, respectively, and competitively hybridized to a cDNA microarray with 7775 human genes. We used $M=log_2(R/G)$ for the signal evaluation, where Rand G denoted the fluorescent intensities of Cy5 and Cy3 dyes, respectively. The statistical problem comprises a one sample test of testing E(M)=0 for each gene and involves multiple tests. The twenty cDNA microarray data would comprise a matrix of dimension 7775 by 20, if there were no missing values. However, missing values occur for various reasons. For each gene, the no missing proportion (NMP) was defined to be the proportion of non-missing values out of twenty. In detecting differentially expressed (DE) genes, we used the genes whose NMP is greater than or equal to 0.4 and then sequentially increased NMP by 0.1 for investigating its effect on the detection of DE genes. For each fixed NMP, we imputed the missing values with K-nearest neighbor method (K=10) and applied the nonparametric t-test of Dudoit et al. (2002), SAM by Tusher et al. (2001) and empirical Bayes procedure by $L\ddot{o}nnstedt$ and Speed (2002) to find out the effect of missing values in the final outcome. These three procedures yielded substantially agreeable result in detecting DE genes. Of these three procedures we used SAM for exploring the acceptable NMP level. The result showed that the optimum no missing proportion (NMP) found in this data set turned out to be 80%. It is more desirable to find the optimum level of NMP for each data set by applying the method described in this note, when the plot of (NMP, Number of overlapping genes) shows a turning point.

  • PDF

Network based Anomaly Intrusion Detection using Bayesian Network Techniques (네트워크 서비스별 이상 탐지를 위한 베이지안 네트워크 기법의 정상 행위 프로파일링)

  • Cha ByungRae;Park KyoungWoo;Seo JaeHyun
    • Journal of Internet Computing and Services
    • /
    • v.6 no.1
    • /
    • pp.27-38
    • /
    • 2005
  • Recently, the rapidly development of computing environments and the spread of Internet make possible to obtain and use of information easily. Immediately, by opposition function the Hacker's unlawful intrusion and threats rise for network environments as time goes on. Specially, the internet consists of Unix and TCP/IP had many vulnerability. the security techniques of authentication and access controls cannot adequate to solve security problem, thus IDS developed with 2nd defence line. In this paper, intrusion detection method using Bayesian Networks estimated probability values of behavior contexts based on Bayes theory. The contexts of behaviors or events represents Bayesian Networks of graphic types. We profiled concisely normal behaviors using behavior context. And this method be able to detect new intrusions or modificated intrusions. We had simulation using DARPA 2000 Intrusion Data.

  • PDF

Conditional Probability Based Early Termination of Recursive Coding Unit Structures in HEVC (HEVC의 재귀적 CU 구조에 대한 조건부 확률 기반 고속 탐색 알고리즘)

  • Han, Woo-Jin
    • Journal of Broadcast Engineering
    • /
    • v.17 no.2
    • /
    • pp.354-362
    • /
    • 2012
  • Recently, High Efficiency Video Coding (HEVC) is under development jointly by MPEG and ITU-T for the next international video coding standard. Compared to the previous standards, HEVC supports variety of splitting units, such as coding unit (CU), prediction unit (PU), and transform unit (TU). Among them, it has been known that the recursive quadtree structure of CU can improve the coding efficiency while the encoding complexity is increased significantly. In this paper, a simple conditional probability to predict the early termination condition of recursive unit structure is introduced. The proposed conditional probability is estimated based on Bayes' formula from local statistics of rate-distortion costs in encoder. Experimental results show that the proposed method can reduce the total encoding time by about 32% according to the test configuration while the coding efficiency loss is 0.4%-0.5%. In addition, the encoding time can be reduced by 50% with 0.9% coding efficiency loss when the proposed method was used jointly with HM4.0 early CU termination algorithm.

A Comparison of Bayesian and Maximum Likelihood Estimations in a SUR Tobit Regression Model (SUR 토빗회귀모형에서 베이지안 추정과 최대가능도 추정의 비교)

  • Lee, Seung-Chun;Choi, Byongsu
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.6
    • /
    • pp.991-1002
    • /
    • 2014
  • Both Bayesian and maximum likelihood methods are efficient for the estimation of regression coefficients of various Tobit regression models (see. e.g. Chib, 1992; Greene, 1990; Lee and Choi, 2013); however, some researchers recognized that the maximum likelihood method tends to underestimate the disturbance variance, which has implications for the estimation of marginal effects and the asymptotic standard error of estimates. The underestimation of the maximum likelihood estimate in a seemingly unrelated Tobit regression model is examined. A Bayesian method based on an objective noninformative prior is shown to provide proper estimates of the disturbance variance as well as other regression parameters

Method of Analyzing Important Variables using Machine Learning-based Golf Putting Direction Prediction Model (머신러닝 기반 골프 퍼팅 방향 예측 모델을 활용한 중요 변수 분석 방법론)

  • Kim, Yeon Ho;Cho, Seung Hyun;Jung, Hae Ryun;Lee, Ki Kwang
    • Korean Journal of Applied Biomechanics
    • /
    • v.32 no.1
    • /
    • pp.1-8
    • /
    • 2022
  • Objective: This study proposes a methodology to analyze important variables that have a significant impact on the putting direction prediction using a machine learning-based putting direction prediction model trained with IMU sensor data. Method: Putting data were collected using an IMU sensor measuring 12 variables from 6 adult males in their 20s at K University who had no golf experience. The data was preprocessed so that it could be applied to machine learning, and a model was built using five machine learning algorithms. Finally, by comparing the performance of the built models, the model with the highest performance was selected as the proposed model, and then 12 variables of the IMU sensor were applied one by one to analyze important variables affecting the learning performance. Results: As a result of comparing the performance of five machine learning algorithms (K-NN, Naive Bayes, Decision Tree, Random Forest, and Light GBM), the prediction accuracy of the Light GBM-based prediction model was higher than that of other algorithms. Using the Light GBM algorithm, which had excellent performance, an experiment was performed to rank the importance of variables that affect the direction prediction of the model. Conclusion: Among the five machine learning algorithms, the algorithm that best predicts the putting direction was the Light GBM algorithm. When the model predicted the putting direction, the variable that had the greatest influence was the left-right inclination (Roll).

A Method for Prediction of Quality Defects in Manufacturing Using Natural Language Processing and Machine Learning (자연어 처리 및 기계학습을 활용한 제조업 현장의 품질 불량 예측 방법론)

  • Roh, Jeong-Min;Kim, Yongsung
    • Journal of Platform Technology
    • /
    • v.9 no.3
    • /
    • pp.52-62
    • /
    • 2021
  • Quality control is critical at manufacturing sites and is key to predicting the risk of quality defect before manufacturing. However, the reliability of manual quality control methods is affected by human and physical limitations because manufacturing processes vary across industries. These limitations become particularly obvious in domain areas with numerous manufacturing processes, such as the manufacture of major nuclear equipment. This study proposed a novel method for predicting the risk of quality defects by using natural language processing and machine learning. In this study, production data collected over 6 years at a factory that manufactures main equipment that is installed in nuclear power plants were used. In the preprocessing stage of text data, a mapping method was applied to the word dictionary so that domain knowledge could be appropriately reflected, and a hybrid algorithm, which combined n-gram, Term Frequency-Inverse Document Frequency, and Singular Value Decomposition, was constructed for sentence vectorization. Next, in the experiment to classify the risky processes resulting in poor quality, k-fold cross-validation was applied to categorize cases from Unigram to cumulative Trigram. Furthermore, for achieving objective experimental results, Naive Bayes and Support Vector Machine were used as classification algorithms and the maximum accuracy and F1-score of 0.7685 and 0.8641, respectively, were achieved. Thus, the proposed method is effective. The performance of the proposed method were compared and with votes of field engineers, and the results revealed that the proposed method outperformed field engineers. Thus, the method can be implemented for quality control at manufacturing sites.

The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms (다중 클래스 데이터셋의 메타특징이 판별 알고리즘의 성능에 미치는 영향 연구)

  • Kim, Jeonghun;Kim, Min Yong;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.23-45
    • /
    • 2020
  • Big data is creating in a wide variety of fields such as medical care, manufacturing, logistics, sales site, SNS, and the dataset characteristics are also diverse. In order to secure the competitiveness of companies, it is necessary to improve decision-making capacity using a classification algorithm. However, most of them do not have sufficient knowledge on what kind of classification algorithm is appropriate for a specific problem area. In other words, determining which classification algorithm is appropriate depending on the characteristics of the dataset was has been a task that required expertise and effort. This is because the relationship between the characteristics of datasets (called meta-features) and the performance of classification algorithms has not been fully understood. Moreover, there has been little research on meta-features reflecting the characteristics of multi-class. Therefore, the purpose of this study is to empirically analyze whether meta-features of multi-class datasets have a significant effect on the performance of classification algorithms. In this study, meta-features of multi-class datasets were identified into two factors, (the data structure and the data complexity,) and seven representative meta-features were selected. Among those, we included the Herfindahl-Hirschman Index (HHI), originally a market concentration measurement index, in the meta-features to replace IR(Imbalanced Ratio). Also, we developed a new index called Reverse ReLU Silhouette Score into the meta-feature set. Among the UCI Machine Learning Repository data, six representative datasets (Balance Scale, PageBlocks, Car Evaluation, User Knowledge-Modeling, Wine Quality(red), Contraceptive Method Choice) were selected. The class of each dataset was classified by using the classification algorithms (KNN, Logistic Regression, Nave Bayes, Random Forest, and SVM) selected in the study. For each dataset, we applied 10-fold cross validation method. 10% to 100% oversampling method is applied for each fold and meta-features of the dataset is measured. The meta-features selected are HHI, Number of Classes, Number of Features, Entropy, Reverse ReLU Silhouette Score, Nonlinearity of Linear Classifier, Hub Score. F1-score was selected as the dependent variable. As a result, the results of this study showed that the six meta-features including Reverse ReLU Silhouette Score and HHI proposed in this study have a significant effect on the classification performance. (1) The meta-features HHI proposed in this study was significant in the classification performance. (2) The number of variables has a significant effect on the classification performance, unlike the number of classes, but it has a positive effect. (3) The number of classes has a negative effect on the performance of classification. (4) Entropy has a significant effect on the performance of classification. (5) The Reverse ReLU Silhouette Score also significantly affects the classification performance at a significant level of 0.01. (6) The nonlinearity of linear classifiers has a significant negative effect on classification performance. In addition, the results of the analysis by the classification algorithms were also consistent. In the regression analysis by classification algorithm, Naïve Bayes algorithm does not have a significant effect on the number of variables unlike other classification algorithms. This study has two theoretical contributions: (1) two new meta-features (HHI, Reverse ReLU Silhouette score) was proved to be significant. (2) The effects of data characteristics on the performance of classification were investigated using meta-features. The practical contribution points (1) can be utilized in the development of classification algorithm recommendation system according to the characteristics of datasets. (2) Many data scientists are often testing by adjusting the parameters of the algorithm to find the optimal algorithm for the situation because the characteristics of the data are different. In this process, excessive waste of resources occurs due to hardware, cost, time, and manpower. This study is expected to be useful for machine learning, data mining researchers, practitioners, and machine learning-based system developers. The composition of this study consists of introduction, related research, research model, experiment, conclusion and discussion.

A Machine Learning Approach to Web Image Classification (기계학습 기반의 웹 이미지 분류)

  • Cho, Soo-Sun;Lee, Dong-Woo;Han, Dong-Won;Hwang, Chi-Jung
    • The KIPS Transactions:PartB
    • /
    • v.9B no.6
    • /
    • pp.759-764
    • /
    • 2002
  • Although image occupies a large part of importance on the Web documents, there have not been many researches for analyzing and understanding it. Many Web images are used for carrying important information but others are not used for it. In this paper classify the Web images from presently served Web sites to erasable or non-erasable classes. based on machine learning methods. For this research, we have detected 16 special and rich features for Web images and experimented by using the Baysian and decision tree methods. As the results, F-measures of 87.09%, 82.72% were achived for each method and particularly, from the experiments to compare the effects of feature groups, it has proved that the added features on this study are very useful for Web image classification.

Multiple Linkage Disequilibrium Mapping Methods to Validate Additive Quantitative Trait Loci in Korean Native Cattle (Hanwoo)

  • Li, Yi;Kim, Jong-Joo
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.28 no.7
    • /
    • pp.926-935
    • /
    • 2015
  • The efficiency of genome-wide association analysis (GWAS) depends on power of detection for quantitative trait loci (QTL) and precision for QTL mapping. In this study, three different strategies for GWAS were applied to detect QTL for carcass quality traits in the Korean cattle, Hanwoo; a linkage disequilibrium single locus regression method (LDRM), a combined linkage and linkage disequilibrium analysis (LDLA) and a $BayesC{\pi}$ approach. The phenotypes of 486 steers were collected for weaning weight (WWT), yearling weight (YWT), carcass weight (CWT), backfat thickness (BFT), longissimus dorsi muscle area, and marbling score (Marb). Also the genotype data for the steers and their sires were scored with the Illumina bovine 50K single nucleotide polymorphism (SNP) chips. For the two former GWAS methods, threshold values were set at false discovery rate <0.01 on a chromosome-wide level, while a cut-off threshold value was set in the latter model, such that the top five windows, each of which comprised 10 adjacent SNPs, were chosen with significant variation for the phenotype. Four major additive QTL from these three methods had high concordance found in 64.1 to 64.9Mb for Bos taurus autosome (BTA) 7 for WWT, 24.3 to 25.4Mb for BTA14 for CWT, 0.5 to 1.5Mb for BTA6 for BFT and 26.3 to 33.4Mb for BTA29 for BFT. Several candidate genes (i.e. glutamate receptor, ionotropic, ampa 1 [GRIA1], family with sequence similarity 110, member B [FAM110B], and thymocyte selection-associated high mobility group box [TOX]) may be identified close to these QTL. Our result suggests that the use of different linkage disequilibrium mapping approaches can provide more reliable chromosome regions to further pinpoint DNA makers or causative genes in these regions.