• Title/Summary/Keyword: conditional probability distribution

Search Result 72, Processing Time 0.022 seconds

Generating and Validating Synthetic Training Data for Predicting Bankruptcy of Individual Businesses

  • Hong, Dong-Suk;Baik, Cheol
    • Journal of information and communication convergence engineering
    • /
    • v.19 no.4
    • /
    • pp.228-233
    • /
    • 2021
  • In this study, we analyze the credit information (loan, delinquency information, etc.) of individual business owners to generate voluminous training data to establish a bankruptcy prediction model through a partial synthetic training technique. Furthermore, we evaluate the prediction performance of the newly generated data compared to the actual data. When using conditional tabular generative adversarial networks (CTGAN)-based training data generated by the experimental results (a logistic regression task), the recall is improved by 1.75 times compared to that obtained using the actual data. The probability that both the actual and generated data are sampled over an identical distribution is verified to be much higher than 80%. Providing artificial intelligence training data through data synthesis in the fields of credit rating and default risk prediction of individual businesses, which have not been relatively active in research, promotes further in-depth research efforts focused on utilizing such methods.

Learning Distribution Graphs Using a Neuro-Fuzzy Network for Naive Bayesian Classifier (퍼지신경망을 사용한 네이브 베이지안 분류기의 분산 그래프 학습)

  • Tian, Xue-Wei;Lim, Joon S.
    • Journal of Digital Convergence
    • /
    • v.11 no.11
    • /
    • pp.409-414
    • /
    • 2013
  • Naive Bayesian classifiers are a powerful and well-known type of classifiers that can be easily induced from a dataset of sample cases. However, the strong conditional independence assumptions can sometimes lead to weak classification performance. Normally, naive Bayesian classifiers use Gaussian distributions to handle continuous attributes and to represent the likelihood of the features conditioned on the classes. The probability density of attributes, however, is not always well fitted by a Gaussian distribution. Another eminent type of classifier is the neuro-fuzzy classifier, which can learn fuzzy rules and fuzzy sets using supervised learning. Since there are specific structural similarities between a neuro-fuzzy classifier and a naive Bayesian classifier, the purpose of this study is to apply learning distribution graphs constructed by a neuro-fuzzy network to naive Bayesian classifiers. We compare the Gaussian distribution graphs with the fuzzy distribution graphs for the naive Bayesian classifier. We applied these two types of distribution graphs to classify leukemia and colon DNA microarray data sets. The results demonstrate that a naive Bayesian classifier with fuzzy distribution graphs is more reliable than that with Gaussian distribution graphs.

Application of Indicator Geostatistics for Probabilistic Uncertainty and Risk Analyses of Geochemical Data (지화학 자료의 확률론적 불확실성 및 위험성 분석을 위한 지시자 지구통계학의 응용)

  • Park, No-Wook
    • Journal of the Korean earth science society
    • /
    • v.31 no.4
    • /
    • pp.301-312
    • /
    • 2010
  • Geochemical data have been regarded as one of the important environmental variables in the environmental management. Since they are often sampled at sparse locations, it is important not only to predict attribute values at unsampled locations, but also to assess the uncertainty attached to the prediction for further analysis. The main objective of this paper is to exemplify how indicator geostatistics can be effectively applied to geochemical data processing for providing decision-supporting information as well as spatial distribution of the geochemical data. A whole geostatistical analysis framework, which includes probabilistic uncertainty modeling, classification and risk analysis, was illustrated through a case study of cadmium mapping. A conditional cumulative distribution function (ccdf) was first modeled by indicator kriging, and then e-type estimates and conditional variance were computed for spatial distribution of cadmium and quantitative uncertainty measures, respectively. Two different classification criteria such as a probability thresholding and an attribute thresholding were applied to delineate contaminated and safe areas. Finally, additional sampling locations were extracted from the coefficient of variation that accounts for both the conditional variance and the difference between attribute values and thresholding values. It is suggested that the indicator geostatistical framework illustrated in this study be a useful tool for analyzing any environmental variables including geochemical data for decision-making in the presence of uncertainty.

Fatigue life prediction based on Bayesian approach to incorporate field data into probability model

  • An, Dawn;Choi, Joo-Ho;Kim, Nam H.;Pattabhiraman, Sriram
    • Structural Engineering and Mechanics
    • /
    • v.37 no.4
    • /
    • pp.427-442
    • /
    • 2011
  • In fatigue life design of mechanical components, uncertainties arising from materials and manufacturing processes should be taken into account for ensuring reliability. A common practice is to apply a safety factor in conjunction with a physics model for evaluating the lifecycle, which most likely relies on the designer's experience. Due to conservative design, predictions are often in disagreement with field observations, which makes it difficult to schedule maintenance. In this paper, the Bayesian technique, which incorporates the field failure data into prior knowledge, is used to obtain a more dependable prediction of fatigue life. The effects of prior knowledge, noise in data, and bias in measurements on the distribution of fatigue life are discussed in detail. By assuming a distribution type of fatigue life, its parameters are identified first, followed by estimating the distribution of fatigue life, which represents the degree of belief of the fatigue life conditional to the observed data. As more data are provided, the values will be updated to reduce the credible interval. The results can be used in various needs such as a risk analysis, reliability based design optimization, maintenance scheduling, or validation of reliability analysis codes. In order to obtain the posterior distribution, the Markov Chain Monte Carlo technique is employed, which is a modern statistical computational method which effectively draws the samples of the given distribution. Field data of turbine components are exploited to illustrate our approach, which counts as a regular inspection of the number of failed blades in a turbine disk.

A Study on Development of Median Encroachment Accident Model (중앙선침범사고 예측모델의 개발에 관한 연구)

  • 하태준;박제진
    • Journal of Korean Society of Transportation
    • /
    • v.19 no.5
    • /
    • pp.109-117
    • /
    • 2001
  • The median encroachment accident model proposed in this paper is the first step to develop cost-effective criteria about installing facilities preventing traffic accidents by median encroachment. This model consists of expected annual number of median encroachment on roadway and conditional probability to collide with vehicles on opposite lane after encroachment. Expected encroachment number is related to traffic volume and quote from a study of Hutchinson & Kennedy(1966). The probability of vehicle collision is composed of assumed headway distribution of opposite directional vehicles (negative exponential distribution), driving time of encroaching vehicle and Gap & Gap acceptance model. By using expected accident number yielded from the presented model, it will be able to calculate the benefit of reduced accident and to analyze the cost of installing facilities. Therefore this will help develop cost-effective criteria of what, to install in the median.

  • PDF

Teaching Statistics through World Cup Soccer Examples (월드컵 축구 예제를 통한 통계교육)

  • Kim, Hyuk-Joo;Kim, Young-Il
    • The Korean Journal of Applied Statistics
    • /
    • v.23 no.6
    • /
    • pp.1201-1208
    • /
    • 2010
  • In teaching probability and statistics classes, we should increase efforts to develop examples that enhance teaching methodology in delivering more meaningful knowledge to students. Sports is one field that provides a variety of examples and World Cup Soccer events are a treasure house of many interesting problems. Teaching, using examples from this field, is an effective way to enhance the interest of students in probability and statistics because World Cup Soccer is a matter of national interest. In this paper, we have suggested several examples pertaining to counting the number of cases and computing probabilities. These examples are related to many issues such as possible scenarios in the preliminary round, victory points necessary for each participant to advance to the second round, and the issue of grouping teams. Based on a simulation using a statistical model, we have proposed a logical method for computing the probabilities of proceeding to the second round and winning the championship for each participant in the 2010 South Africa World Cup.

A probabilistic information retrieval model by document ranking using term dependencies (용어간 종속성을 이용한 문서 순위 매기기에 의한 확률적 정보 검색)

  • You, Hyun-Jo;Lee, Jung-Jin
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.5
    • /
    • pp.763-782
    • /
    • 2019
  • This paper proposes a probabilistic document ranking model incorporating term dependencies. Document ranking is a fundamental information retrieval task. The task is to sort documents in a collection according to the relevance to the user query (Qin et al., Information Retrieval Journal, 13, 346-374, 2010). A probabilistic model is a model for computing the conditional probability of the relevance of each document given query. Most of the widely used models assume the term independence because it is challenging to compute the joint probabilities of multiple terms. Words in natural language texts are obviously highly correlated. In this paper, we assume a multinomial distribution model to calculate the relevance probability of a document by considering the dependency structure of words, and propose an information retrieval model to rank a document by estimating the probability with the maximum entropy method. The results of the ranking simulation experiment in various multinomial situations show better retrieval results than a model that assumes the independence of words. The results of document ranking experiments using real-world datasets LETOR OHSUMED also show better retrieval results.

Estimation of drought risk through the bivariate drought frequency analysis using copula functions (코플라 함수를 활용한 이변량 가뭄빈도해석을 통한 우리나라 가뭄 위험도 산정)

  • Yu, Ji Soo;Yoo, Ji Young;Lee, Joo-Heon;Kim, Tea-Woong
    • Journal of Korea Water Resources Association
    • /
    • v.49 no.3
    • /
    • pp.217-225
    • /
    • 2016
  • The drought is generally characterized by duration and severity, thus it is required to conduct the bivariate frequency analysis simultaneously considering the drought duration and severity. However, since a bivariate joint probability distribution function (JPDF) has a 3-dimensional space, it is difficult to interpret the results in practice. In order to suggest the technical solution, this study employed copula functions to estimate an JPDF, then developed conditional JPDFs on various drought durations and estimated the critical severity corresponding to non-exceedance probability. Based on the historical severe drought events, the hydrologic risks were investigated for various extreme droughts with 95% non-exceedance probability. For the drought events with 10-month duration, the most hazardous areas were decided to Gwangju, Inje, and Uljin, which have 1.3-2.0 times higher drought occurrence probabilities compared with the national average. In addition, it was observed that southern regions were much higher drought prone areas than northern and central areas.

A Study of Anomaly Detection for ICT Infrastructure using Conditional Multimodal Autoencoder (ICT 인프라 이상탐지를 위한 조건부 멀티모달 오토인코더에 관한 연구)

  • Shin, Byungjin;Lee, Jonghoon;Han, Sangjin;Park, Choong-Shik
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.3
    • /
    • pp.57-73
    • /
    • 2021
  • Maintenance and prevention of failure through anomaly detection of ICT infrastructure is becoming important. System monitoring data is multidimensional time series data. When we deal with multidimensional time series data, we have difficulty in considering both characteristics of multidimensional data and characteristics of time series data. When dealing with multidimensional data, correlation between variables should be considered. Existing methods such as probability and linear base, distance base, etc. are degraded due to limitations called the curse of dimensions. In addition, time series data is preprocessed by applying sliding window technique and time series decomposition for self-correlation analysis. These techniques are the cause of increasing the dimension of data, so it is necessary to supplement them. The anomaly detection field is an old research field, and statistical methods and regression analysis were used in the early days. Currently, there are active studies to apply machine learning and artificial neural network technology to this field. Statistically based methods are difficult to apply when data is non-homogeneous, and do not detect local outliers well. The regression analysis method compares the predictive value and the actual value after learning the regression formula based on the parametric statistics and it detects abnormality. Anomaly detection using regression analysis has the disadvantage that the performance is lowered when the model is not solid and the noise or outliers of the data are included. There is a restriction that learning data with noise or outliers should be used. The autoencoder using artificial neural networks is learned to output as similar as possible to input data. It has many advantages compared to existing probability and linear model, cluster analysis, and map learning. It can be applied to data that does not satisfy probability distribution or linear assumption. In addition, it is possible to learn non-mapping without label data for teaching. However, there is a limitation of local outlier identification of multidimensional data in anomaly detection, and there is a problem that the dimension of data is greatly increased due to the characteristics of time series data. In this study, we propose a CMAE (Conditional Multimodal Autoencoder) that enhances the performance of anomaly detection by considering local outliers and time series characteristics. First, we applied Multimodal Autoencoder (MAE) to improve the limitations of local outlier identification of multidimensional data. Multimodals are commonly used to learn different types of inputs, such as voice and image. The different modal shares the bottleneck effect of Autoencoder and it learns correlation. In addition, CAE (Conditional Autoencoder) was used to learn the characteristics of time series data effectively without increasing the dimension of data. In general, conditional input mainly uses category variables, but in this study, time was used as a condition to learn periodicity. The CMAE model proposed in this paper was verified by comparing with the Unimodal Autoencoder (UAE) and Multi-modal Autoencoder (MAE). The restoration performance of Autoencoder for 41 variables was confirmed in the proposed model and the comparison model. The restoration performance is different by variables, and the restoration is normally well operated because the loss value is small for Memory, Disk, and Network modals in all three Autoencoder models. The process modal did not show a significant difference in all three models, and the CPU modal showed excellent performance in CMAE. ROC curve was prepared for the evaluation of anomaly detection performance in the proposed model and the comparison model, and AUC, accuracy, precision, recall, and F1-score were compared. In all indicators, the performance was shown in the order of CMAE, MAE, and AE. Especially, the reproduction rate was 0.9828 for CMAE, which can be confirmed to detect almost most of the abnormalities. The accuracy of the model was also improved and 87.12%, and the F1-score was 0.8883, which is considered to be suitable for anomaly detection. In practical aspect, the proposed model has an additional advantage in addition to performance improvement. The use of techniques such as time series decomposition and sliding windows has the disadvantage of managing unnecessary procedures; and their dimensional increase can cause a decrease in the computational speed in inference.The proposed model has characteristics that are easy to apply to practical tasks such as inference speed and model management.

Analysis of CRLB Performances with CAF under Multiple Emitters (CAF 이용 다중 발기하에서의 CRLB 성능 분석)

  • Lee, Young-kyu;Yang, Sung-hoon;Lee, Chang-bok;Park, Young-Mi;Lee, Moon-Seok
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.21 no.6
    • /
    • pp.589-594
    • /
    • 2015
  • In this paper, we described the Cramer-Rao Lower Bound (CLRB) performances of Time Difference of Arrival (TDOA) and Frequency Difference of Arrival (FDOA) methods when there are multiple emitters. The TDOA and FDOA values between two receivers can be simultaneously estimated by using the so-called Complex Ambiguity Function (CAF). In the case of multiple emitters, there exist Inter Symbol Interferences (ISIs) in the measurement data. Therefore, it is required to reduce the effect of ISI and provide a performance evaluation method of TDOA and FDOA estimations. In order to eliminate the ISIs, using of a filter bank before calculating CAF is proposed when the carrier frequencies of the emitters are different to one another. Angle of Arrival (AOA) or Received Signal Strength (RSS) methods before calculating CAF were proposed to reduce the ISIs when the carrier frequencies are the same. In order to evaluate the CRLB of TDOA and FDOA estimations, we employed the conditional probability distribution method and described the numerical comparison results.