• Title/Summary/Keyword: data set

Search Result 10,939, Processing Time 0.036 seconds

Development of Personal-Credit Evaluation System Using Real-Time Neural Learning Mechanism

  • Park, Jong U.;Park, Hong Y.;Yoon Chung
    • The Journal of Information Technology and Database
    • /
    • v.2 no.2
    • /
    • pp.71-85
    • /
    • 1995
  • Many research results conducted by neural network researchers have claimed that the classification accuracy of neural networks is superior to, or at least equal to that of conventional methods. However, in series of neural network classifications, it was found that the classification accuracy strongly depends on the characteristics of training data set. Even though there are many research reports that the classification accuracy of neural networks can be different, depending on the composition and architecture of the networks, training algorithm, and test data set, very few research addressed the problem of classification accuracy when the basic assumption of data monotonicity is violated, In this research, development project of automated credit evaluation system is described. The finding was that arrangement of training data is critical to successful implementation of neural training to maintain monotonicity of the data set, for enhancing classification accuracy of neural networks.

  • PDF

Optimization of Fuzzy Set-based Fuzzy Inference Systems Based on Evolutionary Data Granulation (진화론적 데이터 입자에 기반한 퍼지 집합 기반 퍼지 추론 시스템의 최적화)

  • Park, Keon-Jun;Lee, Bong-Yoon;Oh, Sung-Kwun
    • Proceedings of the KIEE Conference
    • /
    • 2004.11c
    • /
    • pp.343-345
    • /
    • 2004
  • We propose a new category of fuzzy set-based fuzzy inference systems based on data granulation related to fuzzy space division for each variables. Data granules are viewed as linked collections of objects(data, in particular) drawn together by the criteria of proximity, similarity, or functionality. Granulation of data with the aid of Hard C-Means(HCM) clustering algorithm help determine the initial parameters of fuzzy model such as the initial apexes of the membership functions and the initial values of polyminial functions being used in the premise and consequence part of the fuzzy rules. And the initial parameters are tuned effectively with the aid of the genetic algorithms(GAs) and the least square method. Numerical example is included to evaluate the performance of the proposed model.

  • PDF

Predictive Analysis of Financial Fraud Detection using Azure and Spark ML

  • Priyanka Purushu;Niklas Melcher;Bhagyashree Bhagwat;Jongwook Woo
    • Asia pacific journal of information systems
    • /
    • v.28 no.4
    • /
    • pp.308-319
    • /
    • 2018
  • This paper aims at providing valuable insights on Financial Fraud Detection on a mobile money transactional activity. We have predicted and classified the transaction as normal or fraud with a small sample and massive data set using Azure and Spark ML, which are traditional systems and Big Data respectively. Experimenting with sample dataset in Azure, we found that the Decision Forest model is the most accurate to proceed in terms of the recall value. For the massive data set using Spark ML, it is found that the Random Forest classifier algorithm of the classification model proves to be the best algorithm. It is presented that the Spark cluster gets much faster to build and evaluate models as adding more servers to the cluster with the same accuracy, which proves that the large scale data set can be predictable using Big Data platform. Finally, we reached a recall score with 0.73, which implies a satisfying prediction quality in predicting fraudulent transactions.

Introduction of Japanese Ocean Flux data sets with Use of Remote sensing Observations (J-OFURO)

  • Kubota, Masahisa
    • Proceedings of the KSRS Conference
    • /
    • 1999.11a
    • /
    • pp.231-236
    • /
    • 1999
  • Accurate ocean surface fluxes with high resolution are critical for understanding a mechanism of global climate. However, it is difficult to derive those fluxes by using ocean observation data because the number of ocean observation data is extremely small and the distribution is inhomogeneous. On the other hand. satellite data are characterized by the high density, the high resolution and the homogeneity. Therefore, it can be considered that we obtain accurate ocean surface by using satellite data. Recently we constructed ocean surface data sets mainly using satellite data. The data set is named by Japanese Ocean Flux data sets with Use of Remote sensing Observations (J-OFURO). Here, we introduce J-OFURO. The data set includes shortwave radiation, longwave radiation, latent heat flux, sensible heat flux, and momentum flux etc. Moreover, sea surface dynamic topography data are included in the data set. Radiation data sets covers western Pacific and eastern Indian Ocean because we use a Japanese geostationally satellite (GMS) to estimate radiation fluxes. On the other hand, turbulent heat fluxes are globally estimated. The constructed data sets are used and shows the effectiveness for many scientific studies.

  • PDF

Performance of Korean spontaneous speech recognizers based on an extended phone set derived from acoustic data (음향 데이터로부터 얻은 확장된 음소 단위를 이용한 한국어 자유발화 음성인식기의 성능)

  • Bang, Jeong-Uk;Kim, Sang-Hun;Kwon, Oh-Wook
    • Phonetics and Speech Sciences
    • /
    • v.11 no.3
    • /
    • pp.39-47
    • /
    • 2019
  • We propose a method to improve the performance of spontaneous speech recognizers by extending their phone set using speech data. In the proposed method, we first extract variable-length phoneme-level segments from broadcast speech signals, and convert them to fixed-length latent vectors using an long short-term memory (LSTM) classifier. We then cluster acoustically similar latent vectors and build a new phone set by choosing the number of clusters with the lowest Davies-Bouldin index. We also update the lexicon of the speech recognizer by choosing the pronunciation sequence of each word with the highest conditional probability. In order to analyze the acoustic characteristics of the new phone set, we visualize its spectral patterns and segment duration. Through speech recognition experiments using a larger training data set than our own previous work, we confirm that the new phone set yields better performance than the conventional phoneme-based and grapheme-based units in both spontaneous speech recognition and read speech recognition.

Multi-period DEA Models Using Spanning Set and A Case Example (생성집합을 이용한 다 기간 성과평가를 위한 DEA 모델 개발 및 공학교육혁신사업 사례적용)

  • Kim, Kiseong;Lee, Taehan
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.45 no.3
    • /
    • pp.57-65
    • /
    • 2022
  • DEA(data envelopment analysis) is a technique for evaluation of relative efficiency of decision making units (DMUs) that have multiple input and output. A DEA model measures the efficiency of a DMU by the relative position of the DMU's input and output in the production possibility set defined by the input and output of the DMUs being compared. In this paper, we proposed several DEA models measuring the multi-period efficiency of a DMU. First, we defined the input and output data that make a production possibility set as the spanning set. We proposed several spanning sets containing input and output of entire periods for measuring the multi-period efficiency of a DMU. We defined the production possibility sets with the proposed spanning sets and gave DEA models under the production possibility sets. Some models measure the efficiency score of each period of a DMU and others measure the integrated efficiency score of the DMU over the entire period. For the test, we applied the models to the sample data set from a long term university student training project. The results show that the suggested models may have the better discrimination power than CCR based results while the ranking of DMUs is not different.

Classification and Regression Tree Analysis for Molecular Descriptor Selection and Binding Affinities Prediction of Imidazobenzodiazepines in Quantitative Structure-Activity Relationship Studies

  • Atabati, Morteza;Zarei, Kobra;Abdinasab, Esmaeil
    • Bulletin of the Korean Chemical Society
    • /
    • v.30 no.11
    • /
    • pp.2717-2722
    • /
    • 2009
  • The use of the classification and regression tree (CART) methodology was studied in a quantitative structure-activity relationship (QSAR) context on a data set consisting of the binding affinities of 39 imidazobenzodiazepines for the α1 benzodiazepine receptor. The 3-D structures of these compounds were optimized using HyperChem software with semiempirical AM1 optimization method. After optimization a set of 1481 zero-to three-dimentional descriptors was calculated for each molecule in the data set. The response (dependent variable) in the tree model consisted of the binding affinities of drugs. Three descriptors (two topological and one 3D-Morse descriptors) were applied in the final tree structure to describe the binding affinities. The mean relative error percent for the data set is 3.20%, compared with a previous model with mean relative error percent of 6.63%. To evaluate the predictive power of CART cross validation method was also performed.

A Study of Optimal Ratio of Data Partition for Neuro-Fuzzy-Based Software Reliability Prediction (뉴로-퍼지 소프트웨어 신뢰성 예측에 대한 최적의 데이터 분할비율에 관한 연구)

  • Lee, Sang-Un
    • The KIPS Transactions:PartD
    • /
    • v.8D no.2
    • /
    • pp.175-180
    • /
    • 2001
  • This paper presents the optimal fraction of validation set to obtain a prediction accuracy of software failure count or failure time in the future by a neuro-fuzzy system. Given a fixed amount of training data, the most popular effective approach to avoiding underfitting and overfitting is early stopping, and hence getting optimal generalization. But there is unresolved practical issues : How many data do you assign to the training and validation set\ulcorner Rules of thumb abound, the solution is acquired by trial-and-error and we spend long time in this method. For the sake of optimal fraction of validation set, the variant specific fraction for the validation set be provided. It shows that minimal fraction of the validation data set is sufficient to achieve good next-step prediction. This result can be considered as a practical guideline in a prediction of software reliability by neuro-fuzzy system.

  • PDF

A Study on the Secure Coding for Security Improvement of Delphi XE2 DataSnap Server (델파이 XE2 DataSnap 서버의 보안성 개선을 위한 시큐어 코딩에 관한 연구)

  • Jung, Myoung-Gyu;Park, Man-Gon
    • Journal of Korea Multimedia Society
    • /
    • v.17 no.6
    • /
    • pp.706-715
    • /
    • 2014
  • It is used to lead to serious structural vulnerability of the system security of security-critical system when we have quickly developed software system according to urgent release schedule without appropriate security planning, management, and assurance processes. The Data Set and Provider of DataSnap, which is a middleware of Delphi XE2 of the Embarcadero Technologies Co., certainly help to develop an easy and fast-paced procedure, but it is difficult to apply security program and vulnerable to control software system security when the connection structure Database-DataSnap server-SQL Connection-SQL Data set-Provider is applied. This is due to that all kinds of information of Provider are exposed on the moment when DataSnap Server Port is sure to malicious attackers. This exposure becomes a window capable of running SQL Command. Thus, it should not be used Data Set and Provider in the DataSnap Server in consideration of all aspects of security management. In this paper, we study on the verification of the security vulnerabilities for Client and Server DataSnap in Dlephi XE2, and we propose a secure coding method to improve security vulnerability in the DataSnap server system.

Data Pattern Estimation with Movement of the Center of Gravity

  • Ahn Tae-Chon;Jang Kyung-Won;Shin Dong-Du;Kang Hak-Soo;Yoon Yang-Woong
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.6 no.3
    • /
    • pp.210-216
    • /
    • 2006
  • In the rule based modeling, data partitioning plays crucial role be cause partitioned sub data set implies particular information of the given data set or system. In this paper, we present an empirical study result of the data pattern estimation to find underlying data patterns of the given data. Presented method performs crisp type clustering with given n number of data samples by means of the sequential agglomerative hierarchical nested model (SAHN). In each sequence, the average value of the sum of all inter-distance between centroid and data point. In the sequel, compute the derivation of the weighted average distance to observe a pattern distribution. For the final step, after overall clustering process is completed, weighted average distance value is applied to estimate range of the number of clusters in given dataset. The proposed estimation method and its result are considered with the use of FCM demo data set in MATLAB fuzzy logic toolbox and Box and Jenkins's gas furnace data.