• 제목/요약/키워드: data sets

Search Result 3,771, Processing Time 0.031 seconds

Missing Value Imputation based on Locally Linear Reconstruction for Improving Classification Performance (분류 성능 향상을 위한 지역적 선형 재구축 기반 결측치 대치)

  • Kang, Pilsung
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.38 no.4
    • /
    • pp.276-284
    • /
    • 2012
  • Classification algorithms generally assume that the data is complete. However, missing values are common in real data sets due to various reasons. In this paper, we propose to use locally linear reconstruction (LLR) for missing value imputation to improve the classification performance when missing values exist. We first investigate how much missing values degenerate the classification performance with regard to various missing ratios. Then, we compare the proposed missing value imputation (LLR) with three well-known single imputation methods over three different classifiers using eight data sets. The experimental results showed that (1) any imputation methods, although some of them are very simple, helped to improve the classification accuracy; (2) among the imputation methods, the proposed LLR imputation was the most effective over all missing ratios, and (3) when the missing ratio is relatively high, LLR was outstanding and its classification accuracy was as high as the classification accuracy derived from the compete data set.

Validity Study of Kohonen Self-Organizing Maps

  • Huh, Myung-Hoe
    • Communications for Statistical Applications and Methods
    • /
    • v.10 no.2
    • /
    • pp.507-517
    • /
    • 2003
  • Self-organizing map (SOM) has been developed mainly by T. Kohonen and his colleagues as a unsupervised learning neural network. Because of its topological ordering property, SOM is known to be very useful in pattern recognition and text information retrieval areas. Recently, data miners use Kohonen´s mapping method frequently in exploratory analyses of large data sets. One problem facing SOM builder is that there exists no sensible criterion for evaluating goodness-of-fit of the map at hand. In this short communication, we propose valid evaluation procedures for the Kohonen SOM of any size. The methods can be used in selecting the best map among several candidates.

Censored varying coefficient regression model using Buckley-James method

  • Shim, Jooyong;Seok, Kyungha
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.5
    • /
    • pp.1167-1177
    • /
    • 2017
  • The censored regression using the pseudo-response variable proposed by Buckley and James has been one of the most well-known models. Recently, the varying coefficient regression model has received a great deal of attention as an important tool for modeling. In this paper we propose a censored varying coefficient regression model using Buckley-James method to consider situations where the regression coefficients of the model are not constant but change as the smoothing variables change. By using the formulation of least squares support vector machine (LS-SVM), the coefficient estimators of the proposed model can be easily obtained from simple linear equations. Furthermore, a generalized cross validation function can be easily derived. In this paper, we evaluated the proposed method and demonstrated the adequacy through simulate data sets and real data sets.

Analysis of Multivariate Financial Time Series Using Cointegration : Case Study

  • Choi, M.S.;Park, J.A.;Hwang, S.Y.
    • Journal of the Korean Data and Information Science Society
    • /
    • v.18 no.1
    • /
    • pp.73-80
    • /
    • 2007
  • Cointegration(together with VARMA(vector ARMA)) has been proven to be useful for analyzing multivariate non-stationary data in the field of financial time series. It provides a linear combination (which turns out to be stationary series) of non-stationary component series. This linear combination equation is referred to as long term equilibrium between the component series. We consider two sets of Korean bivariate financial time series and then illustrate cointegration analysis. Specifically estimated VAR(vector AR) and VECM(vector error correction model) are obtained and CV(cointegrating vector) is found for each data sets.

  • PDF

Improved Convective Heat Transfer Correlations for Two-Phase Two-Component Pipe Flow

  • Kim, Dongwoo
    • Journal of Mechanical Science and Technology
    • /
    • v.16 no.3
    • /
    • pp.403-422
    • /
    • 2002
  • In this study, six two-phase nonboiling heat transfer correlations obtained from the recommendations of our previous work were assessed. These correlations were modified using seven extensive sets of two-phase flow experimental data available from the literature, for vertical and horizontal tubes and different flow patterns and fluids. A total of 524 data points from five available experimental studies (which included the seven sets of data) were used for improvement of the six identified correlations. Based on the tabulated and graphical results of the comparisons between the predictions of the modified heat transfer correlations and the available experimental data, appropriate improved correlations for different flow patterns, tube orientations, and liquid-gas combinations were recommended.

Catalyzing social media scholarship with open tools and data

  • Smith, Marc A.
    • Journal of Contemporary Eastern Asia
    • /
    • v.14 no.2
    • /
    • pp.87-96
    • /
    • 2015
  • Social media comprises a vast and consequential landscape that has been poorly mapped and understood. Hundreds of millions of people have eagerly moved many of the conversations and discussions that compose civil society into these services and platforms. There is a need to document and analyze these social spaces for many academic and commercial purposes. The Social Media Research Foundation has engaged a strategy to cultivate better research into the structure and dynamics of social media. The foundation is dedicated to the creation of open tools, open data, and open scholarship related to social media. It has implemented a free and open network collection, analysis, and visualization tool called NodeXL to facilitate social media network research. Using NodeXL a group of researchers has collectively authored a publicly available archive, called the NodeXL Graph Gallery, composed of network data sets and visualizations from users around the world. This site has enabled the aggregation of tens of thousands of network datasets and images. Use of the archive has led to scholarly research results that are based on the wide range and scope of social media data sets available.

Noisy Data Aggregation with Independent Sensors: Insights and Open Problems

  • Murayama, Tatsuto;Davis, Peter
    • Journal of Multimedia Information System
    • /
    • v.3 no.2
    • /
    • pp.21-26
    • /
    • 2016
  • Our networked world has been growing exponentially fast. The explosion in volume of machine-to-machine (M2M) transactions threatens to exceed the transport capacity of the networks that link them. Therefore, it is quite essential to reconsider the tradeoff between using many data sets versus using good data sets. We focus on this tradeoff in the context of the quality of information aggregated from many sensors in a noisy environment. We start with a basic theoretical model considered in the famous "CEO problem'' in the field of information theory. From a point of view of large deviations, we successfully find a simple statement for the optimal strategies under the limited network capacity condition. Moreover, we propose an open problem for a sensor network scenario and report a numerical result.

SUPPORT VECTOR MACHINE USING K-MEANS CLUSTERING

  • Lee, S.J.;Park, C.;Jhun, M.;Koo, J.Y.
    • Journal of the Korean Statistical Society
    • /
    • v.36 no.1
    • /
    • pp.175-182
    • /
    • 2007
  • The support vector machine has been successful in many applications because of its flexibility and high accuracy. However, when a training data set is large or imbalanced, the support vector machine may suffer from significant computational problem or loss of accuracy in predicting minority classes. We propose a modified version of the support vector machine using the K-means clustering that exploits the information in class labels during the clustering process. For large data sets, our method can save the computation time by reducing the number of data points without significant loss of accuracy. Moreover, our method can deal with imbalanced data sets effectively by alleviating the influence of dominant class.

Multiscale Implicit Functions for Unified Data Representation

  • Yun, Seong-Min;Park, Sang-Hun
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.5 no.12
    • /
    • pp.2374-2391
    • /
    • 2011
  • A variety of reconstruction methods has been developed to convert a set of scattered points generated from real models into explicit forms, such as polygonal meshes, parametric or implicit surfaces. In this paper, we present a method to construct multi-scale implicit surfaces from scattered points using multiscale kernels based on kernel and multi-resolution analysis theories. Our approach differs from other methods in that multi-scale reconstruction can be done without additional manipulation on input data, calculated functions support level of detail representation, and it can be naturally expanded for n-dimensional data. The method also works well with point-sets that are noisy or not uniformly distributed. We show features and performances of the proposed method via experimental results for various data sets.

역전파 학습 신경망을 이용한 고립 단어 인식시스템에 관한 연구

  • 김중태
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.15 no.9
    • /
    • pp.738-744
    • /
    • 1990
  • This paper proposed a real-time memory storage method and an improved sample data method from given data of the speech signal, so, the isolated word recognition system using a back-propagation learning algorithm of the neural netwrok is studied. The recognition rate and the error rate are compared with the new sample data sets generated from small sets of given sample data by the node nunber variatiion of each layer. In this result, the recognition rate of 95.1% was achived.

  • PDF