• 제목/요약/키워드: data sets

Search Result 3,771, Processing Time 0.027 seconds

Training for Huge Data set with On Line Pruning Regression by LS-SVM

  • Kim, Dae-Hak;Shim, Joo-Yong;Oh, Kwang-Sik
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2003.10a
    • /
    • pp.137-141
    • /
    • 2003
  • LS-SVM(least squares support vector machine) is a widely applicable and useful machine learning technique for classification and regression analysis. LS-SVM can be a good substitute for statistical method but computational difficulties are still remained to operate the inversion of matrix of huge data set. In modern information society, we can easily get huge data sets by on line or batch mode. For these kind of huge data sets, we suggest an on line pruning regression method by LS-SVM. With relatively small number of pruned support vectors, we can have almost same performance as regression with full data set.

  • PDF

Feature Extraction and Multisource Image Classification

  • Amarsaikhan, D.;Sato, M.
    • Proceedings of the KSRS Conference
    • /
    • 2003.11a
    • /
    • pp.1084-1086
    • /
    • 2003
  • The aim of this study is to assess the integrated use of different features extracted from spaceborne interferometric synthetic aperture radar (InSAR) data and optical data for land cover classification. Special attention is given to the discriminatory characteristics of the features derived from the multisource data sets. For the evaluation of the features , the statistical maximum likelihood decision rule and neural network classification are used and the results are compared. The performance of each method was evaluated by measuring the overall accuracy. In all cases, the performance of the first method was better than the performance of the latter one. Overall, the research indicated that multisource data sets containing different information about backscattering and reflecting properties of the selected classes of objects can significantly improve the classification of land cover types.

  • PDF

Research on Fault Diagnosis of Wind Power Generator Blade Based on SC-SMOTE and kNN

  • Peng, Cheng;Chen, Qing;Zhang, Longxin;Wan, Lanjun;Yuan, Xinpan
    • Journal of Information Processing Systems
    • /
    • v.16 no.4
    • /
    • pp.870-881
    • /
    • 2020
  • Because SCADA monitoring data of wind turbines are large and fast changing, the unbalanced proportion of data in various working conditions makes it difficult to process fault feature data. The existing methods mainly introduce new and non-repeating instances by interpolating adjacent minority samples. In order to overcome the shortcomings of these methods which does not consider boundary conditions in balancing data, an improved over-sampling balancing algorithm SC-SMOTE (safe circle synthetic minority oversampling technology) is proposed to optimize data sets. Then, for the balanced data sets, a fault diagnosis method based on improved k-nearest neighbors (kNN) classification for wind turbine blade icing is adopted. Compared with the SMOTE algorithm, the experimental results show that the method is effective in the diagnosis of fan blade icing fault and improves the accuracy of diagnosis.

Application of Random Forests to Assessment of Importance of Variables in Multi-sensor Data Fusion for Land-cover Classification

  • Park No-Wook;Chi kwang-Hoon
    • Korean Journal of Remote Sensing
    • /
    • v.22 no.3
    • /
    • pp.211-219
    • /
    • 2006
  • A random forests classifier is applied to multi-sensor data fusion for supervised land-cover classification in order to account for the importance of variable. The random forests approach is a non-parametric ensemble classifier based on CART-like trees. The distinguished feature is that the importance of variable can be estimated by randomly permuting the variable of interest in all the out-of-bag samples for each classifier. Two different multi-sensor data sets for supervised classification were used to illustrate the applicability of random forests: one with optical and polarimetric SAR data and the other with multi-temporal Radarsat-l and ENVISAT ASAR data sets. From the experimental results, the random forests approach could extract important variables or bands for land-cover discrimination and showed reasonably good performance in terms of classification accuracy.

Restricted maximum likelihood estimation of a censored random effects panel regression model

  • Lee, Minah;Lee, Seung-Chun
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.4
    • /
    • pp.371-383
    • /
    • 2019
  • Panel data sets have been developed in various areas, and many recent studies have analyzed panel, or longitudinal data sets. Maximum likelihood (ML) may be the most common statistical method for analyzing panel data models; however, the inference based on the ML estimate will have an inflated Type I error because the ML method tends to give a downwardly biased estimate of variance components when the sample size is small. The under estimation could be severe when data is incomplete. This paper proposes the restricted maximum likelihood (REML) method for a random effects panel data model with a censored dependent variable. Note that the likelihood function of the model is complex in that it includes a multidimensional integral. Many authors proposed to use integral approximation methods for the computation of likelihood function; however, it is well known that integral approximation methods are inadequate for high dimensional integrals in practice. This paper introduces to use the moments of truncated multivariate normal random vector for the calculation of multidimensional integral. In addition, a proper asymptotic standard error of REML estimate is given.

On Mathematical Representation and Integration Theory for GIS Application of Remote Sensing and Geological Data

  • Moon, Woo-Il M.
    • Korean Journal of Remote Sensing
    • /
    • v.10 no.2
    • /
    • pp.37-48
    • /
    • 1994
  • In spatial information processing, particularly in non-renewable resource exploration, the spatial data sets, including remote sensing, geophysical and geochemical data, have to be geocoded onto a reference map and integrated for the final analysis and interpretation. Application of a computer based GIS(Geographical Information System of Geological Information System) at some point of the spatial data integration/fusion processing is now a logical and essential step. It should, however, be pointed out that the basic concepts of the GIS based spatial data fusion were developed with insufficient mathematical understanding of spatial characteristics or quantitative modeling framwork of the data. Furthermore many remote sensing and geological data sets, available for many exploration projects, are spatially incomplete in coverage and interduce spatially uneven information distribution. In addition, spectral information of many spatial data sets is often imprecise due to digital rescaling. Direct applications of GIS systems to spatial data fusion can therefore result in seriously erroneous final results. To resolve this problem, some of the important mathematical information representation techniques are briefly reviewed and discussed in this paper with condideration of spatial and spectral characteristics of the common remote sensing and exploration data. They include the basic probabilistic approach, the evidential belief function approach (Dempster-Shafer method) and the fuzzy logic approach. Even though the basic concepts of these three approaches are different, proper application of the techniques and careful interpretation of the final results are expected to yield acceptable conclusions in cach case. Actual tests with real data (Moon, 1990a; An etal., 1991, 1992, 1993) have shown that implementation and application of the methods discussed in this paper consistently provide more accurate final results than most direct applications of GIS techniques.

Automatic Detection of the Updating Object by Areal Feature Matching Based on Shape Similarity (형상유사도 기반의 면 객체 매칭을 통한 갱신 객체 탐지)

  • Kim, Ji-Young;Yu, Ki-Yun
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.30 no.1
    • /
    • pp.59-65
    • /
    • 2012
  • In this paper, we proposed a method for automatic detection of a updating object from spatial data sets of different scale and updating cycle by using areal feature matching based on shape similarity. For this, we defined a updating object by analysing matching relationships between two different spatial data sets. Next, we firstly eliminated systematic errors in different scale by using affine transformation. Secondly, if any object is overlaid with several areal features of other data sets, we changed several areal features into a single areal feature. Finally, we detected the updating objects by applying areal feature matching based on shape similarity into the changed spatial data sets. After applying the proposed method into digital topographic map and a base map of Korean Address Information System in South Korea, we confirmed that F-measure is highly 0.958 in a statistical evaluation and that significant updating objects are detected from a visual evaluation.

A Structural Analysis of Sanghanron by Network Model - Centered on Symptoms and Herbs of Taeyangbyung Compilation in Sanghanron - (네트워크 모델을 통한 상한론(傷寒論) 구조분석 연구 - 태양병(太陽病) 증상(症狀)-처방(處方)을 중심으로 -)

  • Hong, Dae-Ki;Yook, Soon-Hyung;Kim, Min-Yong;Park, Young-Jae;Oh, Hwan-Sup;Nam, Dong-Hyun;Park, Young-Bae
    • The Journal of Korean Medicine
    • /
    • v.32 no.1
    • /
    • pp.56-66
    • /
    • 2011
  • Background: This was a study to analyze Sanghanron through network theory, as the first attempt to construct network models for systems biomedicine in traditional Korean medicine. For this purpose, we investigated the network structure with priority given to two-node connections between symptoms and herbs of Taeyangbyung compilation in Sanghanron. Purpose: We had three goals in carrying out this study. First, to establish the minimum clinical grouping data sets for symptoms and herbs of Taeyangbyung compilation in Sanghanron. Second, to make index files for the obtained data sets. Third, to generate a network structure for systems biomedicine in this part, and analyze its relationship. Methods: Using MS office Excel and Netminer software, we constructed the minimum clinical grouping data sets and the network for systems biomedicine about symptoms and herbs of Taeyangbyung compilation in Sanghanron, and analyzed its relationship. Results: We established the minimum clinical grouping data sets for symptoms and herbs of Taeyangbyung compilation in Sanghanron, using MS Excel. We constructed a network to structurize our database through two-node connections of Netminer program, and analyzed its relationships. Conclusions: Further research on network model for systems biomedicine between symptoms and herbs for three Yang and three Um(Taeyang, Soyang, Yangmyung, Taeum, Soum, Gualum) disease compilation is necessary.

Comparison of Machine Learning-Based Radioisotope Identifiers for Plastic Scintillation Detector

  • Jeon, Byoungil;Kim, Jongyul;Yu, Yonggyun;Moon, Myungkook
    • Journal of Radiation Protection and Research
    • /
    • v.46 no.4
    • /
    • pp.204-212
    • /
    • 2021
  • Background: Identification of radioisotopes for plastic scintillation detectors is challenging because their spectra have poor energy resolutions and lack photo peaks. To overcome this weakness, many researchers have conducted radioisotope identification studies using machine learning algorithms; however, the effect of data normalization on radioisotope identification has not been addressed yet. Furthermore, studies on machine learning-based radioisotope identifiers for plastic scintillation detectors are limited. Materials and Methods: In this study, machine learning-based radioisotope identifiers were implemented, and their performances according to data normalization methods were compared. Eight classes of radioisotopes consisting of combinations of 22Na, 60Co, and 137Cs, and the background, were defined. The training set was generated by the random sampling technique based on probabilistic density functions acquired by experiments and simulations, and test set was acquired by experiments. Support vector machine (SVM), artificial neural network (ANN), and convolutional neural network (CNN) were implemented as radioisotope identifiers with six data normalization methods, and trained using the generated training set. Results and Discussion: The implemented identifiers were evaluated by test sets acquired by experiments with and without gain shifts to confirm the robustness of the identifiers against the gain shift effect. Among the three machine learning-based radioisotope identifiers, prediction accuracy followed the order SVM > ANN > CNN, while the training time followed the order SVM > ANN > CNN. Conclusion: The prediction accuracy for the combined test sets was highest with the SVM. The CNN exhibited a minimum variation in prediction accuracy for each class, even though it had the lowest prediction accuracy for the combined test sets among three identifiers. The SVM exhibited the highest prediction accuracy for the combined test sets, and its training time was the shortest among three identifiers.

Multi-period DEA Models Using Spanning Set and A Case Example (생성집합을 이용한 다 기간 성과평가를 위한 DEA 모델 개발 및 공학교육혁신사업 사례적용)

  • Kim, Kiseong;Lee, Taehan
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.45 no.3
    • /
    • pp.57-65
    • /
    • 2022
  • DEA(data envelopment analysis) is a technique for evaluation of relative efficiency of decision making units (DMUs) that have multiple input and output. A DEA model measures the efficiency of a DMU by the relative position of the DMU's input and output in the production possibility set defined by the input and output of the DMUs being compared. In this paper, we proposed several DEA models measuring the multi-period efficiency of a DMU. First, we defined the input and output data that make a production possibility set as the spanning set. We proposed several spanning sets containing input and output of entire periods for measuring the multi-period efficiency of a DMU. We defined the production possibility sets with the proposed spanning sets and gave DEA models under the production possibility sets. Some models measure the efficiency score of each period of a DMU and others measure the integrated efficiency score of the DMU over the entire period. For the test, we applied the models to the sample data set from a long term university student training project. The results show that the suggested models may have the better discrimination power than CCR based results while the ranking of DMUs is not different.