• Title/Summary/Keyword: synthetic data sampling

Search Result 44, Processing Time 0.021 seconds

Image-to-Image Translation with GAN for Synthetic Data Augmentation in Plant Disease Datasets

  • Nazki, Haseeb;Lee, Jaehwan;Yoon, Sook;Park, Dong Sun
    • Smart Media Journal
    • /
    • v.8 no.2
    • /
    • pp.46-57
    • /
    • 2019
  • In recent research, deep learning-based methods have achieved state-of-the-art performance in various computer vision tasks. However, these methods are commonly supervised, and require huge amounts of annotated data to train. Acquisition of data demands an additional costly effort, particularly for the tasks where it becomes challenging to obtain large amounts of data considering the time constraints and the requirement of professional human diligence. In this paper, we present a data level synthetic sampling solution to learn from small and imbalanced data sets using Generative Adversarial Networks (GANs). The reason for using GANs are the challenges posed in various fields to manage with the small datasets and fluctuating amounts of samples per class. As a result, we present an approach that can improve learning with respect to data distributions, reducing the partiality introduced by class imbalance and hence shifting the classification decision boundary towards more accurate results. Our novel method is demonstrated on a small dataset of 2789 tomato plant disease images, highly corrupted with class imbalance in 9 disease categories. Moreover, we evaluate our results in terms of different metrics and compare the quality of these results for distinct classes.

Estimation of the Forest Stand Volumes from Forest Inventory Data Based on Synthetic Estimation Method: A Case of the Economic Forest in Gangwon-do, Republic of Korea

  • Seo, Hwan seok;Park, Jeong mook;Lee, Jung soo
    • Journal of Forest and Environmental Science
    • /
    • v.32 no.2
    • /
    • pp.140-148
    • /
    • 2016
  • This study aims to estimate the forest volumes of the economic forest in Gangwon Province of Republic of Korea (hereinafter referred to as Gangwon) through the synthetic estimation. To estimate the forest volume, Stratified systematic sampling method was used along with the forest type maps and the $5^{th}$ National Forest Inventory data. The synthetic estimation includes sample plots of the expanded areas as well as those of the target area, and the forest volume of economic forest in every city and county throughout Gangwon. Results show that the average forest volume calculated by synthetic estimation was $159.6m^3/ha$ in national economic forest and $129.6m^3/ha$ in private economic forest. The total forest volume of the national economic forest was approximately $59.45million\;m^3$, which was $20.18million\;m^3$ higher than that of the private economic forest. On the other hands, the standard error of the national economic forest was approximately ${\pm}2.21m^3/ha$, which was ${\pm}0.30m^3/ha$ lower than that of the private economic forest. The lowest standard errors was about ${\pm}3.12 m^3/ha$ in broad-leaved forest, followed by ${\pm}4.33m^3/ha$ of mixed forest, and ${\pm}5.78m^3/ha$ of coniferous forest.

Application of Random Over Sampling Examples(ROSE) for an Effective Bankruptcy Prediction Model (효과적인 기업부도 예측모형을 위한 ROSE 표본추출기법의 적용)

  • Ahn, Cheolhwi;Ahn, Hyunchul
    • The Journal of the Korea Contents Association
    • /
    • v.18 no.8
    • /
    • pp.525-535
    • /
    • 2018
  • If the frequency of a particular class is excessively higher than the frequency of other classes in the classification problem, data imbalance problems occur, which make machine learning distorted. Corporate bankruptcy prediction often suffers from data imbalance problems since the ratio of insolvent companies is generally very low, whereas the ratio of solvent companies is very high. To mitigate these problems, it is required to apply a proper sampling technique. Until now, oversampling techniques which adjust the class distribution of a data set by sampling minor class with replacement have popularly been used. However, they are a risk of overfitting. Under this background, this study proposes ROSE(Random Over Sampling Examples) technique which is proposed by Menardi and Torelli in 2014 for the effective corporate bankruptcy prediction. The ROSE technique creates new learning samples by synthesizing the samples for learning, so it leads to better prediction accuracy of the classifiers while avoiding the risk of overfitting. Specifically, our study proposes to combine the ROSE method with SVM(support vector machine), which is known as the best binary classifier. We applied the proposed method to a real-world bankruptcy prediction case of a Korean major bank, and compared its performance with other sampling techniques. Experimental results showed that ROSE contributed to the improvement of the prediction accuracy of SVM in bankruptcy prediction compared to other techniques, with statistical significance. These results shed a light on the fact that ROSE can be a good alternative for resolving data imbalance problems of the prediction problems in social science area other than bankruptcy prediction.

Research on Fault Diagnosis of Wind Power Generator Blade Based on SC-SMOTE and kNN

  • Peng, Cheng;Chen, Qing;Zhang, Longxin;Wan, Lanjun;Yuan, Xinpan
    • Journal of Information Processing Systems
    • /
    • v.16 no.4
    • /
    • pp.870-881
    • /
    • 2020
  • Because SCADA monitoring data of wind turbines are large and fast changing, the unbalanced proportion of data in various working conditions makes it difficult to process fault feature data. The existing methods mainly introduce new and non-repeating instances by interpolating adjacent minority samples. In order to overcome the shortcomings of these methods which does not consider boundary conditions in balancing data, an improved over-sampling balancing algorithm SC-SMOTE (safe circle synthetic minority oversampling technology) is proposed to optimize data sets. Then, for the balanced data sets, a fault diagnosis method based on improved k-nearest neighbors (kNN) classification for wind turbine blade icing is adopted. Compared with the SMOTE algorithm, the experimental results show that the method is effective in the diagnosis of fan blade icing fault and improves the accuracy of diagnosis.

Method for Assessing Landslide Susceptibility Using SMOTE and Classification Algorithms (SMOTE와 분류 기법을 활용한 산사태 위험 지역 결정 방법)

  • Yoon, Hyung-Koo
    • Journal of the Korean Geotechnical Society
    • /
    • v.39 no.6
    • /
    • pp.5-12
    • /
    • 2023
  • Proactive assessment of landslide susceptibility is necessary for minimizing casualties. This study proposes a methodology for classifying the landslide safety factor using a classification algorithm based on machine learning techniques. The high-risk area model is adopted to perform the classification and eight geotechnical parameters are adopted as inputs. Four classification algorithms-namely decision tree, k-nearest neighbor, logistic regression, and random forest-are employed for comparing classification accuracy for the safety factors ranging between 1.2 and 2.0. Notably, a high accuracy is demonstrated in the safety factor range of 1.2~1.7, but a relatively low accuracy is obtained in the range of 1.8~2.0. To overcome this issue, the synthetic minority over-sampling technique (SMOTE) is adopted to generate additional data. The application of SMOTE improves the average accuracy by ~250% in the safety factor range of 1.8~2.0. The results demonstrate that SMOTE algorithm improves the accuracy of classification algorithms when applied to geotechnical data.

ISAR IMAGING FROM TARGET CAD MODELS

  • Yoo, Ji-Hee;Kwon, Kyung-Il
    • Proceedings of the KSRS Conference
    • /
    • 2005.10a
    • /
    • pp.550-553
    • /
    • 2005
  • To acquire radar target signature, various kinds of target are necessary. Measurement is one of the data acquiring method, but much time and high cost is required to get the target data from the real targets. Even if we can afford that, the targets we can access are very limited. To obtain target signatures avoiding these problems, we build the target CAD (Computer Aided Design) model for the calculation of target signatures. To speed up RCS calculation, we applied adaptive super-sampling and tested quite complex tank CAD model which is 1.4 hundred of thousands facet. We use calculated RCS data for ID range profile and 2D ISAR (Inverse Synthetic Aperture Radar) image formation. We adopted IFFT (Inverse Fast Fourier Transform) algorithm combined with polar formatting algorithm for the ISAR imaging. We could confirm the possibility of the construction of database from the images of CAD models for target classification applications.

  • PDF

Mining Clusters of Sequence Data using Sequence Element-based Similarity Measure (시퀀스 요소 기반의 유사도를 이용한 시퀀스 데이터 클러스터링)

  • 오승준;김재련
    • Proceedings of the Korea Inteligent Information System Society Conference
    • /
    • 2004.11a
    • /
    • pp.221-229
    • /
    • 2004
  • Recently, there has been enormous growth in the amount of commercial and scientific data, such as protein sequences, retail transactions, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, only a few of the existing clustering algorithms consider sequentiality. This study presents a method for clustering such sequence datasets. The similarity between sequences must be decided before clustering the sequences. This study proposes a new similarity measure to compute the similarity between two sequences using a sequence element. Two clustering algorithms using the proposed similarity measure are proposed: a hierarchical clustering algorithm and a scalable clustering algorithm that uses sampling and a k-nearest neighbor method. Using a splice dataset and synthetic datasets, we show that the quality of clusters generated by our proposed clustering algorithms is better than that of clusters produced by traditional clustering algorithms.

  • PDF

Inversion of Geophysical Data Using Genetic Algorithms (유전적 기법에 의한 지구물리자료의 역산)

  • Kim, Hee Joon
    • Economic and Environmental Geology
    • /
    • v.28 no.4
    • /
    • pp.425-431
    • /
    • 1995
  • Genetic algorithms are so named because they are analogous to biological processes. The model parameters are coded in binary form. The algorithm then starts with a randomly chosen population of models called chromosomes. The second step is to evaluate the fitness values of these models, measured by a correlation between data and synthetic for a particular model. Then, the three genetic processes of selection, crossover, and mutation are performed upon the model in sequence. Genetic algorithms share the favorable characteristics of random Monte Carlo over local optimization methods in that they do not require linearizing assumptions nor the calculation of partial derivatives, are independent of the misfit criterion, and avoid numerical instabilities associated with matrix inversion. An additional advantage over converntional methods such as iterative least squares is that the sampling is global, rather than local, thereby reducing the tendency to become entrapped in local minima and avoiding the dependency on an assumed starting model.

  • PDF

Application of In-direct Estimation for Small Area Statistics (소지역 통계 생산을 위한 추정방법)

  • Kim, Young-Won;Sung, Na-Young
    • Journal of the Korean Data and Information Science Society
    • /
    • v.11 no.1
    • /
    • pp.111-126
    • /
    • 2000
  • Small area estimation is becoming important in survey sampling due to a growing demand for reliable small area statistics. In estimating means, totals, and other parameters for small areas of a finite population, samplie sizes for small areas are typically small because the overall sample size is usually determined to provide specific accuracy at a much higher level of aggregation than that of small area. The usual direct estimators that use the only information which is gotten from the sample in a given small area provide unreliable estimates. However, indirect estimators utilize the information from the areas related with a given small area, that is, borrow strength from other related areas, and so give more accurate estimates than direct estimators. In this paper we investigate small area estimation methods such as synthetic, composite and empirical best linear unbiased prediction estimator, and apply them to real domestic data which is from the Survey of Hotels and Restaurants in In-Chon as of 1996 and then evaluate the performance of these methods by measuring average squared errors. This evaluation shows that indirect estimators, which are small area estimation methods, are more efficient than direct estimator.

  • PDF

Data-Driven Kinematic Control for Robotic Spatial Augmented Reality System with Loose Kinematic Specifications

  • Lee, Ahyun;Lee, Joo-Haeng;Kim, Jaehong
    • ETRI Journal
    • /
    • v.38 no.2
    • /
    • pp.337-346
    • /
    • 2016
  • We propose a data-driven kinematic control method for a robotic spatial augmented reality (RSAR) system. We assume a scenario where a robotic device and a projector-camera unit (PCU) are assembled in an ad hoc manner with loose kinematic specifications, which hinders the application of a conventional kinematic control method based on the exact link and joint specifications. In the proposed method, the kinematic relation between a PCU and joints is represented as a set of B-spline surfaces based on sample data rather than analytic or differential equations. The sampling process, which automatically records the values of joint angles and the corresponding external parameters of a PCU, is performed as an off-line process when an RSAR system is installed. In an on-line process, an external parameter of a PCU at a certain joint configuration, which is directly readable from motors, can be computed by evaluating the pre-built B-spline surfaces. We provide details of the proposed method and validate the model through a comparison with an analytic RSAR model with synthetic noises to simulate assembly errors.