• Title/Summary/Keyword: 오버샘플링 기법

Search Result 57, Processing Time 0.03 seconds

A Data Sampling Technique for Secure Dataset Using Weight VAE Oversampling(W-VAE) (가중치 VAE 오버샘플링(W-VAE)을 이용한 보안데이터셋 샘플링 기법 연구)

  • Kang, Hanbada;Lee, Jaewoo
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.12
    • /
    • pp.1872-1879
    • /
    • 2022
  • Recently, with the development of artificial intelligence technology, research to use artificial intelligence to detect hacking attacks is being actively conducted. However, the fact that security data is a representative imbalanced data is recognized as a major obstacle in composing the learning data, which is the key to the development of artificial intelligence models. Therefore, in this paper, we propose a W-VAE oversampling technique that applies VAE, a deep learning generation model, to data extraction for oversampling, and sets the number of oversampling for each class through weight calculation using K-NN for sampling. In this paper, a total of five oversampling techniques such as ROS, SMOTE, and ADASYN were applied through NSL-KDD, an open network security dataset. The oversampling method proposed in this paper proved to be the most effective sampling method compared to the existing oversampling method through the F1-Score evaluation index.

Oversampling scheme using Conditional GAN (Conditional GAN을 활용한 오버샘플링 기법)

  • Son, Minjae;Jung, Seungwon;Hwang, Eenjun
    • Annual Conference of KIPS
    • /
    • 2018.10a
    • /
    • pp.609-612
    • /
    • 2018
  • 기계학습 분야에서 분류 문제를 해결하기 위해 다양한 알고리즘들이 연구되고 있다. 하지만 기존에 연구된 분류 알고리즘 대부분은 각 클래스에 속한 데이터 수가 거의 같다는 가정하에 학습을 진행하기 때문에 각 클래스의 데이터 수가 불균형한 경우 분류 정확도가 다소 떨어지는 현상을 보인다. 이러한 문제를 해결하기 위해 본 논문에서는 Conditional Generative Adversarial Networks(CGAN)을 활용하여 데이터 수의 균형을 맞추는 오버샘플링 기법을 제안한다. CGAN은 데이터 수가 적은 클래스에 속한 데이터 특징을 학습하고 실제 데이터와 유사한 데이터를 생성한다. 이를 통해 클래스별 데이터의 수를 맞춰 분류 알고리즘의 분류 정확도를 높인다. 실제 수집된 데이터를 이용하여 CGAN을 활용한 오버샘플링 기법이 효과가 있음을 보이고 기존 오버샘플링 기법들과 비교하여 기존 기법들보다 우수함을 입증하였다.

Data Sampling Using Oversampling Technique for Estimating Two-Dimensional Dispersion Coefficients (2차원 분산계수 경험식 산정을 위한 오버샘플링 기법 활용 데이터 샘플링)

  • Lee, Sun Mi;Park, In Hwan
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2021.06a
    • /
    • pp.449-449
    • /
    • 2021
  • 하천 내 오염물질 유입원은 하수처리장과 같이 농도를 예측 가능한 점오염원이 일반적이지만, 수질오염사고와 같이 다량의 유해물질이 일시에 하천에 유입되는 경우도 발생하곤 한다. 특히 오염물질 유입지점과 취수장이 인접한 경우, 오염물질 혼합해석에 대한 이해가 오염사고 대응 및 수질 관리 측면에서 매우 중요하다. 자연하천에서는 사행에 따른 유속 구조의 불균일성 등으로 인하여 오염물질의 이송 및 분산 과정은 매우 복잡하게 나타난다. 이러한 하천의 지형적, 수리학적 특성이 오염물질의 혼합 거동에 미치는 영향을 정확하게 모의하기 위해서는 3차원 수치모형을 적용해야 한다. 그러나 대부분의 하천은 하폭 대 수심비가 매우 크기 때문에 2차원 이송-분산 방정식을 지배방정식으로 채택하는 2차원 수치 모형이 널리 사용되어왔다. 2차원 이송-분산 방정식의 해석결과는 입력된 종, 횡 분산계수의 값에 따라 변화하기 때문에 정확한 혼합해석을 위해 분산계수의 결정이 매우 중요하다. 과거 연구에서는 횡 분산계수의 결정을 위해 기본 수리량을 이용한 경험식을 활용하여 계산한 바 있다. 종 분산계수의 경우에는 경험식의 산정에 필요한 충분한 실험 자료가 축적되어 있지 않아 이상적 흐름 상태를 가정하여 유도된 Elder의 이론식(Elder, 1959)을 사용해왔다. 하지만 많은 연구에서 이러한 Elder의 이론식이 종 분산계수를 과소산정 할 우려가 있다고 제시했다. 따라서 하천의 전단류 분산특성을 나타낼 수 있는 데이터 확보를 통해 종 분산계수의 경험식 산정 및 횡 분산계수의 정확도 향상이 필요한 상황이다. 본 연구에서는 기존 선행 연구에서 수행된 2차원 추적자실험 데이터의 확장을 위해 오버샘플링 기법을 적용하였으며, 이를 통한 머신러닝을 통한 분산계수 산정 가능성을 분석하고자 한다. 부족한 추적자 실험 데이터를 확장하기 위해 오버샘플링 기법 중 SMOTE 기법을 활용했다. 오버샘플링 기법을 이용하여 생산된 데이터의 신뢰성을 검증하였으며, 추후 머신러닝을 이용한 2차원 종, 횡 분산계수 산정에 대한 활용 가능성을 분석했다.

  • PDF

Optimal Ratio of Data Oversampling Based on a Genetic Algorithm for Overcoming Data Imbalance (데이터 불균형 해소를 위한 유전알고리즘 기반 최적의 오버샘플링 비율)

  • Shin, Seung-Soo;Cho, Hwi-Yeon;Kim, Yong-Hyuk
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.1
    • /
    • pp.49-55
    • /
    • 2021
  • Recently, with the development of database, it is possible to store a lot of data generated in finance, security, and networks. These data are being analyzed through classifiers based on machine learning. The main problem at this time is data imbalance. When we train imbalanced data, it may happen that classification accuracy is degraded due to over-fitting with majority class data. To overcome the problem of data imbalance, oversampling strategy that increases the quantity of data of minority class data is widely used. It requires to tuning process about suitable method and parameters for data distribution. To improve the process, In this study, we propose a strategy to explore and optimize oversampling combinations and ratio based on various methods such as synthetic minority oversampling technique and generative adversarial networks through genetic algorithms. After sampling credit card fraud detection which is a representative case of data imbalance, with the proposed strategy and single oversampling strategies, we compare the performance of trained classifiers with each data. As a result, a strategy that is optimized by exploring for ratio of each method with genetic algorithms was superior to previous strategies.

A Deep Learning Based Over-Sampling Scheme for Imbalanced Data Classification (불균형 데이터 분류를 위한 딥러닝 기반 오버샘플링 기법)

  • Son, Min Jae;Jung, Seung Won;Hwang, Een Jun
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.8 no.7
    • /
    • pp.311-316
    • /
    • 2019
  • Classification problem is to predict the class to which an input data belongs. One of the most popular methods to do this is training a machine learning algorithm using the given dataset. In this case, the dataset should have a well-balanced class distribution for the best performance. However, when the dataset has an imbalanced class distribution, its classification performance could be very poor. To overcome this problem, we propose an over-sampling scheme that balances the number of data by using Conditional Generative Adversarial Networks (CGAN). CGAN is a generative model developed from Generative Adversarial Networks (GAN), which can learn data characteristics and generate data that is similar to real data. Therefore, CGAN can generate data of a class which has a small number of data so that the problem induced by imbalanced class distribution can be mitigated, and classification performance can be improved. Experiments using actual collected data show that the over-sampling technique using CGAN is effective and that it is superior to existing over-sampling techniques.

Improving BMI Classification Accuracy with Oversampling and 3-D Gait Analysis on Imbalanced Class Data

  • Beom Kwon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.9
    • /
    • pp.9-23
    • /
    • 2024
  • In this study, we propose a method to improve the classification accuracy of body mass index (BMI) estimation techniques based on three-dimensional gait data. In previous studies on BMI estimation techniques, the classification accuracy was only about 60%. In this study, we identify the reasons for the low BMI classification accuracy. According to our analysis, the reason is the use of the undersampling technique to address the class imbalance problem in the gait dataset. We propose applying oversampling instead of undersampling to solve the class imbalance issue. We also demonstrate the usefulness of anthropometric and spatiotemporal features in gait data-based BMI estimation techniques. Previous studies evaluated the usefulness of anthropometric and spatiotemporal features in the presence of undersampling techniques and reported that their combined use leads to lower BMI estimation performance than when using either feature alone. However, our results show that using both features together and applying an oversampling technique achieves state-of-the-art performance with 92.92% accuracy in the BMI estimation problem.

A Hybrid Oversampling Technique for Imbalanced Structured Data based on SMOTE and Adapted CycleGAN (불균형 정형 데이터를 위한 SMOTE와 변형 CycleGAN 기반 하이브리드 오버샘플링 기법)

  • Jung-Dam Noh;Byounggu Choi
    • Information Systems Review
    • /
    • v.24 no.4
    • /
    • pp.97-118
    • /
    • 2022
  • As generative adversarial network (GAN) based oversampling techniques have achieved impressive results in class imbalance of unstructured dataset such as image, many studies have begun to apply it to solving the problem of imbalance in structured dataset. However, these studies have failed to reflect the characteristics of structured data due to changing the data structure into an unstructured data format. In order to overcome the limitation, this study adapted CycleGAN to reflect the characteristics of structured data, and proposed hybridization of synthetic minority oversampling technique (SMOTE) and the adapted CycleGAN. In particular, this study tried to overcome the limitations of existing studies by using a one-dimensional convolutional neural network unlike previous studies that used two-dimensional convolutional neural network. Oversampling based on the method proposed have been experimented using various datasets and compared the performance of the method with existing oversampling methods such as SMOTE and adaptive synthetic sampling (ADASYN). The results indicated the proposed hybrid oversampling method showed superior performance compared to the existing methods when data have more dimensions or higher degree of imbalance. This study implied that the classification performance of oversampling structured data can be improved using the proposed hybrid oversampling method that considers the characteristic of structured data.

Exploring the Performance of Synthetic Minority Over-sampling Technique (SMOTE) to Predict Good Borrowers in P2P Lending (P2P 대부 우수 대출자 예측을 위한 합성 소수집단 오버샘플링 기법 성과에 관한 탐색적 연구)

  • Costello, Francis Joseph;Lee, Kun Chang
    • Journal of Digital Convergence
    • /
    • v.17 no.9
    • /
    • pp.71-78
    • /
    • 2019
  • This study aims to identify good borrowers within the context of P2P lending. P2P lending is a growing platform that allows individuals to lend and borrow money from each other. Inherent in any loans is credit risk of borrowers and needs to be considered before any lending. Specifically in the context of P2P lending, traditional models fall short and thus this study aimed to rectify this as well as explore the problem of class imbalances seen within credit risk data sets. This study implemented an over-sampling technique known as Synthetic Minority Over-sampling Technique (SMOTE). To test our approach, we implemented five benchmarking classifiers such as support vector machines, logistic regression, k-nearest neighbor, random forest, and deep neural network. The data sample used was retrieved from the publicly available LendingClub dataset. The proposed SMOTE revealed significantly improved results in comparison with the benchmarking classifiers. These results should help actors engaged within P2P lending to make better informed decisions when selecting potential borrowers eliminating the higher risks present in P2P lending.

Comparison of Anomaly Detection Performance Based on GRU Model Applying Various Data Preprocessing Techniques and Data Oversampling (다양한 데이터 전처리 기법과 데이터 오버샘플링을 적용한 GRU 모델 기반 이상 탐지 성능 비교)

  • Yoo, Seung-Tae;Kim, Kangseok
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.32 no.2
    • /
    • pp.201-211
    • /
    • 2022
  • According to the recent change in the cybersecurity paradigm, research on anomaly detection methods using machine learning and deep learning techniques, which are AI implementation technologies, is increasing. In this study, a comparative study on data preprocessing techniques that can improve the anomaly detection performance of a GRU (Gated Recurrent Unit) neural network-based intrusion detection model using NGIDS-DS (Next Generation IDS Dataset), an open dataset, was conducted. In addition, in order to solve the class imbalance problem according to the ratio of normal data and attack data, the detection performance according to the oversampling ratio was compared and analyzed using the oversampling technique applied with DCGAN (Deep Convolutional Generative Adversarial Networks). As a result of the experiment, the method preprocessed using the Doc2Vec algorithm for system call feature and process execution path feature showed good performance, and in the case of oversampling performance, when DCGAN was used, improved detection performance was shown.

Optimization of Recall in Malicious Packet Classification Models Using Oversampling Techniques (오버샘플링 기법을 적용한 악성 패킷 분류 모델의 리콜 지표 최적화)

  • Seongil Kim;Heonchang Yu
    • Annual Conference of KIPS
    • /
    • 2024.10a
    • /
    • pp.427-430
    • /
    • 2024
  • 최근 사이버 공격의 지능화와 다양화로 인해 네트워크 보안의 중요성이 더욱 부각되고 있다. 특히, 악성코드를 포함한 악성 패킷은 시스템 감염 및 정보 유출과 같은 심각한 피해를 초래할 수 있으므로 이를 효과적으로 탐지하고 차단할 수 있는 기술 개발이 필수적이다. 기존의 인공지능 기반 침입 탐지 시스템은 다양한 성능 지표(정확도, 정밀도, 재현율 등)의 균형을 맞추기 위해 단일 분류 모델을 기반으로 구축되어 왔다. 본 연구에서는 모든 악성 패킷을 놓치지 않고 탐지하기 위해, 특히 리콜(Recall) 지표를 극대화하는 것을 목표로 하여 오버샘플링 기법을 적용하였다. 이를 통해 기존 시스템의 한계를 보완하고, 모든 사이버 공격에도 효과적으로 대응할 수 있는 새로운 성능 평가 기준의 필요성을 제시하고자 한다.