• Title/Summary/Keyword: Data Imbalance

Search Result 475, Processing Time 0.021 seconds

Severity-based Software Quality Prediction using Class Imbalanced Data

  • Hong, Euy-Seok;Park, Mi-Kyeong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.21 no.4
    • /
    • pp.73-80
    • /
    • 2016
  • Most fault prediction models have class imbalance problems because training data usually contains much more non-fault class modules than fault class ones. This imbalanced distribution makes it difficult for the models to learn the minor class module data. Data imbalance is much higher when severity-based fault prediction is used. This is because high severity fault modules is a smaller subset of the fault modules. In this paper, we propose severity-based models to solve these problems using the three sampling methods, Resample, SpreadSubSample and SMOTE. Empirical results show that Resample method has typical over-fit problems, and SpreadSubSample method cannot enhance the prediction performance of the models. Unlike two methods, SMOTE method shows good performance in terms of AUC and FNR values. Especially J48 decision tree model using SMOTE outperforms other prediction models.

Resolving data imbalance through differentiated anomaly data processing based on verification data (검증데이터 기반의 차별화된 이상데이터 처리를 통한 데이터 불균형 해소 방법)

  • Hwang, Chulhyun
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.4
    • /
    • pp.179-190
    • /
    • 2022
  • Data imbalance refers to a phenomenon in which the number of data in one category is too large or too small compared to another category. Due to this, it has been raised as a major factor that deteriorates performance in machine learning that utilizes classification algorithms. In order to solve the data imbalance problem, various ovrsampling methods for amplifying prime number distribution data have been proposed. Among them, SMOTE is the most representative method. In order to maximize the amplification effect of minority distribution data, various methods have emerged that remove noise included in data (SMOTE-IPF) or enhance only border lines (Borderline SMOTE). This paper proposes a method to ultimately improve classification performance by improving the processing method for anomaly data in the traditional SMOTE method that amplifies minority classification data. The proposed method consistently presented relatively high classification performance compared to the existing methods through experiments.

Research on the Amount of Empty Containers in Japanese Main Ports

  • Kubo, Masayoshi;Zhang, Wenhui
    • Proceedings of the Korean Institute of Navigation and Port Research Conference
    • /
    • 2004.08a
    • /
    • pp.87-95
    • /
    • 2004
  • Economic development is remarkable in Asia and progress of industrialization of NIES, ASEAN, and China in East Asia has increased the international physical distribution in this area. However, an imbalance of trade becomes severe in these areas. The imbalance is especially big in the Asia-North America route and the Japan-China route. The imbalance in the Asia -North America liner route is 5.04 million TEUS in 2002.The transportation ratio of loaded containers between China and Japan route is approximately 3:1 in 2000. In other words, it means that the transportation of loaded containers from China to Japan is 3, the transportation of loaded containers from Japan to China is I. The imbalance at a port is generally obtained by subtracting export loaded container cargo volume from import container cargo volume. However, the imbalance and the empty containers at the port are not always same. Then, in order to evaluate rationalization and efficiency of maritime container transportation, we introduce the amount of empty containers at a port as an evaluation index. However, the past data of the amount of handling empty containers have a lot of lacking portions. Then, it is necessary to estimate the past amount of empty containers in order to grasp the amount of empty containers historically. So, we construct the model that estimates the amount of empty containers using the imbalance of main port statistics in Japan.

  • PDF

Analysis and Compensation of I/Q Amplitude Imbalance In Coherent PON Systems (코히어런트 PON시스템의 I/Q 진폭불균형 분석 및 보상)

  • Kim, Nayeong;Lee, Seungwoo;Park, Youngil
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.40 no.10
    • /
    • pp.1940-1946
    • /
    • 2015
  • An optical coherent system is considered for the next-generation optical access networks in enhancing the data rate and transmission distance. In this system, however, I/Q amplitude imbalance may occur at several parts of the system, leading to serious performance degradation. Asymmetric structure of a coherent receiver at the location of subscriber is one of the sources of I/Q imbalance. Therefore, this imbalance parameters must be removed or compensated to secure the transmission performance. In this paper, the source of I/Q amplitude imbalance is analyzed, and then the way to compensate for the imbalance at the receiver side is suggested. Performance after the compensation is estimated using simulation.

A Study on Gait Imbalance Evaluation System based on Two-axis Angle using Encoder (인코더를 이용한 2축 각도 기반 보행 불균형 평가 시스템 연구)

  • Shim, Hyeon-min;Kim, Yoohyun;Cho, Woo-Hyeong;Kwon, Jangwoo;Lee, Sangmin
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.21 no.5
    • /
    • pp.401-406
    • /
    • 2015
  • In this study, the gait imbalance evaluation algorithm based on two axes angle using encoder is proposed. This experiment was carried out to experiment with a healthy adult male to 10 people. The device is attached to the hip and knee joint in order to measure the angle during the gait. Normal and imbalance gait angle data were measured using an encoder attached to the hip and knee joints. Also, in order to verify the reliability of estimation of asymmetrical gait using hip and knee angle, it was compared with the result of asymmetrical gait estimation using foot pressure. SI (Symmetry Index) was used as an index for determining the gait imbalance. As a result, normal gait and 1.5cm imbalance gait were evaluation as normal gait through SI using an encoder. And imbalance gait of 3cm, 4cm, and 6cm were judge by imbalance gait. Whereas all gait experiments except normal gait were evaluation as imbalance gait through SI using the pressure. It was possible to determine both the normal gait and imbalance gait through measurement for the angle and the pressure.

Development of Evaluation Metrics that Consider Data Imbalance between Classes in Facies Classification (지도학습 기반 암상 분류 시 클래스 간 자료 불균형을 고려한 평가지표 개발)

  • Kim, Dowan;Choi, Junhwan;Byun, Joongmoo
    • Geophysics and Geophysical Exploration
    • /
    • v.23 no.3
    • /
    • pp.131-140
    • /
    • 2020
  • In training a classification model using machine learning, the acquisition of training data is a very important stage, because the amount and quality of the training data greatly influence the model performance. However, when the cost of obtaining data is so high that it is difficult to build ideal training data, the number of samples for each class may be acquired very differently, and a serious data-imbalance problem can occur. If such a problem occurs in the training data, all classes are not trained equally, and classes containing relatively few data will have significantly lower recall values. Additionally, the reliability of evaluation indices such as accuracy and precision will be reduced. Therefore, this study sought to overcome the problem of data imbalance in two stages. First, we introduced weighted accuracy and weighted precision as new evaluation indices that can take into account a data-imbalance ratio by modifying conventional measures of accuracy and precision. Next, oversampling was performed to balance weighted precision and recall among classes. We verified the algorithm by applying it to the problem of facies classification. As a result, the imbalance between majority and minority classes was greatly mitigated, and the boundaries between classes could be more clearly identified.

Mitigating Data Imbalance in Credit Prediction using the Diffusion Model (Diffusion Model을 활용한 신용 예측 데이터 불균형 해결 기법)

  • Sangmin Oh;Juhong Lee
    • Smart Media Journal
    • /
    • v.13 no.2
    • /
    • pp.9-15
    • /
    • 2024
  • In this paper, a Diffusion Multi-step Classifier (DMC) is proposed to address the imbalance issue in credit prediction. DMC utilizes a Diffusion Model to generate continuous numerical data from credit prediction data and creates categorical data through a Multi-step Classifier. Compared to other algorithms generating synthetic data, DMC produces data with a distribution more similar to real data. Using DMC, data that closely resemble actual data can be generated, outperforming other algorithms for data generation. When experiments were conducted using the generated data, the probability of predicting delinquencies increased by over 20%, and overall predictive accuracy improved by approximately 4%. These research findings are anticipated to significantly contribute to reducing delinquency rates and increasing profits when applied in actual financial institutions.

Credit Card Bad Debt Prediction Model based on Support Vector Machine (신용카드 대손회원 예측을 위한 SVM 모형)

  • Kim, Jin Woo;Jhee, Won Chul
    • Journal of Information Technology Services
    • /
    • v.11 no.4
    • /
    • pp.233-250
    • /
    • 2012
  • In this paper, credit card delinquency means the possibility of occurring bad debt within the certain near future from the normal accounts that have no debt and the problem is to predict, on the monthly basis, the occurrence of delinquency 3 months in advance. This prediction is typical binary classification problem but suffers from the issue of data imbalance that means the instances of target class is very few. For the effective prediction of bad debt occurrence, Support Vector Machine (SVM) with kernel trick is adopted using credit card usage and payment patterns as its inputs. SVM is widely accepted in the data mining society because of its prediction accuracy and no fear of overfitting. However, it is known that SVM has the limitation in its ability to processing the large-scale data. To resolve the difficulties in applying SVM to bad debt occurrence prediction, two stage clustering is suggested as an effective data reduction method and ensembles of SVM models are also adopted to mitigate the difficulty due to data imbalance intrinsic to the target problem of this paper. In the experiments with the real world data from one of the major domestic credit card companies, the suggested approach reveals the superior prediction accuracy to the traditional data mining approaches that use neural networks, decision trees or logistics regressions. SVM ensemble model learned from T2 training set shows the best prediction results among the alternatives considered and it is noteworthy that the performance of neural networks with T2 is better than that of SVM with T1. These results prove that the suggested approach is very effective for both SVM training and the classification problem of data imbalance.

Simulated Annealing for Overcoming Data Imbalance in Mold Injection Process (사출성형공정에서 데이터의 불균형 해소를 위한 담금질모사)

  • Dongju Lee
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.45 no.4
    • /
    • pp.233-239
    • /
    • 2022
  • The injection molding process is a process in which thermoplastic resin is heated and made into a fluid state, injected under pressure into the cavity of a mold, and then cooled in the mold to produce a product identical to the shape of the cavity of the mold. It is a process that enables mass production and complex shapes, and various factors such as resin temperature, mold temperature, injection speed, and pressure affect product quality. In the data collected at the manufacturing site, there is a lot of data related to good products, but there is little data related to defective products, resulting in serious data imbalance. In order to efficiently solve this data imbalance, undersampling, oversampling, and composite sampling are usally applied. In this study, oversampling techniques such as random oversampling (ROS), minority class oversampling (SMOTE), ADASYN(Adaptive Synthetic Sampling), etc., which amplify data of the minority class by the majority class, and complex sampling using both undersampling and oversampling, are applied. For composite sampling, SMOTE+ENN and SMOTE+Tomek were used. Artificial neural network techniques is used to predict product quality. Especially, MLP and RNN are applied as artificial neural network techniques, and optimization of various parameters for MLP and RNN is required. In this study, we proposed an SA technique that optimizes the choice of the sampling method, the ratio of minority classes for sampling method, the batch size and the number of hidden layer units for parameters of MLP and RNN. The existing sampling methods and the proposed SA method were compared using accuracy, precision, recall, and F1 Score to prove the superiority of the proposed method.

The Impact of Information on Stock Message Boards on Stock Trading Behaviors of Individual Investors based on Order Imbalance Analysis (온라인 주식게시판 정보가 주식투자자의 거래행태에 미치는 영향)

  • Kim, Hyun Mo;Park, Jae Hong
    • Information Systems Review
    • /
    • v.18 no.2
    • /
    • pp.23-38
    • /
    • 2016
  • Previous studies on information systems (IS) and finance suggest that information on stock message boards influence the investment decisions of individual investors. However, how information on online stock message boards influences an individual investor's buy or sell decisions is unclear. To address this research question, we investigate the relationship between a number of posts on stock message boards and order imbalance in stock markets. Order imbalance is defined as the difference between the daily sum of buy-side shares traded and the daily sum of sell-side shares traded. Therefore, order imbalance can suggest the direction of trades and the strength of the direction with trading volumes. In this regard, this study examines how the number of posts (information on stock message boards) influences order imbalance (stock trading behavior). We collected about 46,077 messages of 40 companies on the Korea Composite Stock Price Index from Paxnet, the most popular Korean online stock message board. The messages we collected were divided based on in-trading and after-trading hours to examine the relationship between the numbers of posts and trading volumes. We also collected order imbalance data on individual investors. We then integrated the balanced panel data sets and analyzed them through vector regression. We found that the number of posts on online stock message boards is positively related to prior order imbalance. We believe that our findings contribute to knowledge in IS and finance. Furthermore, this study suggests that investors should carefully monitor information on stock message boards to understand stock market sentiments.