• Title/Summary/Keyword: Imbalanced Dataset

Search Result 49, Processing Time 0.025 seconds

Naive Bayes Classifier based Anomalous Propagation Echo Identification using Class Imbalanced Data (클래스 불균형 데이터를 이용한 나이브 베이즈 분류기 기반의 이상전파에코 식별방법)

  • Lee, Hansoo;Kim, Sungshin
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.20 no.6
    • /
    • pp.1063-1068
    • /
    • 2016
  • Anomalous propagation echo is a kind of abnormal radar signal occurred by irregularly refracted radar beam caused by temperature or humidity. The echo frequently appears in ground-based weather radar due to its observation principle and disturb weather forecasting process. In order to improve accuracy of weather forecasting, it is important to analyze radar data precisely. Therefore, there are several ongoing researches about identifying the anomalous propagation echo with data mining techniques. This paper conducts researches about implementation of classification method which can separate the anomalous propagation echo in the raw radar data using naive Bayes classifier with various kinds of observation results. Considering that collected data has a class imbalanced problem, this paper includes SMOTE method. It is confirmed that the fine classification results are derived by the suggested classifier with balanced dataset using actual appearance cases of the echo.

An Analytical Study on Automatic Classification of Domestic Journal articles Using Random Forest (랜덤포레스트를 이용한 국내 학술지 논문의 자동분류에 관한 연구)

  • Kim, Pan Jun
    • Journal of the Korean Society for information Management
    • /
    • v.36 no.2
    • /
    • pp.57-77
    • /
    • 2019
  • Random Forest (RF), a representative ensemble technique, was applied to automatic classification of journal articles in the field of library and information science. Especially, I performed various experiments on the main factors such as tree number, feature selection, and learning set size in terms of classification performance that automatically assigns class labels to domestic journals. Through this, I explored ways to optimize the performance of random forests (RF) for imbalanced datasets in real environments. Consequently, for the automatic classification of domestic journal articles, Random Forest (RF) can be expected to have the best classification performance when using tree number interval 100~1000(C), small feature set (10%) based on chi-square statistic (CHI), and most learning sets (9-10 years).

TANFIS Classifier Integrated Efficacious Aassistance System for Heart Disease Prediction using CNN-MDRP

  • Bhaskaru, O.;Sreedevi, M.
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.10
    • /
    • pp.171-176
    • /
    • 2022
  • A dramatic rise in the number of people dying from heart disease has prompted efforts to find a way to identify it sooner using efficient approaches. A variety of variables contribute to the condition and even hereditary factors. The current estimate approaches use an automated diagnostic system that fails to attain a high level of accuracy because it includes irrelevant dataset information. This paper presents an effective neural network with convolutional layers for classifying clinical data that is highly class-imbalanced. Traditional approaches rely on massive amounts of data rather than precise predictions. Data must be picked carefully in order to achieve an earlier prediction process. It's a setback for analysis if the data obtained is just partially complete. However, feature extraction is a major challenge in classification and prediction since increased data increases the training time of traditional machine learning classifiers. The work integrates the CNN-MDRP classifier (convolutional neural network (CNN)-based efficient multimodal disease risk prediction with TANFIS (tuned adaptive neuro-fuzzy inference system) for earlier accurate prediction. Perform data cleaning by transforming partial data to informative data from the dataset in this project. The recommended TANFIS tuning parameters are then improved using a Laplace Gaussian mutation-based grasshopper and moth flame optimization approach (LGM2G). The proposed approach yields a prediction accuracy of 98.40 percent when compared to current algorithms.

Experimental Analysis of Bankruptcy Prediction with SHAP framework on Polish Companies

  • Tuguldur Enkhtuya;Dae-Ki Kang
    • International journal of advanced smart convergence
    • /
    • v.12 no.1
    • /
    • pp.53-58
    • /
    • 2023
  • With the fast development of artificial intelligence day by day, users are demanding explanations about the results of algorithms and want to know what parameters influence the results. In this paper, we propose a model for bankruptcy prediction with interpretability using the SHAP framework. SHAP (SHAPley Additive exPlanations) is framework that gives a visualized result that can be used for explanation and interpretation of machine learning models. As a result, we can describe which features are important for the result of our deep learning model. SHAP framework Force plot result gives us top features which are mainly reflecting overall model score. Even though Fully Connected Neural Networks are a "black box" model, Shapley values help us to alleviate the "black box" problem. FCNNs perform well with complex dataset with more than 60 financial ratios. Combined with SHAP framework, we create an effective model with understandable interpretation. Bankruptcy is a rare event, then we avoid imbalanced dataset problem with the help of SMOTE. SMOTE is one of the oversampling technique that resulting synthetic samples are generated for the minority class. It uses K-nearest neighbors algorithm for line connecting method in order to producing examples. We expect our model results assist financial analysts who are interested in forecasting bankruptcy prediction of companies in detail.

Ensemble Knowledge Distillation for Classification of 14 Thorax Diseases using Chest X-ray Images (흉부 X-선 영상을 이용한 14 가지 흉부 질환 분류를 위한 Ensemble Knowledge Distillation)

  • Ho, Thi Kieu Khanh;Jeon, Younghoon;Gwak, Jeonghwan
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2021.07a
    • /
    • pp.313-315
    • /
    • 2021
  • Timely and accurate diagnosis of lung diseases using Chest X-ray images has been gained much attention from the computer vision and medical imaging communities. Although previous studies have presented the capability of deep convolutional neural networks by achieving competitive binary classification results, their models were seemingly unreliable to effectively distinguish multiple disease groups using a large number of x-ray images. In this paper, we aim to build an advanced approach, so-called Ensemble Knowledge Distillation (EKD), to significantly boost the classification accuracies, compared to traditional KD methods by distilling knowledge from a cumbersome teacher model into an ensemble of lightweight student models with parallel branches trained with ground truth labels. Therefore, learning features at different branches of the student models could enable the network to learn diverse patterns and improve the qualify of final predictions through an ensemble learning solution. Although we observed that experiments on the well-established ChestX-ray14 dataset showed the classification improvements of traditional KD compared to the base transfer learning approach, the EKD performance would be expected to potentially enhance classification accuracy and model generalization, especially in situations of the imbalanced dataset and the interdependency of 14 weakly annotated thorax diseases.

  • PDF

Fire Detection Based on Image Learning by Collaborating CNN-SVM with Enhanced Recall

  • Yongtae Do
    • Journal of Sensor Science and Technology
    • /
    • v.33 no.3
    • /
    • pp.119-124
    • /
    • 2024
  • Effective fire sensing is important to protect lives and property from the disaster. In this paper, we present an intelligent visual sensing method for detecting fires based on machine learning techniques. The proposed method involves a two-step process. In the first step, fire and non-fire images are used to train a convolutional neural network (CNN), and in the next step, feature vectors consisting of 256 values obtained from the CNN are used for the learning of a support vector machine (SVM). Linear and nonlinear SVMs with different parameters are intensively tested. We found that the proposed hybrid method using an SVM with a linear kernel effectively increased the recall rate of fire image detection without compromising detection accuracy when an imbalanced dataset was used for learning. This is a major contribution of this study because recall is important, particularly in the sensing of disaster situations such as fires. In our experiments, the proposed system exhibited an accuracy of 96.9% and a recall rate of 92.9% for test image data.

Study on Detection Technique for Cochlodinium polykrikoides Red tide using Logistic Regression Model under Imbalanced Data (불균형 데이터 환경에서 로지스틱 회귀모형을 이용한 Cochlodinium polykrikoides 적조 탐지 기법 연구)

  • Bak, Su-Ho;Kim, Heung-Min;Kim, Bum-Kyu;Hwang, Do-Hyun;Enkhjargal, Unuzaya;Yoon, Hong-Joo
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.13 no.6
    • /
    • pp.1353-1364
    • /
    • 2018
  • This study proposed a method to detect Cochlodinium polykrikoides red tide pixels in satellite images using a logistic regression model of machine learning technique under Imbalanced data. The spectral profiles extracted from red tide, clear water, and turbid water were used as training dataset. 70% of the entire data set was extracted and used for as model training, and the classification accuracy of the model was evaluated using the remaining 30%. At this time, the white noise was added to the spectral profile of the red tide, which has a relatively small number of data compared to the clear water and the turbid water, and over-sampling was performed to solve the unbalanced data problem. As a result of the accuracy evaluation, the proposed algorithm showed about 94% classification accuracy.

An Experimental Study on the Automatic Classification of Korean Journal Articles through Feature Selection (자질선정을 통한 국내 학술지 논문의 자동분류에 관한 연구)

  • Kim, Pan Jun
    • Journal of the Korean Society for information Management
    • /
    • v.39 no.1
    • /
    • pp.69-90
    • /
    • 2022
  • As basic data that can systematically support and evaluate R&D activities as well as set current and future research directions by grasping specific trends in domestic academic research, I sought efficient ways to assign standardized subject categories (control keywords) to individual journal papers. To this end, I conducted various experiments on major factors affecting the performance of automatic classification, focusing on feature selection techniques, for the purpose of automatically allocating the classification categories on the National Research Foundation of Korea's Academic Research Classification Scheme to domestic journal papers. As a result, the automatic classification of domestic journal papers, which are imbalanced datasets of the real environment, showed that a fairly good level of performance can be expected using more simple classifiers, feature selection techniques, and relatively small training sets.

A Hybrid Oversampling Technique for Imbalanced Structured Data based on SMOTE and Adapted CycleGAN (불균형 정형 데이터를 위한 SMOTE와 변형 CycleGAN 기반 하이브리드 오버샘플링 기법)

  • Jung-Dam Noh;Byounggu Choi
    • Information Systems Review
    • /
    • v.24 no.4
    • /
    • pp.97-118
    • /
    • 2022
  • As generative adversarial network (GAN) based oversampling techniques have achieved impressive results in class imbalance of unstructured dataset such as image, many studies have begun to apply it to solving the problem of imbalance in structured dataset. However, these studies have failed to reflect the characteristics of structured data due to changing the data structure into an unstructured data format. In order to overcome the limitation, this study adapted CycleGAN to reflect the characteristics of structured data, and proposed hybridization of synthetic minority oversampling technique (SMOTE) and the adapted CycleGAN. In particular, this study tried to overcome the limitations of existing studies by using a one-dimensional convolutional neural network unlike previous studies that used two-dimensional convolutional neural network. Oversampling based on the method proposed have been experimented using various datasets and compared the performance of the method with existing oversampling methods such as SMOTE and adaptive synthetic sampling (ADASYN). The results indicated the proposed hybrid oversampling method showed superior performance compared to the existing methods when data have more dimensions or higher degree of imbalance. This study implied that the classification performance of oversampling structured data can be improved using the proposed hybrid oversampling method that considers the characteristic of structured data.

The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms (다중 클래스 데이터셋의 메타특징이 판별 알고리즘의 성능에 미치는 영향 연구)

  • Kim, Jeonghun;Kim, Min Yong;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.23-45
    • /
    • 2020
  • Big data is creating in a wide variety of fields such as medical care, manufacturing, logistics, sales site, SNS, and the dataset characteristics are also diverse. In order to secure the competitiveness of companies, it is necessary to improve decision-making capacity using a classification algorithm. However, most of them do not have sufficient knowledge on what kind of classification algorithm is appropriate for a specific problem area. In other words, determining which classification algorithm is appropriate depending on the characteristics of the dataset was has been a task that required expertise and effort. This is because the relationship between the characteristics of datasets (called meta-features) and the performance of classification algorithms has not been fully understood. Moreover, there has been little research on meta-features reflecting the characteristics of multi-class. Therefore, the purpose of this study is to empirically analyze whether meta-features of multi-class datasets have a significant effect on the performance of classification algorithms. In this study, meta-features of multi-class datasets were identified into two factors, (the data structure and the data complexity,) and seven representative meta-features were selected. Among those, we included the Herfindahl-Hirschman Index (HHI), originally a market concentration measurement index, in the meta-features to replace IR(Imbalanced Ratio). Also, we developed a new index called Reverse ReLU Silhouette Score into the meta-feature set. Among the UCI Machine Learning Repository data, six representative datasets (Balance Scale, PageBlocks, Car Evaluation, User Knowledge-Modeling, Wine Quality(red), Contraceptive Method Choice) were selected. The class of each dataset was classified by using the classification algorithms (KNN, Logistic Regression, Nave Bayes, Random Forest, and SVM) selected in the study. For each dataset, we applied 10-fold cross validation method. 10% to 100% oversampling method is applied for each fold and meta-features of the dataset is measured. The meta-features selected are HHI, Number of Classes, Number of Features, Entropy, Reverse ReLU Silhouette Score, Nonlinearity of Linear Classifier, Hub Score. F1-score was selected as the dependent variable. As a result, the results of this study showed that the six meta-features including Reverse ReLU Silhouette Score and HHI proposed in this study have a significant effect on the classification performance. (1) The meta-features HHI proposed in this study was significant in the classification performance. (2) The number of variables has a significant effect on the classification performance, unlike the number of classes, but it has a positive effect. (3) The number of classes has a negative effect on the performance of classification. (4) Entropy has a significant effect on the performance of classification. (5) The Reverse ReLU Silhouette Score also significantly affects the classification performance at a significant level of 0.01. (6) The nonlinearity of linear classifiers has a significant negative effect on classification performance. In addition, the results of the analysis by the classification algorithms were also consistent. In the regression analysis by classification algorithm, Naïve Bayes algorithm does not have a significant effect on the number of variables unlike other classification algorithms. This study has two theoretical contributions: (1) two new meta-features (HHI, Reverse ReLU Silhouette score) was proved to be significant. (2) The effects of data characteristics on the performance of classification were investigated using meta-features. The practical contribution points (1) can be utilized in the development of classification algorithm recommendation system according to the characteristics of datasets. (2) Many data scientists are often testing by adjusting the parameters of the algorithm to find the optimal algorithm for the situation because the characteristics of the data are different. In this process, excessive waste of resources occurs due to hardware, cost, time, and manpower. This study is expected to be useful for machine learning, data mining researchers, practitioners, and machine learning-based system developers. The composition of this study consists of introduction, related research, research model, experiment, conclusion and discussion.