• Title/Summary/Keyword: oversampling algorithm

Search Result 30, Processing Time 0.026 seconds

Malaria Epidemic Prediction Model by Using Twitter Data and Precipitation Volume in Nigeria

  • Nduwayezu, Maurice;Satyabrata, Aicha;Han, Suk Young;Kim, Jung Eon;Kim, Hoon;Park, Junseok;Hwang, Won-Joo
    • Journal of Korea Multimedia Society
    • /
    • v.22 no.5
    • /
    • pp.588-600
    • /
    • 2019
  • Each year Malaria affects over 200 million people worldwide. Particularly, African continent is highly hit by this disease. According to many researches, this continent is ideal for Anopheles mosquitoes which transmit Malaria parasites to thrive. Rainfall volume is one of the major factor favoring the development of these Anopheles in the tropical Sub-Sahara Africa (SSA). However, the surveillance, monitoring and reporting of this epidemic is still poor and bureaucratic only. In our paper, we proposed a method to fast monitor and report Malaria instances by using Social Network Systems (SNS) and precipitation volume in Nigeria. We used Twitter search Application Programming Interface (API) to live-stream Twitter messages mentioning Malaria, preprocessed those Tweets and classified them into Malaria cases in Nigeria by using Support Vector Machine (SVM) classification algorithm and compared those Malaria cases with average precipitation volume. The comparison yielded a correlation of 0.75 between Malaria cases recorded by using Twitter and average precipitations in Nigeria. To ensure the certainty of our classification algorithm, we used an oversampling technique and eliminated the imbalance in our training Tweets.

Experimental Analysis of Bankruptcy Prediction with SHAP framework on Polish Companies

  • Tuguldur Enkhtuya;Dae-Ki Kang
    • International journal of advanced smart convergence
    • /
    • v.12 no.1
    • /
    • pp.53-58
    • /
    • 2023
  • With the fast development of artificial intelligence day by day, users are demanding explanations about the results of algorithms and want to know what parameters influence the results. In this paper, we propose a model for bankruptcy prediction with interpretability using the SHAP framework. SHAP (SHAPley Additive exPlanations) is framework that gives a visualized result that can be used for explanation and interpretation of machine learning models. As a result, we can describe which features are important for the result of our deep learning model. SHAP framework Force plot result gives us top features which are mainly reflecting overall model score. Even though Fully Connected Neural Networks are a "black box" model, Shapley values help us to alleviate the "black box" problem. FCNNs perform well with complex dataset with more than 60 financial ratios. Combined with SHAP framework, we create an effective model with understandable interpretation. Bankruptcy is a rare event, then we avoid imbalanced dataset problem with the help of SMOTE. SMOTE is one of the oversampling technique that resulting synthetic samples are generated for the minority class. It uses K-nearest neighbors algorithm for line connecting method in order to producing examples. We expect our model results assist financial analysts who are interested in forecasting bankruptcy prediction of companies in detail.

Development of Evaluation Metrics that Consider Data Imbalance between Classes in Facies Classification (지도학습 기반 암상 분류 시 클래스 간 자료 불균형을 고려한 평가지표 개발)

  • Kim, Dowan;Choi, Junhwan;Byun, Joongmoo
    • Geophysics and Geophysical Exploration
    • /
    • v.23 no.3
    • /
    • pp.131-140
    • /
    • 2020
  • In training a classification model using machine learning, the acquisition of training data is a very important stage, because the amount and quality of the training data greatly influence the model performance. However, when the cost of obtaining data is so high that it is difficult to build ideal training data, the number of samples for each class may be acquired very differently, and a serious data-imbalance problem can occur. If such a problem occurs in the training data, all classes are not trained equally, and classes containing relatively few data will have significantly lower recall values. Additionally, the reliability of evaluation indices such as accuracy and precision will be reduced. Therefore, this study sought to overcome the problem of data imbalance in two stages. First, we introduced weighted accuracy and weighted precision as new evaluation indices that can take into account a data-imbalance ratio by modifying conventional measures of accuracy and precision. Next, oversampling was performed to balance weighted precision and recall among classes. We verified the algorithm by applying it to the problem of facies classification. As a result, the imbalance between majority and minority classes was greatly mitigated, and the boundaries between classes could be more clearly identified.

Development of an Anomaly Detection Algorithm for Verification of Radionuclide Analysis Based on Artificial Intelligence in Radioactive Wastes (방사성폐기물 핵종분석 검증용 이상 탐지를 위한 인공지능 기반 알고리즘 개발)

  • Seungsoo Jang;Jang Hee Lee;Young-su Kim;Jiseok Kim;Jeen-hyeng Kwon;Song Hyun Kim
    • Journal of Radiation Industry
    • /
    • v.17 no.1
    • /
    • pp.19-32
    • /
    • 2023
  • The amount of radioactive waste is expected to dramatically increase with decommissioning of nuclear power plants such as Kori-1, the first nuclear power plant in South Korea. Accurate nuclide analysis is necessary to manage the radioactive wastes safely, but research on verification of radionuclide analysis has yet to be well established. This study aimed to develop the technology that can verify the results of radionuclide analysis based on artificial intelligence. In this study, we propose an anomaly detection algorithm for inspecting the analysis error of radionuclide. We used the data from 'Updated Scaling Factors in Low-Level Radwaste' (NP-5077) published by EPRI (Electric Power Research Institute), and resampling was performed using SMOTE (Synthetic Minority Oversampling Technique) algorithm to augment data. 149,676 augmented data with SMOTE algorithm was used to train the artificial neural networks (classification and anomaly detection networks). 324 NP-5077 report data verified the performance of networks. The anomaly detection algorithm of radionuclide analysis was divided into two modules that detect a case where radioactive waste was incorrectly classified or discriminate an abnormal data such as loss of data or incorrectly written data. The classification network was constructed using the fully connected layer, and the anomaly detection network was composed of the encoder and decoder. The latter was operated by loading the latent vector from the end layer of the classification network. This study conducted exploratory data analysis (i.e., statistics, histogram, correlation, covariance, PCA, k-mean clustering, DBSCAN). As a result of analyzing the data, it is complicated to distinguish the type of radioactive waste because data distribution overlapped each other. In spite of these complexities, our algorithm based on deep learning can distinguish abnormal data from normal data. Radionuclide analysis was verified using our anomaly detection algorithm, and meaningful results were obtained.

Design of Optimal FIR Filters for Data Transmission (데이터 전송을 위한 최적 FIR 필터 설계)

  • 이상욱;이용환
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.18 no.8
    • /
    • pp.1226-1237
    • /
    • 1993
  • For data transmission over strictly band-limited non-ideal channels, different types of filters with arbitrary responses are needed. In this paper. we proposed two efficient techniques for the design of such FIR filters whose response is specified in either the time or the frequency domain. In particular when a fractionally-spaced structure is used for the transceiver, these filters can be efficiently designed by making use of characteristics of oversampling. By using a minimum mean-squared error criterion, we design a fractionally-spaced FIR filter whose frequency response can be controlled without affecting the output error. With proper specification of the shape of the additive noise signals, for example, the design results in a receiver filter that can perform compromise equalization as well as phase splitting filtering for QAM demodulation. The second method ad-dresses the design of an FIR filter whose desired response can be arbitrarily specified in the frequency domain. For optimum design, we use an iterative optimization technique based on a weighted least mean square algorithm. A new adaptation algorithm for updating the weighting function is proposed for fast and stable convergence. It is shown that these two independent methods can be efficiently combined together for more complex applications.

  • PDF

Coherent X-ray Diffraction Imaging with Single-pulse Table-top Soft X-ray Laser

  • Kang, Hyon-Chol;Kim, H.T.;Lee, S.K.;Kim, C.M.;Choi, I.W.;Yu, T.J.;Sung, J.H.;Hafz, N.;Jeong, T.M.;Kang, S.W.;Jin, Y.Y.;Noh, Y.C.;Ko, D.K.;Kim, S.S.;Marathe, S.;Kim, S.N.;Kim, C.;Noh, D.Y.;Lee, J.
    • Proceedings of the Optical Society of Korea Conference
    • /
    • 2008.02a
    • /
    • pp.429-430
    • /
    • 2008
  • We demonstrate coherent x-ray diffraction imaging using table-top x-ray laser at a wavelength of 13.9nm driven by 10-Hz ti:Sapphire laser system at the Advanced Photonics Research Institute in Korea. Since the flux of x-ray photons reaches as high as $10^9$ photons/pulse in a $20{\times}20{\mu}m^2$ field of view, we measured a ingle-pulse diffraction pattern of a micrometer-scale object with high dynamic range of diffraction intensities and successfully reconstructed to the image using phase retrieval algorithm with an oversampling ratio of 1:6. the imaging resolution is $^{\sim}150$ nm, while that is much improved by stacking the many diffraction patterns. This demonstration can be extended to the biological sample with the diffraction limited resolution.

  • PDF

Boosting the Performance of the Predictive Model on the Imbalanced Dataset Using SVM Based Bagging and Out-of-Distribution Detection (SVM 기반 Bagging과 OoD 탐색을 활용한 제조공정의 불균형 Dataset에 대한 예측모델의 성능향상)

  • Kim, Jong Hoon;Oh, Hayoung
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.11
    • /
    • pp.455-464
    • /
    • 2022
  • There are two unique characteristics of the datasets from a manufacturing process. They are the severe class imbalance and lots of Out-of-Distribution samples. Some good strategies such as the oversampling over the minority class, and the down-sampling over the majority class, are well known to handle the class imbalance. In addition, SMOTE has been chosen to address the issue recently. But, Out-of-Distribution samples have been studied just with neural networks. It seems to be hardly shown that Out-of-Distribution detection is applied to the predictive model using conventional machine learning algorithms such as SVM, Random Forest and KNN. It is known that conventional machine learning algorithms are much better than neural networks in prediction performance, because neural networks are vulnerable to over-fitting and requires much bigger dataset than conventional machine learning algorithms does. So, we suggests a new approach to utilize Out-of-Distribution detection based on SVM algorithm. In addition to that, bagging technique will be adopted to improve the precision of the model.

The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms (다중 클래스 데이터셋의 메타특징이 판별 알고리즘의 성능에 미치는 영향 연구)

  • Kim, Jeonghun;Kim, Min Yong;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.23-45
    • /
    • 2020
  • Big data is creating in a wide variety of fields such as medical care, manufacturing, logistics, sales site, SNS, and the dataset characteristics are also diverse. In order to secure the competitiveness of companies, it is necessary to improve decision-making capacity using a classification algorithm. However, most of them do not have sufficient knowledge on what kind of classification algorithm is appropriate for a specific problem area. In other words, determining which classification algorithm is appropriate depending on the characteristics of the dataset was has been a task that required expertise and effort. This is because the relationship between the characteristics of datasets (called meta-features) and the performance of classification algorithms has not been fully understood. Moreover, there has been little research on meta-features reflecting the characteristics of multi-class. Therefore, the purpose of this study is to empirically analyze whether meta-features of multi-class datasets have a significant effect on the performance of classification algorithms. In this study, meta-features of multi-class datasets were identified into two factors, (the data structure and the data complexity,) and seven representative meta-features were selected. Among those, we included the Herfindahl-Hirschman Index (HHI), originally a market concentration measurement index, in the meta-features to replace IR(Imbalanced Ratio). Also, we developed a new index called Reverse ReLU Silhouette Score into the meta-feature set. Among the UCI Machine Learning Repository data, six representative datasets (Balance Scale, PageBlocks, Car Evaluation, User Knowledge-Modeling, Wine Quality(red), Contraceptive Method Choice) were selected. The class of each dataset was classified by using the classification algorithms (KNN, Logistic Regression, Nave Bayes, Random Forest, and SVM) selected in the study. For each dataset, we applied 10-fold cross validation method. 10% to 100% oversampling method is applied for each fold and meta-features of the dataset is measured. The meta-features selected are HHI, Number of Classes, Number of Features, Entropy, Reverse ReLU Silhouette Score, Nonlinearity of Linear Classifier, Hub Score. F1-score was selected as the dependent variable. As a result, the results of this study showed that the six meta-features including Reverse ReLU Silhouette Score and HHI proposed in this study have a significant effect on the classification performance. (1) The meta-features HHI proposed in this study was significant in the classification performance. (2) The number of variables has a significant effect on the classification performance, unlike the number of classes, but it has a positive effect. (3) The number of classes has a negative effect on the performance of classification. (4) Entropy has a significant effect on the performance of classification. (5) The Reverse ReLU Silhouette Score also significantly affects the classification performance at a significant level of 0.01. (6) The nonlinearity of linear classifiers has a significant negative effect on classification performance. In addition, the results of the analysis by the classification algorithms were also consistent. In the regression analysis by classification algorithm, Naïve Bayes algorithm does not have a significant effect on the number of variables unlike other classification algorithms. This study has two theoretical contributions: (1) two new meta-features (HHI, Reverse ReLU Silhouette score) was proved to be significant. (2) The effects of data characteristics on the performance of classification were investigated using meta-features. The practical contribution points (1) can be utilized in the development of classification algorithm recommendation system according to the characteristics of datasets. (2) Many data scientists are often testing by adjusting the parameters of the algorithm to find the optimal algorithm for the situation because the characteristics of the data are different. In this process, excessive waste of resources occurs due to hardware, cost, time, and manpower. This study is expected to be useful for machine learning, data mining researchers, practitioners, and machine learning-based system developers. The composition of this study consists of introduction, related research, research model, experiment, conclusion and discussion.

Automatic scoring of mathematics descriptive assessment using random forest algorithm (랜덤 포레스트 알고리즘을 활용한 수학 서술형 자동 채점)

  • Inyong Choi;Hwa Kyung Kim;In Woo Chung;Min Ho Song
    • The Mathematical Education
    • /
    • v.63 no.2
    • /
    • pp.165-186
    • /
    • 2024
  • Despite the growing attention on artificial intelligence-based automated scoring technology as a support method for the introduction of descriptive items in school environments and large-scale assessments, there is a noticeable lack of foundational research in mathematics compared to other subjects. This study developed an automated scoring model for two descriptive items in first-year middle school mathematics using the Random Forest algorithm, evaluated its performance, and explored ways to enhance this performance. The accuracy of the final models for the two items was found to be between 0.95 to 1.00 and 0.73 to 0.89, respectively, which is relatively high compared to automated scoring models in other subjects. We discovered that the strategic selection of the number of evaluation categories, taking into account the amount of data, is crucial for the effective development and performance of automated scoring models. Additionally, text preprocessing by mathematics education experts proved effective in improving both the performance and interpretability of the automated scoring model. Selecting a vectorization method that matches the characteristics of the items and data was identified as one way to enhance model performance. Furthermore, we confirmed that oversampling is a useful method to supplement performance in situations where practical limitations hinder balanced data collection. To enhance educational utility, further research is needed on how to utilize feature importance derived from the Random Forest-based automated scoring model to generate useful information for teaching and learning, such as feedback. This study is significant as foundational research in the field of mathematics descriptive automatic scoring, and there is a need for various subsequent studies through close collaboration between AI experts and math education experts.

A Study about Learning Graph Representation on Farmhouse Apple Quality Images with Graph Transformer (그래프 트랜스포머 기반 농가 사과 품질 이미지의 그래프 표현 학습 연구)

  • Ji Hun Bae;Ju Hwan Lee;Gwang Hyun Yu;Gyeong Ju Kwon;Jin Young Kim
    • Smart Media Journal
    • /
    • v.12 no.1
    • /
    • pp.9-16
    • /
    • 2023
  • Recently, a convolutional neural network (CNN) based system is being developed to overcome the limitations of human resources in the apple quality classification of farmhouse. However, since convolutional neural networks receive only images of the same size, preprocessing such as sampling may be required, and in the case of oversampling, information loss of the original image such as image quality degradation and blurring occurs. In this paper, in order to minimize the above problem, to generate a image patch based graph of an original image and propose a random walk-based positional encoding method to apply the graph transformer model. The above method continuously learns the position embedding information of patches which don't have a positional information based on the random walk algorithm, and finds the optimal graph structure by aggregating useful node information through the self-attention technique of graph transformer model. Therefore, it is robust and shows good performance even in a new graph structure of random node order and an arbitrary graph structure according to the location of an object in an image. As a result, when experimented with 5 apple quality datasets, the learning accuracy was higher than other GNN models by a minimum of 1.3% to a maximum of 4.7%, and the number of parameters was 3.59M, which was about 15% less than the 23.52M of the ResNet18 model. Therefore, it shows fast reasoning speed according to the reduction of the amount of computation and proves the effect.