• Title/Summary/Keyword: imbalance class

Search Result 128, Processing Time 0.024 seconds

Improving minority prediction performance of support vector machine for imbalanced text data via feature selection and SMOTE (단어선택과 SMOTE 알고리즘을 이용한 불균형 텍스트 데이터의 소수 범주 예측성능 향상 기법)

  • Jongchan Kim;Seong Jun Chang;Won Son
    • The Korean Journal of Applied Statistics
    • /
    • v.37 no.4
    • /
    • pp.395-410
    • /
    • 2024
  • Text data is usually made up of a wide variety of unique words. Even in standard text data, it is common to find tens of thousands of different words. In text data analysis, usually, each unique word is treated as a variable. Thus, text data can be regarded as a dataset with a large number of variables. On the other hand, in text data classification, we often encounter class label imbalance problems. In the cases of substantial imbalances, the performance of conventional classification models can be severely degraded. To improve the classification performance of support vector machines (SVM) for imbalanced data, algorithms such as the Synthetic Minority Over-sampling Technique (SMOTE) can be used. The SMOTE algorithm synthetically generates new observations for the minority class based on the k-Nearest Neighbors (kNN) algorithm. However, in datasets with a large number of variables, such as text data, errors may accumulate. This can potentially impact the performance of the kNN algorithm. In this study, we propose a method for enhancing prediction performance for the minority class of imbalanced text data. Our approach involves employing variable selection to generate new synthetic observations in a reduced space, thereby improving the overall classification performance of SVM.

Total Dietary Fiber and Mineral Absorption

  • Gordon, Dennis-T.
    • Journal of Nutrition and Health
    • /
    • v.25 no.6
    • /
    • pp.429-449
    • /
    • 1992
  • The consumption of foods rich in TDF should not be associated with impaired mineral absorp-tion and long-term mineral status. In surveys of populations consuming high amounts of TDF e.g Third World populations and vegetarinas gross deficiencies in mineral nutrition have not been noted. If mineral status is low among these groups it is most likely caused by the inadequacy or imbalance of the diet and not by the TDF. The key word is interaction which should be inte-rpreted in dietary imbalances that produce nut-rient deficiencies. There are no strong data to support the concept that TDF inhibits mineral absorption through a binding chelation mechanism. Limited data sug-gest that positively charged groups on polymers such as chitosan and cholestyramine will decrease iron absorption in humans and animals. Because TDF does not contain positively charged groups future research should be directed at the possible role of protein consumed along with TDF and the combination of effects on mineral nutrition Phytic acid is acknowledged as a potent chela-tor of zinc. However its association with zinc and its propensity to lower Zn bioavaiability may enhance the absorption of other elements notably copper and iron. The importance of interactions among nutrients including TDF will gain addi-tional attention in the scientific community. Soluble and insoluble dietary fiber function di-fferently in the intestine. Insoluble fibers accele-rate movement through the intestine. Soluble die-tary fibers appear to regulated blood concentra-tions of glucose and cholesterol albeit by some unknown mechanism. In creased viscosity produ-ced by the SDF in the intestine may provide an explanation of how this class of polymers affects plasma glucose cholesterol and other nutrients. Employing a double-perfusion technique in the rat we demonstrated that viscosity produced by SDF will delay transfer of zinc into the circulatory system. This delayed absorption should not be interpreted as decreased utilization. A great deal of additional research is required to prove the importance of luminaly viscosity produced by SDF on slowing nutrient absorption or regulating bllod nutrient homeostasis. Increased intake of TDF in the total human diet appears desirable. A dietary intake of 35g/day should not be considered to have a negative effect on mineral absorption. It is important to educate people that an intake of more than 35g TDF/day may cause an imbalance in the diet that can adve-rsely affect mineral utilization. Acknowledgments. Appreciation is given to Dr. George V. Vahouny(deceased) who was intense a great competitor in and out of science and who gave the author inspiration Portions of this work were supported by the University of Missouri Ag-ricultural Station and by a grant from the Univer-rch Support Grant RR 07053 from the National Institutes of Health. Contribution of the Missouri Agriculatural Experiments Station Journal Series No. 10747.

  • PDF

A Method of Machine Learning-based Defective Health Functional Food Detection System for Efficient Inspection of Imported Food (효율적 수입식품 검사를 위한 머신러닝 기반 부적합 건강기능식품 탐지 방법)

  • Lee, Kyoungsu;Bak, Yerin;Shin, Yoonjong;Sohn, Kwonsang;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.3
    • /
    • pp.139-159
    • /
    • 2022
  • As interest in health functional foods has increased since COVID-19, the importance of imported food safety inspections is growing. However, in contrast to the annual increase in imports of health functional foods, the budget and manpower required for inspections for import and export are reaching their limit. Hence, the purpose of this study is to propose a machine learning model that efficiently detects unsuitable food suitable for the characteristics of data possessed by government offices on imported food. First, the components of food import/export inspections data that affect the judgment of nonconformity were examined and derived variables were newly created. Second, in order to select features for the machine learning, class imbalance and nonlinearity were considered when performing exploratory analysis on imported food-related data. Third, we try to compare the performance and interpretability of each model by applying various machine learning techniques. In particular, the ensemble model was the best, and it was confirmed that the derived variables and models proposed in this study can be helpful to the system used in import/export inspections.

Feature Selection for Anomaly Detection Based on Genetic Algorithm (유전 알고리즘 기반의 비정상 행위 탐지를 위한 특징선택)

  • Seo, Jae-Hyun
    • Journal of the Korea Convergence Society
    • /
    • v.9 no.7
    • /
    • pp.1-7
    • /
    • 2018
  • Feature selection, one of data preprocessing techniques, is one of major research areas in many applications dealing with large dataset. It has been used in pattern recognition, machine learning and data mining, and is now widely applied in a variety of fields such as text classification, image retrieval, intrusion detection and genome analysis. The proposed method is based on a genetic algorithm which is one of meta-heuristic algorithms. There are two methods of finding feature subsets: a filter method and a wrapper method. In this study, we use a wrapper method, which evaluates feature subsets using a real classifier, to find an optimal feature subset. The training dataset used in the experiment has a severe class imbalance and it is difficult to improve classification performance for rare classes. After preprocessing the training dataset with SMOTE, we select features and evaluate them with various machine learning algorithms.

Pattern Analysis of Traffic Accident data and Prediction of Victim Injury Severity Using Hybrid Model (교통사고 데이터의 패턴 분석과 Hybrid Model을 이용한 피해자 상해 심각도 예측)

  • Ju, Yeong Ji;Hong, Taek Eun;Shin, Ju Hyun
    • Smart Media Journal
    • /
    • v.5 no.4
    • /
    • pp.75-82
    • /
    • 2016
  • Although Korea's economic and domestic automobile market through the change of road environment are growth, the traffic accident rate has also increased, and the casualties is at a serious level. For this reason, the government is establishing and promoting policies to open traffic accident data and solve problems. In this paper, describe the method of predicting traffic accidents by eliminating the class imbalance using the traffic accident data and constructing the Hybrid Model. Using the original traffic accident data and the sampled data as learning data which use FP-Growth algorithm it learn patterns associated with traffic accident injury severity. Accordingly, In this paper purpose a method for predicting the severity of a victim of a traffic accident by analyzing the association patterns of two learning data, we can extract the same related patterns, when a decision tree and multinomial logistic regression analysis are performed, a hybrid model is constructed by assigning weights to related attributes.

Problems and Development of Police Officials' Physical Fitness Tests (경찰공무원 체력검정의 문제점 및 발전방안)

  • Kim, Sang-Woon
    • The Journal of the Korea Contents Association
    • /
    • v.19 no.8
    • /
    • pp.609-619
    • /
    • 2019
  • The study aims to present a solution to the problem of police physical fitness tests as police officers, who tried to subdue the drunk in a video clip titled "Darim-dong female police officer Assault" in May in Seoul, showed a rather lethargic figure, such as being pushed out of a physical fight with the suspect. The police physical fitness test is subject to criticism as it consists of items that are difficult to apply in real life despite having to be linked to job performance. The problem is that the physical fitness test events are not realistic, the physical fitness test standards are set too low, and their credibility is not reliable due to the imbalance of standards between men and women and the vision culture of physical fitness testing methods. First of all, we hope that the Republic of Korea will become a world-class security powerhouse by upgrading its physical fitness standards and establishing a scientific fitness test system for preventing injuries and effectively measuring physical strength.

CAB: Classifying Arrhythmias based on Imbalanced Sensor Data

  • Wang, Yilin;Sun, Le;Subramani, Sudha
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.7
    • /
    • pp.2304-2320
    • /
    • 2021
  • Intelligently detecting anomalies in health sensor data streams (e.g., Electrocardiogram, ECG) can improve the development of E-health industry. The physiological signals of patients are collected through sensors. Timely diagnosis and treatment save medical resources, promote physical health, and reduce complications. However, it is difficult to automatically classify the ECG data, as the features of ECGs are difficult to extract. And the volume of labeled ECG data is limited, which affects the classification performance. In this paper, we propose a Generative Adversarial Network (GAN)-based deep learning framework (called CAB) for heart arrhythmia classification. CAB focuses on improving the detection accuracy based on a small number of labeled samples. It is trained based on the class-imbalance ECG data. Augmenting ECG data by a GAN model eliminates the impact of data scarcity. After data augmentation, CAB classifies the ECG data by using a Bidirectional Long Short Term Memory Recurrent Neural Network (Bi-LSTM). Experiment results show a better performance of CAB compared with state-of-the-art methods. The overall classification accuracy of CAB is 99.71%. The F1-scores of classifying Normal beats (N), Supraventricular ectopic beats (S), Ventricular ectopic beats (V), Fusion beats (F) and Unclassifiable beats (Q) heartbeats are 99.86%, 97.66%, 99.05%, 98.57% and 99.88%, respectively. Unclassifiable beats (Q) heartbeats are 99.86%, 97.66%, 99.05%, 98.57% and 99.88%, respectively.

Automatic Augmentation Technique of an Autoencoder-based Numerical Training Data (오토인코더 기반 수치형 학습데이터의 자동 증강 기법)

  • Jeong, Ju-Eun;Kim, Han-Joon;Chun, Jong-Hoon
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.22 no.5
    • /
    • pp.75-86
    • /
    • 2022
  • This study aims to solve the problem of class imbalance in numerical data by using a deep learning-based Variational AutoEncoder and to improve the performance of the learning model by augmenting the learning data. We propose 'D-VAE' to artificially increase the number of records for a given table data. The main features of the proposed technique go through discretization and feature selection in the preprocessing process to optimize the data. In the discretization process, K-means are applied and grouped, and then converted into one-hot vectors by one-hot encoding technique. Subsequently, for memory efficiency, sample data are generated with Variational AutoEncoder using only features that help predict with RFECV among feature selection techniques. To verify the performance of the proposed model, we demonstrate its validity by conducting experiments by data augmentation ratio.

Structural health monitoring data anomaly detection by transformer enhanced densely connected neural networks

  • Jun, Li;Wupeng, Chen;Gao, Fan
    • Smart Structures and Systems
    • /
    • v.30 no.6
    • /
    • pp.613-626
    • /
    • 2022
  • Guaranteeing the quality and integrity of structural health monitoring (SHM) data is very important for an effective assessment of structural condition. However, sensory system may malfunction due to sensor fault or harsh operational environment, resulting in multiple types of data anomaly existing in the measured data. Efficiently and automatically identifying anomalies from the vast amounts of measured data is significant for assessing the structural conditions and early warning for structural failure in SHM. The major challenges of current automated data anomaly detection methods are the imbalance of dataset categories. In terms of the feature of actual anomalous data, this paper proposes a data anomaly detection method based on data-level and deep learning technique for SHM of civil engineering structures. The proposed method consists of a data balancing phase to prepare a comprehensive training dataset based on data-level technique, and an anomaly detection phase based on a sophisticatedly designed network. The advanced densely connected convolutional network (DenseNet) and Transformer encoder are embedded in the specific network to facilitate extraction of both detail and global features of response data, and to establish the mapping between the highest level of abstractive features and data anomaly class. Numerical studies on a steel frame model are conducted to evaluate the performance and noise immunity of using the proposed network for data anomaly detection. The applicability of the proposed method for data anomaly classification is validated with the measured data of a practical supertall structure. The proposed method presents a remarkable performance on data anomaly detection, which reaches a 95.7% overall accuracy with practical engineering structural monitoring data, which demonstrates the effectiveness of data balancing and the robust classification capability of the proposed network.

Development of a Deep Learning Algorithm for Anomaly Detection of Manufacturing Facility (설비 이상탐지를 위한 딥러닝 알고리즘 개발)

  • Kim, Min-Hee;Jin, Kyo-Hong
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.2
    • /
    • pp.199-206
    • /
    • 2022
  • A malfunction or breakdown of a manufacturing facility leads to product defects and the suspension of production lines, resulting in huge financial losses for manufacturers. Due to the spread of smart factory services, a large amount of data is being collected in factories, and AI-based research is being conducted to predict and diagnose manufacturing facility breakdowns or manufacturing site efficiency. However, because of the characteristics of manufacturing data, such as a severe class imbalance about abnormalities and ambiguous label information that distinguishes abnormalities, developing classification or anomaly detection models is highly difficult. In this paper, we present an deep learning algorithm for anomaly detection of a manufacturing facility using reconstruction loss of CNN-based model and ananlyze its performance. The algorithm detects anomalies by relying solely on normal data from the facility's manufacturing data in the exclusion of abnormal data.