• Title/Summary/Keyword: Imbalanced Large-scale Data

Search Result 8, Processing Time 0.025 seconds

Parameter Tuning in Support Vector Regression for Large Scale Problems (대용량 자료에 대한 서포트 벡터 회귀에서 모수조절)

  • Ryu, Jee-Youl;Kwak, Minjung;Yoon, Min
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.25 no.1
    • /
    • pp.15-21
    • /
    • 2015
  • In support vector machine, the values of parameters included in kernels affect strongly generalization ability. It is often difficult to determine appropriate values of those parameters in advance. It has been observed through our studies that the burden for deciding the values of those parameters in support vector regression can be reduced by utilizing ensemble learning. However, the straightforward application of the method to large scale problems is too time consuming. In this paper, we propose a method in which the original data set is decomposed into a certain number of sub data set in order to reduce the burden for parameter tuning in support vector regression with large scale data sets and imbalanced data set, particularly.

Comparative Study of Dimension Reduction Methods for Highly Imbalanced Overlapping Churn Data

  • Lee, Sujee;Koo, Bonhyo;Jung, Kyu-Hwan
    • Industrial Engineering and Management Systems
    • /
    • v.13 no.4
    • /
    • pp.454-462
    • /
    • 2014
  • Retention of possible churning customer is one of the most important issues in customer relationship management, so companies try to predict churn customers using their large-scale high-dimensional data. This study focuses on dealing with large data sets by reducing the dimensionality. By using six different dimension reduction methods-Principal Component Analysis (PCA), factor analysis (FA), locally linear embedding (LLE), local tangent space alignment (LTSA), locally preserving projections (LPP), and deep auto-encoder-our experiments apply each dimension reduction method to the training data, build a classification model using the mapped data and then measure the performance using hit rate to compare the dimension reduction methods. In the result, PCA shows good performance despite its simplicity, and the deep auto-encoder gives the best overall performance. These results can be explained by the characteristics of the churn prediction data that is highly correlated and overlapped over the classes. We also proposed a simple out-of-sample extension method for the nonlinear dimension reduction methods, LLE and LTSA, utilizing the characteristic of the data.

Data anomaly detection for structural health monitoring using a combination network of GANomaly and CNN

  • Liu, Gaoyang;Niu, Yanbo;Zhao, Weijian;Duan, Yuanfeng;Shu, Jiangpeng
    • Smart Structures and Systems
    • /
    • v.29 no.1
    • /
    • pp.53-62
    • /
    • 2022
  • The deployment of advanced structural health monitoring (SHM) systems in large-scale civil structures collects large amounts of data. Note that these data may contain multiple types of anomalies (e.g., missing, minor, outlier, etc.) caused by harsh environment, sensor faults, transfer omission and other factors. These anomalies seriously affect the evaluation of structural performance. Therefore, the effective analysis and mining of SHM data is an extremely important task. Inspired by the deep learning paradigm, this study develops a novel generative adversarial network (GAN) and convolutional neural network (CNN)-based data anomaly detection approach for SHM. The framework of the proposed approach includes three modules : (a) A three-channel input is established based on fast Fourier transform (FFT) and Gramian angular field (GAF) method; (b) A GANomaly is introduced and trained to extract features from normal samples alone for class-imbalanced problems; (c) Based on the output of GANomaly, a CNN is employed to distinguish the types of anomalies. In addition, a dataset-oriented method (i.e., multistage sampling) is adopted to obtain the optimal sampling ratios between all different samples. The proposed approach is tested with acceleration data from an SHM system of a long-span bridge. The results show that the proposed approach has a higher accuracy in detecting the multi-pattern anomalies of SHM data.

Transfer Learning-Based Feature Fusion Model for Classification of Maneuver Weapon Systems

  • Jinyong Hwang;You-Rak Choi;Tae-Jin Park;Ji-Hoon Bae
    • Journal of Information Processing Systems
    • /
    • v.19 no.5
    • /
    • pp.673-687
    • /
    • 2023
  • Convolutional neural network-based deep learning technology is the most commonly used in image identification, but it requires large-scale data for training. Therefore, application in specific fields in which data acquisition is limited, such as in the military, may be challenging. In particular, the identification of ground weapon systems is a very important mission, and high identification accuracy is required. Accordingly, various studies have been conducted to achieve high performance using small-scale data. Among them, the ensemble method, which achieves excellent performance through the prediction average of the pre-trained models, is the most representative method; however, it requires considerable time and effort to find the optimal combination of ensemble models. In addition, there is a performance limitation in the prediction results obtained by using an ensemble method. Furthermore, it is difficult to obtain the ensemble effect using models with imbalanced classification accuracies. In this paper, we propose a transfer learning-based feature fusion technique for heterogeneous models that extracts and fuses features of pre-trained heterogeneous models and finally, fine-tunes hyperparameters of the fully connected layer to improve the classification accuracy. The experimental results of this study indicate that it is possible to overcome the limitations of the existing ensemble methods by improving the classification accuracy through feature fusion between heterogeneous models based on transfer learning.

Churn Prediction Model using Logistic Regression (Logistic Regression을 이용한 이탈고객예측모형)

  • Jeong, Han-Na;Park, Hye-Jin;Kim, Nam-Hyeong;Jeon, Chi-Hyeok;Lee, Jae-Uk
    • Proceedings of the Korean Operations and Management Science Society Conference
    • /
    • 2008.10a
    • /
    • pp.324-328
    • /
    • 2008
  • 금융산업에서 고객의 이탈비율은 기대수익에 영향을 미친다는 점에서 예측이 필요한 부분이며 최근 들어 정확한 예측을 통한 비용관리가 이루어지면서 고객 이탈을 예측하는 것이 중요한 문제로 떠오르고 있다. 그러나 보험 고객 데이터가 대용량이고 불균형한 출력 값을 갖는 특성으로 인해 기존의 방법으로 예측 모델을 만드는 것이 적합하지 않다. 본 연구에서는 대용량 데이터를 처리하는 데 효과적으로 알려져 있는 Trust-region Newton method를 적용한 로지스틱 회귀분석을 통해 이탈고객을 예측하는 것을 주된 연구로 하며, 불균형한 데이터에서의 예측정확도를 높이기 위해 Oversampling, Clustering, Boosting 등을 이용하여 고객 데이터에 적합한 이탈 고객 예측 모형을 제시하고자 한다.

  • PDF

F_MixBERT: Sentiment Analysis Model using Focal Loss for Imbalanced E-commerce Reviews

  • Fengqian Pang;Xi Chen;Letong Li;Xin Xu;Zhiqiang Xing
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.2
    • /
    • pp.263-283
    • /
    • 2024
  • Users' comments after online shopping are critical to product reputation and business improvement. These comments, sometimes known as e-commerce reviews, influence other customers' purchasing decisions. To confront large amounts of e-commerce reviews, automatic analysis based on machine learning and deep learning draws more and more attention. A core task therein is sentiment analysis. However, the e-commerce reviews exhibit the following characteristics: (1) inconsistency between comment content and the star rating; (2) a large number of unlabeled data, i.e., comments without a star rating, and (3) the data imbalance caused by the sparse negative comments. This paper employs Bidirectional Encoder Representation from Transformers (BERT), one of the best natural language processing models, as the base model. According to the above data characteristics, we propose the F_MixBERT framework, to more effectively use inconsistently low-quality and unlabeled data and resolve the problem of data imbalance. In the framework, the proposed MixBERT incorporates the MixMatch approach into BERT's high-dimensional vectors to train the unlabeled and low-quality data with generated pseudo labels. Meanwhile, data imbalance is resolved by Focal loss, which penalizes the contribution of large-scale data and easily-identifiable data to total loss. Comparative experiments demonstrate that the proposed framework outperforms BERT and MixBERT for sentiment analysis of e-commerce comments.

Consensus-Based Distributed Algorithm for Optimal Resource Allocation of Power Network under Supply-Demand Imbalance (수급 불균형을 고려한 전력망의 최적 자원 할당을 위한 일치 기반의 분산 알고리즘)

  • Young-Hun, Lim
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.15 no.6
    • /
    • pp.440-448
    • /
    • 2022
  • Recently, due to the introduction of distributed energy resources, the optimal resource allocation problem of the power network is more and more important, and the distributed resource allocation method is required to process huge amount of data in large-scale power networks. In the optimal resource allocation problem, many studies have been conducted on the case when the supply-demand balance is satisfied due to the limitation of the generation capacity of each generator, but the studies considering the supply-demand imbalance, that total demand exceeds the maximum generation capacity, have rarely been considered. In this paper, we propose the consensus-based distributed algorithm for the optimal resource allocation of power network considering the supply-demand imbalance condition as well as the supply-demand balance condition. The proposed distributed algorithm is designed to allocate the optimal resources when the supply-demand balance condition is satisfied, and to measure the amount of required resources when the supply-demand is imbalanced. Finally, we conduct the simulations to verify the performance of the proposed algorithm.

Task Balancing Scheme of MPI Gridding for Large-scale LiDAR Data Interpolation (대용량 LiDAR 데이터 보간을 위한 MPI 격자처리 과정의 작업량 발란싱 기법)

  • Kim, Seon-Young;Lee, Hee-Zin;Park, Seung-Kyu;Oh, Sang-Yoon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.19 no.9
    • /
    • pp.1-10
    • /
    • 2014
  • In this paper, we propose MPI gridding algorithm of LiDAR data that minimizes the communication between the cores. The LiDAR data collected from aircraft is a 3D spatial information which is used in various applications. Since there are many cases where the LiDAR data has too high resolution than actually required or non-surface information is included in the data, filtering the raw LiDAR data is required. In order to use the filtered data, the interpolation using the data structure to search adjacent locations is conducted to reconstruct the data. Since the processing time of LiDAR data is directly proportional to the size of it, there have been many studies on the high performance parallel processing system using MPI. However, previously proposed methods in parallel approach possess possible performance degradations such as imbalanced data size among cores or communication overhead for resolving boundary condition inconsistency. We conduct empirical experiments to verify the effectiveness of our proposed algorithm. The results show that the total execution time of the proposed method decreased up to 4.2 times than that of the conventional method on heterogeneous clusters.