• Title/Summary/Keyword: SAMPLE dataset

Search Result 145, Processing Time 0.024 seconds

Default Prediction for Real Estate Companies with Imbalanced Dataset

  • Dong, Yuan-Xiang;Xiao, Zhi;Xiao, Xue
    • Journal of Information Processing Systems
    • /
    • v.10 no.2
    • /
    • pp.314-333
    • /
    • 2014
  • When analyzing default predictions in real estate companies, the number of non-defaulted cases always greatly exceeds the defaulted ones, which creates the two-class imbalance problem. This lowers the ability of prediction models to distinguish the default sample. In order to avoid this sample selection bias and to improve the prediction model, this paper applies a minority sample generation approach to create new minority samples. The logistic regression, support vector machine (SVM) classification, and neural network (NN) classification use an imbalanced dataset. They were used as benchmarks with a single prediction model that used a balanced dataset corrected by the minority samples generation approach. Instead of using prediction-oriented tests and the overall accuracy, the true positive rate (TPR), the true negative rate (TNR), G-mean, and F-score are used to measure the performance of default prediction models for imbalanced dataset. In this paper, we describe an empirical experiment that used a sampling of 14 default and 315 non-default listed real estate companies in China and report that most results using single prediction models with a balanced dataset generated better results than an imbalanced dataset.

Development of Korean Medicine Data Center(KDC) Teaching Dataset to Enhance Utilization of KDC (한의임상정보은행 활용도 제고를 위한 교육용 데이터 개발)

  • Baek, Younghwa;Lee, Siwoo
    • Journal of Sasang Constitutional Medicine
    • /
    • v.29 no.3
    • /
    • pp.242-247
    • /
    • 2017
  • Objective Korean medicine Data Center (KDC) has established large-scale biological and clinical data based on Korean medicine to demonstrate and validate its theory. The aim of this study was to develop KDC teaching dataset and user guideline to improve utilization of the KDC. Method KDC teaching dataset were selected using stratified random sampling according to the Sasang constitution (SC). This dataset included 72 variables of 500 sample subjects. The user guideline described how to conducted eight statistical analysis methods using the teaching dataset. Results The KDC teaching dataset was sampled from 200(40%) Taeeumin, 125(25%) Soeumin, and 175(35%) Soyanain. It was consisted of questionnaire (basic, habit, disease, symptom), physical exam (body measurement, blood pressure), blood exam, and expert' SC diagnosis. The usage guidelines provided instruction for users to perform several statistical analysis step by step with KDC teaching dataset. Conclusion We hope that our results will contribute to enhancing KDC utilization and understanding.

A Naive Multiple Imputation Method for Ignorable Nonresponse

  • Lee, Seung-Chun
    • Communications for Statistical Applications and Methods
    • /
    • v.11 no.2
    • /
    • pp.399-411
    • /
    • 2004
  • A common method of handling nonresponse in sample survey is to delete the cases, which may result in a substantial loss of cases. Thus in certain situation, it is of interest to create a complete set of sample values. In this case, a popular approach is to impute the missing values in the sample by the mean or the median of responders. The difficulty with this method which just replaces each missing value with a single imputed value is that inferences based on the completed dataset underestimate the precision of the inferential procedure. Various suggestions have been made to overcome the difficulty but they might not be appropriate for public-use files where the user has only limited information for about the reasons for nonresponse. In this note, a multiple imputation method is considered to create complete dataset which might be used for all possible inferential procedures without misleading or underestimating the precision.

Classification for Imbalanced Breast Cancer Dataset Using Resampling Methods

  • Hana Babiker, Nassar
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.1
    • /
    • pp.89-95
    • /
    • 2023
  • Analyzing breast cancer patient files is becoming an exciting area of medical information analysis, especially with the increasing number of patient files. In this paper, breast cancer data is collected from Khartoum state hospital, and the dataset is classified into recurrence and no recurrence. The data is imbalanced, meaning that one of the two classes have more sample than the other. Many pre-processing techniques are applied to classify this imbalanced data, resampling, attribute selection, and handling missing values, and then different classifiers models are built. In the first experiment, five classifiers (ANN, REP TREE, SVM, and J48) are used, and in the second experiment, meta-learning algorithms (Bagging, Boosting, and Random subspace). Finally, the ensemble model is used. The best result was obtained from the ensemble model (Boosting with J48) with the highest accuracy 95.2797% among all the algorithms, followed by Bagging with J48(90.559%) and random subspace with J48(84.2657%). The breast cancer imbalanced dataset was classified into recurrence, and no recurrence with different classified algorithms and the best result was obtained from the ensemble model.

Building Dataset of Sensor-only Facilities for Autonomous Cooperative Driving

  • Hyung Lee;Chulwoo Park;Handong Lee;Junhyuk Lee
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.1
    • /
    • pp.21-30
    • /
    • 2024
  • In this paper, we propose a method to build a sample dataset of the features of eight sensor-only facilities built as infrastructure for autonomous cooperative driving. The feature extracted from point cloud data acquired by LiDAR and build them into the sample dataset for recognizing the facilities. In order to build the dataset, eight sensor-only facilities with high-brightness reflector sheets and a sensor acquisition system were developed. To extract the features of facilities located within a certain measurement distance from the acquired point cloud data, a cylindrical projection method was applied to the extracted points after applying DBSCAN method for points and then a modified OTSU method for reflected intensity. Coordinates of 3D points, projected coordinates of 2D, and reflection intensity were set as the features of the facility, and the dataset was built along with labels. In order to check the effectiveness of the facility dataset built based on LiDAR data, a common CNN model was selected and tested after training, showing an accuracy of about 90% or more, confirming the possibility of facility recognition. Through continuous experiments, we will improve the feature extraction algorithm for building the proposed dataset and improve its performance, and develop a dedicated model for recognizing sensor-only facilities for autonomous cooperative driving.

Progress Report on Optical Spectroscopy of X-ray selected Intermediate-mass Black Holes

  • Kim, Minjin;Ho, Luis C.
    • The Bulletin of The Korean Astronomical Society
    • /
    • v.39 no.1
    • /
    • pp.42.2-42.2
    • /
    • 2014
  • We present high-resolution optical spectra of newly selected candidates of intermediate-mass black holes. The sample was selected based on the variability and spectral shape in X-ray. The spectra was taken with Magellan 6.5 m Clay Telescope and cover the rest-frame region 3500-10000A. The high spectral resolution (R~4000) of the spectrum allows us to estimate BH masses of the sources. Interestingly, the majority of the sample appears to have broad emission lines. Using this dataset, we will estimate the BH masses and Eddington ratio in order to understand their physical properties.

  • PDF

Re-SSS: Rebalancing Imbalanced Data Using Safe Sample Screening

  • Shi, Hongbo;Chen, Xin;Guo, Min
    • Journal of Information Processing Systems
    • /
    • v.17 no.1
    • /
    • pp.89-106
    • /
    • 2021
  • Different samples can have different effects on learning support vector machine (SVM) classifiers. To rebalance an imbalanced dataset, it is reasonable to reduce non-informative samples and add informative samples for learning classifiers. Safe sample screening can identify a part of non-informative samples and retain informative samples. This study developed a resampling algorithm for Rebalancing imbalanced data using Safe Sample Screening (Re-SSS), which is composed of selecting Informative Samples (Re-SSS-IS) and rebalancing via a Weighted SMOTE (Re-SSS-WSMOTE). The Re-SSS-IS selects informative samples from the majority class, and determines a suitable regularization parameter for SVM, while the Re-SSS-WSMOTE generates informative minority samples. Both Re-SSS-IS and Re-SSS-WSMOTE are based on safe sampling screening. The experimental results show that Re-SSS can effectively improve the classification performance of imbalanced classification problems.

Evaluation of Recent Data Processing Strategies on Q-TOF LC/MS Based Untargeted Metabolomics

  • Kaplan, Ozan;Celebier, Mustafa
    • Mass Spectrometry Letters
    • /
    • v.11 no.1
    • /
    • pp.1-5
    • /
    • 2020
  • In this study, some of the recently reported data processing strategies were evaluated and modified based on their capabilities and a brief workflow for data mining was redefined for Q-TOF LC-MS based untargeted metabolomics. Commercial pooled human plasma samples were used for this purpose. An ultrafiltration procedure was applied on sample preparation. Sample set was analyzed through Q-TOF LC/MS. A C18 column (Agilent Zorbax 1.8 µM, 50 × 2.1 mm) was used for chromatographic separation. Raw chromatograms were processed using XCMS - R programming language edition and Isotopologue Parameter Optimization (IPO) was used to optimize XCMS parameters. The raw XCMS table was processed using MS Excel to find reliable and reproducible peaks. Totally 1650 reliable and reproducible potential metabolite peaks were found based on the data processing procedures given in this paper. The redefined dataset was upload into MetaboAnalyst platform and the identified metabolites were matched with 86 metabolic pathways. Thus, two list were obtained and presented in this study as supplement files. The first list is to present the retention times and m/z values of detected metabolite peaks. The second list is the metabolic pathways related with the identified metabolites. The briefly described data processing strategies and dataset presented in this study could be beneficial for the researchers working on untargeted metabolomics for processing their data and validating their results.

Novel estimation based on a minimum distance under the progressive Type-II censoring scheme

  • Young Eun Jeon;Suk-Bok Kang;Jung-In Seo
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.4
    • /
    • pp.411-421
    • /
    • 2023
  • This paper provides a new estimation equation based on the concept of a minimum distance between the empirical and theoretical distribution functions under the most widely used progressive Type-II censoring scheme. For illustrative purposes, simulated and real datasets from a three-parameter Weibull distribution are analyzed. For comparison, the most popular estimation methods, the maximum likelihood and maximum product of spacings estimation methods, are developed together. In the analysis of simulated datasets, the excellence of the provided estimation method is demonstrated through the degree of the estimation failure of the likelihood-based method, and its validity is demonstrated through the mean squared errors and biases of the estimators obtained from the provided estimation equation. In the analysis of the real dataset, two types of goodness-of-fit tests are performed on whether the observed dataset has the three-parameter Weibull distribution under the progressive Type-II censoring scheme, through which the performance of the new estimation equation provided is examined.

Selection of features and hidden Markov model parameters for English word recognition from Leap Motion air-writing trajectories

  • Deval Verma;Himanshu Agarwal;Amrish Kumar Aggarwal
    • ETRI Journal
    • /
    • v.46 no.2
    • /
    • pp.250-262
    • /
    • 2024
  • Air-writing recognition is relevant in areas such as natural human-computer interaction, augmented reality, and virtual reality. A trajectory is the most natural way to represent air writing. We analyze the recognition accuracy of words written in air considering five features, namely, writing direction, curvature, trajectory, orthocenter, and ellipsoid, as well as different parameters of a hidden Markov model classifier. Experiments were performed on two representative datasets, whose sample trajectories were collected using a Leap Motion Controller from a fingertip performing air writing. Dataset D1 contains 840 English words from 21 classes, and dataset D2 contains 1600 English words from 40 classes. A genetic algorithm was combined with a hidden Markov model classifier to obtain the best subset of features. Combination ftrajectory, orthocenter, writing direction, curvatureg provided the best feature set, achieving recognition accuracies on datasets D1 and D2 of 98.81% and 83.58%, respectively.