• Title/Summary/Keyword: datasets

Search Result 2,091, Processing Time 0.026 seconds

A comparison of imputation methods using machine learning models

  • Heajung Suh;Jongwoo Song
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.3
    • /
    • pp.331-341
    • /
    • 2023
  • Handling missing values in data analysis is essential in constructing a good prediction model. The easiest way to handle missing values is to use complete case data, but this can lead to information loss within the data and invalid conclusions in data analysis. Imputation is a technique that replaces missing data with alternative values obtained from information in a dataset. Conventional imputation methods include K-nearest-neighbor imputation and multiple imputations. Recent methods include missForest, missRanger, and mixgb ,all which use machine learning algorithms. This paper compares the imputation techniques for datasets with mixed datatypes in various situations, such as data size, missing ratios, and missing mechanisms. To evaluate the performance of each method in mixed datasets, we propose a new imputation performance measure (IPM) that is a unified measurement applicable to numerical and categorical variables. We believe this metric can help find the best imputation method. Finally, we summarize the comparison results with imputation performances and computational times.

Model Independent Statistics in Cosmology

  • Keeley, Ryan E.;Shafieloo, Arman
    • The Bulletin of The Korean Astronomical Society
    • /
    • v.45 no.1
    • /
    • pp.49.1-49.1
    • /
    • 2020
  • In this talk, I will discuss a few different techniques to reconstruct different cosmological functions, such as the primordial power spectrum and the expansion history. These model independent techniques are useful because they can discover surprising results in a way that nested modeling cannot. For instance, we can use the modified Richardson Lucy algorithm to reconstruct a novel primordial power spectra from the Planck data that can resolve the "Hubble tension". This novel primordial power spectrum has regular oscillatory features that would be difficult to find using parametric methods. Further, we can use Gaussian process regression to reconstruct the expansion history of the Universe from low-redshift distance datasets. We can also this technique to test if these datasets are consistent with one another, which essentially allows for this technique to serve as a systematics finder.

  • PDF

Image Scene Classification of Multiclass (다중 클래스의 이미지 장면 분류)

  • Shin, Seong-Yoon;Lee, Hyun-Chang;Shin, Kwang-Seong;Kim, Hyung-Jin;Lee, Jae-Wan
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.10a
    • /
    • pp.551-552
    • /
    • 2021
  • In this paper, we present a multi-class image scene classification method based on transformation learning. ImageNet classifies multiple classes of natural scene images by relying on pre-trained network models on large image datasets. In the experiment, we obtained excellent results by classifying the optimized ResNet model on Kaggle's Intel Image Classification data set.

  • PDF

New Text Sentiment Classification Method (새로운 텍스트 감정 분류 방법)

  • Shin, Seong-Yoon;Lee, Hyun-Chang;Shin, Kwang-Seong;Kim, Hyung-Jin;Lee, Jae-Wan
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.10a
    • /
    • pp.553-554
    • /
    • 2021
  • This paper proposes a convergence model based on LSTM and CNN deep learning techniques, and obtains good results by applying it to multi-category news datasets. According to the experiment, the deep learning-based fusion model significantly improved the precision and accuracy of text sentiment classification.

  • PDF

Novel estimation based on a minimum distance under the progressive Type-II censoring scheme

  • Young Eun Jeon;Suk-Bok Kang;Jung-In Seo
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.4
    • /
    • pp.411-421
    • /
    • 2023
  • This paper provides a new estimation equation based on the concept of a minimum distance between the empirical and theoretical distribution functions under the most widely used progressive Type-II censoring scheme. For illustrative purposes, simulated and real datasets from a three-parameter Weibull distribution are analyzed. For comparison, the most popular estimation methods, the maximum likelihood and maximum product of spacings estimation methods, are developed together. In the analysis of simulated datasets, the excellence of the provided estimation method is demonstrated through the degree of the estimation failure of the likelihood-based method, and its validity is demonstrated through the mean squared errors and biases of the estimators obtained from the provided estimation equation. In the analysis of the real dataset, two types of goodness-of-fit tests are performed on whether the observed dataset has the three-parameter Weibull distribution under the progressive Type-II censoring scheme, through which the performance of the new estimation equation provided is examined.

Text Classification Method Using Deep Learning Model Fusion and Its Application

  • Shin, Seong-Yoon;Cho, Gwang-Hyun;Cho, Seung-Pyo;Lee, Hyun-Chang
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.10a
    • /
    • pp.409-410
    • /
    • 2022
  • This paper proposes a fusion model based on Long-Short Term Memory networks (LSTM) and CNN deep learning methods, and applied to multi-category news datasets, and achieved good results. Experiments show that the fusion model based on deep learning has greatly improved the precision and accuracy of text sentiment classification. This method will become an important way to optimize the model and improve the performance of the model.

  • PDF

Testing LCDM with eBOSS / SDSS

  • Keeley, Ryan E.;Shafieloo, Arman;Zhao, Gong-bo;Koo, Hanwool
    • The Bulletin of The Korean Astronomical Society
    • /
    • v.46 no.1
    • /
    • pp.47.3-47.3
    • /
    • 2021
  • In this talk I will review recent progress that the SDSS-IV / eBOSS collaboration has made in constraining cosmology from the clustering of galaxies, quasars and the Lyman-alpha forest. The SDSS-IV / eBOSS collaboration has measured the baryon acoustic oscillation (BAO) and redshift space distortion (RSD) features in the correlation function in redshift bins from z~0.15 to z~2.33. These features constitute measurements of angular diameter distances, Hubble distances, and growth rate measurements. A number of consistency tests have been performed between the BAO and RSD datasets and additional cosmological datasets such as the Planck cosmic microwave background constraints, the Pantheon Type Ia supernova compilation, and the weak lensing results from the Dark Energy Survey. Taken together, these joint constraints all point to a broad consistency with the standard model of cosmology LCDM + GR, though they remain in tension with local measurements of the Hubble parameter.

  • PDF

Ensemble Gene Selection Method Based on Multiple Tree Models

  • Mingzhu Lou
    • Journal of Information Processing Systems
    • /
    • v.19 no.5
    • /
    • pp.652-662
    • /
    • 2023
  • Identifying highly discriminating genes is a critical step in tumor recognition tasks based on microarray gene expression profile data and machine learning. Gene selection based on tree models has been the subject of several studies. However, these methods are based on a single-tree model, often not robust to ultra-highdimensional microarray datasets, resulting in the loss of useful information and unsatisfactory classification accuracy. Motivated by the limitations of single-tree-based gene selection, in this study, ensemble gene selection methods based on multiple-tree models were studied to improve the classification performance of tumor identification. Specifically, we selected the three most representative tree models: ID3, random forest, and gradient boosting decision tree. Each tree model selects top-n genes from the microarray dataset based on its intrinsic mechanism. Subsequently, three ensemble gene selection methods were investigated, namely multipletree model intersection, multiple-tree module union, and multiple-tree module cross-union, were investigated. Experimental results on five benchmark public microarray gene expression datasets proved that the multiple tree module union is significantly superior to gene selection based on a single tree model and other competitive gene selection methods in classification accuracy.

Reliable Fault Diagnosis Method Based on An Optimized Deep Belief Network for Gearbox

  • Oybek Eraliev;Ozodbek Xakimov;Chul-Hee Lee
    • Journal of Drive and Control
    • /
    • v.20 no.4
    • /
    • pp.54-63
    • /
    • 2023
  • High and intermittent loading cycles induce fatigue damage to transmission components, resulting in premature gearbox failure. To identify gearbox defects, numerous vibration-based diagnostics techniques, using several artificial intelligence (AI) algorithms, have recently been presented. In this paper, an optimized deep belief network (DBN) model for gearbox problem diagnosis was designed based on time-frequency visual pattern identification. To optimize the hyperparameters of the model, a particle swarm optimization (PSO) approach was integrated into the DBN. The proposed model was tested on two gearbox datasets: a wind turbine gearbox and an experimental gearbox. The optimized DBN model demonstrated strong and robust performance in classification accuracy. In addition, the accuracy of the generated datasets was compared using traditional ML and DL algorithms. Furthermore, the proposed model was evaluated on different partitions of the dataset. The results showed that, even with a small amount of sample data, the optimized DBN model achieved high accuracy in diagnosis.

Selection of features and hidden Markov model parameters for English word recognition from Leap Motion air-writing trajectories

  • Deval Verma;Himanshu Agarwal;Amrish Kumar Aggarwal
    • ETRI Journal
    • /
    • v.46 no.2
    • /
    • pp.250-262
    • /
    • 2024
  • Air-writing recognition is relevant in areas such as natural human-computer interaction, augmented reality, and virtual reality. A trajectory is the most natural way to represent air writing. We analyze the recognition accuracy of words written in air considering five features, namely, writing direction, curvature, trajectory, orthocenter, and ellipsoid, as well as different parameters of a hidden Markov model classifier. Experiments were performed on two representative datasets, whose sample trajectories were collected using a Leap Motion Controller from a fingertip performing air writing. Dataset D1 contains 840 English words from 21 classes, and dataset D2 contains 1600 English words from 40 classes. A genetic algorithm was combined with a hidden Markov model classifier to obtain the best subset of features. Combination ftrajectory, orthocenter, writing direction, curvatureg provided the best feature set, achieving recognition accuracies on datasets D1 and D2 of 98.81% and 83.58%, respectively.