• Title/Summary/Keyword: Datasets

Search Result 2,049, Processing Time 0.025 seconds

Cross-Project Pooling of Defects for Handling Class Imbalance

  • Catherine, J.M.;Djodilatchoumy, S
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.10
    • /
    • pp.11-16
    • /
    • 2022
  • Applying predictive analytics to predict software defects has improved the overall quality and decreased maintenance costs. Many supervised and unsupervised learning algorithms have been used for defect prediction on publicly available datasets. Most of these datasets suffer from an imbalance in the output classes. We study the impact of class imbalance in the defect datasets on the efficiency of the defect prediction model and propose a CPP method for handling imbalances in the dataset. The performance of the methods is evaluated using measures like Matthew's Correlation Coefficient (MCC), Recall, and Accuracy measures. The proposed sampling technique shows significant improvement in the efficiency of the classifier in predicting defects.

A Brief Survey into the Field of Automatic Image Dataset Generation through Web Scraping and Query Expansion

  • Bart Dikmans;Dongwann Kang
    • Journal of Information Processing Systems
    • /
    • v.19 no.5
    • /
    • pp.602-613
    • /
    • 2023
  • High-quality image datasets are in high demand for various applications. With many online sources providing manually collected datasets, a persisting challenge is to fully automate the dataset collection process. In this study, we surveyed an automatic image dataset generation field through analyzing a collection of existing studies. Moreover, we examined fields that are closely related to automated dataset generation, such as query expansion, web scraping, and dataset quality. We assess how both noise and regional search engine differences can be addressed using an automated search query expansion focused on hypernyms, allowing for user-specific manual query expansion. Combining these aspects provides an outline of how a modern web scraping application can produce large-scale image datasets.

A Density Peak Clustering Algorithm Based on Information Bottleneck

  • Yongli Liu;Congcong Zhao;Hao Chao
    • Journal of Information Processing Systems
    • /
    • v.19 no.6
    • /
    • pp.778-790
    • /
    • 2023
  • Although density peak clustering can often easily yield excellent results, there is still room for improvement when dealing with complex, high-dimensional datasets. One of the main limitations of this algorithm is its reliance on geometric distance as the sole similarity measurement. To address this limitation, we draw inspiration from the information bottleneck theory, and propose a novel density peak clustering algorithm that incorporates this theory as a similarity measure. Specifically, our algorithm utilizes the joint probability distribution between data objects and feature information, and employs the loss of mutual information as the measurement standard. This approach not only eliminates the potential for subjective error in selecting similarity method, but also enhances performance on datasets with multiple centers and high dimensionality. To evaluate the effectiveness of our algorithm, we conducted experiments using ten carefully selected datasets and compared the results with three other algorithms. The experimental results demonstrate that our information bottleneck-based density peaks clustering (IBDPC) algorithm consistently achieves high levels of accuracy, highlighting its potential as a valuable tool for data clustering tasks.

A Review of Public Datasets for Keystroke-based Behavior Analysis

  • Kolmogortseva Karina;Soo-Hyung Kim;Aera Kim
    • Smart Media Journal
    • /
    • v.13 no.7
    • /
    • pp.18-26
    • /
    • 2024
  • One of the newest trends in AI is emotion recognition utilizing keystroke dynamics, which leverages biometric data to identify users and assess emotional states. This work offers a comparison of four datasets that are frequently used to research keystroke dynamics: BB-MAS, Buffalo, Clarkson II, and CMU. The datasets contain different types of data, both behavioral and physiological biometric data that was gathered in a range of environments, from controlled labs to real work environments. Considering the benefits and drawbacks of each dataset, paying particular attention to how well it can be used for tasks like emotion recognition and behavioral analysis. Our findings demonstrate how user attributes, task circumstances, and ambient elements affect typing behavior. This comparative analysis aims to guide future research and development of applications for emotion detection and biometrics, emphasizing the importance of collecting diverse data and the possibility of integrating keystroke dynamics with other biometric measurements.

Soft Computing Optimized Models for Plant Leaf Classification Using Small Datasets

  • Priya;Jasmeen Gill
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.8
    • /
    • pp.72-84
    • /
    • 2024
  • Plant leaf classification is an imperative task when their use in real world is considered either for medicinal purposes or in agricultural sector. Accurate identification of plants is, therefore, quite important, since there are numerous poisonous plants which if by mistake consumed or used by humans can prove fatal to their lives. Furthermore, in agriculture, detection of certain kinds of weeds can prove to be quite significant for saving crops against such unwanted plants. In general, Artificial Neural Networks (ANN) are a suitable candidate for classification of images when small datasets are available. However, these suffer from local minima problems which can be effectively resolved using some global optimization techniques. Considering this issue, the present research paper presents an automated plant leaf classification system using optimized soft computing models in which ANNs are optimized using Grasshopper Optimization algorithm (GOA). In addition, the proposed model outperformed the state-of-the-art techniques when compared with simple ANN and particle swarm optimization based ANN. Results show that proposed GOA-ANN based plant leaf classification system is a promising technique for small image datasets.

Training of a Siamese Network to Build a Tracker without Using Tracking Labels (샴 네트워크를 사용하여 추적 레이블을 사용하지 않는 다중 객체 검출 및 추적기 학습에 관한 연구)

  • Kang, Jungyu;Song, Yoo-Seung;Min, Kyoung-Wook;Choi, Jeong Dan
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.21 no.5
    • /
    • pp.274-286
    • /
    • 2022
  • Multi-object tracking has been studied for a long time under computer vision and plays a critical role in applications such as autonomous driving and driving assistance. Multi-object tracking techniques generally consist of a detector that detects objects and a tracker that tracks the detected objects. Various publicly available datasets allow us to train a detector model without much effort. However, there are relatively few publicly available datasets for training a tracker model, and configuring own tracker datasets takes a long time compared to configuring detector datasets. Hence, the detector is often developed separately with a tracker module. However, the separated tracker should be adjusted whenever the former detector model is changed. This study proposes a system that can train a model that performs detection and tracking simultaneously using only the detector training datasets. In particular, a Siam network with augmentation is used to compose the detector and tracker. Experiments are conducted on public datasets to verify that the proposed algorithm can formulate a real-time multi-object tracker comparable to the state-of-the-art tracker models.

FinBERT Fine-Tuning for Sentiment Analysis: Exploring the Effectiveness of Datasets and Hyperparameters (감성 분석을 위한 FinBERT 미세 조정: 데이터 세트와 하이퍼파라미터의 효과성 탐구)

  • Jae Heon Kim;Hui Do Jung;Beakcheol Jang
    • Journal of Internet Computing and Services
    • /
    • v.24 no.4
    • /
    • pp.127-135
    • /
    • 2023
  • This research paper explores the application of FinBERT, a variational BERT-based model pre-trained on financial domain, for sentiment analysis in the financial domain while focusing on the process of identifying suitable training data and hyperparameters. Our goal is to offer a comprehensive guide on effectively utilizing the FinBERT model for accurate sentiment analysis by employing various datasets and fine-tuning hyperparameters. We outline the architecture and workflow of the proposed approach for fine-tuning the FinBERT model in this study, emphasizing the performance of various datasets and hyperparameters for sentiment analysis tasks. Additionally, we verify the reliability of GPT-3 as a suitable annotator by using it for sentiment labeling tasks. Our results show that the fine-tuned FinBERT model excels across a range of datasets and that the optimal combination is a learning rate of 5e-5 and a batch size of 64, which perform consistently well across all datasets. Furthermore, based on the significant performance improvement of the FinBERT model with our Twitter data in general domain compared to our news data in general domain, we also express uncertainty about the model being further pre-trained only on financial news data. We simplify the complex process of determining the optimal approach to the FinBERT model and provide guidelines for selecting additional training datasets and hyperparameters within the fine-tuning process of financial sentiment analysis models.

Vegetation Height and Age Estimation using Shuttle Radar Topography Mission and National Elevation Datasets (SRTM과 NED를 이용한 식생수고 및 수령 추정)

  • Kim Jin-Woo;Heo Joon;Sohn Hong-Gyoo
    • Proceedings of the KSRS Conference
    • /
    • 2006.03a
    • /
    • pp.127-130
    • /
    • 2006
  • SRTM 데이터와 USGS의 NED (National Elevation Datasets) 데이터를 사용하였으며 두 데이터를 차분함으로써 식생수고도(vegetation height map)를 얻었다. 또한 차분값과 shape 파일에 포함된 식수년도의 비교를 통해 상관관계여부를 판단하고자 했다. 회귀분석을 통해 차분데이터와 식수년도 사이의 큰 상관관계가 존재함을 확인할 수 있었으며 결국 수령추정과 수령정보의 맵핑이 가능함을 보였다. 추가적으로 지역별 지형특성, 숲의 균일도 등에 의해 선형성이 영향을 받는지 관찰하였다.

  • PDF

Map Integration Method using Relative Location (상대적 위치를 이용한 지도통합 방법 : 랜드마크 선정을 중심으로)

  • Kim, Jung-Ok;Park, Jae-June;Yu, Ki-Yun
    • Proceedings of the Korean Society of Surveying, Geodesy, Photogrammetry, and Cartography Conference
    • /
    • 2010.04a
    • /
    • pp.3-4
    • /
    • 2010
  • Map integration usually involves matching the common spatial objects in different datasets. There have been recent studies on object matching using relative location as defined by spatial relationships between the object and its neighbor landmark. Therefore the landmark selection process is an important part of map integration using relative location. In this research, we describe an approach to determine landmarks automatically in different geospatial datasets.

  • PDF

Reconstructing the cosmic density field based on the generative adversarial network.

  • Shi, Feng
    • The Bulletin of The Korean Astronomical Society
    • /
    • v.45 no.1
    • /
    • pp.50.1-50.1
    • /
    • 2020
  • In this topic, I will introduce a recent work on reconstructing the cosmic density field based on the GAN. I will show the performance of the GAN compared to the traditional Unet architecture. I'd also like to discuss a 3-channels-based 2D datasets for the training to recover the 3D density field. Finally, I will present some performance tests based on the test datasets.

  • PDF