• 제목/요약/키워드: datasets

검색결과 2,091건 처리시간 0.022초

A Study on the Complementary Method of Aerial Image Learning Dataset Using Cycle Generative Adversarial Network (CycleGAN을 활용한 항공영상 학습 데이터 셋 보완 기법에 관한 연구)

  • Choi, Hyeoung Wook;Lee, Seung Hyeon;Kim, Hyeong Hun;Suh, Yong Cheol
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • 제38권6호
    • /
    • pp.499-509
    • /
    • 2020
  • This study explores how to build object classification learning data based on artificial intelligence. The data has been investigated recently in image classification fields and, in turn, has a great potential to use. In order to recognize and extract relatively accurate objects using artificial intelligence, a large amount of learning data is required to be used in artificial intelligence algorithms. However, currently, there are not enough datasets for object recognition learning to share and utilize. In addition, generating data requires long hours of work, high expenses and labor. Therefore, in the present study, a small amount of initial aerial image learning data was used in the GAN (Generative Adversarial Network)-based generator network in order to establish image learning data. Moreover, the experiment also evaluated its quality in order to utilize additional learning datasets. The method of oversampling learning data using GAN can complement the amount of learning data, which have a crucial influence on deep learning data. As a result, this method is expected to be effective particularly with insufficient initial datasets.

Experimental Study on Application of an Anomaly Detection Algorithm in Electric Current Datasets Generated from Marine Air Compressor with Time-series Features (시계열 특징을 갖는 선박용 공기 압축기 전류 데이터의 이상 탐지 알고리즘 적용 실험)

  • Lee, Jung-Hyung
    • Journal of the Korean Society of Marine Environment & Safety
    • /
    • 제27권1호
    • /
    • pp.127-134
    • /
    • 2021
  • In this study, an anomaly detection (AD) algorithm was implemented to detect the failure of a marine air compressor. A lab-scale experiment was designed to produce fault datasets (time-series electric current measurements) for 10 failure modes of the air compressor. The results demonstrated that the temporal pattern of the datasets showed periodicity with a different period, depending on the failure mode. An AD model with a convolutional autoencoder was developed and trained based on a normal operation dataset. The reconstruction error was used as the threshold for AD. The reconstruction error was noted to be dependent on the AD model and hyperparameter tuning. The AD model was applied to the synthetic dataset, which comprised both normal and abnormal conditions of the air compressor for validation. The AD model exhibited good detection performance on anomalies showing periodicity but poor performance on anomalies resulting from subtle load changes in the motor.

Developing and Pre-Processing a Dataset using a Rhetorical Relation to Build a Question-Answering System based on an Unsupervised Learning Approach

  • Dutta, Ashit Kumar;Wahab sait, Abdul Rahaman;Keshta, Ismail Mohamed;Elhalles, Abheer
    • International Journal of Computer Science & Network Security
    • /
    • 제21권11호
    • /
    • pp.199-206
    • /
    • 2021
  • Rhetorical relations between two text fragments are essential information and support natural language processing applications such as Question - Answering (QA) system and automatic text summarization to produce an effective outcome. Question - Answering (QA) system facilitates users to retrieve a meaningful response. There is a demand for rhetorical relation based datasets to develop such a system to interpret and respond to user requests. There are a limited number of datasets for developing an Arabic QA system. Thus, there is a lack of an effective QA system in the Arabic language. Recent research works reveal that unsupervised learning can support the QA system to reply to users queries. In this study, researchers intend to develop a rhetorical relation based dataset for implementing unsupervised learning applications. A web crawler is developed to crawl Arabic content from the web. A discourse-annotated corpus is generated using the rhetorical structural theory. A Naïve Bayes based QA system is developed to evaluate the performance of datasets. The outcome shows that the performance of the QA system is improved with proposed dataset and able to answer user queries with an appropriate response. In addition, the results on fine-grained and coarse-grained relations reveal that the dataset is highly reliable.

A Study on METS Design Using DDI Metadata (DDI 메타데이터를 활용한 METS 설계에 관한 연구)

  • Park, Jin Ho
    • Journal of the Korean Society for information Management
    • /
    • 제38권4호
    • /
    • pp.153-171
    • /
    • 2021
  • This study suggested a method of utilizing METS based on DDI metadata to manage, preserve, and service datasets. DDI is a standard for statistical data processing, and there are currently two versions of DDI Codebook (DDI-C) and DDI Lifecycle (DDI-L). In this study, the main elements of DDI-C were mainly used. First the structures and elements of METS and DDI-C were first analyzed. And the mapping of the major elements of METS and DDI-C. The standard was finally taken as METS, the format to express it. Since METS and DDI-C do not show a perfect 1:1 mapping, the DDI-C element that best matches each element of the standard METS was selected. As a result, a new dataset management transmission standard METS using DDI-C metadata elements was designed and presented.

Model selection via Bayesian information criterion for divide-and-conquer penalized quantile regression (베이즈 정보 기준을 활용한 분할-정복 벌점화 분위수 회귀)

  • Kang, Jongkyeong;Han, Seokwon;Bang, Sungwan
    • The Korean Journal of Applied Statistics
    • /
    • 제35권2호
    • /
    • pp.217-227
    • /
    • 2022
  • Quantile regression is widely used in many fields based on the advantage of providing an efficient tool for examining complex information latent in variables. However, modern large-scale and high-dimensional data makes it very difficult to estimate the quantile regression model due to limitations in terms of computation time and storage space. Divide-and-conquer is a technique that divide the entire data into several sub-datasets that are easy to calculate and then reconstruct the estimates of the entire data using only the summary statistics in each sub-datasets. In this paper, we studied on a variable selection method using Bayes information criteria by applying the divide-and-conquer technique to the penalized quantile regression. When the number of sub-datasets is properly selected, the proposed method is efficient in terms of computational speed, providing consistent results in terms of variable selection as long as classical quantile regression estimates calculated with the entire data. The advantages of the proposed method were confirmed through simulation data and real data analysis.

Dataset Search System Using Metadata-Based Ranking Algorithm (메타데이터 기반 순위 알고리즘을 활용한 데이터셋 검색 시스템)

  • Choi, Wooyoung;Chun, Jonghoon
    • Journal of Broadcast Engineering
    • /
    • 제27권4호
    • /
    • pp.581-592
    • /
    • 2022
  • Recently, as the requirements for using big data have increased, interest in dataset search technology needed for data analysis is also growing. Although it is necessary to proactively utilize metadata, unlike conventional text search, research on such dataset search systems has not been actively carried out. In this paper, we propose a new dataset-tailored search system that indexes metadata of datasets and performs dataset search based on metadata indices. The ranking given to the dataset search results from a newly devised algorithm that reflects the unique characteristics of the dataset. The system provides the capability to search for additional datasets which correlate with the dataset searched by the user-submitted query so that multiple datasets needed for analysis can be found at once.

A study of Battery User Pattern Change tracking method using Linear Regression and ARIMA Model (선형회귀 및 ARIMA 모델을 이용한 배터리 사용자 패턴 변화 추적 연구)

  • Park, Jong-Yong;Yoo, Min-Hyeok;Nho, Tae-Min;Shin, Dae-Kyeon;Kim, Seong-Kweon
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • 제17권3호
    • /
    • pp.423-432
    • /
    • 2022
  • This paper addresses the safety concern that the SOH of batteries in electric vehicles decreases sharply when drivers change or their driving patterns change. Such a change can overload the battery, reduce the battery life, and induce safety issues. This paper aims to present the SOH as the changes on a dashboard of an electric vehicle in real-time in response to user pattern changes. As part of the training process I used battery data among the datasets provided by NASA, and built models incorporating linear regression and ARIMA, and predicted new battery data that contained user changes based on previously trained models. Therefore, as a result of the prediction, the linear regression is better at predicting some changes in SOH based on the user's pattern change if we have more battery datasets with a wide range of independent values. The ARIMA model can be used if we only have battery datasets with SOH data.

Detects depression-related emotions in user input sentences (사용자 입력 문장에서 우울 관련 감정 탐지)

  • Oh, Jaedong;Oh, Hayoung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • 제26권12호
    • /
    • pp.1759-1768
    • /
    • 2022
  • This paper proposes a model to detect depression-related emotions in a user's speech using wellness dialogue scripts provided by AI Hub, topic-specific daily conversation datasets, and chatbot datasets published on Github. There are 18 emotions, including depression and lethargy, in depression-related emotions, and emotion classification tasks are performed using KoBERT and KOELECTRA models that show high performance in language models. For model-specific performance comparisons, we build diverse datasets and compare classification results while adjusting batch sizes and learning rates for models that perform well. Furthermore, a person performs a multi-classification task by selecting all labels whose output values are higher than a specific threshold as the correct answer, in order to reflect feeling multiple emotions at the same time. The model with the best performance derived through this process is called the Depression model, and the model is then used to classify depression-related emotions for user utterances.

Research on data augmentation algorithm for time series based on deep learning

  • Shiyu Liu;Hongyan Qiao;Lianhong Yuan;Yuan Yuan;Jun Liu
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제17권6호
    • /
    • pp.1530-1544
    • /
    • 2023
  • Data monitoring is an important foundation of modern science. In most cases, the monitoring data is time-series data, which has high application value. The deep learning algorithm has a strong nonlinear fitting capability, which enables the recognition of time series by capturing anomalous information in time series. At present, the research of time series recognition based on deep learning is especially important for data monitoring. Deep learning algorithms require a large amount of data for training. However, abnormal sample is a small sample in time series, which means the number of abnormal time series can seriously affect the accuracy of recognition algorithm because of class imbalance. In order to increase the number of abnormal sample, a data augmentation method called GANBATS (GAN-based Bi-LSTM and Attention for Time Series) is proposed. In GANBATS, Bi-LSTM is introduced to extract the timing features and then transfer features to the generator network of GANBATS.GANBATS also modifies the discriminator network by adding an attention mechanism to achieve global attention for time series. At the end of discriminator, GANBATS is adding averagepooling layer, which merges temporal features to boost the operational efficiency. In this paper, four time series datasets and five data augmentation algorithms are used for comparison experiments. The generated data are measured by PRD(Percent Root Mean Square Difference) and DTW(Dynamic Time Warping). The experimental results show that GANBATS reduces up to 26.22 in PRD metric and 9.45 in DTW metric. In addition, this paper uses different algorithms to reconstruct the datasets and compare them by classification accuracy. The classification accuracy is improved by 6.44%-12.96% on four time series datasets.

Enhancing Recommender Systems by Fusing Diverse Information Sources through Data Transformation and Feature Selection

  • Thi-Linh Ho;Anh-Cuong Le;Dinh-Hong Vu
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제17권5호
    • /
    • pp.1413-1432
    • /
    • 2023
  • Recommender systems aim to recommend items to users by taking into account their probable interests. This study focuses on creating a model that utilizes multiple sources of information about users and items by employing a multimodality approach. The study addresses the task of how to gather information from different sources (modalities) and transform them into a uniform format, resulting in a multi-modal feature description for users and items. This work also aims to transform and represent the features extracted from different modalities so that the information is in a compatible format for integration and contains important, useful information for the prediction model. To achieve this goal, we propose a novel multi-modal recommendation model, which involves extracting latent features of users and items from a utility matrix using matrix factorization techniques. Various transformation techniques are utilized to extract features from other sources of information such as user reviews, item descriptions, and item categories. We also proposed the use of Principal Component Analysis (PCA) and Feature Selection techniques to reduce the data dimension and extract important features as well as remove noisy features to increase the accuracy of the model. We conducted several different experimental models based on different subsets of modalities on the MovieLens and Amazon sub-category datasets. According to the experimental results, the proposed model significantly enhances the accuracy of recommendations when compared to SVD, which is acknowledged as one of the most effective models for recommender systems. Specifically, the proposed model reduces the RMSE by a range of 4.8% to 21.43% and increases the Precision by a range of 2.07% to 26.49% for the Amazon datasets. Similarly, for the MovieLens dataset, the proposed model reduces the RMSE by 45.61% and increases the Precision by 14.06%. Additionally, the experimental results on both datasets demonstrate that combining information from multiple modalities in the proposed model leads to superior outcomes compared to relying on a single type of information.