• 제목/요약/키워드: datasets

검색결과 2,012건 처리시간 0.028초

행정정보 데이터세트의 이관규격의 다양화 및 재현 방안에 관한 연구 (Research on Diversification of Transfer Specifications and Reproduction Methods for Administrative Information Datasets)

  • 양동민;최광훈;김지혜;유남희
    • 정보관리학회지
    • /
    • 제40권4호
    • /
    • pp.167-200
    • /
    • 2023
  • 국내 행정정보 데이터세트 기록관리에서는 행정정보 데이터세트를 이관할 때 이관규격으로 SIARD를 활용할 것을 권고하고 있다. 그러나 행정정보 데이터세트의 기록관리 단위, SIARD를 지원하는 도구의 기술적 한계, 공공기관의 현실적인 상황 등으로 인해 SIARD 적용이 적합하지 않은 경우가 다수 발생하고 있다. 본 연구에서는 SIARD 이외에 행정정보 데이터세트의 이관규격을 다양화하는 방안을 제안하고자 한다. 행정정보 데이터세트의 기록관리에서는 데이터세트와 연계된 사용자 인터페이스의 재현에 대한 필요성에 대한 논의는 지속되고 있지만 구체적으로 제시되고 있지 않다. 본 연구에서는 필수보존속성(Significant Properties) 관점에서 사용자 인터페이스도 함께 보존되어야 할 속성임을 확인하고, 사용자 인터페이스를 효과적으로 재현하는 방안을 제시하고, 실제 검증한 사례를 제공하고자 한다.

행정정보 데이터세트의 관리기준표 개선방안 연구 (A Study on the Improvement of the Management Reference Tables for Datasets in Administrative Information Systems)

  • 이정은;김지혜;왕호성;양동민
    • 한국기록관리학회지
    • /
    • 제22권1호
    • /
    • pp.177-200
    • /
    • 2022
  • 행정정보 데이터세트는 조직의 업무수행을 기반으로 생산되는 기록이다. 기록 행위에 대한 증거일뿐만 아니라 업무에 활용될 수 있는 수많은 정보를 포함하고 있다. 그동안 기록관리 현장의 그늘에 있던 데이터세트는 2020년 법령의 개정을 통해 기록으로 관리될 수 있는 법적 근거가 마련되었다. 이에 데이터세트 기록관리업무는 필요한 기관을 중심으로 이미 점진적으로 시작되었다. 데이터세트 기록관리업무의 핵심은 관리기준표의 작성에 있다. 그러나 기록관리를 수행하는 현장에서는 기록관리기준표 개념과의 혼선과 익숙하지 않은 개념의 등장으로 업무의 고충을 토로하고 있다. 본 연구는 이러한 배경 속에서 초반에 드러나는 데이터세트 기록관리의 문제점을 다시 한번 되짚어 보고, 보다 효과적으로 데이터세트 기록관리업무를 안착시킬 방법을 제시하고자 한다. 그 방안으로 관리기준표를 연구대상으로 선정하여 그간 논의되었던 문제점을 정리하고, 현행의 관리기준표 항목을 분석하였다. 연구의 결과로 관리기준표 항목의 간소화, 관리기준표 영역의 재편성, 보유기간의 개념 도입, 관리기준표 작성 프로세스를 제언하였다.

소량 및 불균형 능동소나 데이터세트에 대한 딥러닝 기반 표적식별기의 종합적인 분석 (Comprehensive analysis of deep learning-based target classifiers in small and imbalanced active sonar datasets)

  • 김근환;황용상;신성진;김주호;황수복;추영민
    • 한국음향학회지
    • /
    • 제42권4호
    • /
    • pp.329-344
    • /
    • 2023
  • 본 논문에서는 소량 및 불균형 능동소나 데이터세트에 적용된 다양한 딥러닝 기반 표적식별기의 일반화 성능을 종합적으로 분석하였다. 서로 다른 시간과 해역에서 수집된 능동소나 실험 데이터를 이용하여 두 가지 능동소나 데이터세트를 생성하였다. 데이터세트의 각 샘플은 탐지 처리 이후 탐지된 오디오 신호로부터 추출된 시간-주파수 영역 이미지이다. 표적식별기의 신경망 모델은 다양한 구조를 가지는 22개의 Convolutional Neural Networks(CNN) 모델을 사용하였다. 실험에서 두 가지 데이터세트는 학습/검증 데이터세트와 테스트 데이터세트로 번갈아 가며 사용되었으며, 표적식별기 출력의 변동성을 계산하기 위해 학습/검증/테스트를 10번 반복하고 표적식별 성능을 분석하였다. 이때 학습을 위한 초매개변수는 베이지안 최적화를 이용하여 최적화하였다. 실험 결과 본 논문에서 설계한 얕은 층을 가지는 CNN 모델이 대부분의 깊은 층을 가지는 CNN 모델보다 견실하면서 우수한 일반화 성능을 가지는 것을 확인하였다. 본 논문은 향후 딥러닝 기반 능동소나 표적식별 연구에 대한 방향성을 설정할 때 유용하게 사용될 수 있다.

Non-negligible Occurrence of Errors in Gender Description in Public Data Sets

  • Kim, Jong Hwan;Park, Jong-Luyl;Kim, Seon-Young
    • Genomics & Informatics
    • /
    • 제14권1호
    • /
    • pp.34-40
    • /
    • 2016
  • Due to advances in omics technologies, numerous genome-wide studies on human samples have been published, and most of the omics data with the associated clinical information are available in public repositories, such as Gene Expression Omnibus and ArrayExpress. While analyzing several public datasets, we observed that errors in gender information occur quite often in public datasets. When we analyzed the gender description and the methylation patterns of gender-specific probes (glucose-6-phosphate dehydrogenase [G6PD], ephrin-B1 [EFNB1], and testis specific protein, Y-linked 2 [TSPY2]) in 5,611 samples produced using Infinium 450K HumanMethylation arrays, we found that 19 samples from 7 datasets were erroneously described. We also analyzed 1,819 samples produced using the Affymetrix U133Plus2 array using several gender-specific genes (X (inactive)-specific transcript [XIST], eukaryotic translation initiation factor 1A, Y-linked [EIF1AY], and DEAD [Asp-Glu-Ala-Asp] box polypeptide 3, Y-linked [DDDX3Y]) and found that 40 samples from 3 datasets were erroneously described. We suggest that the users of public datasets should not expect that the data are error-free and, whenever possible, that they should check the consistency of the data.

Establishing the Process of Spatial Informatization Using Data from Social Network Services

  • Eo, Seung-Won;Lee, Youngmin;Yu, Kiyun;Park, Woojin
    • 한국측량학회지
    • /
    • 제34권2호
    • /
    • pp.111-120
    • /
    • 2016
  • Prior knowledge about the SNS (Social Network Services) datasets is often required to conduct valuable analysis using social media data. Understanding the characteristics of the information extracted from SNS datasets leaves much to be desired in many ways. This paper purposes on analyzing the detail of the target social network services, Twitter, Instagram, and YouTube to establish the spatial informatization process to integrate social media information with existing spatial datasets. In this study, valuable information in SNS datasets have been selected and total 12,938 data have been collected in Seoul via Open API. The dataset has been geo-coded and turned into the point form. We also removed the overlapped values of the dataset to conduct spatial integration with the existing building layers. The resultant of this spatial integration process will be utilized in various industries and become a fundamental resource to further studies related to geospatial integration using social media datasets.

High Utility Itemset Mining over Uncertain Datasets Based on a Quantum Genetic Algorithm

  • Wang, Ju;Liu, Fuxian;Jin, Chunjie
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제12권8호
    • /
    • pp.3606-3629
    • /
    • 2018
  • The discovered high potential utility itemsets (HPUIs) have significant influence on a variety of areas, such as retail marketing, web click analysis, and biological gene analysis. Thus, in this paper, we propose an algorithm called HPUIM-QGA (Mining high potential utility itemsets based on a quantum genetic algorithm) to mine HPUIs over uncertain datasets based on a quantum genetic algorithm (QGA). The proposed algorithm not only can handle the problem of the non-downward closure property by developing an upper bound of the potential utility (UBPU) (which prunes the unpromising itemsets in the early stage) but can also handle the problem of combinatorial explosion by introducing a QGA, which finds optimal solutions quickly and needs to set only very few parameters. Furthermore, a pruning strategy has been designed to avoid the meaningless and redundant itemsets that are generated in the evolution process of the QGA. As proof of the HPUIM-QGA, a substantial number of experiments are performed on the runtime, memory usage, analysis of the discovered itemsets and the convergence on real-life and synthetic datasets. The results show that our proposed algorithm is reasonable and acceptable for mining meaningful HPUIs from uncertain datasets.

Land Cover Classification Map of Northeast Asia Using GOCI Data

  • Son, Sanghun;Kim, Jinsoo
    • 대한원격탐사학회지
    • /
    • 제35권1호
    • /
    • pp.83-92
    • /
    • 2019
  • Land cover (LC) is an important factor in socioeconomic and environmental studies. According to various studies, a number of LC maps, including global land cover (GLC) datasets, are made using polar orbit satellite data. Due to the insufficiencies of reference datasets in Northeast Asia, several LC maps display discrepancies in that region. In this paper, we performed a feasibility assessment of LC mapping using Geostationary Ocean Color Imager (GOCI) data over Northeast Asia. To produce the LC map, the GOCI normalized difference vegetation index (NDVI) was used as an input dataset and a level-2 LC map of South Korea was used as a reference dataset to evaluate the LC map. In this paper, 7 LC types(urban, croplands, forest, grasslands, wetlands, barren, and water) were defined to reflect Northeast Asian LC. The LC map was produced via principal component analysis (PCA) with K-means clustering, and a sensitivity analysis was performed. The overall accuracy was calculated to be 77.94%. Furthermore, to assess the accuracy of the LC map not only in South Korea but also in Northeast Asia, 6 GLC datasets (IGBP, UMD, GLC2000, GlobCover2009, MCD12Q1, GlobeLand30) were used as comparison datasets. The accuracy scores for the 6 GLC datasets were calculated to be 59.41%, 56.82%, 60.97%, 51.71%, 70.24%, and 72.80%, respectively. Therefore, the first attempt to produce the LC map using geostationary satellite data is considered to be acceptable.

Development of Tourism Information Named Entity Recognition Datasets for the Fine-tune KoBERT-CRF Model

  • Jwa, Myeong-Cheol;Jwa, Jeong-Woo
    • International Journal of Internet, Broadcasting and Communication
    • /
    • 제14권2호
    • /
    • pp.55-62
    • /
    • 2022
  • A smart tourism chatbot is needed as a user interface to efficiently provide smart tourism services such as recommended travel products, tourist information, my travel itinerary, and tour guide service to tourists. We have been developed a smart tourism app and a smart tourism information system that provide smart tourism services to tourists. We also developed a smart tourism chatbot service consisting of khaiii morpheme analyzer, rule-based intention classification, and tourism information knowledge base using Neo4j graph database. In this paper, we develop the Korean and English smart tourism Name Entity (NE) datasets required for the development of the NER model using the pre-trained language models (PLMs) for the smart tourism chatbot system. We create the tourism information NER datasets by collecting source data through smart tourism app, visitJeju web of Jeju Tourism Organization (JTO), and web search, and preprocessing it using Korean and English tourism information Name Entity dictionaries. We perform training on the KoBERT-CRF NER model using the developed Korean and English tourism information NER datasets. The weight-averaged precision, recall, and f1 scores are 0.94, 0.92 and 0.94 on Korean and English tourism information NER datasets.

A Deep Learning Approach for Classification of Cloud Image Patches on Small Datasets

  • Phung, Van Hiep;Rhee, Eun Joo
    • Journal of information and communication convergence engineering
    • /
    • 제16권3호
    • /
    • pp.173-178
    • /
    • 2018
  • Accurate classification of cloud images is a challenging task. Almost all the existing methods rely on hand-crafted feature extraction. Their limitation is low discriminative power. In the recent years, deep learning with convolution neural networks (CNNs), which can auto extract features, has achieved promising results in many computer vision and image understanding fields. However, deep learning approaches usually need large datasets. This paper proposes a deep learning approach for classification of cloud image patches on small datasets. First, we design a suitable deep learning model for small datasets using a CNN, and then we apply data augmentation and dropout regularization techniques to increase the generalization of the model. The experiments for the proposed approach were performed on SWIMCAT small dataset with k-fold cross-validation. The experimental results demonstrated perfect classification accuracy for most classes on every fold, and confirmed both the high accuracy and the robustness of the proposed model.

Synergic Effect of using the Optical and Radar Image Data for the Land Cover Classification in Coastal Region

  • Kim, Sun-Hwa;Lee, Kyu-Sung
    • 대한원격탐사학회:학술대회논문집
    • /
    • 대한원격탐사학회 2003년도 Proceedings of ACRS 2003 ISRS
    • /
    • pp.1030-1032
    • /
    • 2003
  • This study a imed to analyze the effect of combined optical and radar image for the land cover classification in coastal region. The study area, Gyeonggi Bay area has one of the largest tidal ranges and has frequent land cover changes due to the several reclamations and rather intensive land uses. Ten land cover types were classified using several datasets of combining Landsat ETM+ and RADARSAT imagery. The synergic effects of the merged datasets were analyzed by both visual interpretation and an ordinary supervised classification. The merged optical and SAR datasets provided better discrimination among the land cover classes in the coastal area. The overall classification accuracy of merged datasets was improved to 86.5% as compared to 78% accuracy of using ETM+ only.

  • PDF