• Title/Summary/Keyword: data model

Search Result 46,727, Processing Time 0.063 seconds

Finding Unexpected Test Accuracy by Cross Validation in Machine Learning

  • Yoon, Hoijin
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.12spc
    • /
    • pp.549-555
    • /
    • 2021
  • Machine Learning(ML) splits data into 3 parts, which are usually 60% for training, 20% for validation, and 20% for testing. It just splits quantitatively instead of selecting each set of data by a criterion, which is very important concept for the adequacy of test data. ML measures a model's accuracy by applying a set of validation data, and revises the model until the validation accuracy reaches on a certain level. After the validation process, the complete model is tested with the set of test data, which are not seen by the model yet. If the set of test data covers the model's attributes well, the test accuracy will be close to the validation accuracy of the model. To make sure that ML's set of test data works adequately, we design an experiment and see if the test accuracy of model is always close to its validation adequacy as expected. The experiment builds 100 different SVM models for each of six data sets published in UCI ML repository. From the test accuracy and its validation accuracy of 600 cases, we find some unexpected cases, where the test accuracy is very different from its validation accuracy. Consequently, it is not always true that ML's set of test data is adequate to assure a model's quality.

Development of a Data Reference Model for Joint Utilization of Biological Resource Research Data (생물자원 연구데이터의 공동 활용을 위한 데이터 참조모델 개발)

  • Kwon, Soon-chul;Jeong, Seung-ryul
    • Journal of Internet Computing and Services
    • /
    • v.19 no.4
    • /
    • pp.135-150
    • /
    • 2018
  • The biological resources research data around the world are not only very critical themselves but should be shared and utilized. Up to now, the biological resources have been compiled and managed individually depending on the purpose and characteristics of the study without any clear standard. So, in this study, the data reference model would be suggested which is applicable in the phase ranging from the start of the construction of the information system and which can be commonly used. For this purpose, the data model of the related information system would be expanded based on the domestic and foreign standards and data control policy so that the data reference model which can be commonly applicable to individual information system would be developed and its application procedure would be suggested. In addition, for the purpose of proving the excellence of the suggested data reference model, the quality level would be verified by applying the Korgstie's data model evaluation model and its level of data sharing with the domestic and foreign standards would be compared. The test results of this model showed that this model is better than the conventional data model in classifying the data into 4 levels of resources, target, activities and performances and that it has higher quality and sharing level of data in the data reference model which defines the derivation and relation of entity.

EJB-based Workflow Model Data Management Mechanism (EJB 기반의 워크플로우 모델 데이터 관리 기술)

  • 김민홍
    • Journal of the Korea Computer Industry Society
    • /
    • v.5 no.1
    • /
    • pp.19-28
    • /
    • 2004
  • The major problems in workflow system which controls business process arise with the difficulty of managing a vast volume of data. In this paper, a more reasonable method to manage workflow data is proposed after analyzing the data being used in workflow system. The data used in workflow system can be classified to model data, control data, workitem data and relevant data. The prime accent is placed on the workflow model data, as the model data is normally consistent and referenced more frequently that if the data is used efficiently, it is anticipated to give a good performance to workflow system. Relying on an intensive study, this paper designs and develops a model data system. This model data system is based on memory and manages versions, consistency, dynamic modification, and etc

  • PDF

Scientific Visualization of Oceanic Data (GIS정보를 이용한 해양자료의 과학적 가시화)

  • Im, Hyo-Hyuc;Kim, Hyeon-Seong;Han, Sang-Cheon;Seong, Ha-Keun;Kim, Kye-Yeong
    • Proceedings of the Korean Society of Marine Engineers Conference
    • /
    • 2006.06a
    • /
    • pp.195-196
    • /
    • 2006
  • Recently, there are increasing need to make a synthetic assessment about oceanic data which is collected over the various scientific field, in addition to just gathering oceanic data. In this study, we made a basic map using satellite image, aerial photo, multi-beam data, geological stratum data etc. And as well we are producing comprehensive SVT(Scientific Visualization Toolkit) which can visualize various kinds of oceanic data. These oceanic data include both survey data such as tidal height, tide, current, wave, water temperature, salinity, oceanic weather data and numeric modelling results such as ocean hydrodynamic model, wave model, erosion/sediment model, thermal discharged coastal water model, ocean water quality model. In this process, we introduce GIS(Geographic Information System) concepts to reflect time and spatial characteristics of oceanic data.

  • PDF

Performance Comparison of LSTM-Based Groundwater Level Prediction Model Using Savitzky-Golay Filter and Differential Method (Savitzky-Golay 필터와 미분을 활용한 LSTM 기반 지하수 수위 예측 모델의 성능 비교)

  • Keun-San Song;Young-Jin Song
    • Journal of the Semiconductor & Display Technology
    • /
    • v.22 no.3
    • /
    • pp.84-89
    • /
    • 2023
  • In water resource management, data prediction is performed using artificial intelligence, and companies, governments, and institutions continue to attempt to efficiently manage resources through this. LSTM is a model specialized for processing time series data, which can identify data patterns that change over time and has been attempted to predict groundwater level data. However, groundwater level data can cause sen-sor errors, missing values, or outliers, and these problems can degrade the performance of the LSTM model, and there is a need to improve data quality by processing them in the pretreatment stage. Therefore, in pre-dicting groundwater data, we will compare the LSTM model with the MSE and the model after normaliza-tion through distribution, and discuss the important process of analysis and data preprocessing according to the comparison results and changes in the results.

  • PDF

A Unifying Model for Hypothesis Testing Using Legislative Voting Data: A Multilevel Item-Response-Theory Model

  • Jeong, Gyung-Ho
    • Analyses & Alternatives
    • /
    • v.5 no.1
    • /
    • pp.3-24
    • /
    • 2021
  • This paper introduces a multilevel item-response-theory (IRT) model as a unifying model for hypothesis testing using legislative voting data. This paper shows that a probit or logit model is a special type of multilevel IRT model. In particular, it is demonstrated that, when a probit or logit model is applied to multiple votes, it makes unrealistic assumptions and produces incorrect coefficient estimates. The advantages of a multilevel IRT model over a probit or logit model are illustrated with a Monte Carlo experiment and an example from the U.S. House. Finally, this paper provides a practical guide to fitting this model to legislative voting data.

  • PDF

Revisiting the Bradley-Terry model and its application to information retrieval

  • Jeon, Jong-June;Kim, Yongdai
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.5
    • /
    • pp.1089-1099
    • /
    • 2013
  • The Bradley-Terry model is widely used for analysis of pairwise preference data. We explain that the popularity of Bradley-Terry model is gained due to not only easy computation but also some nice asymptotic properties when the model is misspecified. For information retrieval required to analyze big ranking data, we propose to use a pseudo likelihood based on the Bradley-Terry model even when the true model is different from the Bradley-Terry model. We justify using the Bradley-Terry model by proving that the estimated ranking based on the proposed pseudo likelihood is consistent when the true model belongs to the class of Thurstone models, which is much bigger than the Bradley-Terry model.

A Bayesian model for two-way contingency tables with nonignorable nonresponse from small areas

  • Woo, Namkyo;Kim, Dal Ho
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.1
    • /
    • pp.245-254
    • /
    • 2016
  • Many surveys provide categorical data and there may be one or more missing categories. We describe a nonignorable nonresponse model for the analysis of two-way contingency tables from small areas. There are both item and unit nonresponse. One approach to analyze these data is to construct several tables corresponding to missing categories. We describe a hierarchical Bayesian model to analyze two-way categorical data from different areas. This allows a "borrowing of strength" of the data from larger areas to improve the reliability in the estimates of the model parameters corresponding to the small areas. Also we use a nonignorable nonresponse model with Bayesian uncertainty analysis by placing priors in nonidentifiable parameters instead of a sensitivity analysis for nonidentifiable parameters. We use the griddy Gibbs sampler to fit our models and compute DIC and BPP for model diagnostics. We illustrate our method using data from NHANES III data on thirteen states to obtain the finite population proportions.

Sequence Anomaly Detection based on Diffusion Model (확산 모델 기반 시퀀스 이상 탐지)

  • Zhiyuan Zhang;Inwhee, Joe
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.2-4
    • /
    • 2023
  • Sequence data plays an important role in the field of intelligence, especially for industrial control, traffic control and other aspects. Finding abnormal parts in sequence data has long been an application field of AI technology. In this paper, we propose an anomaly detection method for sequence data using a diffusion model. The diffusion model has two major advantages: interpretability derived from rigorous mathematical derivation and unrestricted selection of backbone models. This method uses the diffusion model to predict and reconstruct the sequence data, and then detects the abnormal part by comparing with the real data. This paper successfully verifies the feasibility of the diffusion model in the field of anomaly detection. We use the combination of MLP and diffusion model to generate data and compare the generated data with real data to detect anomalous points.

An Assessment System for Evaluating Big Data Capability Based on a Reference Model (빅데이터 역량 평가를 위한 참조모델 및 수준진단시스템 개발)

  • Cheon, Min-Kyeong;Baek, Dong-Hyun
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.39 no.2
    • /
    • pp.54-63
    • /
    • 2016
  • As technology has developed and cost for data processing has reduced, big data market has grown bigger. Developed countries such as the United States have constantly invested in big data industry and achieved some remarkable results like improving advertisement effects and getting patents for customer service. Every company aims to achieve long-term survival and profit maximization, but it needs to establish a good strategy, considering current industrial conditions so that it can accomplish its goal in big data industry. However, since domestic big data industry is at its initial stage, local companies lack systematic method to establish competitive strategy. Therefore, this research aims to help local companies diagnose their big data capabilities through a reference model and big data capability assessment system. Big data reference model consists of five maturity levels such as Ad hoc, Repeatable, Defined, Managed and Optimizing and five key dimensions such as Organization, Resources, Infrastructure, People, and Analytics. Big data assessment system is planned based on the reference model's key factors. In the Organization area, there are 4 key diagnosis factors, big data leadership, big data strategy, analytical culture and data governance. In Resource area, there are 3 factors, data management, data integrity and data security/privacy. In Infrastructure area, there are 2 factors, big data platform and data management technology. In People area, there are 3 factors, training, big data skills and business-IT alignment. In Analytics area, there are 2 factors, data analysis and data visualization. These reference model and assessment system would be a useful guideline for local companies.