• Title/Summary/Keyword: Data validation

Search Result 3,187, Processing Time 0.037 seconds

An Evaluation Study on Artificial Intelligence Data Validation Methods and Open-source Frameworks (인공지능 데이터 품질검증 기술 및 오픈소스 프레임워크 분석 연구)

  • Yun, Changhee;Shin, Hokyung;Choo, Seung-Yeon;Kim, Jaeil
    • Journal of Korea Multimedia Society
    • /
    • v.24 no.10
    • /
    • pp.1403-1413
    • /
    • 2021
  • In this paper, we investigate automated data validation techniques for artificial intelligence training, and also disclose open-source frameworks, such as Google's TensorFlow Data Validation (TFDV), that support automated data validation in the AI model development process. We also introduce an experimental study using public data sets to demonstrate the effectiveness of the open-source data validation framework. In particular, we presents experimental results of the data validation functions for schema testing and discuss the limitations of the current open-source frameworks for semantic data. Last, we introduce the latest studies for the semantic data validation using machine learning techniques.

Finding Unexpected Test Accuracy by Cross Validation in Machine Learning

  • Yoon, Hoijin
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.12spc
    • /
    • pp.549-555
    • /
    • 2021
  • Machine Learning(ML) splits data into 3 parts, which are usually 60% for training, 20% for validation, and 20% for testing. It just splits quantitatively instead of selecting each set of data by a criterion, which is very important concept for the adequacy of test data. ML measures a model's accuracy by applying a set of validation data, and revises the model until the validation accuracy reaches on a certain level. After the validation process, the complete model is tested with the set of test data, which are not seen by the model yet. If the set of test data covers the model's attributes well, the test accuracy will be close to the validation accuracy of the model. To make sure that ML's set of test data works adequately, we design an experiment and see if the test accuracy of model is always close to its validation adequacy as expected. The experiment builds 100 different SVM models for each of six data sets published in UCI ML repository. From the test accuracy and its validation accuracy of 600 cases, we find some unexpected cases, where the test accuracy is very different from its validation accuracy. Consequently, it is not always true that ML's set of test data is adequate to assure a model's quality.

Validation Data Augmentation for Improving the Grading Accuracy of Diabetic Macular Edema using Deep Learning (딥러닝을 이용한 당뇨성황반부종 등급 분류의 정확도 개선을 위한 검증 데이터 증강 기법)

  • Lee, Tae Soo
    • Journal of Biomedical Engineering Research
    • /
    • v.40 no.2
    • /
    • pp.48-54
    • /
    • 2019
  • This paper proposed a method of validation data augmentation for improving the grading accuracy of diabetic macular edema (DME) using deep learning. The data augmentation technique is basically applied in order to secure diversity of data by transforming one image to several images through random translation, rotation, scaling and reflection in preparation of input data of the deep neural network (DNN). In this paper, we apply this technique in the validation process of the trained DNN, and improve the grading accuracy by combining the classification results of the augmented images. To verify the effectiveness, 1,200 retinal images of Messidor dataset was divided into training and validation data at the ratio 7:3. By applying random augmentation to 359 validation data, $1.61{\pm}0.55%$ accuracy improvement was achieved in the case of six times augmentation (N=6). This simple method has shown that the accuracy can be improved in the N range from 2 to 6 with the correlation coefficient of 0.5667. Therefore, it is expected to help improve the diagnostic accuracy of DME with the grading information provided by the proposed DNN.

Rubber O-ring defect detection system using K-fold cross validation and support vector machine (K-겹 교차 검증과 서포트 벡터 머신을 이용한 고무 오링결함 검출 시스템)

  • Lee, Yong Eun;Choi, Nak Joon;Byun, Young Hoo;Kim, Dae Won;Kim, Kyung Chun
    • Journal of the Korean Society of Visualization
    • /
    • v.19 no.1
    • /
    • pp.68-73
    • /
    • 2021
  • In this study, the detection of rubber o-ring defects was carried out using k-fold cross validation and Support Vector Machine (SVM) algorithm. The data process was carried out in 3 steps. First, we proceeded with a frame alignment to eliminate unnecessary regions in the learning and secondly, we applied gray-scale changes for computational reduction. Finally, data processing was carried out using image augmentation to prevent data overfitting. After processing data, SVM algorithm was used to obtain normal and defect detection accuracy. In addition, we applied the SVM algorithm through the k-fold cross validation method to compare the classification accuracy. As a result, we obtain results that show better performance by applying the k-fold cross validation method.

A Visual Approach for Data-Intensive Workflow Validation

  • Park, Minjae;Ahn, Hyun;Kim, Kwanghoon Pio
    • Journal of Internet Computing and Services
    • /
    • v.17 no.5
    • /
    • pp.43-49
    • /
    • 2016
  • This paper presents a workflow validation method for data-intensive graphical workflow models using real-time workflow tracing mode on data-intensive workflow designer. In order to model and validate workflows, we try to divide as modes have editable mode and tracing mode on data-intensive workflow designer. We could design data-intensive workflow using drag and drop in editable-mode, otherwise we could not design but view and trace workflow model in tracing mode. We would like to focus on tracing-mode for workflow validation, and describe how to use workflow tracing on data-intensive workflow model designer. Especially, it is support data centered operation about control logics and exchange variables on workflow runtime for workflow tracing.

OVERVIEW OF KOMPSAT APPLICATION PRODUCT VALIDATION SITE AND THE RELATED ACTIVITIES

  • Lee, Kwang-Jae;Youn, Bo-Yeol;Kim, Duk-Jin;Kim, Youn-Soo
    • Proceedings of the KSRS Conference
    • /
    • 2007.10a
    • /
    • pp.122-125
    • /
    • 2007
  • In recent years, there has been an increasing demand for improved accuracy and reliability of Earth Observation Satellite (EOS) data. Most of the data users in the field of remote sensing require understanding of product accuracy and uncertainty. Especially, EOS application products should be validated for practical application in the field. In order to evaluate the availability and applicability of application products, it will be necessary to establish a systematic validation system including techniques, equipments, ground truth data, etc. The Product Validation Site (PVS) for generation and validation of KOMPSAT application products was designed and established with various in-situ equipment and dataset. This paper presents the status of PVS and summarizes some results from experiment studies at PVS.

  • PDF

Comparison of the Cluster Validation Techniques using Gene Expression Data (유전자 발현 자료를 이용한 군집 타당성분석 기법 비교)

  • Jeong, Yun-Kyoung;Baek, Jang-Sun
    • 한국데이터정보과학회:학술대회논문집
    • /
    • 2006.04a
    • /
    • pp.63-76
    • /
    • 2006
  • Several clustering algorithms to analyze gene expression data and cluster validation techniques that assess the quality of their outcomes, have been suggested, but evaluations of these cluster validation techniques have seldom been implemented. In this paper we compared various cluster validity indices for simulation data and real genomic data, and found that Dunn's index is more effective and robust through small simulations and with real gene expression data.

  • PDF

Design of an Algorithm for the Validation of SCL in Digital Substations

  • Jang, B.T.;Alidu, A.;Kim, N.D.
    • KEPCO Journal on Electric Power and Energy
    • /
    • v.3 no.2
    • /
    • pp.89-97
    • /
    • 2017
  • The substation is a critical node in the power network where power is transformed in the power generation, transmission and distribution system. The IEC 61850 is a global standard which proposes efficient substation automation by defining interoperable communication and data modelling techniques. In order to achieve this level of interoperability and automation, the IEC 61850 (Part 6) defines System Configuration description Language (SCL). The SCL is an XML based file format for defining the abstract model of primary and secondary substation equipment, communications systems and also the relationship between them. It enables the interoperable exchange of data during substation engineering by standardizing the description of applications at different stages of the engineering process. To achieve the seamless interoperability, multi-vendor devices are required to adhere completely to the IEC 61850. This paper proposes an efficient algorithm required for verifying the interoperability of multi-vendor devices by checking the adherence of the SCL file to specifications of the standard. Our proposed SCL validation algorithm consists of schema validation and other functionalities including information model validation using UML data model, the Vendor Defined Extension model validation, the User Defined Rule validation and the IED Engineering Table (IET) consistency validation. It also integrates the standard UCAIUG (Utility Communication Architecture International Users Group) Procedure validation for quality assurance testing. Our proposed algorithm is not only flexible and efficient in terms of ensuring interoperable functionality of tested devices, it is also convenient for use by system integrators and test engineers.

Comparison of the Cluster Validation Methods for High-dimensional (Gene Expression) Data (고차원 (유전자 발현) 자료에 대한 군집 타당성분석 기법의 성능 비교)

  • Jeong, Yun-Kyoung;Baek, Jang-Sun
    • The Korean Journal of Applied Statistics
    • /
    • v.20 no.1
    • /
    • pp.167-181
    • /
    • 2007
  • Many clustering algorithms and cluster validation techniques for high-dimensional gene expression data have been suggested. The evaluations of these cluster validation techniques have, however, seldom been implemented. In this paper we compared various cluster validity indices for low-dimensional simulation data and real gene expression data, and found that Dunn's index is the most effective and robust, Silhouette index is next and Davies-Bouldin index is the bottom among the internal measures. Jaccard index is much more effective than Goodman-Kruskal index and adjusted Rand index among the external measures.

Basic Principles of the Validation for Good Laboratory Practice Institutes

  • Cho, Kyu-Hyuk;Kim, Jin-Sung;Jeon, Man-Soo;Lee, Kyu-Hong;Chung, Moon-Koo;Song, Chang-Woo
    • Toxicological Research
    • /
    • v.25 no.1
    • /
    • pp.1-8
    • /
    • 2009
  • Validation specifies and coordinates all relevant activities to ensure compliance with good laboratory practices (GLP) according to suitable international standards. This includes validation activities of past, present and future for the best possible actions to ensure the integrity of non-clinical laboratory data. Recently, validation has become increasingly important, not only in good manufacturing practice (GMP) institutions but also in GLP facilities. In accordance with the guideline for GLP regulations, all equipments used to generate, measure, or assess data should undergo validation to ensure that this equipment is of appropriate design and capacity and that it will consistently function as intended. Therefore, the implantation of validation processes is considered to be an essential step in a global institution. This review describes the procedures and documentations required for validation of GLP. It introduces basic elements such as the validation master plan, risk assessment, gap analysis, design qualification, installation qualification, operational qualification, performance qualification, calibration, traceability, and revalidation.