• Title/Summary/Keyword: Data set

Search Result 10,943, Processing Time 0.034 seconds

Study on the Improvement of Machine Learning Ability through Data Augmentation (데이터 증강을 통한 기계학습 능력 개선 방법 연구)

  • Kim, Tae-woo;Shin, Kwang-seong
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.05a
    • /
    • pp.346-347
    • /
    • 2021
  • For pattern recognition for machine learning, the larger the amount of learning data, the better its performance. However, it is not always possible to secure a large amount of learning data with the types and information of patterns that must be detected in daily life. Therefore, it is necessary to significantly inflate a small data set for general machine learning. In this study, we study techniques to augment data so that machine learning can be performed. A representative method of performing machine learning using a small data set is the transfer learning technique. Transfer learning is a method of obtaining a result by performing basic learning with a general-purpose data set and then substituting the target data set into the final stage. In this study, a learning model trained with a general-purpose data set such as ImageNet is used as a feature extraction set using augmented data to detect a desired pattern.

  • PDF

AVHRR MOSAIC IMAGE DATA SET FOR ASIAN REGION

  • Yokoyama, Ryuzo;Lei, Liping;Purevdorj, Ts.;Tanba, Sumio
    • Proceedings of the KSRS Conference
    • /
    • 1999.11a
    • /
    • pp.285-289
    • /
    • 1999
  • A processing system to produce cloud-free composite image data set was developed. In the process, a fine geometric correction based on orbit parameters and ground control points and radiometric correction based on 6S code are applied. Presently, by using AVHRR image data received at Tokyo, Okinawa, Ulaanbaatar and Bangkok, data set of 10 days composite images covering almost whole Asian region.

  • PDF

Parameter Tuning in Support Vector Regression for Large Scale Problems (대용량 자료에 대한 서포트 벡터 회귀에서 모수조절)

  • Ryu, Jee-Youl;Kwak, Minjung;Yoon, Min
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.25 no.1
    • /
    • pp.15-21
    • /
    • 2015
  • In support vector machine, the values of parameters included in kernels affect strongly generalization ability. It is often difficult to determine appropriate values of those parameters in advance. It has been observed through our studies that the burden for deciding the values of those parameters in support vector regression can be reduced by utilizing ensemble learning. However, the straightforward application of the method to large scale problems is too time consuming. In this paper, we propose a method in which the original data set is decomposed into a certain number of sub data set in order to reduce the burden for parameter tuning in support vector regression with large scale data sets and imbalanced data set, particularly.

Detecting differentially expressed genes from a mixed data set

  • Lee, Sun-Ho;Kim, In-Young;Kim, Sang-Cheol;Rha, Sun-Young;Chung, Hyun-Chel;Kim, Byung-Soo
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2003.10a
    • /
    • pp.173-177
    • /
    • 2003
  • When we have both a paired data set and two independent data sets, neither a paired t-test nor a two-sample t-test can be used to detect differences between two samples. In order to identify differentially expressed genes in a mixed data set, a new test statistic is proposed.

  • PDF

Extraction of Fuzzy Rules from Data using Rough Set (Rough Set을 이용한 퍼지 규칙의 생성)

  • 조영완;노흥식;위성윤;이희진;박민용
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 1996.10a
    • /
    • pp.327-332
    • /
    • 1996
  • Rough Set theory suggested by Pawlak has a property that it can describe the degree of relation between condition and decision attributes of data which don't have linguistic information. In this paper, by using this ability of rough set theory, we define a occupancy degree which is a measure can represent a degree of relational quantity between condition and decision attributes of data table. We also propose a method that can find an optimal fuzzy rule table and membership functions of input and output variables from data without linguistic information and examine the validity of the method by modeling data generated by fuzzy rule.

  • PDF

TMY2 Weather data for Korea (TMY2 방식에 의한 국내 기상자료 작성 연구)

  • Shin, Kee-Shik;Yoon, Chang-Ryuel;Park, Sang-Dong
    • 한국신재생에너지학회:학술대회논문집
    • /
    • 2009.06a
    • /
    • pp.243-246
    • /
    • 2009
  • To evaluate the building energy performance, many building simulation programs are used and its capabilities are developed. Despite of its increased capabilities the weather data used In the Building Energy performance evaluation, are still using the same limited set of data. This often forces users to find or calculate weather data such as illuminance, solar radiation, and ground temperature from other sources to calculate it. Also, proper selection of a right weather data set has been considered as one of important factors for a successful building energy simulation. In this paper, we describe TMY2 data, a generalized weather data format developed for use, and applied to Seoul region and examine the differences comparing to existing weather data. A set of 23 years raw weather data base has been developed to provide the weather data file for building energy analysis in Seoul.

  • PDF

Utilizing the GOA-RF hybrid model, predicting the CPT-based pile set-up parameters

  • Zhao, Zhilong;Chen, Simin;Zhang, Dengke;Peng, Bin;Li, Xuyang;Zheng, Qian
    • Geomechanics and Engineering
    • /
    • v.31 no.1
    • /
    • pp.113-127
    • /
    • 2022
  • The undrained shear strength of soil is considered one of the engineering parameters of utmost significance in geotechnical design methods. In-situ experiments like cone penetration tests (CPT) have been used in the last several years to estimate the undrained shear strength depending on the characteristics of the soil. Nevertheless, the majority of these techniques rely on correlation presumptions, which may lead to uneven accuracy. This research's general aim is to extend a new united soft computing model, which is a combination of random forest (RF) with grasshopper optimization algorithm (GOA) to the pile set-up parameters' better approximation from CPT, based on two different types of data as inputs. Data type 1 contains pile parameters, and data type 2 consists of soil properties. The contribution of this article is that hybrid GOA - RF for the first time, was suggested to forecast the pile set-up parameter from CPT. In order to do this, CPT data and related bore log data were gathered from 70 various locations across Louisiana. With an R2 greater than 0.9098, which denotes the permissible relationship between measured and anticipated values, the results demonstrated that both models perform well in forecasting the set-up parameter. It is comprehensible that, in the training and testing step, the model with data type 2 has finer capability than the model using data type 1, with R2 and RMSE are 0.9272 and 0.0305 for the training step and 0.9182 and 0.0415 for the testing step. All in all, the models' results depict that the A parameter could be forecasted with adequate precision from the CPT data with the usage of hybrid GOA - RF models. However, the RF model with soil features as input parameters results in a finer commentary of pile set-up parameters.

The Effect of Bias in Data Set for Conceptual Clustering Algorithms

  • Lee, Gye Sung
    • International journal of advanced smart convergence
    • /
    • v.8 no.3
    • /
    • pp.46-53
    • /
    • 2019
  • When a partitioned structure is derived from a data set using a clustering algorithm, it is not unusual to have a different set of outcomes when it runs with a different order of data. This problem is known as the order bias problem. Many algorithms in machine learning fields try to achieve optimized result from available training and test data. Optimization is determined by an evaluation function which has also a tendency toward a certain goal. It is inevitable to have a tendency in the evaluation function both for efficiency and for consistency in the result. But its preference for a specific goal in the evaluation function may sometimes lead to unfavorable consequences in the final result of the clustering. To overcome this bias problems, the first clustering process proceeds to construct an initial partition. The initial partition is expected to imply the possible range in the number of final clusters. We apply the data centric sorting to the data objects in the clusters of the partition to rearrange them in a new order. The same clustering procedure is reapplied to the newly arranged data set to build a new partition. We have developed an algorithm that reduces bias effect resulting from how data is fed into the algorithm. Experiment results have been presented to show that the algorithm helps minimize the order bias effects. We have also shown that the current evaluation measure used for the clustering algorithm is biased toward favoring a smaller number of clusters and a larger size of clusters as a result.

Preliminary Study on Utilization of Big Data from CCTV at Child Care Centers (어린이집 CCTV 빅데이터의 활용을 위한 기초 연구)

  • Shin, Nary;Yu, Aehyung
    • Korean Journal of Childcare and Education
    • /
    • v.13 no.6
    • /
    • pp.43-67
    • /
    • 2017
  • Objective: The purpose of this study was to explore the feasibility to utilize image data recorded and accumulated from CCTV at child care centers. Methods: Literature reviews, consultations and workshops with scholars studying child development, legal professionals, and engineers, focus group interviews with professionals working with young children, and surveys targeting parents, directors and teachers were implemented. Results: It was found the big data from CCTV at child care centers can be used to make policies and implement research as a secondary data set after anonymization. Extracting implicit and useful data from images stored on CCTV is technically feasible. Also, it can be legally guaranteed to analyze the data under the condition of acquiring informed consents. Conclusion/Implications: It was likely to utilize image data from CCTV at child care centers as a secondary data set in order for policy development and scholarly purposes, after overcoming obstacles of the budget for additional infrastructures and consents of information holders.

Study on Rainfall Characteristics for the Millimeter-wave Communication Systems-Comparisons of Rainfall rate data from Several observation methods.

  • Chung, H.S.;Song, B.H.;Lee, J.H.;Park, K.M.;Lee, K.A.
    • Proceedings of the KSRS Conference
    • /
    • 1999.11a
    • /
    • pp.132-134
    • /
    • 1999
  • Rainfall characteristics for designing the optimum millimeter-wave communication systems from two rainfall data set was analyzed. Two rainfall data sets were compared; one-minute rainfall rate data, one-hour synoptic observation data. Each data set has different observation method, sampling frequency. We looked for tendency and quality confluence between two data sets. We showed several results using one-minute rainfall data by millimeter-wave attenuation model. A climatological one-minute rainfall rate data set over Korean Peninsula will be made after data quality control procedure

  • PDF