• Title/Summary/Keyword: 데이터 선별

Search Result 583, Processing Time 0.025 seconds

Load Shedding via Predicting the Frequency of Tuple for Efficient Analsis over Data Streams (효율적 데이터 스트림 분석을 위한 발생빈도 예측 기법을 이용한 과부하 처리)

  • Chang, Joong-Hyuk
    • The KIPS Transactions:PartD
    • /
    • v.13D no.6 s.109
    • /
    • pp.755-764
    • /
    • 2006
  • In recent, data streams are generated in various application fields such as a ubiquitous computing and a sensor network, and various algorithms are actively proposed for processing data streams efficiently. They mainly focus on the restriction of their memory usage and minimization of their processing time per data element. However, in the algorithms, if data elements of a data stream are generated in a rapid rate for a time unit, some of the data elements cannot be processed in real time. Therefore, an efficient load shedding technique is required to process data streams effcientlv. For this purpose, a load shedding technique over a data stream is proposed in this paper, which is based on the predicting technique of the frequency of data element considering its current frequency. In the proposed technique, considering the change of the data stream, its threshold for tuple alive is controlled adaptively. It can help to prevent unnecessary load shedding.

Performance Improvement of Nearest-neighbor Classification Learning through Prototype Selections (프로토타입 선택을 이용한 최근접 분류 학습의 성능 개선)

  • Hwang, Doo-Sung
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.49 no.2
    • /
    • pp.53-60
    • /
    • 2012
  • Nearest-neighbor classification predicts the class of an input data with the most frequent class among the near training data of the input data. Even though nearest-neighbor classification doesn't have a training stage, all of the training data are necessary in a predictive stage and the generalization performance depends on the quality of training data. Therefore, as the training data size increase, a nearest-neighbor classification requires the large amount of memory and the large computation time in prediction. In this paper, we propose a prototype selection algorithm that predicts the class of test data with the new set of prototypes which are near-boundary training data. Based on Tomek links and distance metric, the proposed algorithm selects boundary data and decides whether the selected data is added to the set of prototypes by considering classes and distance relationships. In the experiments, the number of prototypes is much smaller than the size of original training data and we takes advantages of storage reduction and fast prediction in a nearest-neighbor classification.

Compare to Factorization Machines Learning and High-order Factorization Machines Learning for Recommend system (추천시스템에 활용되는 Matrix Factorization 중 FM과 HOFM의 비교)

  • Cho, Seong-Eun
    • Journal of Digital Contents Society
    • /
    • v.19 no.4
    • /
    • pp.731-737
    • /
    • 2018
  • The recommendation system is actively researched for the purpose of suggesting information that users may be interested in in many fields such as contents, online commerce, social network, advertisement system, and the like. However, there are many recommendation systems that propose based on past preference data, and it is difficult to provide users with little or no data in the past. Therefore, interest in higher-order data analysis is increasing and Matrix Factorization is attracting attention. In this paper, we study and propose a comparison and replay of the Factorization Machines Leaning(FM) model which is attracting attention in the recommendation system and High-Order Factorization Machines Learning(HOFM) which is a high - dimensional data analysis.

데이터 기반 유사연구영역 효율성 제고 방안 및 과제 우선순위 도출에 대한 탐색적 연구 -출연연 사례 및 AHP분석을 중심으로

  • Jeong, Jae-Yeon;Choe, San;Gang, In-Je;Jeong, Jae-Ung;Han, Yu-Ri;Jeon, Seung-Pyo
    • Proceedings of the Korea Technology Innovation Society Conference
    • /
    • 2017.05a
    • /
    • pp.537-547
    • /
    • 2017
  • 현재 우리나라의 GDP 대비 R&D 투자 규모는 세계최고의 수준에 이르렀다. 이러한 연구개발 예산의 양적인 확대 및 성장과 함께 상대적으로 연구개발 예산의 효율적 활용이 중요한 과학기술정책 이슈로 부각되고 있다. 본 연구는 정부 R&D사업 유사영역의 효율성 제고를 위한 정책, 전략의 수립 및 실행의 의사결정을 돕는 데이터 기반의 객관적인 지표들을 제시하였다. 그 후 본 연구에서 제시한 효율성 지표들을 NTIS에서 추출한 2015년 정부출연연구기관 R&D 사업 데이터와 연계하여 실질적으로 측정과 사용이 가능한 정량적 지표들만을 따로 선별하였다. 또한 정부 R&D사업 효율성 지표들의 가중치를 측정하기 위하여 계층분석기법(analytic hierarchy process)을 수행하였으며 계층분석기법의 결과로 나온 가중치를 효율성 지표들에 적용하여 과제 우선순위를 도출하였다. 이를 통해 정책의 수립, 실행 및 조정 시 고려해야 할 지표의 우선순위를 설정하여 유사연구영역 관련 정부 R&D 정책수립에서 실행까지의 연계를 강화시키고 국가적으로 한정된 자원의 효율적 사용을 위한 방안을 제시하였다.

  • PDF

Visualization of Local Eating-Out Trend Using AR Graph (증강현실 그래프를 이용한 지역별 외식 성향 시각화)

  • Kim, Sang-Joon;Ko, Yu-Jin;Park, Goo-Man;Choi, Yoo-Joo
    • Annual Conference of KIPS
    • /
    • 2019.05a
    • /
    • pp.700-701
    • /
    • 2019
  • 본 논문에서는 지역 데이터의 시각화에 적합한 증강현실 그래프를 제안하고, 이를 카드 사용 빅데이터에 적용하여 지역별 외식 성향 시각화 도구로 활용한 사례를 제시한다. 증강현실 그래프는 사용자가 위치한 해당 지역의 GPS 정보를 기반으로 빅데이터에서 분석 대상 지역을 선별하고, 지역별 특수 데이터를 찾아내어 해당 지역에 대한 빅데이타 분석 내용을 카메라 영상과 함께 시각화한 그래프이다. 증강현실 그래프를 적용한 외식 성향 시각화 사례에서는 카드 사용 가맹점 소재지 정보, 가맹점 업종, 카드사용시점(월), 카드 사용자 성별구분, 연령대, 월 카드사용금액 및 월 사용 건수 정보등을 수집하였다. 그리고, 분석 대상 지역에 대한 연령대별 외식 선호도 내용을 카드사용건수가 많은 업종별 순위 그래프로 시각화 하여 사용자의 위치에서 확인할 수 있도록 하였다. 제안 증강현실 그래프는 지역별 상권 현황, 아파트 시세 등에 효과적으로 적용될 수 있을 것으로 기대된다.

A Combined Multiple Regression Trees Predictor for Screening Large Chemical Databases (대용량 화학 데이터 베이스를 선별하기위한 결합다중회귀나무 예측치)

  • 임용빈;이소영;정종희
    • The Korean Journal of Applied Statistics
    • /
    • v.14 no.1
    • /
    • pp.91-101
    • /
    • 2001
  • It has been shown that the multiple trees predictors are more accurate in reducing test set error than a single tree predictor. There are two ways of generating multiple trees. One is to generate modified training sets by resampling the original training set, and then construct trees. It is known that arcing algorithm is efficient. The other is to perturb randomly the working split at each node from a list of best splits, which is expected to generate reasonably good trees for the original training set. We propose a new combined multiple regression trees predictor which uses the latter multiple regression tree predictor as a predictor based on a modified training set at each stage of arcing. The efficiency of those prediction methods are compared by applying to high throughput screening of chemical compounds for biological effects.

  • PDF

Automatic Selection of Similar Sentences for Teaching Writing in Elementary School (초등 글쓰기 교육을 위한 유사 문장 자동 선별)

  • Park, Youngki
    • Journal of The Korean Association of Information Education
    • /
    • v.20 no.4
    • /
    • pp.333-340
    • /
    • 2016
  • When elementary students write their own sentences, it is often educationally beneficial to compare them with other people's similar sentences. However, it is impractical for use in most classrooms, because it is burdensome for teachers to look up all of the sentences written by students. To cope with this problem, we propose a novel approach for automatic selection of similar sentences based on a three-step process: 1) extracting the subword units from the word-level sentences, 2) training the model with the encoder-decoder architecture, and 3) using the approximate k-nearest neighbor search algorithm to find the similar sentences. Experimental results show that the proposed approach achieves the accuracy of 75% for our test data.

Detection of major genotypes combination by genotype matrix mapping (유전자 행렬 맵핑을 활용한 우수 유전자형 조합 선별)

  • Lee, Jea-Young;Lee, Jong-Hyeong;Lee, Yong-Won
    • Journal of the Korean Data and Information Science Society
    • /
    • v.21 no.3
    • /
    • pp.387-395
    • /
    • 2010
  • It is important to identify the interaction of genes about human disease and characteristic value. Many studies as like logistic analysis, have associated being pursued, but, previous methods did not consider the sub-group of the genotypes. So, QTL interactions and the GMM (genotype matrix mapping) have been developed. In this study, we detect the superior genotype combination to have an impact on economic traits of Korean cattle based on the study over GMM method. Thus, we identified interaction effects of single nucleotide polymorphisms (SNPs) responsible for average daily gain(ADG), marbling score (MS), carcass cold weight (CWT), longissimus muscle dorsiarea (LMA) using GMM method. In addition, we examine significance of the major genotype combination selected by implementing permutation test of the F-measure which was not obtained by Sachiko et al.

Forecasting Daily Demand of Domestic City Gas with Selective Sampling (선별적 샘플링을 이용한 국내 도시가스 일별 수요예측 절차 개발)

  • Lee, Geun-Cheol;Han, Jung-Hee
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.16 no.10
    • /
    • pp.6860-6868
    • /
    • 2015
  • In this study, we consider a problem of forecasting daily city gas demand of Korea. Forecasting daily gas demand is a daily routine for gas provider, and gas demand needs to be forecasted accurately in order to guarantee secure gas supply. In this study, we analyze the time series of city gas demand in several ways. Data analysis shows that primary factors affecting the city gas demand include the demand of previous day, temperature, day of week, and so on. Incorporating these factors, we developed a multiple linear regression model. Also, we devised a sampling procedure that selectively collects the past data considering the characteristics of the city gas demand. Test results on real data exhibit that the MAPE (Mean Absolute Percentage Error) obtained by the proposed method is about 2.22%, which amounts to 7% of the relative improvement ratio when compared with the existing method in the literature.

A study on the Computational Efficiency Improvement for the Conjunction Screening Algorithm (접근물체 선별 알고리즘 계산 효율성 향상 연구)

  • Kim, Hyoung-Jin;Kim, Hae-Dong;Seong, Jae-Dong
    • Journal of the Korean Society for Aeronautical & Space Sciences
    • /
    • v.40 no.9
    • /
    • pp.818-826
    • /
    • 2012
  • In this paper, the improvement methods of the computational efficiency of the conjunction screening algorithm, which calculates the closest distance between primary satellite and space objects are presented. First method is to use GPU(Graphics Processing Unit) that has high computing power and handles quickly large amounts of data. Second method is to use Apogee/Perigee filter which excludes non-threatening objects that have low probability of collision and/or minimum distance rather than that of thresh hold. Third method is to combine first method with second method. As a result, the computational efficiency has been improved 34 times and 3 times for the first method only and second method only, respectively. On the contrary, the computational efficiency has been dramatically improved 163 times when two kinds of methods are combined.