• Title/Summary/Keyword: outlier elimination

Search Result 15, Processing Time 0.021 seconds

Analyzing Influence of Outlier Elimination on Accuracy of Software Effort Estimation (소프트웨어 공수 예측의 정확성에 대한 이상치 제거의 영향 분석)

  • Seo, Yeong-Seok;Yoon, Kyung-A;Bae, Doo-Hwan
    • Journal of KIISE:Software and Applications
    • /
    • v.35 no.10
    • /
    • pp.589-599
    • /
    • 2008
  • Accurate software effort estimation has always been a challenge for the software industrial and academic software engineering communities. Many studies have focused on effort estimation methods to improve the estimation accuracy of software effort. Although data quality is one of important factors for accurate effort estimation, most of the work has not considered it. In this paper, we investigate the influence of outlier elimination on the accuracy of software effort estimation through empirical studies applying two outlier elimination methods(Least trimmed square regression and K-means clustering) and three effort estimation methods(Least squares regression, Neural network and Bayesian network) associatively. The empirical studies are performed using two industry data sets(the ISBSG Release 9 and the Bank data set which consists of the project data collected from a bank in Korea) with or without outlier elimination.

Elimination of Outlier from Technology Growth Curve using M-estimator for Defense Science and Technology Survey (M-추정을 사용한 국방과학기술 수준조사 기술성장모형의 이상치 제거)

  • Kim, Jangheon
    • Journal of the Korea Institute of Military Science and Technology
    • /
    • v.23 no.1
    • /
    • pp.76-86
    • /
    • 2020
  • Technology growth curve methodology is commonly used in technology forecasting. A technology growth curve represents the paths of product performance in relation to time or investment in R&D. It is a useful tool to compare the technological performances between Korea and advanced nations and to describe the inflection points, the limit of improvement of a technology and their technology innovation strategies, etc. However, the curve fitting to a set of survey data often leads to model mis-specification, biased parameter estimation and incorrect result since data through survey with experts frequently contain outlier in process of curve fitting due to the subjective response characteristics. This paper propose a method to eliminate of outlier from a technology growth curve using M-estimator. The experimental results prove the overall improvement in technology growth curves by several pilot tests using real-data in Defense Science and Technology Survey reports.

Improving the Quality of Response Surface Analysis of an Experiment for Coffee-Supplemented Milk Beverage: I. Data Screening at the Center Point and Maximum Possible R-Square

  • Rheem, Sungsue;Oh, Sejong
    • Food Science of Animal Resources
    • /
    • v.39 no.1
    • /
    • pp.114-120
    • /
    • 2019
  • Response surface methodology (RSM) is a useful set of statistical techniques for modeling and optimizing responses in research studies of food science. As a design for a response surface experiment, a central composite design (CCD) with multiple runs at the center point is frequently used. However, sometimes there exist situations where some among the responses at the center point are outliers and these outliers are overlooked. Since the responses from center runs are those from the same experimental conditions, there should be no outliers at the center point. Outliers at the center point ruin statistical analysis. Thus, the responses at the center point need to be looked at, and if outliers are observed, they have to be examined. If the reasons for the outliers are not errors in measuring or typing, such outliers need to be deleted. If the outliers are due to such errors, they have to be corrected. Through a re-analysis of a dataset published in the Korean Journal for Food Science of Animal Resources, we have shown that outlier elimination resulted in the increase of the maximum possible R-square that the modeling of the data can obtain, which enables us to improve the quality of response surface analysis.

Development of Integrated Outlier Analysis System for Construction Monitoring Data (건설 계측 데이터에 대한 통합 이상치 분석 시스템 개발)

  • Jeon, Jesung
    • Journal of the Korean GEO-environmental Society
    • /
    • v.21 no.5
    • /
    • pp.5-11
    • /
    • 2020
  • Outliers detection and elimination included in field monitoring datum are essential for effective foundation of unusual movement, long and short range forecast of stability and future behavior to various structures. Integrated outlier analysis system for assessing long term time series data was developed in this study. Outlier analysis could be conducted in two step of primary analysis targeted at single dataset and second multi datasets analysis using synthesis value. Integrated outlier analysis system presents basic information for evaluating stability and predicting movement of structure combined with real-time safety management platform. Field application results showed increased correlation between synthesis value including similar sort of sensor showing constant trend and each single dataset. Various monitoring data in case of showing different trend can be used to analyse outlier through correlation-weighted value.

Combined Filtering Model Using Voting Rule and Median Absolute Deviation for Travel Time Estimation (통행시간 추정을 위한 Voting Rule과 중위절대편차법 기반의 복합 필터링 모형)

  • Jeong, Youngje;Park, Hyun Suk;Kim, Byung Hwa;Kim, Youngchan
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.12 no.6
    • /
    • pp.10-21
    • /
    • 2013
  • This study suggested combined filtering model to eliminate outlier travel time data in transportation information system, and it was based on Median Absolute Deviation and Voting Rule. This model applied Median Absolute Deviation (MAD) method to follow normal distribution as first filtering process. After that, Voting rule is applied to eliminate remaining outlier travel time data after Median Absolute Deviation. In Voting Rule, travel time samples are judged as outliers according to travel-time difference between sample data and mean data. Elimination or not of outliers are determined using a majority rule. In case study of national highway No. 3, combined filtering model selectively eliminated outliers only and could improve accuracy of estimated travel time.

Image registration using outlier removal and triangulation-based local transformation (이상치 제거와 삼각망 기반의 지역 변환을 이용한 영상 등록)

  • Ye, Chul-Soo
    • Korean Journal of Remote Sensing
    • /
    • v.30 no.6
    • /
    • pp.787-795
    • /
    • 2014
  • This paper presents an image registration using Triangulation-based Local Transformation (TLT) applied to the remaining matched points after elimination of the matched points with gross error. The corners extracted using geometric mean-based corner detector are matched using Pearson's correlation coefficient and then accepted as initial matched points only when they satisfy the Left-Right Consistency (LRC) check. We finally accept the remaining matched points whose RANdom SAmple Consensus (RANSAC)-based global transformation (RGT) errors are smaller than a predefined outlier threshold. After Delaunay triangulated irregular networks (TINs) are created using the final matched points on reference and sensed images, respectively, affine transformation is applied to every corresponding triangle and then all the inner pixels of the triangles on the sensed image are transformed to the reference image coordinate. The proposed algorithm was tested using KOMPSAT-2 images and the results showed higher image registration accuracy than the RANSAC-based global transformation.

A Development of Preprocessing Models of Toll Collection System Data for Travel Time Estimation (통행시간 추정을 위한 TCS 데이터의 전처리 모형 개발)

  • Lee, Hyun-Seok;NamKoong, Seong J.
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.8 no.5
    • /
    • pp.1-11
    • /
    • 2009
  • TCS Data imply characteristics of traffic conditions. However, there are outliers in TCS data, which can not represent the travel time of the pertinent section, if these outliers are not eliminated, travel time may be distorted owing to these outliers. Various travel time can be distributed under the same section and time because the variation of the travel time is increase as the section distance is increase, which make difficult to calculate the representative of travel time. Accordingly, it is important to grasp travel time characteristics in order to compute the representative of travel time using TCS Data. In this study, after analyzing the variation ratio of the travel time according to the link distance and the level of congestion, the outlier elimination model and the smoothing model for TCS data were proposed. The results show that the proposed model can be utilized for estimating a reliable travel time for a long-distance path in which there are a variation of travel times from the same departure time, the intervals are large and the change in the representative travel time is irregular for a short period.

  • PDF

An Application of Support Vector Machines to Customer Loyalty Classification of Korean Retailing Company Using R Language

  • Nguyen, Phu-Thien;Lee, Young-Chan
    • The Journal of Information Systems
    • /
    • v.26 no.4
    • /
    • pp.17-37
    • /
    • 2017
  • Purpose Customer Loyalty is the most important factor of customer relationship management (CRM). Especially in retailing industry, where customers have many options of where to spend their money. Classifying loyal customers through customers' data can help retailing companies build more efficient marketing strategies and gain competitive advantages. This study aims to construct classification models of distinguishing the loyal customers within a Korean retailing company using data mining techniques with R language. Design/methodology/approach In order to classify retailing customers, we used combination of support vector machines (SVMs) and other classification algorithms of machine learning (ML) with the support of recursive feature elimination (RFE). In particular, we first clean the dataset to remove outlier and impute the missing value. Then we used a RFE framework for electing most significant predictors. Finally, we construct models with classification algorithms, tune the best parameters and compare the performances among them. Findings The results reveal that ML classification techniques can work well with CRM data in Korean retailing industry. Moreover, customer loyalty is impacted by not only unique factor such as net promoter score but also other purchase habits such as expensive goods preferring or multi-branch visiting and so on. We also prove that with retailing customer's dataset the model constructed by SVMs algorithm has given better performance than others. We expect that the models in this study can be used by other retailing companies to classify their customers, then they can focus on giving services to these potential vip group. We also hope that the results of this ML algorithm using R language could be useful to other researchers for selecting appropriate ML algorithms.

Relevancy contemplation in medical data analytics and ranking of feature selection algorithms

  • P. Antony Seba;J. V. Bibal Benifa
    • ETRI Journal
    • /
    • v.45 no.3
    • /
    • pp.448-461
    • /
    • 2023
  • This article performs a detailed data scrutiny on a chronic kidney disease (CKD) dataset to select efficient instances and relevant features. Data relevancy is investigated using feature extraction, hybrid outlier detection, and handling of missing values. Data instances that do not influence the target are removed using data envelopment analysis to enable reduction of rows. Column reduction is achieved by ranking the attributes through feature selection methodologies, namely, extra-trees classifier, recursive feature elimination, chi-squared test, analysis of variance, and mutual information. These methodologies are ranked via Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) using weight optimization to identify the optimal features for model building from the CKD dataset to facilitate better prediction while diagnosing the severity of the disease. An efficient hybrid ensemble and novel similarity-based classifiers are built using the pruned dataset, and the results are thereafter compared with random forest, AdaBoost, naive Bayes, k-nearest neighbors, and support vector machines. The hybrid ensemble classifier yields a better prediction accuracy of 98.31% for the features selected by extra tree classifier (ETC), which is ranked as the best by TOPSIS.