• Title/Summary/Keyword: Outlier Analysis

Search Result 234, Processing Time 0.023 seconds

Lowess and outlier analysis of biological oxygen demand on Nakdong main stream river (낙동강 본류 측정소들의 생물학적 산소요구량 수치에 대한 비모수적 회귀분석과 특이점분석)

  • Kim, Jong Tae
    • Journal of the Korean Data and Information Science Society
    • /
    • v.25 no.1
    • /
    • pp.119-130
    • /
    • 2014
  • This paper is based on water information system of NIE, National Institute of Environmental Research. We used monthly data of water quality from January, 2013 to August, 2013 starting from measuring point A (nbA) to measuring point N (nbN) located along the Nakdong river main stream. Statistical water quality analysis of BOD (biological oxygen demand) is specified by R programming depending on month, year, and points. Based on BOD measured from Nakdong river's measuring points, we used exploratory data analysis and locally weighted scatter plot smoother (Lowess) trend analysis, which is a method of non-parametic regression analysis, to analyze long-term water tendency and water quality distribution depending on points. Also, we analyzed the period and the measuring point of which the outliers are abundant. As a result, compared to BOD measured in nbM located in Busan along the downstream, BOD measured in nbG located in Daegu and nbI located in Changwon along the midstream showed higher rate of water pollution at a severe level.

Outlier Identification in Regression Analysis using Projection Pursuit

  • Kim, Hyojung;Park, Chongsun
    • Communications for Statistical Applications and Methods
    • /
    • v.7 no.3
    • /
    • pp.633-641
    • /
    • 2000
  • In this paper, we propose a method to identify multiple outliers in regression analysis with only assumption of smoothness on the regression function. Our method uses single-linkage clustering algorithm and Projection Pursuit Regression (PPR). It was compared with existing methods using several simulated and real examples and turned out to be very useful in regression problem with the regression function which is far from linear.

  • PDF

Sequence-based 5-mers highly correlated to epigenetic modifications in genes interactions

  • Salimi, Dariush;Moeini, Ali;Masoudi?Nejad, Ali
    • Genes and Genomics
    • /
    • v.40 no.12
    • /
    • pp.1363-1371
    • /
    • 2018
  • One of the main concerns in biology is extracting sophisticated features from DNA sequence for gene interaction determination, receiving a great deal of researchers' attention. The epigenetic modifications along with their patterns have been intensely recognized as dominant features affecting on gene expression. However, studying sequenced-based features highly correlated to this key element has remained limited. The main objective in this research was to propose a new feature highly correlated to epigenetic modifications capable of classification of genes. In this paper, classification of 34 genes in PPAR signaling pathway associated with muscle fat tissue in human was performed. Using different statistical outlier detection methods, we proposed that 5-mers highly correlated to epigenetic modifications can correctly categorize the genes involved in the same biological pathway or process. Thirty-four genes in PPAR signaling pathway were classified via applying a proposed feature, 5-mers strongly associated to 17 different epigenetic modifications. For this, diverse statistical outlier detection methods were applied to specify the group of thoroughly correlated genes. The results indicated that these 5-mers can appropriately identify correlated genes. In addition, our results corresponded to GeneMania interaction information, leading to support the suggested method. The appealing findings imply that not only epigenetic modifications but also their highly correlated 5-mers can be applied for reconstructing gene regulatory networks as supplementary data as well as other applications like physical interaction, genes prioritization, indicating some sort of data fusion in this analysis.

A Study on the Application of Outlier Analysis for Fraud Detection: Focused on Transactions of Auction Exception Agricultural Products (부정 탐지를 위한 이상치 분석 활용방안 연구 : 농수산 상장예외품목 거래를 대상으로)

  • Kim, Dongsung;Kim, Kitae;Kim, Jongwoo;Park, Steve
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.3
    • /
    • pp.93-108
    • /
    • 2014
  • To support business decision making, interests and efforts to analyze and use transaction data in different perspectives are increasing. Such efforts are not only limited to customer management or marketing, but also used for monitoring and detecting fraud transactions. Fraud transactions are evolving into various patterns by taking advantage of information technology. To reflect the evolution of fraud transactions, there are many efforts on fraud detection methods and advanced application systems in order to improve the accuracy and ease of fraud detection. As a case of fraud detection, this study aims to provide effective fraud detection methods for auction exception agricultural products in the largest Korean agricultural wholesale market. Auction exception products policy exists to complement auction-based trades in agricultural wholesale market. That is, most trades on agricultural products are performed by auction; however, specific products are assigned as auction exception products when total volumes of products are relatively small, the number of wholesalers is small, or there are difficulties for wholesalers to purchase the products. However, auction exception products policy makes several problems on fairness and transparency of transaction, which requires help of fraud detection. In this study, to generate fraud detection rules, real huge agricultural products trade transaction data from 2008 to 2010 in the market are analyzed, which increase more than 1 million transactions and 1 billion US dollar in transaction volume. Agricultural transaction data has unique characteristics such as frequent changes in supply volumes and turbulent time-dependent changes in price. Since this was the first trial to identify fraud transactions in this domain, there was no training data set for supervised learning. So, fraud detection rules are generated using outlier detection approach. We assume that outlier transactions have more possibility of fraud transactions than normal transactions. The outlier transactions are identified to compare daily average unit price, weekly average unit price, and quarterly average unit price of product items. Also quarterly averages unit price of product items of the specific wholesalers are used to identify outlier transactions. The reliability of generated fraud detection rules are confirmed by domain experts. To determine whether a transaction is fraudulent or not, normal distribution and normalized Z-value concept are applied. That is, a unit price of a transaction is transformed to Z-value to calculate the occurrence probability when we approximate the distribution of unit prices to normal distribution. The modified Z-value of the unit price in the transaction is used rather than using the original Z-value of it. The reason is that in the case of auction exception agricultural products, Z-values are influenced by outlier fraud transactions themselves because the number of wholesalers is small. The modified Z-values are called Self-Eliminated Z-scores because they are calculated excluding the unit price of the specific transaction which is subject to check whether it is fraud transaction or not. To show the usefulness of the proposed approach, a prototype of fraud transaction detection system is developed using Delphi. The system consists of five main menus and related submenus. First functionalities of the system is to import transaction databases. Next important functions are to set up fraud detection parameters. By changing fraud detection parameters, system users can control the number of potential fraud transactions. Execution functions provide fraud detection results which are found based on fraud detection parameters. The potential fraud transactions can be viewed on screen or exported as files. The study is an initial trial to identify fraud transactions in Auction Exception Agricultural Products. There are still many remained research topics of the issue. First, the scope of analysis data was limited due to the availability of data. It is necessary to include more data on transactions, wholesalers, and producers to detect fraud transactions more accurately. Next, we need to extend the scope of fraud transaction detection to fishery products. Also there are many possibilities to apply different data mining techniques for fraud detection. For example, time series approach is a potential technique to apply the problem. Even though outlier transactions are detected based on unit prices of transactions, however it is possible to derive fraud detection rules based on transaction volumes.

A Reference Value for Cook's Measure

  • Lee, Jae-Jun
    • Communications for Statistical Applications and Methods
    • /
    • v.6 no.1
    • /
    • pp.25-32
    • /
    • 1999
  • A single outlier can influence on the least squares estimators and can invalidate analysis based on these estimators. The Cook's statistic has been introduced to measure influence of individual data point on parameter estimation and the quantile of the F distribution is recommended as a reference value. but in practice subjective judgement is applied in the choice of appropriate quantile. A simple reference value is introduced in this paper which is developed by approximating conditional quantities of Cook's measure. The performance of the proposed criterion is evaluated through analysis of real data set.

  • PDF

Image Feature Extraction Using Energy field Analysis (에너지장 해석을 통한 영상 특징량 추출 방법 개발)

  • 김면희;이태영;이상룡
    • Proceedings of the Korean Society of Precision Engineering Conference
    • /
    • 2002.10a
    • /
    • pp.404-406
    • /
    • 2002
  • In this paper, the method of image feature extraction is proposed. This method employ the energy field analysis, outlier removal algorithm and ring projection. Using this algorithm, we achieve rotation-translation-scale invariant feature extraction. The force field are exploited to automatically locate the extrema of a small number of potential energy wells and associated potential channels. The image feature is acquired from relationship of local extrema using the ring projection method.

  • PDF

Long-Term Trend Analysis and Exploratory Data Analysis of Geumho River based on Seasonal Mann-Kendall Test (계절 맨-켄달 기법을 이용한 금호강 본류 BOD의 장기 경향 분석 및 탐색적 자료 분석)

  • Jung, Kang-Young;Lee, In Jung;Lee, Kyung-Lak;Cheon, Se-Uk;Hong, Jun Young;Ahn, Jung-Min
    • Journal of Environmental Science International
    • /
    • v.25 no.2
    • /
    • pp.217-229
    • /
    • 2016
  • The government has conducted a plan of total maximum daily loads(TMDL), which divides with unit watershed, for management of stable water quality target by setting the permitted total amount of the pollutant. In this study, BOD concentration trends over the last 10 years from 2005 to 2014 were analyzed in the Geumho river. Improvement effect of water quality throughout the implementation period of TMDL was evaluated using the seasonal Mann-Kendall test and a LOWESS(locally weighted scatter plot smoother) smooth. As a study result of the seasonal Mann-Kendall test and the LOWESS smooth, BOD concentration in the Geumho river appeared to have been reduced or held at a constant. As a result of quantitatively analysis for BOD concentration with exploratory data analysis(EDA), the mean and the median of BOD concentration appeared in the order of GH8 > GH7 > GH6 > GH5 > GH4 > GH3 > GH2 > GH1. The monthly average concentration of BOD appeared in the order of Apr > Mar > Feb >May > Jun > Jul > Jan > Aug > Sep > Dec > Nov > Oct. As a result of the outlier, its value was the most frequent in February, which is estimated 1.5 times more than July, and was smallest frequent in July. The outlier in terms of water quality management is necessary in order to establish a management plan for the contaminants in watershed.

Study of estimated model of drift through real ship (실선에 의한 표류 예측모델에 관한 연구)

  • Chang-Heon LEE;Kwang-Il KIM;Sang-Lok YOO;Min-Son KIM;Seung-Hun HAN
    • Journal of the Korean Society of Fisheries and Ocean Technology
    • /
    • v.60 no.1
    • /
    • pp.57-70
    • /
    • 2024
  • In order to present a predictive drift model, Jeju National University's training ship was tested for about 11 hours and 40 minutes, and 81 samples that selected one of the entire samples at ten-minute intervals were subjected to regression analysis after verifying outliers and influence points. In the outlier and influence point analysis, although there is a part where the wind direction exceeds 1 in the DFBETAS (difference in Betas) value, the CV (cumulative variable) value is 6%, close to 1. Therefore, it was judged that there would be no problem in conducting multiple regression analyses on samples. The standard regression coefficient showed how much current and wind affect the dependent variable. It showed that current speed and direction were the most important variables for drift speed and direction, with values of 47.1% and 58.1%, respectively. The analysis showed that the statistical values indicated the fit of the model at the significance level of 0.05 for multiple regression analysis. The multiple correlation coefficients indicating the degree of influence on the dependent variable were 83.2% and 89.0%, respectively. The determination of coefficients were 69.3% and 79.3%, and the adjusted determination of coefficients were 67.6% and 78.3%, respectively. In this study, a more quantitative prediction model will be presented because it is performed after identifying outliers and influence points of sample data before multiple regression analysis. Therefore, many studies will be active in the future by combining them.

Study on Enhancement of TRANSGUIDE Outlier Filter Method under Unstable Traffic Flow for Reliable Travel Time Estimation -Focus on Dedicated Short Range Communications Probes- (불안정한 교통류상태에서 TRANSGUIDE 이상치 제거 기법 개선을 통한 교통 통행시간 예측 향상 연구 -DSRC 수집정보를 중심으로-)

  • Khedher, Moataz Bellah Ben;Yun, Duk Geun
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.18 no.3
    • /
    • pp.249-257
    • /
    • 2017
  • Filtering the data for travel time records obtained from DSRC probes is essential for a better estimation of the link travel time. This study addresses the major deficiency in the performance of TRANSGUIDE in removing anomalous data. This algorithm is unable to handle unstable traffic flow conditions for certain time intervals, where fluctuations are observed. In this regard, this study proposes an algorithm that is capable of overcoming the weaknesses of TRANSGUIDE. If TRANSGUIDE fails to validate sufficient number of observations inside one time interval, another process specifies a new validity range based on the median absolute deviation (MAD), a common statistical approach. The proposed algorithm suggests the parameters, ${\alpha}$ and ${\beta}$, to consider the maximum allowed outlier within a one-time interval to respond to certain traffic flow conditions. The parameter estimation relies on historical data because it needs to be updated frequently. To test the proposed algorithm, the DSRC probe travel time data were collected from a multilane highway road section. Calibration of the model was performed by statistical data analysis through using cumulative relative frequency. The qualitative evaluation shows satisfactory performance. The proposed model overcomes the deficiency associated with the rapid change in travel time.

A Performance Analysis of the SIFT Matching on Simulated Geospatial Image Differences (공간 영상 처리를 위한 SIFT 매칭 기법의 성능 분석)

  • Oh, Jae-Hong;Lee, Hyo-Seong
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.29 no.5
    • /
    • pp.449-457
    • /
    • 2011
  • As automated image processing techniques have been required in multi-temporal/multi-sensor geospatial image applications, use of automated but highly invariant image matching technique has been a critical ingredient. Note that there is high possibility of geometric and spectral differences between multi-temporal/multi-sensor geospatial images due to differences in sensor, acquisition geometry, season, and weather, etc. Among many image matching techniques, the SIFT (Scale Invariant Feature Transform) is a popular method since it has been recognized to be very robust to diverse imaging conditions. Therefore, the SIFT has high potential for the geospatial image processing. This paper presents a performance test results of the SIFT on geospatial imagery by simulating various image differences such as shear, scale, rotation, intensity, noise, and spectral differences. Since a geospatial image application often requires a number of good matching points over the images, the number of matching points was analyzed with its matching positional accuracy. The test results show that the SIFT is highly invariant but could not overcome significant image differences. In addition, it guarantees no outlier-free matching such that it is highly recommended to use outlier removal techniques such as RANSAC (RANdom SAmple Consensus).