• 제목/요약/키워드: outlier detecting

Search Result 48, Processing Time 0.028 seconds

Procedures for Detecting Multiple Outliers in Linear Regression Using R

  • Kwon, Soon-Sun;Lee, Gwi-Hyun;Park, Sung-Hyun
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2005.11a
    • /
    • pp.13-17
    • /
    • 2005
  • In recent years, many people use R as a statistics system. R is frequently updated by many R project teams. We are interested in the method of multiple outlier detection and know that R is not supplied the method of multiple outlier detection. In this talk, we review these procedures for detecting multiple outliers and provide more efficient procedures combined with direct methods and indirect methods using R.

  • PDF

Influence in Testing the Equality of Two Covariance Matrices (두개의 공분산 행렬의 동질성 검정에서의 영향치 분석)

  • Myung Geun Kim
    • The Korean Journal of Applied Statistics
    • /
    • v.7 no.2
    • /
    • pp.213-224
    • /
    • 1994
  • A diagnostic method useful for detecting outliers in testing the equality of two covariance metrics is developed using the influence curve approach. This method is easily generalized to more than two covariance matrices. A sample version for the influence measure of detecting outliers is considered based on the empirical distribution functions. The sample version includes as its component terms the well-known test statistic for detecting one outlier at a time introduced by Wilks and its generalization to the two-group case.

  • PDF

A sequential outlier detecting method using a clustering algorithm (군집 알고리즘을 이용한 순차적 이상치 탐지법)

  • Seo, Han Son;Yoon, Min
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.4
    • /
    • pp.699-706
    • /
    • 2016
  • Outlier detection methods without performing a test often do not succeed in detecting multiple outliers because they are structurally vulnerable to a masking effect or a swamping effect. This paper considers testing procedures supplemented to a clustering-based method of identifying the group with a minority of the observations as outliers. One of general steps is performing a variety of t-test on individual outlier-candidates. This paper proposes a sequential procedure for searching for outliers by changing cutoff values on a cluster tree and performing a test on a set of outlier-candidates. The proposed method is illustrated and compared to existing methods by an example and Monte Carlo studies.

Outlier Detection in Growth Curve Model

  • Shim, Kyu-Bark
    • Journal of the Korean Data and Information Science Society
    • /
    • v.14 no.2
    • /
    • pp.313-323
    • /
    • 2003
  • For the growth curve model with arbitrary covariance structure, known as unstructured covariance matrix, the problems of detecting outliers are discussed in this paper. In order to detect outliers in the growth curve model, the test statistics using U-distribution is established. After detecting outliers in growth curve model, we test homo and/or hetero-geneous covariance matrices using PSR Quasi-Bayes Criterion. For illustration, one numerical example is discussed, which compares between before and after outlier deleting.

  • PDF

A Binary Prediction Method for Outlier Detection using One-class SVM and Spectral Clustering in High Dimensional Data (고차원 데이터에서 One-class SVM과 Spectral Clustering을 이용한 이진 예측 이상치 탐지 방법)

  • Park, Cheong Hee
    • Journal of Korea Multimedia Society
    • /
    • v.25 no.6
    • /
    • pp.886-893
    • /
    • 2022
  • Outlier detection refers to the task of detecting data that deviate significantly from the normal data distribution. Most outlier detection methods compute an outlier score which indicates the degree to which a data sample deviates from normal. However, setting a threshold for an outlier score to determine if a data sample is outlier or normal is not trivial. In this paper, we propose a binary prediction method for outlier detection based on spectral clustering and one-class SVM ensemble. Given training data consisting of normal data samples, a clustering method is performed to find clusters in the training data, and the ensemble of one-class SVM models trained on each cluster finds the boundaries of the normal data. We show how to obtain a threshold for transforming outlier scores computed from the ensemble of one-class SVM models into binary predictive values. Experimental results with high dimensional text data show that the proposed method can be effectively applied to high dimensional data, especially when the normal training data consists of different shapes and densities of clusters.

Outlier Detection Based on Discrete Wavelet Transform with Application to Saudi Stock Market Closed Price Series

  • RASHEDI, Khudhayr A.;ISMAIL, Mohd T.;WADI, S. Al;SERROUKH, Abdeslam
    • The Journal of Asian Finance, Economics and Business
    • /
    • v.7 no.12
    • /
    • pp.1-10
    • /
    • 2020
  • This study investigates the problem of outlier detection based on discrete wavelet transform in the context of time series data where the identification and treatment of outliers constitute an important component. An outlier is defined as a data point that deviates so much from the rest of observations within a data sample. In this work we focus on the application of the traditional method suggested by Tukey (1977) for detecting outliers in the closed price series of the Saudi Arabia stock market (Tadawul) between Oct. 2011 and Dec. 2019. The method is applied to the details obtained from the MODWT (Maximal-Overlap Discrete Wavelet Transform) of the original series. The result show that the suggested methodology was successful in detecting all of the outliers in the series. The findings of this study suggest that we can model and forecast the volatility of returns from the reconstructed series without outliers using GARCH models. The estimated GARCH volatility model was compared to other asymmetric GARCH models using standard forecast error metrics. It is found that the performance of the standard GARCH model were as good as that of the gjrGARCH model over the out-of-sample forecasts for returns among other GARCH specifications.

Development of the Financial Account Pre-screening System for Corporate Credit Evaluation (분식 적발을 위한 재무이상치 분석시스템 개발)

  • Roh, Tae-Hyup
    • The Journal of Information Systems
    • /
    • v.18 no.4
    • /
    • pp.41-57
    • /
    • 2009
  • Although financial information is a great influence upon determining of the group which use them, detection of management fraud and earning manipulation is a difficult task using normal audit procedures and corporate credit evaluation processes, due to the shortage of knowledge concerning the characteristics of management fraud, and the limitation of time and cost. These limitations suggest the need of systemic process for !he effective risk of earning manipulation for credit evaluators, external auditors, financial analysts, and regulators. Moot researches on management fraud have examined how various characteristics of the company's management features affect the occurrence of corporate fraud. This study examines financial characteristics of companies engaged in fraudulent financial reporting and suggests a model and system for detecting GAAP violations to improve reliability of accounting information and transparency of their management. Since the detection of management fraud has limited proven theory, this study used the detecting method of outlier(upper, and lower bound) financial ratio, as a real-field application. The strength of outlier detecting method is its use of easiness and understandability. In the suggested model, 14 variables of the 7 useful variable categories among the 76 financial ratio variables are examined through the distribution analysis as possible indicators of fraudulent financial statements accounts. The developed model from these variables show a 80.82% of hit ratio for the holdout sample. This model was developed as a financial outlier detecting system for a financial institution. External auditors, financial analysts, regulators, and other users of financial statements might use this model to pre-screen potential earnings manipulators in the credit evaluation system. Especially, this model will be helpful for the loan evaluators of financial institutes to decide more objective and effective credit ratings and to improve the quality of financial statements.

Variable Selection and Outlier Detection for Automated K-means Clustering

  • Kim, Sung-Soo
    • Communications for Statistical Applications and Methods
    • /
    • v.22 no.1
    • /
    • pp.55-67
    • /
    • 2015
  • An important problem in cluster analysis is the selection of variables that define cluster structure that also eliminate noisy variables that mask cluster structure; in addition, outlier detection is a fundamental task for cluster analysis. Here we provide an automated K-means clustering process combined with variable selection and outlier identification. The Automated K-means clustering procedure consists of three processes: (i) automatically calculating the cluster number and initial cluster center whenever a new variable is added, (ii) identifying outliers for each cluster depending on used variables, (iii) selecting variables defining cluster structure in a forward manner. To select variables, we applied VS-KM (variable-selection heuristic for K-means clustering) procedure (Brusco and Cradit, 2001). To identify outliers, we used a hybrid approach combining a clustering based approach and distance based approach. Simulation results indicate that the proposed automated K-means clustering procedure is effective to select variables and identify outliers. The implemented R program can be obtained at http://www.knou.ac.kr/~sskim/SVOKmeans.r.

Plagiarism Detection among Source Codes using Adaptive Methods

  • Lee, Yun-Jung;Lim, Jin-Su;Ji, Jeong-Hoon;Cho, Hwaun-Gue;Woo, Gyun
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.6 no.6
    • /
    • pp.1627-1648
    • /
    • 2012
  • We propose an adaptive method for detecting plagiarized pairs from a large set of source code. This method is adaptive in that it uses an adaptive algorithm and it provides an adaptive threshold for determining plagiarism. Conventional algorithms are based on greedy string tiling or on local alignments of two code strings. However, most of them are not adaptive; they do not consider the characteristics of the program set, thereby causing a problem for a program set in which all the programs are inherently similar. We propose adaptive local alignment-a variant of local alignment that uses an adaptive similarity matrix. Each entry of this matrix is the logarithm of the probabilities of the keywords based on their frequency in a given program set. We also propose an adaptive threshold based on the local outlier factor (LOF), which represents the likelihood of an entity being an outlier. Experimental results indicate that our method is more sensitive than JPlag, which uses greedy string tiling for detecting plagiarism-suspected code pairs. Further, the adaptive threshold based on the LOF is shown to be effective, and the detection performance shows high sensitivity with negligible loss of specificity, compared with that using a fixed threshold.

Graphical Methods for Evaluating the Effect of Outliers in Univariate and Bivariate Data (일변량 및 이변량 자료에 대하여 특이값의 영향을 평가하기 위한 그래픽 방법)

  • Jang, Dae-Heung
    • Proceedings of the Korean Society for Quality Management Conference
    • /
    • 2006.11a
    • /
    • pp.221-226
    • /
    • 2006
  • We usually use two techniques(influence function and local influence) for detecting outliers. But, we cannot use these difficult techniques in elementary industrial statistics course for college students. We can use some simple graphical methods(box plot, dandelion seed plot, influence graph and cumulative deletion plot) for univariate and bivariate outlier detection and outlier effect in elementary industrial statistics course for college students.

  • PDF