• Title/Summary/Keyword: Outlier Analysis

Search Result 234, Processing Time 0.019 seconds

A survey on unsupervised subspace outlier detection methods for high dimensional data (고차원 자료의 비지도 부분공간 이상치 탐지기법에 대한 요약 연구)

  • Ahn, Jaehyeong;Kwon, Sunghoon
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.3
    • /
    • pp.507-521
    • /
    • 2021
  • Detecting outliers among high-dimensional data encounters a challenging problem of screening the variables since relevant information is often contained in only a few of the variables. Otherwise, when a number of irrelevant variables are included in the data, the distances between all observations tend to become similar which leads to making the degree of outlierness of all observations alike. The subspace outlier detection method overcomes the problem by measuring the degree of outlierness of the observation based on the relevant subsets of the entire variables. In this paper, we survey recent subspace outlier detection techniques, classifying them into three major types according to the subspace selection method. And we summarize the techniques of each type based on how to select the relevant subspaces and how to measure the degree of outlierness. In addition, we introduce some computing tools for implementing the subspace outlier detection techniques and present results from the simulation study and real data analysis.

Outlier detection of main engine data of a ship using ensemble method (앙상블 기법을 이용한 선박 메인엔진 빅데이터의 이상치 탐지)

  • KIM, Dong-Hyun;LEE, Ji-Hwan;LEE, Sang-Bong;JUNG, Bong-Kyu
    • Journal of the Korean Society of Fisheries and Ocean Technology
    • /
    • v.56 no.4
    • /
    • pp.384-394
    • /
    • 2020
  • This paper proposes an outlier detection model based on machine learning that can diagnose the presence or absence of major engine parts through unsupervised learning analysis of main engine big data of a ship. Engine big data of the ship was collected for more than seven months, and expert knowledge and correlation analysis were performed to select features that are closely related to the operation of the main engine. For unsupervised learning analysis, ensemble model wherein many predictive models are strategically combined to increase the model performance, is used for anomaly detection. As a result, the proposed model successfully detected the anomalous engine status from the normal status. To validate our approach, clustering analysis was conducted to find out the different patterns of anomalies the anomalous point. By examining distribution of each cluster, we could successfully find the patterns of anomalies.

A study on the Flood Frequency Analyzed in Consideration of Low Outliers. (Low Outliers를 고려한 홍수빈도분석에 관한 연구)

  • 이순혁;홍성표;박명근
    • Magazine of the Korean Society of Agricultural Engineers
    • /
    • v.30 no.4
    • /
    • pp.62-70
    • /
    • 1988
  • This study was conducted to solve the problems for the unsuitable parameters and the uncertainty of design flood can be appeared by low outliers were inclined to the lower part from the trend of the balance of the data. Derivation of reasonable design flood was attempted finally by modification of low outliers with analysis of flood frequency by means of Log Pearson Type Ill distribution. Three subwatersheds were selected as studying basins with the annual maximum series including low outliers along Geum River basin. The results through this study were analyzed and summarized as follows. 1. Log Pearson Type In distribution was confirmed as a reasonable one by X$^2$ goodness of fit test at Gong Ju, Gyu Am, og Cheon watershed along Geum River basin. 2. Probable flood flows for each watershed were derivated by flood frequency curve with outliers. 3. Weighted skew coefficient for each watershed was calculated for the evaluation of freq- uency factor which is needed for the modification of low outlier. 4. It was confirrned that adjusted frequency curve has a lower tendency than that of deletion of low outlier in common at all watersheds. 5. Final probable flood flows were derivated by modification with evaluation of modified basic statistics for three watersheds. 6. In comparison with a frequency curve with modification and one with outlier, The former has a higher probable flood flow within three years of return periods than that of the latter, and vice versa over three years of return periods.

  • PDF

The Use of Local Outlier Factor(LOF) for Improving Performance of Independent Component Analysis(ICA) based Statistical Process Control(SPC) (LOF를 이용한 ICA 기반 통계적 공정관리의 성능 개선 방법론)

  • Lee, Jae-Shin;Kang, Bok-Young;Kang, Suk-Ho
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.36 no.1
    • /
    • pp.39-55
    • /
    • 2011
  • Process monitoring has been emphasized for the monitoring of complex system such as chemical processing industries to achieve the efficiency enhancement, quality management, safety improvement. Recently, ICA (Independent Component Analysis) based MSPC (Multivariate Statistical Process Control) was widely used in process monitoring approaches. Moreover, DICA (Dynamic ICA) has been introduced to consider the system dynamics. However, the existing approaches show the limitation that their performances are strongly dependent on the statistical distributions of control variables. To improve the limitation, we propose a novel approach for process monitoring by integrating DICA and LOF (Local Outlier Factor). In this paper, we aim to improve the fault detection rate with the proposed method. LOF detects local outliers by using density of surrounding space so that its performance is regardless of data distribution. Therefore, the proposed method not only can consider the system dynamics but can also assure robust performance regardless of the statistical distributions of control variables. Comparison experiments were conducted on the widely used benchmark dataset, Tennessee Eastman process (TE process), and showed the improved performance than existing approaches.

Speaker Recognition Based on Robust PCA (강인한 주성분 분석법을 갖는 화자인식)

  • Lee Youn Jeong;Lee Ki Yong
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • spring
    • /
    • pp.225-228
    • /
    • 2002
  • 본 논문에서는 화자인식을 위하여 강인한 주성분 분석법(Robust Principal Component Analysis)을 갖는 화자인식 방법을 제안하였다. 강인한 주성분 분석법은 특징벡터들의 outlier가 존재할 경우 k-차원으로 줄이면서 강인한 화자 모델을 만들기 위하여 사용한다. 기존의 PCA 방법은 순수한 화자의 정보가 잡음 등의 outlier에 의해 손상될 수 있으므로, 강인한 주성분 분석법을 사용하여 outlier의 영향을 감소 시켰다. 화자 별로 k-차원 diagonal GMM 학습시 mixture 수를 적응시켜 데이터 저장 공간을 최소화하였다. 200명의 고립 숫자음을 사용하여 기존의 diagonal GMM 방법과 제안된 방법을 실험한 결과, 제안된 방법에서 약 $1.5\%$더 높은 인증률을 얻을 수 있었다.

  • PDF

A Note on Bayesian Prediction Analysis for the Rayleigh Model in the presence of Outliers

  • Ko, Jeong-Hwan;Kim, Yeung-Hoon
    • 한국데이터정보과학회:학술대회논문집
    • /
    • 2003.05a
    • /
    • pp.171-176
    • /
    • 2003
  • This paper deals with the problem of predicting order statistics in samples from a Rayleigh population when an outlier is present. Bayesian predictive distribution and prediction bounds of the p-th order statistics is obtained where an outlier of type $\theta\delta$ is present. In this connection, some identies are derived.

  • PDF

A Suggestion to Establish Statistical Treatment Guideline for Aircraft Manufacturer (국산 복합재료 시험데이터 처리지침 수립을 위한 제언)

  • Suh, Jangwon
    • Journal of Aerospace System Engineering
    • /
    • v.8 no.4
    • /
    • pp.39-43
    • /
    • 2014
  • This paper examines the statistical process that should be performed with caution in the composite material qualification and equivalency process, and describes statistically significant considerations on outlier finding and handling process, data pooling through normalization process, review for data distributions and design allowables determination process for structural analysis. Based on these considerations, the need for guidance on statistical process for aircraft manufacturers who use the composite material properties database are proposed.

Outlier detection for multivariate long memory processes (다변량 장기 종속 시계열에서의 이상점 탐지)

  • Kim, Kyunghee;Yu, Seungyeon;Baek, Changryong
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.3
    • /
    • pp.395-406
    • /
    • 2022
  • This paper studies the outlier detection method for multivariate long memory time series. The existing outlier detection methods are based on a short memory VARMA model, so they are not suitable for multivariate long memory time series. It is because higher order of autoregressive model is necessary to account for long memory, however, it can also induce estimation instability as the number of parameter increases. To resolve this issue, we propose outlier detection methods based on the VHAR structure. We also adapt the robust estimation method to estimate VHAR coefficients more efficiently. Our simulation results show that our proposed method performs well in detecting outliers in multivariate long memory time series. Empirical analysis with stock index shows RVHAR model finds additional outliers that existing model does not detect.

An Outlier Detection Method in Penalized Spline Regression Models (벌점 스플라인 회귀모형에서의 이상치 탐지방법)

  • Seo, Han Son;Song, Ji Eun;Yoon, Min
    • The Korean Journal of Applied Statistics
    • /
    • v.26 no.4
    • /
    • pp.687-696
    • /
    • 2013
  • The detection and the examination of outliers are important parts of data analysis because some outliers in the data may have a detrimental effect on statistical analysis. Outlier detection methods have been discussed by many authors. In this article, we propose to apply Hadi and Simonoff's (1993) method to penalized spline a regression model to detect multiple outliers. Simulated data sets and real data sets are used to illustrate and compare the proposed procedure to a penalized spline regression and a robust penalized spline regression.

Outlier Detection and Treatment for the Conversion of Chemical Oxygen Demand to Total Organic Carbon (화학적산소요구량의 총유기탄소 변환을 위한 이상자료의 탐지와 처리)

  • Cho, Beom Jun;Cho, Hong Yeon;Kim, Sung
    • Journal of Korean Society of Coastal and Ocean Engineers
    • /
    • v.26 no.4
    • /
    • pp.207-216
    • /
    • 2014
  • Total organic carbon (TOC) is an important indicator used as an direct biological index in the research field of the marine carbon cycle. It is possible to produce the sufficient TOC estimation data by using the Chemical Oxygen Demand(COD) data because the available TOC data is relatively poor than the COD data. The outlier detection and treatment (removal) should be carried out reasonably and objectively because the equation for a COD-TOC conversion is directly affected the TOC estimation. In this study, it aims to suggest the optimal regression model using the available salinity, COD, and TOC data observed in the Korean coastal zone. The optimal regression model is selected by the comparison and analysis on the changes of data numbers before and after removal, variation coefficients and root mean square (RMS) error of the diverse detection methods of the outlier and influential observations. According to research result, it is shown that a diagnostic case combining SIQR (Semi - Inter-Quartile Range) boxplot and Cook's distance method is most suitable for the outlier detection. The optimal regression function is estimated as the TOC(mg/L) = $0.44{\cdot}COD(mg/L)+1.53$, then determination coefficient is showed a value of 0.47 and RMS error is 0.85 mg/L. The RMS error and the variation coefficients of the leverage values are greatly reduced to the 31% and 80% of the value before the outlier removal condition. The method suggested in this study can provide more appropriate regression curve because the excessive impacts of the outlier frequently included in the COD and TOC monitoring data is removed.