• Title/Summary/Keyword: Skewed Data

Search Result 203, Processing Time 0.023 seconds

Statistically Optimized Asynchronous Barrel Shifters for Variable Length Codecs (통계적으로 최적화된 비동기식 가변길이코덱용 배럴 쉬프트)

  • Peter A. Beerel;Kim, Kyeoun-Soo
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.28 no.11A
    • /
    • pp.891-901
    • /
    • 2003
  • This paper presents low-power asynchronous barrel shifters for variable length encoders and decoders useful in portable applications using multimedia standards. Our approach is to create multi-level asynchronous barrel shifters optimized for the skewed shift control statistics often found in these codecs. For common shifts, data passes through one level, whereas for rare shifts, data passes though multiple levels. We compare our optimized designs with the straightforward asynchronous and synchronous designs. Both pre- and Post-layout HSPICE simulation results indicate that, compared to their synchronous counterparts, our designs provide over a 40% savings in average energy consumption for a given average performance.

Statistical Methods to Control Response Bias in Nursing Activity Surveys (간호활동시간 조사 시 응답편이 통제를 위한 통계적 접근 방안)

  • Lim, Ji-Young;Park, Chang-Gi
    • Journal of Korean Academy of Nursing
    • /
    • v.42 no.1
    • /
    • pp.48-55
    • /
    • 2012
  • Purpose: The aim of this study was to compare statistical methods to control response bias in nursing activity surveys. Methods: Data were collected at a medical unit of a general hospital. The number of nursing activities and consumed activity time were measured using self-report questionnaires. Descriptive statistics were used to identify general characteristics of the units. Average, Z-standardization, gamma regression, finite mixture model, and stochastic frontier model were adopted to estimate true activity time controlling for response bias. Results: The nursing activity time data were highly skewed and had non-normal distributions. Among the 4 different methods, only gamma regression and stochastic frontier model controlled response bias effectively and the estimated total nursing activity time did not exceeded total work time. However, in gamma regression, estimated total nursing activity time was too small to use in real clinical settings. Thus stochastic frontier model was the most appropriate method to control response bias when compared with the other methods. Conclusion: According to these results, we recommend the use of a stochastic frontier model to estimate true nursing activity time when using self-report surveys.

A Study on a Measure for Non-Normal Process Capability (비정규 공정능력 측도에 관한 연구)

  • 김홍준;김진수;조남호
    • Proceedings of the Korean Reliability Society Conference
    • /
    • 2001.06a
    • /
    • pp.311-319
    • /
    • 2001
  • All indices that are now in use assume normally distributed data, and any use of the indices on non-normal data results in inaccurate capability measurements. Therefore, $C_{s}$ is proposed which extends the most useful index to date, the Pearn-Kotz-Johnson $C_{pmk}$, by not only taking into account that the process mean may not lie midway between the specification limits and incorporating a penalty when the mean deviates from its target, but also incorporating a penalty for skewness. Therefore we propose, a new process capability index $C_{psk}$( WV) applying the weighted variance control charting method for non-normally distributed. The main idea of the weighted variance method(WVM) is to divide a skewed or asymmetric distribution into two normal distribution from its mean to create two new distributions which have the same mean but different standard distributions. In this paper we propose an example, a distribution generated from the Johnson family of distributions, to demonstrate how the weighted variance-based process capability indices perform in comparison with another two non-normal methods, namely the Clements and the Wright methods. This example shows that the weighted valiance-based indices are more consistent than the other two methods In terms of sensitivity to departure to the process mean/median from the target value for non-normal process.s.s.s.

  • PDF

A Comparison of Ensemble Methods Combining Resampling Techniques for Class Imbalanced Data (데이터 전처리와 앙상블 기법을 통한 불균형 데이터의 분류모형 비교 연구)

  • Leea, Hee-Jae;Lee, Sungim
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.3
    • /
    • pp.357-371
    • /
    • 2014
  • There are many studies related to imbalanced data in which the class distribution is highly skewed. To address the problem of imbalanced data, previous studies deal with resampling techniques which correct the skewness of the class distribution in each sampled subset by using under-sampling, over-sampling or hybrid-sampling such as SMOTE. Ensemble methods have also alleviated the problem of class imbalanced data. In this paper, we compare around a dozen algorithms that combine the ensemble methods and resampling techniques based on simulated data sets generated by the Backbone model, which can handle the imbalance rate. The results on various real imbalanced data sets are also presented to compare the effectiveness of algorithms. As a result, we highly recommend the resampling technique combining ensemble methods for imbalanced data in which the proportion of the minority class is less than 10%. We also find that each ensemble method has a well-matched sampling technique. The algorithms which combine bagging or random forest ensembles with random undersampling tend to perform well; however, the boosting ensemble appears to perform better with over-sampling. All ensemble methods combined with SMOTE outperform in most situations.

DGR-Tree : An Efficient Index Structure for POI Search in Ubiquitous Location Based Services (DGR-Tree : u-LBS에서 POI의 검색을 위한 효율적인 인덱스 구조)

  • Lee, Deuk-Woo;Kang, Hong-Koo;Lee, Ki-Young;Han, Ki-Joon
    • Journal of Korea Spatial Information System Society
    • /
    • v.11 no.3
    • /
    • pp.55-62
    • /
    • 2009
  • Location based Services in the ubiquitous computing environment, namely u-LBS, use very large and skewed spatial objects that are closely related to locational information. It is especially essential to achieve fast search, which is looking for POI(Point of Interest) related to the location of users. This paper examines how to search large and skewed POI efficiently in the u-LBS environment. We propose the Dynamic-level Grid based R-Tree(DGR-Tree), which is an index for point data that can reduce the cost of stationary POI search. DGR-Tree uses both R-Tree as a primary index and Dynamic-level Grid as a secondary index. DGR-Tree is optimized to be suitable for point data and solves the overlapping problem among leaf nodes. Dynamic-level Grid of DGR-Tree is created dynamically according to the density of POI. Each cell in Dynamic-level Grid has a leaf node pointer for direct access with the leaf node of the primary index. Therefore, the index access performance is improved greatly by accessing the leaf node directly through Dynamic-level Grid. We also propose a K-Nearest Neighbor(KNN) algorithm for DGR-Tree, which utilizes Dynamic-level Grid for fast access to candidate cells. The KNN algorithm for DGR-Tree provides the mechanism, which can access directly to cells enclosing given query point and adjacent cells without tree traversal. The KNN algorithm minimizes sorting cost about candidate lists with minimum distance and provides NEB(Non Extensible Boundary), which need not consider the extension of candidate nodes for KNN search.

  • PDF

Bivariate skewness, kurtosis and surface plot (이변량 왜도, 첨도 그리고 표면그림)

  • Hong, Chong Sun;Sung, Jae Hyun
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.5
    • /
    • pp.959-970
    • /
    • 2017
  • In this study, we propose bivariate skewness and kurtosis statistics and suggest a surface plot that can visually implement bivariate data containing the correlation coefficient. The skewness statistic is expressed in the form of a paired real values because this represents the skewed directions and degrees of the bivariate random sample. The kurtosis has a positive value which can determine how thick the tail part of the data is compared to the bivariate normal distribution. Moreover, the surface plot implements bivariate data based on the quantile vectors. Skewness and kurtosis are obtained and surface plots are explored for various types of bivariate data. With these results, it has been found that the values of the skewness and kurtosis reflect the characteristics of the bivariate data implemented by the surface plots. Therefore, the skewness, kurtosis and surface plot proposed in this paper could be used as one of valuable descriptive statistical methods for analyzing bivariate distributions.

Ensemble Learning for Solving Data Imbalance in Bankruptcy Prediction (기업부실 예측 데이터의 불균형 문제 해결을 위한 앙상블 학습)

  • Kim, Myoung-Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.15 no.3
    • /
    • pp.1-15
    • /
    • 2009
  • In a classification problem, data imbalance occurs when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed boundary and thus the reduction in the classification accuracy of such a classifier. This paper proposes a Geometric Mean-based Boosting (GM-Boost) to resolve the problem of data imbalance. Since GM-Boost introduces the notion of geometric mean, it can perform learning process considering both majority and minority sides, and reinforce the learning on misclassified data. An empirical study with bankruptcy prediction on Korea companies shows that GM-Boost has the higher classification accuracy than previous methods including Under-sampling, Over-Sampling, and AdaBoost, used in imbalanced data and robust learning performance regardless of the degree of data imbalance.

  • PDF

An Exploratory Analysis on the User Response Pattern and Quality Characteristics of Marketing Contents in the SNS of Regional Government (지역마케팅 콘텐츠의 사용자 반응패턴과 품질특성에 관한 탐색적 분석: 지방자치단체가 운영하는 SNS를 중심으로)

  • Jeong, Yeon-Su;Jeong, Dae-Yul
    • The Journal of Information Systems
    • /
    • v.26 no.4
    • /
    • pp.419-442
    • /
    • 2017
  • Purpose The purpose of this study is to explore the pattern of user response and it's duration time through social media content response analysis. We also analyze the characteristics of content quality factors which are associate with the user response pattern. The analysis results will provide some implications to develop strategies and schematic plans for the operator of regional marketing on the SNS. Design/methodology/approach This study used mixed methods to verify the effects and responses of social media contents on the users who have concerns about regional events such as local festival, cultural events, and city tours etc. Big data analysis was conducted with the quantitative data from regional government SNSs. The data was collected through web crawling in order to analyze the social media contents. We especially analyzed the contents duration time and peak level time. This study also analyzed the characteristics of contents quality factors using expert evaluation data on the social media contents. Finally, we verify the relationship between the contents quality factors and user response types by cross correlation analysis. Findings According to the big data analysis, we could find some content life cycle which can be explained through empirical distribution with peak time pattern and left skewed long tail. The user response patterns are dependent on time and contents quality. In addition, this study confirms that the level of quality of social media content is closely relate to user interaction and response pattern. As a result of the contents response pattern analysis, it is necessary to develop high quality contents design strategy and content posting and propagation tactics. The SNS operators need to develop high quality contents using rich-media technology and active response contents that induce opinion leader on the SNS.

Load Balancing for Distributed Processing of Real-time Spatial Big Data Stream (실시간 공간 빅데이터 스트림 분산 처리를 위한 부하 균형화 방법)

  • Yoon, Susik;Lee, Jae-Gil
    • Journal of KIISE
    • /
    • v.44 no.11
    • /
    • pp.1209-1218
    • /
    • 2017
  • A variety of sensors is widely used these days, and it has become much easier to acquire spatial big data streams from various sources. Since spatial data streams have inherently skewed and dynamically changing distributions, the system must effectively distribute the load among workers. Previous studies to solve this load imbalance problem are not directly applicable to processing spatial data. In this research, we propose Adaptive Spatial Key Grouping (ASKG). The main idea of ASKG is, by utilizing the previous distribution of the data streams, to adaptively suggest a new grouping scheme that evenly distributes the future load among workers. We evaluate the validity of the proposed algorithm in various environments, by conducting an experiment with real datasets while varying the number of workers, input rate, and processing overhead. Compared to two other alternative algorithms, ASKG improves the system performance in terms of load imbalance, throughput, and latency.

The Marshall-Olkin generalized gamma distribution

  • Barriga, Gladys D.C.;Cordeiro, Gauss M.;Dey, Dipak K.;Cancho, Vicente G.;Louzada, Francisco;Suzuki, Adriano K.
    • Communications for Statistical Applications and Methods
    • /
    • v.25 no.3
    • /
    • pp.245-261
    • /
    • 2018
  • Attempts have been made to define new classes of distributions that provide more flexibility for modelling skewed data in practice. In this work we define a new extension of the generalized gamma distribution (Stacy, The Annals of Mathematical Statistics, 33, 1187-1192, 1962) for Marshall-Olkin generalized gamma (MOGG) distribution, based on the generator pioneered by Marshall and Olkin (Biometrika, 84, 641-652, 1997). This new lifetime model is very flexible including twenty one special models. The main advantage of the new family relies on the fact that practitioners will have a quite flexible distribution to fit real data from several fields, such as engineering, hydrology and survival analysis. Further, we also define a MOGG mixture model, a modification of the MOGG distribution for analyzing lifetime data in presence of cure fraction. This proposed model can be seen as a model of competing causes, where the parameter associated with the Marshall-Olkin distribution controls the activation mechanism of the latent risks (Cooner et al., Statistical Methods in Medical Research, 15, 307-324, 2006). The asymptotic properties of the maximum likelihood estimation approach of the parameters of the model are evaluated by means of simulation studies. The proposed distribution is fitted to two real data sets, one arising from measuring the strength of fibers and the other on melanoma data.