• Title/Summary/Keyword: Skewed Data

Search Result 205, Processing Time 0.021 seconds

Load Balancing for Distributed Processing of Real-time Spatial Big Data Stream (실시간 공간 빅데이터 스트림 분산 처리를 위한 부하 균형화 방법)

  • Yoon, Susik;Lee, Jae-Gil
    • Journal of KIISE
    • /
    • v.44 no.11
    • /
    • pp.1209-1218
    • /
    • 2017
  • A variety of sensors is widely used these days, and it has become much easier to acquire spatial big data streams from various sources. Since spatial data streams have inherently skewed and dynamically changing distributions, the system must effectively distribute the load among workers. Previous studies to solve this load imbalance problem are not directly applicable to processing spatial data. In this research, we propose Adaptive Spatial Key Grouping (ASKG). The main idea of ASKG is, by utilizing the previous distribution of the data streams, to adaptively suggest a new grouping scheme that evenly distributes the future load among workers. We evaluate the validity of the proposed algorithm in various environments, by conducting an experiment with real datasets while varying the number of workers, input rate, and processing overhead. Compared to two other alternative algorithms, ASKG improves the system performance in terms of load imbalance, throughput, and latency.

The Marshall-Olkin generalized gamma distribution

  • Barriga, Gladys D.C.;Cordeiro, Gauss M.;Dey, Dipak K.;Cancho, Vicente G.;Louzada, Francisco;Suzuki, Adriano K.
    • Communications for Statistical Applications and Methods
    • /
    • v.25 no.3
    • /
    • pp.245-261
    • /
    • 2018
  • Attempts have been made to define new classes of distributions that provide more flexibility for modelling skewed data in practice. In this work we define a new extension of the generalized gamma distribution (Stacy, The Annals of Mathematical Statistics, 33, 1187-1192, 1962) for Marshall-Olkin generalized gamma (MOGG) distribution, based on the generator pioneered by Marshall and Olkin (Biometrika, 84, 641-652, 1997). This new lifetime model is very flexible including twenty one special models. The main advantage of the new family relies on the fact that practitioners will have a quite flexible distribution to fit real data from several fields, such as engineering, hydrology and survival analysis. Further, we also define a MOGG mixture model, a modification of the MOGG distribution for analyzing lifetime data in presence of cure fraction. This proposed model can be seen as a model of competing causes, where the parameter associated with the Marshall-Olkin distribution controls the activation mechanism of the latent risks (Cooner et al., Statistical Methods in Medical Research, 15, 307-324, 2006). The asymptotic properties of the maximum likelihood estimation approach of the parameters of the model are evaluated by means of simulation studies. The proposed distribution is fitted to two real data sets, one arising from measuring the strength of fibers and the other on melanoma data.

Spatial Index based on Main Memory for Web CIS (Web GIS를 위한 주기억 장치 기반 공간 색인)

  • 김진덕;진교홍
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2001.10a
    • /
    • pp.191-194
    • /
    • 2001
  • The availability of the inexpensive, large main memories coupled with the demand for faster response time are bringing a new perspective to database technology. The Web GIS used by u unspecified number of general public in the internet needs high speed response time and frequent data retrieval for spatial analysis rather than data update. Therefore, it is appropriate to use main memory as a underlying storage structures for the Web GIS data. In this paper, we propose a data representation method based on relative coordinates and the size of the MBR. The method is able to compress the spatial data widely used in the Web GIS into smaller volume of memory. We also propose a memory resident spatial index with simple mechanism for processing point and region queries. The performance test shows that the index is suitable for managing the skewed data in terms of the size of the index and the number of the MBR intersection check operations.

  • PDF

Matrix-based Filtering and Load-balancing Algorithm for Efficient Similarity Join Query Processing in Distributed Computing Environment (분산 컴퓨팅 환경에서 효율적인 유사 조인 질의 처리를 위한 행렬 기반 필터링 및 부하 분산 알고리즘)

  • Yang, Hyeon-Sik;Jang, Miyoung;Chang, Jae-Woo
    • The Journal of the Korea Contents Association
    • /
    • v.16 no.7
    • /
    • pp.667-680
    • /
    • 2016
  • As distributed computing platforms like Hadoop MapReduce have been developed, it is necessary to perform the conventional query processing techniques, which have been executed in a single computing machine, in distributed computing environments efficiently. Especially, studies on similarity join query processing in distributed computing environments have been done where similarity join means retrieving all data pairs with high similarity between given two data sets. But the existing similarity join query processing schemes for distributed computing environments have a problem of skewed computing load balance between clusters because they consider only the data transmission cost. In this paper, we propose Matrix-based Load-balancing Algorithm for efficient similarity join query processing in distributed computing environment. In order to uniform load balancing of clusters, the proposed algorithm estimates expected computing cost by using matrix and generates partitions based on the estimated cost. In addition, it can reduce computing loads by filtering out data which are not used in query processing in clusters. Finally, it is shown from our performance evaluation that the proposed algorithm is better on query processing performance than the existing one.

The effect of parameter estimation on $\bar{X}$ charts based on the median run length ($\bar{X}$ 관리도에서 런길이의 중위수에 기초한 모수 추정의 영향)

  • Lee, Yoojin;Lee, Jaeheon
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.6
    • /
    • pp.1487-1498
    • /
    • 2016
  • In monitoring a process, in-control process parameters must be estimated from the Phase I data. When we design the control chart based on the estimated process parameters, the control limits are usually chosen to satisfy a specific in-control average run length (ARL). However, as the run length distribution is skewed when the process is either in-control or out-of-control, the median run length (MRL) can be used as alternative measure instead of the ARL. In this paper, we evaluate the performance of Shewhart $\bar{X}$ chart with estimated parameters in terms of the average of median run length (AMRL) and the standard deviation of MRL (SDMRL) metrics. In simualtion study, the grand sample mean is used as a process mean estimator, and several competing process standard deviation estimators are used to evaluate the in-control performance for various amounts of Phase I data.

Analysis of health-related quality of life using Beta regression (베타회귀분석 방법을 이용한 건강 관련 삶의 질 자료 분석)

  • Jang, Eun Jin
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.3
    • /
    • pp.547-557
    • /
    • 2017
  • The health-related quality of life data are commonly skewed and bounded with spike at the perfect health status, and the variance tended to be heteroscedastic. In this study, we have developed a prediction model for EQ-5D using linear regression model, beta regression model, and extended beta regression model with mean and precision submodel, and also compared the predictive accuracy. The extended beta regression model allows to model skewness and differences in dispersion related to covariates. Although the extended beta regression model has higher prediction accuracy than the linear regression model, the overlapped confidence intervals suggested that the extended beta regression model was superior to the linear regression model. However, the expended beta regression model could explain the heteroscedasticity and predict within the bounded range. Therefore, the expended beta regression model are appropriate for fitting the health-related quality of life data such as EQ-5D.

A Study of Library Grouping using Cluster Analysis Methods (군집분석 기법을 이용한 공공도서관 그룹화에 대한 연구)

  • Kwak, Chul Wan
    • Journal of the Korean BIBLIA Society for library and Information Science
    • /
    • v.31 no.3
    • /
    • pp.79-99
    • /
    • 2020
  • The purpose of this study is to investigate the model of cluster analysis techniques for grouping public libraries and analyze their characteristics. Statistical data of public libraries of the National Library Statistics System were used, and three models of cluster analysis were applied. As a result of the study, cluster analysis was conducted based on the size of public libraries, and it was largely divided into two clusters. The size of the cluster was largely skewed to one side. For grouping based on size, the ward method of hierarchical cluster analysis and the k-means cluster analysis model were suitable. Three suggestions were presented as implications of the grouping method of public libraries. First, it is necessary to collect library service-related data in addition to statistical data. Second, an analysis model suitable for the data set to be analyzed must be applied. Third, it is necessary to study the possibility of using cluster analysis techniques in various fields other than library grouping.

Uniform Load Distribution Using Sampling-Based Cost Estimation in Parallel Join (병렬 조인에서 샘플링 기반 비용 예측 기법을 이용한 균등 부하 분산)

  • Park, Ung-Gyu
    • The Transactions of the Korea Information Processing Society
    • /
    • v.6 no.6
    • /
    • pp.1468-1480
    • /
    • 1999
  • In database systems, join operations are the most complex and time consuming ones which limit performance of such system. Many parallel join algorithms have been proposed for the systems. However, they did not consider data skew, such as attribute value skew (AVS) and join product skew (JPS). In the skewness environments, performance of framework for a uniform load distribution and an efficient parallel join algorithm using the framework to handle AVS and JPS. In our algorithm, we estimate data distributions of input and output relations of join operations using the sampling methodology and evaluate join cost for the estimated data distributions. Finally, using the histogram equalization method we distribute data among nodes to achieve good load balancing among nodes in the local joining phase. For performance comparison, we present simulation model of our algorithm and other join algorithms and present the result of some simulation experiments. The results indicate that our algorithm outperforms other algorithms in the skewed case.

  • PDF

Estimation of underwater acoustic uncertainty based on the ocean experimental data measured in the East Sea and its application to predict sonar detection probability (동해 해역에서 측정된 해상실험 데이터 기반의 수중음향 불확정성 추정 및 소나 탐지확률 예측)

  • Dae Hyeok Lee;Wonjun Yang;Ji Seop Kim;Hoseok Sul;Jee Woong Choi;Su-Uk Son
    • The Journal of the Acoustical Society of Korea
    • /
    • v.43 no.3
    • /
    • pp.285-292
    • /
    • 2024
  • When calculating sonar detection probability, underwater acoustic uncertainty is assumed to be normal distributed with a standard deviation of 8 dB to 9 dB. However, due to the variability in experimental areas and ocean environmental conditions, predicting detection performance requires accounting for underwater acoustic uncertainty based on ocean experimental data. In this study, underwater acoustic uncertainty was determined using measured mid-frequency (2.3 kHz, 3 kHz) noise level and transmission loss data collected in the shallow water of the East Sea. After calculating the predictable probability of detection reflecting underwater acoustic uncertainty based on ocean experimental data, we compared it with the conventional detection probability results, as well as the predictable probability of detection results considering the uncertainty of the Rayleigh distribution and a negatively skewed distribution. As a result, we confirmed that differences in the detection area occur depending on each underwater acoustic uncertainty.

A Cross-sectional Study of Biochemical Analysis and Assessment of Iron Deficiency by Gestational Age(II) (임신 시기별 생화학적 철분 분석 및 철분 결핍상태에 대한 횡적 조사 연구(II))

  • 유경희
    • Journal of Nutrition and Health
    • /
    • v.32 no.8
    • /
    • pp.887-896
    • /
    • 1999
  • The purpose of this research is to assess hematological and biochemical status and the prevalence of iron deficiency of pregnant women by gestational age to provide the primary data about iron nutritional status of pregnant women. Pregnant women visiting public health centers in Ulsan participated in study and were divided into 3 trimester by last menstrual period(LMP). Hemoglobin (Hgb), hematocrit(Hct)and mean corpuscular volume(MCV) among iron status indices were not statistically different from normal distribution, however total iron binding capacity(TIBC) and serum ferritin were skewed to left and serum iron and transferrin saturation(TS) were skewed to right. Hgb was positively correlated with Hct(r=0.93, p<0.001) but TIBC was negatively correlated with all indices. Serum ferritin was also correlated with all indices, especially in 3rd trimester but not reached to 1st trimester level. Mean corpuscular hemoglobin(MCH), mean corpuscular hemoglobin concentration(MCHC), Red cell distribution width(RDW), serum iron and TS were not significantly different by trimester, however when serum serum iron was adjusted with hematocrit to correct the hemodilution, it significantly decreased in 2nd trimester. MCV increased in 2nd trimester and was maintained until late pregnancy, TIBC continued to increase throughout the trimester. The prevalence of anemic by CDC(Centers for Disease Control) Hgb criteria(Hgb <11.0g/dl in 1st and 3nd trimester, Hgb<10.5g/dl in 2nd trimester) was 2.8% in 1st trimester, 22.5% in 2nd trimester, 27.1% in 3rd trimester and was similar with prevalence by CDC Hct criteria(Hct < 33% in 1st and 3rd, Hct < 32% in 2nd). The prevalence of anemic of total subjects was 32.7% by WHO criteria(Hgb < 11.0g/dl). Although almost iron status indices increased in 3rd trimester, the prevalence of anemia by different criteria of all indices increased throughout the trimester, so iron nutritional status was considered as serious during late pregnancy. However, since factors other than iron deficiency, such as infection, infection, inflammation, other nutrient deficiency may also play a significant role, to differentiate the anemia due to mainly iron deficiency from the anemia due to other factors, serum ferritin is among the more useful indices in distinguishing the two conditions because it is depressed only in iron deficiency. Hgb<11.0g/dl and serum ferritin<12.0ug/L as the criteria of iron deficiency was suggested by CDC. 17.8% of all subjects were classified as iron deficient anemia, 14.9% as anemic from other reasons, 21.2% as iron deficiency any only 46.2% were in normal iron status.

  • PDF