• Title/Summary/Keyword: Skewed Data

Search Result 203, Processing Time 0.028 seconds

Outlier detection and treatment in industrial sampling survey (경제조사에서의 이상치 탐지와 처리방법)

  • Joo, Young Sun;Cho, Gyo-Young
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.1
    • /
    • pp.131-142
    • /
    • 2016
  • Outliers in surveys can have a large effect on estimates of totals. This is especially true in business surveys where the populations are drawn are typically skewed. In this paper, we discussed the practical development and implementation of methods to identify and deal with outliers. A detection method is based on quartile method and detected outlier is processed in various ways. The study examines two versions of winsorised estimators with three different cut-off thresholds for each one. For the simulation study, four types of weight transformation function have been considered.

Development and validation of a non-linear k-ε model for flow over a full-scale building

  • Wright, N.G.;Easom, G.J.;Hoxey, R.J.
    • Wind and Structures
    • /
    • v.4 no.3
    • /
    • pp.177-196
    • /
    • 2001
  • At present the most popular turbulence models used for engineering solutions to flow problems are the $k-{\varepsilon}$ and Reynolds stress models. The shortcoming of these models based on the isotropic eddy viscosity concept and Reynolds averaging in flow fields of the type found in the field of Wind Engineering are well documented. In view of these shortcomings this paper presents the implementation of a non-linear model and its evaluation for flow around a building. Tests were undertaken using the classical bluff body shape, a surface mounted cube, with orientations both normal and skewed at $45^{\circ}$ to the incident wind. Full-scale investigations have been undertaken at the Silsoe Research Institute with a 6 m surface mounted cube and a fetch of roughness height equal to 0.01 m. All tests were originally undertaken for a number of turbulence models including the standard, RNG and MMK $k-{\varepsilon}$ models and the differential stress model. The sensitivity of the CFD results to a number of solver parameters was tested. The accuracy of the turbulence model used was deduced by comparison to the full-scale predicted roof and wake recirculation zone lengths. Mean values of the predicted pressure coefficients were used to further validate the turbulence models. Preliminary comparisons have also been made with available published experimental and large eddy simulation data. Initial investigations suggested that a suitable turbulence model should be able to model the anisotropy of turbulent flow such as the Reynolds stress model whilst maintaining the ease of use and computational stability of the two equations models. Therefore development work concentrated on non-linear quadratic and cubic expansions of the Boussinesq eddy viscosity assumption. Comparisons of these with models based on an isotropic assumption are presented along with comparisons with measured data.

A Research on the Impacts of Technology Rransfer in Government-sponsored Research to the Growth of Technology Licensees (공공 R&D의 기술이전이 기업의 성장에 미치는 효과 연구)

  • Kim, Junhuck
    • Journal of Korea Technology Innovation Society
    • /
    • v.20 no.4
    • /
    • pp.1159-1191
    • /
    • 2017
  • This study considered technology commercialization as a sort of external R&D of the licensee firm. Then, this study analyzed industrial characteristics of technology commercialization and interactions between internal R&D and technology commercialization from the licensee's viewpoint. Data from NTIS (National science and Technology Information Service) and KED (Korea Enterprise Database) were matched. 7,645 technology commercializations from 1,980 firms were extracted. Afterward, OLS and quantile regression were applied to the extracted data. The impact of technology commercialization on firm growth was concentrated to few high-tech and medium high-tech firms. Technology commercialization was effective in the growth in a year while internal R&D was effective in the growth in two years. The firm size was insiginificant variable. In analysis of 4 selected industries (automobile, electronics, semiconductor, chemistry), the impact was skewed among industries. Though the importance of technology commercialization is widely acknowledged, quantitative analyses like this study are uncommon. Therefore, this study can be useful for the tailored industry solutions for technology commercialization.

Prediction Interval Estimation in Ttansformed ARMA Models (변환된 자기회귀이동평균 모형에서의 예측구간추정)

  • Cho, Hye-Min;Oh, Sung-Un;Yeo, In-Kwon
    • The Korean Journal of Applied Statistics
    • /
    • v.20 no.3
    • /
    • pp.541-550
    • /
    • 2007
  • One of main aspects of time series analysis is to forecast future values of series based on values up to a given time. The prediction interval for future values is usually obtained under the normality assumption. When the assumption is seriously violated, a transformation of data may permit the valid use of the normal theory. We investigate the prediction problem for future values in the original scale when transformations are applied in ARMA models. In this paper, we introduce the methodology based on Yeo-Johnson transformation to solve the problem of skewed data whose modelling is relatively difficult in the analysis of time series. Simulation studies show that the coverage probabilities of proposed intervals are closer to the nominal level than those of usual intervals.

The f0 distribution of Korean speakers in a spontaneous speech corpus

  • Yang, Byunggon
    • Phonetics and Speech Sciences
    • /
    • v.13 no.3
    • /
    • pp.31-37
    • /
    • 2021
  • The fundamental frequency, or f0, is an important acoustic measure in the prosody of human speech. The current study examined the f0 distribution of a corpus of spontaneous speech in order to provide normative data for Korean speakers. The corpus consists of 40 speakers talking freely about their daily activities and their personal views. Praat scripts were created to collect f0 values, and a majority of obvious errors were corrected manually by watching and listening to the f0 contour on a narrow-band spectrogram. Statistical analyses of the f0 distribution were conducted using R. The results showed that the f0 values of all the Korean speakers were right-skewed, with a pointy distribution. The speakers produced spontaneous speech within a frequency range of 274 Hz (from 65 Hz to 339 Hz), excluding statistical outliers. The mode of the total f0 data was 102 Hz. The female f0 range, with a bimodal distribution, appeared wider than that of the male group. Regression analyses based on age and f0 values yielded negligible R-squared values. As the mode of an individual speaker could be predicted from the median, either the median or mode could serve as a good reference for the individual f0 range. Finally, an analysis of the continuous f0 points of intonational phrases revealed that the initial and final segments of the phrases yielded several f0 measurement errors. From these results, we conclude that an examination of a spontaneous speech corpus can provide linguists with useful measures to generalize acoustic properties of f0 variability in a language by an individual or groups. Further studies would be desirable of the use of statistical measures to secure reliable f0 values of individual speakers.

Numerical analysis of unsteady hydrodynamic performance of pump-jet propulsor in oblique flow

  • Qiu, Chengcheng;Pan, Guang;Huang, Qiaogao;Shi, Yao
    • International Journal of Naval Architecture and Ocean Engineering
    • /
    • v.12 no.1
    • /
    • pp.102-115
    • /
    • 2020
  • In this study, the SST k - ω turbulence model and the sliding mesh technology based on RANS method have been adopted to simulate the exciting force and hydrodynamic of a pump-jet propulsor in different oblique inflow angle (0°, 10°, 20°, 30°) and different advance ratio (J = 0.95, J = 1.18, J = 1.58).The fully structured grid and full channel model have been adopted to improved computational accuracy. The classical skewed marine propeller E779A with different advance ratio was carried out to verify the accuracy of the numerical simulation method. The grid independence was verified. The time-domain data of pump-jet propulsor exciting force including bearing force and fluctuating pressure in different working conditions was monitored, and then which was converted to frequency domain data by fast Fourier transform (FFT). The variation laws of bearing force and fluctuating pressure in different advance ratio and different oblique flow angle has been presented. The influence of the peak of pulsation pressure in different oblique flow angle and different advance ratio has been presented. The results show that the exciting force increases with the increase of the advance ratio, the closer which is to the rotor domain and the closer to the blades tip, the greater the variation of the pulsating pressure. At the same time, the exciting force decrease with the oblique flow angle increases. And the vertical and transverse forces will change more obviously, which is the main cause of the exciting force. In addition, the pressure distribution and the velocity distribution of rotor blades tip in different oblique flow angles has been investigated.

Geometric and Semantic Improvement for Unbiased Scene Graph Generation

  • Ruhui Zhang;Pengcheng Xu;Kang Kang;You Yang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.10
    • /
    • pp.2643-2657
    • /
    • 2023
  • Scene graphs are structured representations that can clearly convey objects and the relationships between them, but are often heavily biased due to the highly skewed, long-tailed relational labeling in the dataset. Indeed, the visual world itself and its descriptions are biased. Therefore, Unbiased Scene Graph Generation (USGG) prefers to train models to eliminate long-tail effects as much as possible, rather than altering the dataset directly. To this end, we propose Geometric and Semantic Improvement (GSI) for USGG to mitigate this issue. First, to fully exploit the feature information in the images, geometric dimension and semantic dimension enhancement modules are designed. The geometric module is designed from the perspective that the position information between neighboring object pairs will affect each other, which can improve the recall rate of the overall relationship in the dataset. The semantic module further processes the embedded word vector, which can enhance the acquisition of semantic information. Then, to improve the recall rate of the tail data, the Class Balanced Seesaw Loss (CBSLoss) is designed for the tail data. The recall rate of the prediction is improved by penalizing the body or tail relations that are judged incorrectly in the dataset. The experimental findings demonstrate that the GSI method performs better than mainstream models in terms of the mean Recall@K (mR@K) metric in three tasks. The long-tailed imbalance in the Visual Genome 150 (VG150) dataset is addressed better using the GSI method than by most of the existing methods.

Study on Imputation Methods of Missing Real-Time Traffic Data (실시간 누락 교통자료의 대체기법에 관한 연구)

  • Jang Jin-hwan;Ryu Seung-ki;Moon Hak-yong;Byun Sang-cheal
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.3 no.1 s.4
    • /
    • pp.45-52
    • /
    • 2004
  • There are many cities installing ITS(Intelligent Transportation Systems) and running TMC(Trafnc Management Center) to improve mobility and safety of roadway transportation by providing roadway information to drivers. There are many devices in ITS which collect real-time traffic data. We can obtain many valuable traffic data from the devices. But it's impossible to avoid missing traffic data for many reasons such as roadway condition, adversary weather, communication shutdown and problems of the devices itself. We couldn't do any secondary process such as travel time forecasting and other transportation related research due to the missing data. If we use the traffic data to produce AADT and DHV, essential data in roadway planning and design, We might get skewed data that could make big loss. Therefore, He study have explored some imputation techniques such as heuristic methods, regression model, EM algorithm and time-series analysis for the missing traffic volume data using some evaluating indices such as MAPE, RMSE, and Inequality coefficient. We could get the best result from time-series model generating 5.0$\%$, 0.03 and 110 as MAPE, Inequality coefficient and RMSE, respectively. Other techniques produce a little different results, but the results were very encouraging.

  • PDF

A Study on the Optimization of Suwon City Bus Route using GWR Model (GWR모델 이용한 수원시 일반버스노선 최적화에 관한 연구)

  • Park, Cheol Gyu;Cho, Seong Kil
    • Journal of Korean Society for Geospatial Information Science
    • /
    • v.22 no.1
    • /
    • pp.41-46
    • /
    • 2014
  • Bus service is easily adjusted to accommodate the changed demand. Despite the flexibility of that, its relocation should overcome the following problems: first, Bus line rearrangement should consider the balance between the demand and the supply to enhance the transit equity among the users scattered around the area that supply against demand imbalances. Second, the existing demand analysed is to crude since the demand was analysed based on TAZ. mainly based on the Dong unit. Utilization of the GWR and GIS-T data can resolve the problem. In this paper, the limitation of the conventional transit demand analysis model is overcome by deploying the GWR model which identifies the transit demand based on the geographic relation between the service location and those of the users. GWR model considers the spatial effect of the bus demand in accordance with the distance to the each bus stops using SCD(Smart Card Data) and BIS(Bus Information System). This demand map was then superimposes with the existing bus route which identified the areas where the balance between demand and supply is severly skewed. since the analysis was computed with SCD and BIS at every bus stops. the shortage and surplus of bus service of entire study area could computed. Further. based on this computational result and considering the entire bus service capacity data. Bus routes optimization from the oversupplied areas to the undersupplied area was illustrated thus this study clearly compared the benefits the GIS.

Predicting Harvest Maturity of the 'Fuji' Apple using a Beta Distribution Phenology Model based on Temperature (온도기반의 Beta Distribution Model 을 이용한 후지 사과의 성숙기 예측)

  • Choi, In-Tae;Shim, Kyo-Moon;Kim, Yong-Seok;Jung, Myung-Pyo
    • Journal of Environmental Science International
    • /
    • v.26 no.11
    • /
    • pp.1247-1253
    • /
    • 2017
  • The Fuji variety of apple, introduced in Japan, has excellent storage quality and good taste, such that it is the most commonly cultivated apple variety in Gunwi County, North Gyeongsang Province, Korean Peninsula. Accurate prediction of harvest maturity allows farmers to more efficiently manage their farm in important aspects such as working time, fruit storage, market shipment, and labor distribution. Temperature is one of the most important factors that determine plant growth, development, and yield. This paper reports on the beta distribution (function) model that can be used to simulate the the phenological response of plants to temperature. The beta function, commonly used as a skewed probability density in statistics, was introduced to estimate apple harvest maturity as a function of temperature in this study. The model parameters were daily maximum temperature, daily optimum temperature, and maximum growth rate. They were estimated from the input data of daily maximum and minimum temperature and apple harvest maturity. The difference in observed and predicted maturity day from 2009 to 2012, with optimal parameters, was from two days earlier to one day later.