• Title/Summary/Keyword: Data imputation

Search Result 201, Processing Time 0.033 seconds

Estimation of Freeway Accident Likelihood using Real-time Traffic Data (실시간 교통자료 기반 고속도로 교통사고 발생 가능성 추정 모형)

  • Park, Joon-Hyung;Oh, Cheol;NamKoong, Seong
    • Journal of Korean Society of Transportation
    • /
    • v.26 no.2
    • /
    • pp.157-166
    • /
    • 2008
  • This study proposed a model to estimate traffic accident likelihood using real-time traffic data obtained from freeway traffic surveillance systems. Traffic variables representing spatio-temporal variations of traffic conditions were utilized as independent variables in the proposed models. Binary logistics regression modelings were conducted to correlate traffic variables and accident data that were collected from the Seohaean freeway during recent three years, from 2004 to 2006. To apply more reliable traffic variables, outlier filtering and data imputation were also performed. The outcomes of the model that are actually probabilistic measures of accident occurrence would be effectively utilized not only in designing warning information systems but also in evaluating the effectiveness of various traffic operations strategies in terms of traffic safety.

Implementation of Quality Evaluation, Error Filtering, Imputation for Traffic Missing Data (교통 데이터에 대한 품질 평가 및 자료 처리 기법의 구현)

  • Cheong, Su-Jeong;Song, Soo-Kyung;Lee, Min-Soo;NamGung, Sung
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2007.10c
    • /
    • pp.185-190
    • /
    • 2007
  • 대용량의 자료가 생산됨에 따라 데이터를 효율적으로 저장, 관리, 이용할 수 있는 데이터 웨어하우스의 역할이 중요하게 되었고, 그에 따라 자료 처리 기법의 개발은 필수 과제가 되었다. 품질 평가와 오류 판단, 결측 보정의 자료 처리 과점은 자료의 신뢰도를 판단하고 활용도를 높일 수 있는 과정으로 매우 중요하다. 본 논문에서는 우리나라의 실제 교통상황을 반영하고 평가 기준의 오차를 줄이면서 더욱 간단 명료한 평가 계산식을 도입하여 효율적인 품질평가와 오류판단, 결측 보정의 자료 처리 기법을 제안한다. 또한 오류 판단 기준에 새로운 파라미터론 도입하여 교통 연구자의 요구 사항을 반영할 수 있게 하였다. 결측 보정 과정은 여러 기법을 연구하고 기존의 결측 보정 기법에 입력 변수를 추가하여 실제 대용량의 교통 자료에 적용하였다. 그리고 교통 자료가 저장되는 데이터베이스에 직접 접근하여 결측 보정과정을 수행하도록 PL/SQL로 구현하였으며, 이를 통해 교통 연구자에게 쉽고 다양한 방법으로 결측 보정을 수행하고 그 결과를 이용하여 다양한 교통 정보를 가공할 수 있는 환경을 제공하였다.

  • PDF

Forecasting the Demand Areas of a Factory Site: Based on a Statistical Model and Sampling Survey (공장용지 수요 추정 모형 개발 및 수요예측)

  • Jeong, Hyeong-Chul;Han, Geun-Shik;Kim, Seong-Yong
    • The Korean Journal of Applied Statistics
    • /
    • v.24 no.3
    • /
    • pp.465-475
    • /
    • 2011
  • In this paper, we have considered the problems of the estimation of the gross areas of a factory site relating to the areas of industrial complex lands based on a statistical forecasting model and the results of a sampling survey. In respect to the data of a gross areas of a factory site, we have only the sizes from 1981-2003. In 2009, the Korea Industrial Complex Corp. conducted a sampling survey to estimate its bulk size, and investigate the demands of its sizes for the next five years. In this study, we have adopted the sampling survey results, and have created a statistical growth model for the gross areas of a factory site to improve the prediction for the areas of a factory site. The three-different parts of data: the results of areas of a factory site by Korea National Statistical Office, imputation results by the statistical forecasting model, and sampling survey results have used as the basis for analysis. The combination of the three-different parts of data has created a new forecasting value of the areas of a factory site through the spline smoothing method.

A Study on the Optimal Cut-off Point in the Cut-off Sampling Method (절사표본에서 최적 절사점에 관한 연구)

  • Lee, Sang Eun;Cho, Min Ji;Shin, Key-Il
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.3
    • /
    • pp.501-512
    • /
    • 2014
  • Modified cut-off sampling is widely used for highly skewed data. A serious drawback of modified cut-off sampling is the difficulty of adjustment of non-response in take-all stratum. Therefore, solutions of the problems of non-response in take-all stratum have been studied in various ways such as substitute of samples, imputation or re-weight method. In this paper, a new cut-off point based on minimizing MSE being used in exponential and power functions is suggested and it can be reduced the number of take-all stratum. We also investigate another cut-off point determination method with underlying distributions such as truncated log-normal and truncated gamma distributions. Finally we suggest the optimal cut-off point which has a minimum of take-all stratum size among suggested methods. Simulation studies are performed and Labor Survey data and simulated data are used for the case study.

A Study for Traffic Forecasting Using Traffic Statistic Information (교통 통계 정보를 이용한 속도 패턴 예측에 관한 연구)

  • Choi, Bo-Seung;Kang, Hyun-Cheol;Lee, Seong-Keon;Han, Sang-Tae
    • The Korean Journal of Applied Statistics
    • /
    • v.22 no.6
    • /
    • pp.1177-1190
    • /
    • 2009
  • The traffic operating speed is one of important information to measure a road capacity. When we supply the information of the road of high traffic by using navigation, offering the present traffic information and the forecasted future information are the outstanding functions to serve the more accurate expected times and intervals. In this study, we proposed the traffic speed forecasting model using the accumulated traffic speed data of the road and highway and forecasted the average speed for each the road and high interval and each time interval using Fourier transformation and time series regression model with trigonometrical function. We also propose the proper method of missing data imputation and treatment for the outliers to raise an accuracy of the traffic speed forecasting and the speed grouping method for which data have similar traffic speed pattern to increase an efficiency of analysis.

Survival Prognostic Factors of Male Breast Cancer in Southern Iran: a LASSO-Cox Regression Approach

  • Shahraki, Hadi Raeisi;Salehi, Alireza;Zare, Najaf
    • Asian Pacific Journal of Cancer Prevention
    • /
    • v.16 no.15
    • /
    • pp.6773-6777
    • /
    • 2015
  • We used to LASSO-Cox method for determining prognostic factors of male breast cancer survival and showed the superiority of this method compared to Cox proportional hazard model in low sample size setting. In order to identify and estimate exactly the relative hazard of the most important factors effective for the survival duration of male breast cancer, the LASSO-Cox method has been used. Our data includes the information of male breast cancer patients in Fars province, south of Iran, from 1989 to 2008. Cox proportional hazard and LASSO-Cox models were fitted for 20 classified variables. To reduce the impact of missing data, the multiple imputation method was used 20 times through the Markov chain Mont Carlo method and the results were combined with Rubin's rules. In 50 patients, the age at diagnosis was 59.6 (SD=12.8) years with a minimum of 34 and maximum of 84 years and the mean of survival time was 62 months. Three, 5 and 10 year survival were 92%, 77% and 26%, respectively. Using the LASSO-Cox method led to eliminating 8 low effect variables and also decreased the standard error by 2.5 to 7 times. The relative efficiency of LASSO-Cox method compared with the Cox proportional hazard method was calculated as 22.39. The19 years follow of male breast cancer patients show that the age, having a history of alcohol use, nipple discharge, laterality, histological grade and duration of symptoms were the most important variables that have played an effective role in the patient's survival. In such situations, estimating the coefficients by LASSO-Cox method will be more efficient than the Cox's proportional hazard method.

Development of Machine Learning Based Precipitation Imputation Method (머신러닝 기반의 강우추정 방법 개발)

  • Heechan Han;Changju Kim;Donghyun Kim
    • Journal of Wetlands Research
    • /
    • v.25 no.3
    • /
    • pp.167-175
    • /
    • 2023
  • Precipitation data is one of the essential input datasets used in various fields such as wetland management, hydrological simulation, and water resource management. In order to efficiently manage water resources using precipitation data, it is essential to secure as much data as possible by minimizing the missing rate of data. In addition, more efficient hydrological simulation is possible if precipitation data for ungauged areas are secured. However, missing precipitation data have been estimated mainly by statistical equations. The purpose of this study is to propose a new method to restore missing precipitation data using machine learning algorithms that can predict new data based on correlations between data. Moreover, compared to existing statistical methods, the applicability of machine learning techniques for restoring missing precipitation data is evaluated. Representative machine learning algorithms, Artificial Neural Network (ANN) and Random Forest (RF), were applied. For the performance of classifying the occurrence of precipitation, the RF algorithm has higher accuracy in classifying the occurrence of precipitation than the ANN algorithm. The F1-score and Accuracy values, which are evaluation indicators of the classification model, were calculated as 0.80 and 0.77, while the ANN was calculated as 0.76 and 0.71. In addition, the performance of estimating precipitation also showed higher accuracy in RF than in ANN algorithm. The RMSE of the RF and ANN algorithms was 2.8 mm/day and 2.9 mm/day, and the values were calculated as 0.68 and 0.73.

Development of Sample Survey Design for the Industrial Research and Development Statistics (표본조사에 의한 기업 연구개발활동 통계 작성방안)

  • Cho, Seong-Pyo;Park, Sun-Young;Han, Ki-In;Noh, Min-Sun
    • Journal of Technology Innovation
    • /
    • v.17 no.2
    • /
    • pp.1-23
    • /
    • 2009
  • The Survey on the Industrial Research and Development(R&D) is the primary source of information on R&D performed by Korea industrial sector. The results of the survey are used to assess trends in R&D expenditures. Government agencies, corporations, and research organizations use the data to investigate productivity determinants, formulate tax policy, and compare individual company performance with industry averages. Recently, Korea Industrial Technology Association(KOITA) has collected the data by complete enumeration. Koita has, currently, considered sample survey because the number of R&D institutions in industry has been dramatically increased. This study develops survey design for the industrial research and development(R&D) statistics by introducing a sample survey. Companies are divided into 8 groups according to the amount of R&D expenditures and firm size or type. We collect the sample from 24 or 8 sampling strata and compare the results with those of complete enumeration survey. The estimates from 24 sampling strata are not significantly different to the results of complete enumeration survey. We propose the survey design as follows: Companies are divided into 11 groups including the companies of which R&D expenditures are unknown. All large companies are included in the survey and medium and small companies are sampled from 70% and 3%. Simple random sampling (SRS) is applied to the small company partition since they show uniform distribution in R&D expenditures. The independent probability proportionate to size (PPS) sampling procedure may be applied to those companies identified as 'not R&D performers'. When respondents do not provide the requested information, estimates for the missing data are made using imputation algorithms. In the future study, new key variables should be developed in survey questionnaires.

  • PDF

Comparing Survival Functions with Doubly Interval-Censored Data: An Application to Diabetes Surveyed by Korean Cancer Prevention Study (이중구간중도절단된 생존자료의 생존함수 비교를 위한 검정: 한국인 암 예방연구 중 당뇨병에의 응용)

  • Jee, Sun-Ha;Nam, Chung-Mo;Kim, Jin-Heum
    • The Korean Journal of Applied Statistics
    • /
    • v.22 no.3
    • /
    • pp.595-606
    • /
    • 2009
  • Two tests were introduced for comparing several survival functions with doubly interval-censored data and illustrated with data surveyed by Korean Cancer Prevention Study (Jee et al., 2005). The test which extended Kim et al. (2006)'s test to the doubly interval-censored data has an advantage over Sun (2006)'s test in terms of saving computation time because the proposed test only depends on the size of risk set, and also the proposed test is applicable to continuous failure time data as well as discrete failure time data unlike Sun's test. Comparing male with female groups on the incubation time of diabetes was highly different and the survival of female group was longer than that of male one. Regardless of gender, the difference in survival functions of four age groups was highly significant with p-value of less than 0.001. This trend was more remarkable for female group than for male one. Simulation results showed that the significance level of both tests was well controlled and the proposed test was better than Sun's test in terms of power.

Sensitivity analysis of missing mechanisms for the 19th Korean presidential election poll survey (19대 대선 여론조사에서 무응답 메카니즘의 민감도 분석)

  • Kim, Seongyong;Kwak, Dongho
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.1
    • /
    • pp.29-40
    • /
    • 2019
  • Categorical data with non-responses are frequently observed in election poll surveys, and can be represented by incomplete contingency tables. To estimate supporting rates of candidates, the identification of the missing mechanism should be pre-determined because the estimates of non-responses can be changed depending on the assumed missing mechanism. However, it has been shown that it is not possible to identify the missing mechanism when using observed data. To overcome this problem, sensitivity analysis has been suggested. The previously proposed sensitivity analysis can be applicable only to two-way incomplete contingency tables with binary variables. The previous sensitivity analysis is inappropriate to use since more than two of the factors such as region, gender, and age are usually considered in election poll surveys. In this paper, sensitivity analysis suitable to an multi-dimensional incomplete contingency table is devised, and also applied to the 19th Korean presidential election poll survey data. As a result, the intervals of estimates from the sensitivity analysis include actual results as well as estimates from various missing mechanisms. In addition, the properties of the missing mechanism that produce estimates nearest to actual election results are investigated.