• 제목/요약/키워드: Model over-fitting

검색결과 151건 처리시간 0.024초

결측 데이터 보정법에 의한 의사 데이터로 조정된 예측 최적화 방법 (Predictive Optimization Adjusted With Pseudo Data From A Missing Data Imputation Technique)

  • 김정우
    • 한국산학기술학회논문지
    • /
    • 제20권2호
    • /
    • pp.200-209
    • /
    • 2019
  • 미래 값을 예측할 때, 학습 오차(training error)를 최소화하여 추정된 모형은 보통 많은 테스트 오차(test error)를 야기할 수 있다. 이것은 추정 모델이 주어진 데이터 집합에만 집중하여 발생하는 모델 복잡성에 따른 과적합(overfitting) 문제이다. 일부 정규화 및 리샘플링 방법은 이 문제를 완화하여 테스트 오차를 줄이기 위해 도입되었지만, 이 방법들 또한 주어진 데이터 집합에서만 국한 되도록 설계되었다. 본 논문에서는 테스트 오차 최소화 문제를 학습 오차 최소화 문제로 변환하여 테스트 오차를 줄이기 위한 새로운 최적화 방법을 제안한다. 이 변환을 수행하기 위해 주어진 데이터 집합에 대해 의사(pseudo) 데이터라고 하는 새로운 데이터를 추가하였다. 그리고 적절한 의사 데이터를 만들기 위해 결측 데이터 보정법의 세 가지 유형을 사용하였다. 예측 모델로서 선형회귀모형, 자기회귀모형, ridge 회귀모형을 사용하고 이 모형들에 의사 데이터 방법을 적용하였다. 또한, 의사 데이터로 조정된 최적화 방법을 활용하여 환경 데이터 및 금융 데이터에 적용한 사례를 제시하였다. 결과적으로 이 논문에서 제시된 방법은 원래의 예측 모형보다 테스트 오차를 감소시키는 것으로 나타났다.

Determination of Critical Nitrogen Concentration and Dilution Curve for Rice Growth

  • Lee, Byun-Woo;Cui, Ri-Xian;Kim, Min-Ho;Kim, Jun-Hwan;Nam, Hong-Shik
    • 한국작물학회지
    • /
    • 제47권2호
    • /
    • pp.127-131
    • /
    • 2002
  • Critical nitrogen concentration (Nc), which is defined as the minimum % N in shoots required to maintain the maximum growth rate of top dry weight (W) at any time, was determined for rice plant. Using two rice varietal groups, japonica varieties and an indica $\times$ japonica "Dasanbyeo", 18 data points fulfilling the statistical criteria for determining Nc were obtained through eight N-fertilization experiments over two years at Suwon (37$^{\circ}$16'N), Korea. Nc dilution curve for each variety was obtained by fitting the Nc-W relationship to power function. However, The critical nitrogen curves for the two variety groups were not different statistically. Thus, a Nc dilution curve was fitted for the Nc data points pooled over the two variety groups and proposed in rice as: Nc=4.08, where W<1.73 t h $a^{-1}$ , Nc=5.197 $W^{0.425}$3/ ($R^2$=0.964), where 1.73 t h $a^{-1}$ <W<12 t h $a^{-l}$. The Nc for W<1.73 t h $a^{-l}$ were estimated as a constant value of 4.08%, the mean value of the maximum N concentration for N-limiting condition and the minimum N concentration for N non-limiting condition. The model for Nc is applicable to diagnosing the nitrogen nutrition status during the rice growth period from emergence to heading stage. The Nc curve well discriminated the 144 data points between the N limiting and the N non-limiting groups regardless of varieties, cultural methods, and years.-limiting groups regardless of varieties, cultural methods, and years.

Support Vector Machine과 상태공간모형을 이용한 단변량 수문 시계열의 동역학적 비선형 예측모형 (Dynamic Nonlinear Prediction Model of Univariate Hydrologic Time Series Using the Support Vector Machine and State-Space Model)

  • 권현한;문영일
    • 대한토목학회논문집
    • /
    • 제26권3B호
    • /
    • pp.279-289
    • /
    • 2006
  • 최근에 수문시계열로부터 저차원의 비선형 거동을 재구성하고자 하는 연구가 활발히 진행되고 있다. 이러한 관점에서 본 연구에서는 Support Vector Machine(SVM)을 이용하여 우수한 상태-공간 재구성 능력을 갖는 비선형 예측모형을 구성하여 Great Salt Lake(GSL) Volume에 적용하였다. SVM은 Kernel 함수로부터 유도된 고차원의 특성공간 안에서 선형함수의 가상공간을 이용하는 Machine Learning 방법론이다. 또한 SVM은 훈련자료로부터 얻어지는 평균제곱오차가 아닌 일반화된 오차를 최소화함으로써 상대적으로 기존 방법에 비해 적은 수의 매개변수와 과적합(over fitting)을 피하면서 비선형 함수의 최적화가 가능하다. 본 연구에서 제시한 SVM 회귀분석의 적용성은 미국의 GSL의 2주 간격 Volume을 대상으로 검토하였다. SVM을 이용한 비선형 예측모형은 GSL Volume의 2주(1-Step), 8주(4-Step)와 반복예측(Iterated Prediction, 121-Step)까지 적용되었다. 본 연구에서는 극치사상 즉, 급격한 감소 및 증가 구간을 예측하는데 있어서 훈련구간과 예측구간을 구분하여 모형의 신뢰성을 평가하였다. 예측결과SVM은 훈련자료로부터 적은 수의 관측치를 이용하여 동역학적 거동을 추출할 수 있었으며 실제 관측자료와 거의 유사한 예측이 가능함을 통계적 지표로 확인할 수 있었다. 따라서 비선형 수문시계열의 단기 예측을 위한 모형으로 적용이 가능할 것으로 판단된다.

Bacillus drentensis sp. 사균과 polysulfone으로 이루어진 미생물담체를 이용한 수용액 내 벤젠 제거 (Removal of Benzene in Solution by using the Bio-carrier with Dead Bacillus drentensis sp. and Polysulfone)

  • 박상희;이민희
    • 한국지하수토양환경학회지:지하수토양환경
    • /
    • 제18권1호
    • /
    • pp.46-56
    • /
    • 2013
  • Laboratory scale experiments to remove benzene in solution by using the bio-carrier composed of dead biomass have been performed. The immobilized bio-carrier with dead Bacillus drentensis sp. and polysulfone was manufactured as the biosorbent. Batch sorption experiments were performed with bio-carriers having various quantities of biomass and then, their removal efficiencies and uptake capacities were calculated. From results of batch experiments, 98.0% of the initial benzene (1 mg/L) in 1 liter of solution was removed by using 40 g of immobilized bio-carrier containing 5% biomass within 1 hour and the biosorption reaction reached in equilibrium within 2 hours. Benzene removal efficiency slightly increased (99.0 to $99.4%{\pm}0.05$) as the temperature increased from 15 to $35^{\circ}C$, suggesting that the temperature rarely affects on the removal efficiency of the bio-carrier. The removal efficiency changed under the different initial benzene concentration in solution and benzene removal efficiency of the bio-carrier increased with the increase of the initial benzene concentration (0.001 to 10 mg/L). More than 99.0% of benzene was removed from solution when the initial benzene concentration ranged from 1 to 10 mg/L. From results of fitting process for batch experimental data to Langmuir and Freundlich isotherms, the removal isotherms of benzene were more well fitted to Freundlich model ($r^2$=0.9242) rather than Langmuir model ($r^2$=0.7453). From the column experiment, the benzene removal efficiency maintained over 99.0% until 420 pore volumes of benzene solution (initial benzene concentration: 1 mg/L) were injected in the column packed with bio-carriers, investigating that the immobilized carrier containing Bacillus drentensis sp. and polysulfone is the outstanding biosorbent to remove benzene in solution.

데이터마이닝 기법을 이용한 건강보험공단의 수술 통계량 근사치 추정 -허니아 수술을 중심으로- (Estimation of a Nationwide Statistics of Hernia Operation Applying Data Mining Technique to the National Health Insurance Database)

  • 강성홍;서숙경;양영자;이애경;배종면
    • Journal of Preventive Medicine and Public Health
    • /
    • 제39권5호
    • /
    • pp.433-437
    • /
    • 2006
  • Objectives: The aim of this study is to develop a methodology for estimating a nationwide statistic for hernia operations with using the claim database of the Korea Health Insurance Cooperation (KHIC). Methods: According to the insurance claim procedures, the claim database was divided into the electronic data interchange database (EDI_DB) and the sheet database (Paper_DB). Although the EDI_DB has operation and management codes showing the facts and kinds of operations, the Paper_DB doesn't. Using the hernia matched management code in the EDI_DB, the cases of hernia surgery were extracted. For drawing the potential cases from the Paper_DB, which doesn't have the code, the predictive model was developed using the data mining technique called SEMMA. The claim sheets of the cases that showed a predictive probability of an operation over the threshold, as was decided by the ROC curve, were identified in order to get the positive predictive value as an index of usefulness for the predictive model. Results: Of the claim databases in 2004, 14,386 cases had hernia related management codes with using the EDI system. For fitting the models with applying the data mining technique, logistic regression was chosen rather than the neural network method or the decision tree method. From the Paper_DB, 1,019 cases were extracted as potential cases. Direct review of the sheets of the extracted cases showed that the positive predictive value was 95.3%. Conclusions: The results suggested that applying the data mining technique to the claim database in the KHIC for estimating the nationwide surgical statistics would be useful from the aspect of execution and cost-effectiveness.

컨볼루션 신경망을 이용한 도시 환경에서의 안전도 점수 예측 모델 연구 (A Safety Score Prediction Model in Urban Environment Using Convolutional Neural Network)

  • 강현우;강행봉
    • 정보처리학회논문지:소프트웨어 및 데이터공학
    • /
    • 제5권8호
    • /
    • pp.393-400
    • /
    • 2016
  • 최근, 컴퓨터 비전과 기계 학습 기술의 도움을 받아 효율적이고 자동적인 도시 환경에 대한 분석 방법의 개발에 대한 연구가 이루어지고 있다. 많은 분석들 중에서도 도시의 안전도 분석은 지역 사회의 많은 관심을 받고 있다. 더욱 정확한 안전도 점수 예측과 인간의 시각적 인지를 반영하기 위해서, 인간의 시각적 인지에서 가장 중요한 전역 정보와 지역 정보의 고려가 필요하다. 이를 위해 우리는 전역 칼럼과 지역 칼럼으로 구성된 Double-column Convolutional Neural Network를 사용한다. 전역 칼럼과 지역 칼럼 각각은 입력은 크기가 변환된 원 영상과 원 영상에서 무작위로 크로핑을 사용한다. 또한, 학습 과정에서 특정 칼럼에 오버피팅되는 문제를 해결하기 위한 새로운 학습방법을 제안한다. 우리의 DCNN 모델의 성능 비교를 위해 2개의 SVR 모델과 3개의 CNN 모델의 평균 제곱근 오차와 상관관계 분석을 측정하였다. 성능 비교 실험 결과 우리의 모델이 0.7432의 평균 제곱근 오차와 0.853/0.840 피어슨/스피어맨 상관 계수로 가장 좋은 성능을 보여주었다.

A comparison of deep-learning models to the forecast of the daily solar flare occurrence using various solar images

  • Shin, Seulki;Moon, Yong-Jae;Chu, Hyoungseok
    • 천문학회보
    • /
    • 제42권2호
    • /
    • pp.61.1-61.1
    • /
    • 2017
  • As the application of deep-learning methods has been succeeded in various fields, they have a high potential to be applied to space weather forecasting. Convolutional neural network, one of deep learning methods, is specialized in image recognition. In this study, we apply the AlexNet architecture, which is a winner of Imagenet Large Scale Virtual Recognition Challenge (ILSVRC) 2012, to the forecast of daily solar flare occurrence using the MatConvNet software of MATLAB. Our input images are SOHO/MDI, EIT $195{\AA}$, and $304{\AA}$ from January 1996 to December 2010, and output ones are yes or no of flare occurrence. We consider other input images which consist of last two images and their difference image. We select training dataset from Jan 1996 to Dec 2000 and from Jan 2003 to Dec 2008. Testing dataset is chosen from Jan 2001 to Dec 2002 and from Jan 2009 to Dec 2010 in order to consider the solar cycle effect. In training dataset, we randomly select one fifth of training data for validation dataset to avoid the over-fitting problem. Our model successfully forecasts the flare occurrence with about 0.90 probability of detection (POD) for common flares (C-, M-, and X-class). While POD of major flares (M- and X-class) forecasting is 0.96, false alarm rate (FAR) also scores relatively high(0.60). We also present several statistical parameters such as critical success index (CSI) and true skill statistics (TSS). All statistical parameters do not strongly depend on the number of input data sets. Our model can immediately be applied to automatic forecasting service when image data are available.

  • PDF

Modelling of dissolved oxygen (DO) in a reservoir using artificial neural networks: Amir Kabir Reservoir, Iran

  • Asadollahfardi, Gholamreza;Aria, Shiva Homayoun;Abaei, Mehrdad
    • Advances in environmental research
    • /
    • 제5권3호
    • /
    • pp.153-167
    • /
    • 2016
  • We applied multilayer perceptron (MLP) and radial basis function (RBF) neural network in upstream and downstream water quality stations of the Karaj Reservoir in Iran. For both neural networks, inputs were pH, turbidity, temperature, chlorophyll-a, biochemical oxygen demand (BOD) and nitrate, and the output was dissolved oxygen (DO). We used an MLP neural network with two hidden layers, for upstream station 15 and 33 neurons in the first and second layers respectively, and for the downstream station, 16 and 21 neurons in the first and second hidden layer were used which had minimum amount of errors. For learning process 6-fold cross validation were applied to avoid over fitting. The best results acquired from RBF model, in which the mean bias error (MBE) and root mean squared error (RMSE) were 0.063 and 0.10 for the upstream station. The MBE and RSME were 0.0126 and 0.099 for the downstream station. The coefficient of determination ($R^2$) between the observed data and the predicted data for upstream and downstream stations in the MLP was 0.801 and 0.904, respectively, and in the RBF network were 0.962 and 0.97, respectively. The MLP neural network had acceptable results; however, the results of RBF network were more accurate. A sensitivity analysis for the MLP neural network indicated that temperature was the first parameter, pH the second and nitrate was the last factor affecting the prediction of DO concentrations. The results proved the workability and accuracy of the RBF model in the prediction of the DO.

GMM 음소 단위 파라미터와 어휘 클러스터링을 융합한 음성 인식 성능 향상 (Speech Recognition Performance Improvement using a convergence of GMM Phoneme Unit Parameter and Vocabulary Clustering)

  • 오상엽
    • 융합정보논문지
    • /
    • 제10권8호
    • /
    • pp.35-39
    • /
    • 2020
  • DNN은 기존의 음성 인식 시스템에 비해 에러가 적으나 병렬 훈련이 어렵고, 계산의 양이 많으며, 많은 양의 데이터 확보를 필요로 한다. 본 논문에서는 이러한 문제를 효율적으로 해결하기 위해 GMM에서 모델 파라메터를 가지고 음소별 GMM 파라메터를 추정하여 음소 단위를 생성한다. 그리고 이를 효율적으로 적용하기 위해 특정 어휘에 대한 클러스터링을 통해 성능을 향상시키기 위한 방법을 제안한다. 이를 위해 3가지 종류의 단어 음성 데이터베이스를 이용하여 DB를 가지고 어휘 모델을 구축하였고, 잡음 처리는 워너필터를 사용한 특징을 추출하여 음성 인식실험에 사용하였다. 본 논문에서 제안한 방법을 사용한 결과 음성 인식률에서 97.9%의 인식률을 나타내었다. 본 연구에서 개선된 오버피팅의 문제점을 향상시킬 수 있는 추가적인 연구를 필요로 한다.

Rate Capability of LiFePO4 Cathodes and the Shape Engineering of Their Anisotropic Crystallites

  • Alexander, Bobyl;Sang-Сheol, Nam;Jung-Hoon, Song;Alexander, Ivanishchev;Arseni, Ushakov
    • Journal of Electrochemical Science and Technology
    • /
    • 제13권4호
    • /
    • pp.438-452
    • /
    • 2022
  • For cuboid and ellipsoid crystallites of LiFePO4 powders, by X-ray diffraction (XRD) and microscopic (TEM) studies, it is possible to determine the anisotropic parameters of the crystallite size distribution functions. These parameters were used to describe the cathode rate capability within the model of averaging the diffusion coefficient D over the length of the crystallite columns along the [010] direction. A LiFePO4 powder was chosen for testing the developed model, consisting of big cuboid and small ellipsoid crystallites (close to them). When analyzing the parts of big and small rate capabilities, the fitting values D = 2.1 and 0.3 nm2/s were obtained for cuboids and ellipsoids, respectively. When analyzing the results of cyclic voltammetry using the Randles-Sevcik equation and the total area of projections of electrode crystallites on their (010) plane, slightly different values were obtained, D = 0.9 ± 0.15 and 0.5 ± 0.15 nm2/s, respectively. We believe that these inconsistencies can be considered quite acceptable, since both methods of determining D have obvious sources of error. However, the developed method has a clearly lower systematic error due to the ability to actually take into account the shape and statistics of crystallites, and it is also useful for improving the accuracy of the Randles-Sevcik equation. It has also been demonstrated that the shape engineering of crystallites, among other tasks, can increase the cathode capacity by 15% by increasing their size correlation coefficients.