• Title/Summary/Keyword: K 평균 알고리즘

Search Result 1,297, Processing Time 0.026 seconds

Ensemble Learning with Support Vector Machines for Bond Rating (회사채 신용등급 예측을 위한 SVM 앙상블학습)

  • Kim, Myoung-Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.2
    • /
    • pp.29-45
    • /
    • 2012
  • Bond rating is regarded as an important event for measuring financial risk of companies and for determining the investment returns of investors. As a result, it has been a popular research topic for researchers to predict companies' credit ratings by applying statistical and machine learning techniques. The statistical techniques, including multiple regression, multiple discriminant analysis (MDA), logistic models (LOGIT), and probit analysis, have been traditionally used in bond rating. However, one major drawback is that it should be based on strict assumptions. Such strict assumptions include linearity, normality, independence among predictor variables and pre-existing functional forms relating the criterion variablesand the predictor variables. Those strict assumptions of traditional statistics have limited their application to the real world. Machine learning techniques also used in bond rating prediction models include decision trees (DT), neural networks (NN), and Support Vector Machine (SVM). Especially, SVM is recognized as a new and promising classification and regression analysis method. SVM learns a separating hyperplane that can maximize the margin between two categories. SVM is simple enough to be analyzed mathematical, and leads to high performance in practical applications. SVM implements the structuralrisk minimization principle and searches to minimize an upper bound of the generalization error. In addition, the solution of SVM may be a global optimum and thus, overfitting is unlikely to occur with SVM. In addition, SVM does not require too many data sample for training since it builds prediction models by only using some representative sample near the boundaries called support vectors. A number of experimental researches have indicated that SVM has been successfully applied in a variety of pattern recognition fields. However, there are three major drawbacks that can be potential causes for degrading SVM's performance. First, SVM is originally proposed for solving binary-class classification problems. Methods for combining SVMs for multi-class classification such as One-Against-One, One-Against-All have been proposed, but they do not improve the performance in multi-class classification problem as much as SVM for binary-class classification. Second, approximation algorithms (e.g. decomposition methods, sequential minimal optimization algorithm) could be used for effective multi-class computation to reduce computation time, but it could deteriorate classification performance. Third, the difficulty in multi-class prediction problems is in data imbalance problem that can occur when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed boundary and thus the reduction in the classification accuracy of such a classifier. SVM ensemble learning is one of machine learning methods to cope with the above drawbacks. Ensemble learning is a method for improving the performance of classification and prediction algorithms. AdaBoost is one of the widely used ensemble learning techniques. It constructs a composite classifier by sequentially training classifiers while increasing weight on the misclassified observations through iterations. The observations that are incorrectly predicted by previous classifiers are chosen more often than examples that are correctly predicted. Thus Boosting attempts to produce new classifiers that are better able to predict examples for which the current ensemble's performance is poor. In this way, it can reinforce the training of the misclassified observations of the minority class. This paper proposes a multiclass Geometric Mean-based Boosting (MGM-Boost) to resolve multiclass prediction problem. Since MGM-Boost introduces the notion of geometric mean into AdaBoost, it can perform learning process considering the geometric mean-based accuracy and errors of multiclass. This study applies MGM-Boost to the real-world bond rating case for Korean companies to examine the feasibility of MGM-Boost. 10-fold cross validations for threetimes with different random seeds are performed in order to ensure that the comparison among three different classifiers does not happen by chance. For each of 10-fold cross validation, the entire data set is first partitioned into tenequal-sized sets, and then each set is in turn used as the test set while the classifier trains on the other nine sets. That is, cross-validated folds have been tested independently of each algorithm. Through these steps, we have obtained the results for classifiers on each of the 30 experiments. In the comparison of arithmetic mean-based prediction accuracy between individual classifiers, MGM-Boost (52.95%) shows higher prediction accuracy than both AdaBoost (51.69%) and SVM (49.47%). MGM-Boost (28.12%) also shows the higher prediction accuracy than AdaBoost (24.65%) and SVM (15.42%)in terms of geometric mean-based prediction accuracy. T-test is used to examine whether the performance of each classifiers for 30 folds is significantly different. The results indicate that performance of MGM-Boost is significantly different from AdaBoost and SVM classifiers at 1% level. These results mean that MGM-Boost can provide robust and stable solutions to multi-classproblems such as bond rating.

Development of an Offline Based Internal Organ Motion Verification System during Treatment Using Sequential Cine EPID Images (연속촬영 전자조사 문 영상을 이용한 오프라인 기반 치료 중 내부 장기 움직임 확인 시스템의 개발)

  • Ju, Sang-Gyu;Hong, Chae-Seon;Huh, Woong;Kim, Min-Kyu;Han, Young-Yih;Shin, Eun-Hyuk;Shin, Jung-Suk;Kim, Jing-Sung;Park, Hee-Chul;Ahn, Sung-Hwan;Lim, Do-Hoon;Choi, Doo-Ho
    • Progress in Medical Physics
    • /
    • v.23 no.2
    • /
    • pp.91-98
    • /
    • 2012
  • Verification of internal organ motion during treatment and its feedback is essential to accurate dose delivery to the moving target. We developed an offline based internal organ motion verification system (IMVS) using cine EPID images and evaluated its accuracy and availability through phantom study. For verification of organ motion using live cine EPID images, a pattern matching algorithm using an internal surrogate, which is very distinguishable and represents organ motion in the treatment field, like diaphragm, was employed in the self-developed analysis software. For the system performance test, we developed a linear motion phantom, which consists of a human body shaped phantom with a fake tumor in the lung, linear motion cart, and control software. The phantom was operated with a motion of 2 cm at 4 sec per cycle and cine EPID images were obtained at a rate of 3.3 and 6.6 frames per sec (2 MU/frame) with $1,024{\times}768$ pixel counts in a linear accelerator (10 MVX). Organ motion of the target was tracked using self-developed analysis software. Results were compared with planned data of the motion phantom and data from the video image based tracking system (RPM, Varian, USA) using an external surrogate in order to evaluate its accuracy. For quantitative analysis, we analyzed correlation between two data sets in terms of average cycle (peak to peak), amplitude, and pattern (RMS, root mean square) of motion. Averages for the cycle of motion from IMVS and RPM system were $3.98{\pm}0.11$ (IMVS 3.3 fps), $4.005{\pm}0.001$ (IMVS 6.6 fps), and $3.95{\pm}0.02$ (RPM), respectively, and showed good agreement on real value (4 sec/cycle). Average of the amplitude of motion tracked by our system showed $1.85{\pm}0.02$ cm (3.3 fps) and $1.94{\pm}0.02$ cm (6.6 fps) as showed a slightly different value, 0.15 (7.5% error) and 0.06 (3% error) cm, respectively, compared with the actual value (2 cm), due to time resolution for image acquisition. In analysis of pattern of motion, the value of the RMS from the cine EPID image in 3.3 fps (0.1044) grew slightly compared with data from 6.6 fps (0.0480). The organ motion verification system using sequential cine EPID images with an internal surrogate showed good representation of its motion within 3% error in a preliminary phantom study. The system can be implemented for clinical purposes, which include organ motion verification during treatment, compared with 4D treatment planning data, and its feedback for accurate dose delivery to the moving target.

A Definition of Korean Heat Waves and Their Spatio-temporal Patterns (우리나라에 적합한 열파의 정의와 그 시.공간적 발생패턴)

  • Choi, Gwang-Yong
    • Journal of the Korean Geographical Society
    • /
    • v.41 no.5 s.116
    • /
    • pp.527-544
    • /
    • 2006
  • This study provides a definition of heat waves, which indicate the conditions of strong sultriness in summer, appropriate to Korea and intends to clarify long term(1973-2006) averaged spatial and temporal patterns of annual frequency of heat waves with respect to their intensity. Based on examination of the Korean mortality rate changes due to increase of apparent temperature under hot and humid summer conditions, three consecutive days with at least $32.5^{\circ}C,\;35.5^{\circ}C,\;38.5^{\circ}C,\;and\;41.5^{\circ}C$ of daily maximum Heat Index are defined as the Hot Spell(HS), the Heat Wave(HW), the Strong Heat Wave(SHW), and the Extreme Heat Wave(EHW), respectively. The annual frequency of all categories of heat waves is relatively low in high-elevated regions or on islands adjacent to seas. In contrast, the maximum annual frequency of heat waves during the study period as well as annual average frequency are highest in interior, low-elevated regions along major rivers in South Korea, particularly during the Changma Break period(between late July and mid-August). There is no obvious increasing or decreasing trend in the annual total frequency of all categories of heat waves for the study period However, the maximum annual frequencies of HS days at each weather station were recorded mainly in the 1970s, while most of maximum frequency records of both the HW and the SHW at individual weather stations were observed in the 1990s. It is also revealed that when heat waves occur in South Korea high humidity as well as high temperature contributes to increasing the heat wave intensity by $4.3-9.5^{\circ}C$. These results provide a useful basis to help develop a heat wave warning system appropriate to Korea.

A Simulation-Based Investigation of an Advanced Traveler Information System with V2V in Urban Network (시뮬레이션기법을 통한 차량 간 통신을 이용한 첨단교통정보시스템의 효과 분석 (도시 도로망을 중심으로))

  • Kim, Hoe-Kyoung
    • Journal of Korean Society of Transportation
    • /
    • v.29 no.5
    • /
    • pp.121-138
    • /
    • 2011
  • More affordable and available cutting-edge technologies (e.g., wireless vehicle communication) are regarded as a possible alternative to the fixed infrastructure-based traffic information system requiring the expensive infrastructure investments and mostly implemented in the uninterrupted freeway network with limited spatial system expansion. This paper develops an advanced decentralized traveler information System (ATIS) using vehicle-to-vehicle (V2V) communication system whose performance (drivers' travel time savings) are enhanced by three complementary functions (autonomous automatic incident detection algorithm, reliable sample size function, and driver behavior model) and evaluates it in the typical $6{\times}6$ urban grid network with non-recurrent traffic state (traffic incident) with the varying key parameters (traffic flow, communication radio range, and penetration ratio), employing the off-the-shelf microscopic simulation model (VISSIM) under the ideal vehicle communication environment. Simulation outputs indicate that as the three key parameters are increased more participating vehicles are involved for traffic data propagation in the less communication groups at the faster data dissemination speed. Also, participating vehicles saved their travel time by dynamically updating the up-to-date traffic states and searching for the new route. Focusing on the travel time difference of (instant) re-routing vehicles, lower traffic flow cases saved more time than higher traffic flow ones. This is because a relatively small number of vehicles in 300vph case re-route during the most system-efficient time period (the early time of the traffic incident) but more vehicles in 514vph case re-route during less system-efficient time period, even after the incident is resolved. Also, normally re-routings on the network-entering links saved more travel time than any other places inside the network except the case where the direct effect of traffic incident triggers vehicle re-routings during the effective incident time period and the location and direction of the incident link determines the spatial distribution of re-routing vehicles.

An accuracy analysis of Cyberknife tumor tracking radiotherapy according to unpredictable change of respiration (예측 불가능한 호흡 변화에 따른 사이버나이프 종양 추적 방사선 치료의 정확도 분석)

  • Seo, jung min;Lee, chang yeol;Huh, hyun do;Kim, wan sun
    • The Journal of Korean Society for Radiation Therapy
    • /
    • v.27 no.2
    • /
    • pp.157-166
    • /
    • 2015
  • Purpose : Cyber-Knife tumor tracking system, based on the correlation relationship between the position of a tumor which moves in response to the real time respiratory cycle signal and respiration was obtained by the LED marker attached to the outside of the patient, the location of the tumor to predict in advance, the movement of the tumor in synchronization with the therapeutic device to track real-time tumor, is a system for treating. The purpose of this study, in the cyber knife tumor tracking radiation therapy, trying to evaluate the accuracy of tumor tracking radiation therapy system due to the change in the form of unpredictable sudden breathing due to cough and sleep. Materials and Methods : Breathing Log files that were used in the study, based on the Respiratory gating radiotherapy and Cyber-knife tracking radiosurgery breathing Log files of patients who received herein, measured using the Log files in the form of a Sinusoidal pattern and Sudden change pattern. it has been reconstituted as possible. Enter the reconstructed respiratory Log file cyber knife dynamic chest Phantom, so that it is possible to implement a motion due to respiration, add manufacturing the driving apparatus of the existing dynamic chest Phantom, Phantom the form of respiration we have developed a program that can be applied to. Movement of the phantom inside the target (Ball cube target) was driven by the displacement of three sizes of according to the size of the respiratory vertical (Superior-Inferior) direction to the 5 mm, 10 mm, 20 mm. Insert crosses two EBT3 films in phantom inside the target in response to changes in the target movement, the End-to-End (E2E) test provided in Cyber-Knife manufacturer depending on the form of the breathing five times each. It was determined by carrying. Accuracy of tumor tracking system is indicated by the target error by analyzing the inserted film, additional E2E test is analyzed by measuring the correlation error while being advanced. Results : If the target error is a sine curve breathing form, the size of the target of the movement is in response to the 5 mm, 10 mm, 20 mm, respectively, of the average $1.14{\pm}0.13mm$, $1.05{\pm}0.20mm$, with $2.37{\pm}0.17mm$, suddenly for it is variations in breathing, respective average $1.87{\pm}0.19mm$, $2.15{\pm}0.21mm$, and analyzed with $2.44{\pm}0.26mm$. If the correlation error can be defined by the length of the displacement vector in the target track is a sinusoidal breathing mode, the size of the target of the movement in response to 5 mm, 10 mm, 20 mm, respective average $0.84{\pm}0.01mm$, $0.70{\pm}0.13mm$, with $1.63{\pm}0.10mm$, if it is a variant of sudden breathing respective average $0.97{\pm}0.06mm$, $1.44{\pm}0.11mm$, and analyzed with $1.98{\pm}0.10mm$. The larger the correlation error values in both the both the respiratory form, the target error value is large. If the motion size of the target of the sine curve breathing form is greater than or equal to 20 mm, was measured at 1.5 mm or more is a recommendation value of both cyber knife manufacturer of both error value. Conclusion : There is a tendency that the correlation error value between about target error value magnitude of the target motion is large is increased, the error value becomes large in variation of rapid respiration than breathing the form of a sine curve. The more the shape of the breathing large movements regular shape of sine curves target accuracy of the tumor tracking system can be judged to be reduced. Using the algorithm of Cyber-Knife tumor tracking system, when there is a change in the sudden unpredictable respiratory due patient coughing during treatment enforcement is to stop the treatment, it is assumed to carry out the internal target validation process again, it is necessary to readjust the form of respiration. Patients under treatment is determined to be able to improve the treatment of accuracy to induce the observed form of regular breathing and put like to see the goggles monitor capable of the respiratory form of the person.

  • PDF

Overview of Research Trends in Estimation of Forest Carbon Stocks Based on Remote Sensing and GIS (원격탐사와 GIS 기반의 산림탄소저장량 추정에 관한 주요국 연구동향 개관)

  • Kim, Kyoung-Min;Lee, Jung-Bin;Kim, Eun-Sook;Park, Hyun-Ju;Roh, Young-Hee;Lee, Seung-Ho;Park, Key-Ho;Shin, Hyu-Seok
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.14 no.3
    • /
    • pp.236-256
    • /
    • 2011
  • Forest carbon stocks change due to land use change is an important data required by UNFCCC(United Nations framework convention on climate change). Spatially explicit estimation of forest carbon stocks based on IPCC GPG(intergovernmental panel on climate change good practice guidance) tier 3 gives high reliability. But a current estimation which was aggregated from NFI data doesn't have detail forest carbon stocks by polygon or cell. In order to improve an estimation remote sensing and GIS have been used especially in Europe and North America. We divided research trends in main countries into 4 categories such as remote sensing, GIS, geostatistics and environmental modeling considering spatial heterogeneity. The easiest way to apply is combination NFI data with forest type map based on GIS. Considering especially complicated forest structure of Korea, geostatistics is useful to estimate local variation of forest carbon. In addition, fine scale image is good for verification of forest carbon stocks and determination of CDM site. Related domestic researches are still on initial status and forest carbon stocks are mainly estimated using k-nearest neighbor(k-NN). In order to select suitable method for forest in Korea, an applicability of diverse spatial data and algorithm must be considered. Also the comparison between methods is required.

Quantitative Conductivity Estimation Error due to Statistical Noise in Complex $B_1{^+}$ Map (정량적 도전율측정의 오차와 $B_1{^+}$ map의 노이즈에 관한 분석)

  • Shin, Jaewook;Lee, Joonsung;Kim, Min-Oh;Choi, Narae;Seo, Jin Keun;Kim, Dong-Hyun
    • Investigative Magnetic Resonance Imaging
    • /
    • v.18 no.4
    • /
    • pp.303-313
    • /
    • 2014
  • Purpose : In-vivo conductivity reconstruction using transmit field ($B_1{^+}$) information of MRI was proposed. We assessed the accuracy of conductivity reconstruction in the presence of statistical noise in complex $B_1{^+}$ map and provided a parametric model of the conductivity-to-noise ratio value. Materials and Methods: The $B_1{^+}$ distribution was simulated for a cylindrical phantom model. By adding complex Gaussian noise to the simulated $B_1{^+}$ map, quantitative conductivity estimation error was evaluated. The quantitative evaluation process was repeated over several different parameters such as Larmor frequency, object radius and SNR of $B_1{^+}$ map. A parametric model for the conductivity-to-noise ratio was developed according to these various parameters. Results: According to the simulation results, conductivity estimation is more sensitive to statistical noise in $B_1{^+}$ phase than to noise in $B_1{^+}$ magnitude. The conductivity estimate of the object of interest does not depend on the external object surrounding it. The conductivity-to-noise ratio is proportional to the signal-to-noise ratio of the $B_1{^+}$ map, Larmor frequency, the conductivity value itself and the number of averaged pixels. To estimate accurate conductivity value of the targeted tissue, SNR of $B_1{^+}$ map and adequate filtering size have to be taken into account for conductivity reconstruction process. In addition, the simulation result was verified at 3T conventional MRI scanner. Conclusion: Through all these relationships, quantitative conductivity estimation error due to statistical noise in $B_1{^+}$ map is modeled. By using this model, further issues regarding filtering and reconstruction algorithms can be investigated for MREPT.

The Evaluation of Meteorological Inputs retrieved from MODIS for Estimation of Gross Primary Productivity in the US Corn Belt Region (MODIS 위성 영상 기반의 일차생산성 알고리즘 입력 기상 자료의 신뢰도 평가: 미국 Corn Belt 지역을 중심으로)

  • Lee, Ji-Hye;Kang, Sin-Kyu;Jang, Keun-Chang;Ko, Jong-Han;Hong, Suk-Young
    • Korean Journal of Remote Sensing
    • /
    • v.27 no.4
    • /
    • pp.481-494
    • /
    • 2011
  • Investigation of the $CO_2$ exchange between biosphere and atmosphere at regional, continental, and global scales can be directed to combining remote sensing with carbon cycle process to estimate vegetation productivity. NASA Earth Observing System (EOS) currently produces a regular global estimate of gross primary productivity (GPP) and annual net primary productivity (NPP) of the entire terrestrial earth surface at 1 km spatial resolution. While the MODIS GPP algorithm uses meteorological data provided by the NASA Data Assimilation Office (DAO), the sub-pixel heterogeneity or complex terrain are generally reflected due to coarse spatial resolutions of the DAO data (a resolution of $1{\circ}\;{\times}\;1.25{\circ}$). In this study, we estimated inputs retrieved from MODIS products of the AQUA and TERRA satellites with 5 km spatial resolution for the purpose of finer GPP and/or NPP determinations. The derivatives included temperature, VPD, and solar radiation. Seven AmeriFlux data located in the Corn Belt region were obtained to use for evaluation of the input data from MODIS. MODIS-derived air temperature values showed a good agreement with ground-based observations. The mean error (ME) and coefficient of correlation (R) ranged from $-0.9^{\circ}C$ to $+5.2^{\circ}C$ and from 0.83 to 0.98, respectively. VPD somewhat coarsely agreed with tower observations (ME = -183.8 Pa ~ +382.1 Pa; R = 0.51 ~ 0.92). While MODIS-derived shortwave radiation showed a good correlation with observations, it was slightly overestimated (ME = -0.4 MJ $day^{-1}$ ~ +7.9 MJ $day^{-1}$; R = 0.67 ~ 0.97). Our results indicate that the use of inputs derived MODIS atmosphere and land products can provide a useful tool for estimating crop GPP.

Estimation of forest Site Productivity by Regional Environment and Forest Soil Factors (권역별 입지$\cdot$토양 환경 요인에 의한 임지생산력 추정)

  • Won Hyong-kyu;Jeong Jin-Hyun;Koo Kyo-Sang;Song Myung Hee;Shin Man Yong
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.7 no.2
    • /
    • pp.132-140
    • /
    • 2005
  • This study was conducted to develop regional site index equations for main tree species in Gangwon, Gyunggi-Chungcheong, Gyungsang, and Jeolla area of Korea, using environmental and soil factors obtained from a digital forest site map. Using the large data set obtained from the digital forest map, a total of 28 environmental and soil factors were regressed on site index by tree species for developing the best site index equations for each of the regions. The selected main tree species were Larix 1eptolepis, Pinus koraiensis, Pinus densiflora, Pinus thunbergii, and Quercus acutissima. Finally, four to five environmental and soil factors by species were chosen as independent variables in defining the best regional site index equations with the highest coefficients of determination $(R^2)$. For those site index equations, three evaluation statistics such as mean difference, standard deviation of difference and standard error of difference were applied to the data sets independently collected from fields within the region. According to the evaluation statistics, it was found that the regional site index equations by species developed in this study conformed well to the independent data set, having relatively low bias and variation. It was concluded that the regional site index equations by species had sufficient capability for the estimation of site productivity.

Estimation of Linkage Disequilibrium and Effective Population Size using Whole Genome Single Nucleotide Polymorphisms in Hanwoo (한우에서 전장의 유전체 정보를 활용한 연관불평형 및 유효집단크기 추정에 관한 연구)

  • Cho, Chung-Il;Lee, Joon-Ho;Lee, Deuk-Hwan
    • Journal of Life Science
    • /
    • v.22 no.3
    • /
    • pp.366-372
    • /
    • 2012
  • This study was conducted to estimate the extent of linkage disequilibrium (LD) and effective population size using whole genomic single nucleotide polymorphisms (SNP) genotyped by DNA chip in Hanwoo. Using the blood samples of 35 young bulls born from 2005 to 2008 and their progenies (N=253) in a Hanwoo nucleus population collected from Hanwoo Improvement Center, 51,582 SNPs were genotyped using Bovine SNP50 chips. A total of 40,851 SNPs were used in this study after elimination of SNPs with a missing genotyping rate of over 10 percent and monomorphic SNPs (10,730 SNPs). The total autosomal genome length, measured as the sum of the longest syntenic pairs of SNPs by chromosome, was 2,541.6 Mb (Mega base pairs). The average distances of all adjacent pairs by each BTA ranged from 0.55 to 0.74 cM. Decay of LD showed an exponential trend with physical distance. The means of LD ($r^2$) among syntenic SNP pairs were 0.136 at a range of 0-0.1 Mb in physical distance and 0.06 at a range of 0.1-0.2 Mb. When these results were used for Luo's formula, about 2,000 phenotypic records were found to be required to achieve power > 0.9 to detect 5% QTL in the population of Hanwoo. As a result of estimating effective population size by generation in Hanwoo, the estimated effective population size for the current status was 84 heads and the estimate of effective population size for 50 generations of ancestors was 1,150 heads. The average decreasing rates of effective population size by generation were 9.0% at about five generations and 17.3% at the current generation. The main cause of the rapid decrease in effective population size was considered to be the intensive use of a few prominent sires since the application of artificial insemination technology in Korea. To increase and/or sustain the effective population size, the selection of various proven bulls and mating systems that consider genetic diversity are needed.