• Title/Summary/Keyword: Statistical decision

Search Result 944, Processing Time 0.029 seconds

Anomaly Detection for User Action with Generative Adversarial Networks (적대적 생성 모델을 활용한 사용자 행위 이상 탐지 방법)

  • Choi, Nam woong;Kim, Wooju
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.43-62
    • /
    • 2019
  • At one time, the anomaly detection sector dominated the method of determining whether there was an abnormality based on the statistics derived from specific data. This methodology was possible because the dimension of the data was simple in the past, so the classical statistical method could work effectively. However, as the characteristics of data have changed complexly in the era of big data, it has become more difficult to accurately analyze and predict the data that occurs throughout the industry in the conventional way. Therefore, SVM and Decision Tree based supervised learning algorithms were used. However, there is peculiarity that supervised learning based model can only accurately predict the test data, when the number of classes is equal to the number of normal classes and most of the data generated in the industry has unbalanced data class. Therefore, the predicted results are not always valid when supervised learning model is applied. In order to overcome these drawbacks, many studies now use the unsupervised learning-based model that is not influenced by class distribution, such as autoencoder or generative adversarial networks. In this paper, we propose a method to detect anomalies using generative adversarial networks. AnoGAN, introduced in the study of Thomas et al (2017), is a classification model that performs abnormal detection of medical images. It was composed of a Convolution Neural Net and was used in the field of detection. On the other hand, sequencing data abnormality detection using generative adversarial network is a lack of research papers compared to image data. Of course, in Li et al (2018), a study by Li et al (LSTM), a type of recurrent neural network, has proposed a model to classify the abnormities of numerical sequence data, but it has not been used for categorical sequence data, as well as feature matching method applied by salans et al.(2016). So it suggests that there are a number of studies to be tried on in the ideal classification of sequence data through a generative adversarial Network. In order to learn the sequence data, the structure of the generative adversarial networks is composed of LSTM, and the 2 stacked-LSTM of the generator is composed of 32-dim hidden unit layers and 64-dim hidden unit layers. The LSTM of the discriminator consists of 64-dim hidden unit layer were used. In the process of deriving abnormal scores from existing paper of Anomaly Detection for Sequence data, entropy values of probability of actual data are used in the process of deriving abnormal scores. but in this paper, as mentioned earlier, abnormal scores have been derived by using feature matching techniques. In addition, the process of optimizing latent variables was designed with LSTM to improve model performance. The modified form of generative adversarial model was more accurate in all experiments than the autoencoder in terms of precision and was approximately 7% higher in accuracy. In terms of Robustness, Generative adversarial networks also performed better than autoencoder. Because generative adversarial networks can learn data distribution from real categorical sequence data, Unaffected by a single normal data. But autoencoder is not. Result of Robustness test showed that he accuracy of the autocoder was 92%, the accuracy of the hostile neural network was 96%, and in terms of sensitivity, the autocoder was 40% and the hostile neural network was 51%. In this paper, experiments have also been conducted to show how much performance changes due to differences in the optimization structure of potential variables. As a result, the level of 1% was improved in terms of sensitivity. These results suggest that it presented a new perspective on optimizing latent variable that were relatively insignificant.

A Study for Strategy of On-line Shopping Mall: Based on Customer Purchasing and Re-purchasing Pattern (시스템 다이내믹스 기법을 활용한 온라인 쇼핑몰의 전략에 관한 연구 : 소비자의 구매 및 재구매 행동을 중심으로)

  • Lee, Sang-Gun;Min, Suk-Ki;Kang, Min-Cheol
    • Asia pacific journal of information systems
    • /
    • v.18 no.3
    • /
    • pp.91-121
    • /
    • 2008
  • Electronic commerce, commonly known as e-commerce or eCommerce, has become a major business trend in these days. The amount of trade conducted electronically has grown extraordinarily by developing the Internet technology. Most electronic commerce has being conducted between businesses to customers; therefore, the researches with respect to e-commerce are to find customer's needs, behaviors through statistical methods. However, the statistical researches, mostly based on a questionnaire, are the static researches, They can tell us the dynamic relationships between initial purchasing and repurchasing. Therefore, this study proposes dynamic research model for analyzing the cause of initial purchasing and repurchasing. This paper is based on the System-Dynamic theory, using the powerful simulation model with some restriction, The restrictions are based on the theory TAM(Technology Acceptance Model), PAM, and TPB(Theory of Planned Behavior). This article investigates not only the customer's purchasing and repurchasing behavior by passing of time but also the interactive effects to one another. This research model has six scenarios and three steps for analyzing customer behaviors. The first step is the research of purchasing situations. The second step is the research of repurchasing situations. Finally, the third step is to study the relationship between initial purchasing and repurchasing. The purpose of six scenarios is to find the customer's purchasing patterns according to the environmental changes. We set six variables in these scenarios by (1) changing the number of products; (2) changing the number of contents in on-line shopping malls; (3) having multimedia files or not in the shopping mall web sites; (4) grading on-line communities; (5) changing the qualities of products; (6) changing the customer's degree of confidence on products. First three variables are applied to study customer's purchasing behavior, and the other variables are applied to repurchasing behavior study. Through the simulation study, this paper presents some inter-relational result about customer purchasing behaviors, For example, Active community actions are not the increasing factor of purchasing but the increasing factor of word of mouth effect, Additionally. The higher products' quality, the more word of mouth effects increase. The number of products and contents on the web sites have same influence on people's buying behaviors. All simulation methods in this paper is not only display the result of each scenario but also find how to affect each other. Hence, electronic commerce firm can make more realistic marketing strategy about consumer behavior through this dynamic simulation research. Moreover, dynamic analysis method can predict the results which help the decision of marketing strategy by using the time-line graph. Consequently, this dynamic simulation analysis could be a useful research model to make firm's competitive advantage. However, this simulation model needs more further study. With respect to reality, this simulation model has some limitations. There are some missing factors which affect customer's buying behaviors in this model. The first missing factor is the customer's degree of recognition of brands. The second factor is the degree of customer satisfaction. The third factor is the power of word of mouth in the specific region. Generally, word of mouth affects significantly on a region's culture, even people's buying behaviors. The last missing factor is the user interface environment in the internet or other on-line shopping tools. In order to get more realistic result, these factors might be essential matters to make better research in the future studies.

Development of Systematic Process for Estimating Commercialization Duration and Cost of R&D Performance (기술가치 평가를 위한 기술사업화 기간 및 비용 추정체계 개발)

  • Jun, Seoung-Pyo;Choi, Daeheon;Park, Hyun-Woo;Seo, Bong-Goon;Park, Do-Hyung
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.2
    • /
    • pp.139-160
    • /
    • 2017
  • Technology commercialization creates effective economic value by linking the company's R & D processes and outputs to the market. This technology commercialization is important in that a company can retain and maintain a sustained competitive advantage. In order for a specific technology to be commercialized, it goes through the stage of technical planning, technology research and development, and commercialization. This process involves a lot of time and money. Therefore, the duration and cost of technology commercialization are important decision information for determining the market entry strategy. In addition, it is more important information for a technology investor to rationally evaluate the technology value. In this way, it is very important to scientifically estimate the duration and cost of the technology commercialization. However, research on technology commercialization is insufficient and related methodology are lacking. In this study, we propose an evaluation model that can estimate the duration and cost of R & D technology commercialization for small and medium-sized enterprises. To accomplish this, this study collected the public data of the National Science & Technology Information Service (NTIS) and the survey data provided by the Small and Medium Business Administration. Also this study will develop the estimation model of commercialization duration and cost of R&D performance on using these data based on the market approach, one of the technology valuation methods. Specifically, this study defined the process of commercialization as consisting of development planning, development progress, and commercialization. We collected the data from the NTIS database and the survey of SMEs technical statistics of the Small and Medium Business Administration. We derived the key variables such as stage-wise R&D costs and duration, the factors of the technology itself, the factors of the technology development, and the environmental factors. At first, given data, we estimates the costs and duration in each technology readiness level (basic research, applied research, development research, prototype production, commercialization), for each industry classification. Then, we developed and verified the research model of each industry classification. The results of this study can be summarized as follows. Firstly, it is reflected in the technology valuation model and can be used to estimate the objective economic value of technology. The duration and the cost from the technology development stage to the commercialization stage is a critical factor that has a great influence on the amount of money to discount the future sales from the technology. The results of this study can contribute to more reliable technology valuation because it estimates the commercialization duration and cost scientifically based on past data. Secondly, we have verified models of various fields such as statistical model and data mining model. The statistical model helps us to find the important factors to estimate the duration and cost of technology Commercialization, and the data mining model gives us the rules or algorithms to be applied to an advanced technology valuation system. Finally, this study reaffirms the importance of commercialization costs and durations, which has not been actively studied in previous studies. The results confirm the significant factors to affect the commercialization costs and duration, furthermore the factors are different depending on industry classification. Practically, the results of this study can be reflected in the technology valuation system, which can be provided by national research institutes and R & D staff to provide sophisticated technology valuation. The relevant logic or algorithm of the research result can be implemented independently so that it can be directly reflected in the system, so researchers can use it practically immediately. In conclusion, the results of this study can be a great contribution not only to the theoretical contributions but also to the practical ones.

Predicting Regional Soybean Yield using Crop Growth Simulation Model (작물 생육 모델을 이용한 지역단위 콩 수량 예측)

  • Ban, Ho-Young;Choi, Doug-Hwan;Ahn, Joong-Bae;Lee, Byun-Woo
    • Korean Journal of Remote Sensing
    • /
    • v.33 no.5_2
    • /
    • pp.699-708
    • /
    • 2017
  • The present study was to develop an approach for predicting soybean yield using a crop growth simulation model at the regional level where the detailed and site-specific information on cultivation management practices is not easily accessible for model input. CROPGRO-Soybean model included in Decision Support System for Agrotechnology Transfer (DSSAT) was employed for this study, and Illinois which is a major soybean production region of USA was selected as a study region. As a first step to predict soybean yield of Illinois using CROPGRO-Soybean model, genetic coefficients representative for each soybean maturity group (MG I~VI) were estimated through sowing date experiments using domestic and foreign cultivars with diverse maturity in Seoul National University Farm ($37.27^{\circ}N$, $126.99^{\circ}E$) for two years. The model using the representative genetic coefficients simulated the developmental stages of cultivars within each maturity group fairly well. Soybean yields for the grids of $10km{\times}10km$ in Illinois state were simulated from 2,000 to 2,011 with weather data under 18 simulation conditions including the combinations of three maturity groups, three seeding dates and two irrigation regimes. Planting dates and maturity groups were assigned differently to the three sub-regions divided longitudinally. The yearly state yields that were estimated by averaging all the grid yields simulated under non-irrigated and fully-Irrigated conditions showed a big difference from the statistical yields and did not explain the annual trend of yield increase due to the improved cultivation technologies. Using the grain yield data of 9 agricultural districts in Illinois observed and estimated from the simulated grid yield under 18 simulation conditions, a multiple regression model was constructed to estimate soybean yield at agricultural district level. In this model a year variable was also added to reflect the yearly yield trend. This model explained the yearly and district yield variation fairly well with a determination coefficients of $R^2=0.61$ (n = 108). Yearly state yields which were calculated by weighting the model-estimated yearly average agricultural district yield by the cultivation area of each agricultural district showed very close correspondence ($R^2=0.80$) to the yearly statistical state yields. Furthermore, the model predicted state yield fairly well in 2012 in which data were not used for the model construction and severe yield reduction was recorded due to drought.

Optimization of Support Vector Machines for Financial Forecasting (재무예측을 위한 Support Vector Machine의 최적화)

  • Kim, Kyoung-Jae;Ahn, Hyun-Chul
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.4
    • /
    • pp.241-254
    • /
    • 2011
  • Financial time-series forecasting is one of the most important issues because it is essential for the risk management of financial institutions. Therefore, researchers have tried to forecast financial time-series using various data mining techniques such as regression, artificial neural networks, decision trees, k-nearest neighbor etc. Recently, support vector machines (SVMs) are popularly applied to this research area because they have advantages that they don't require huge training data and have low possibility of overfitting. However, a user must determine several design factors by heuristics in order to use SVM. For example, the selection of appropriate kernel function and its parameters and proper feature subset selection are major design factors of SVM. Other than these factors, the proper selection of instance subset may also improve the forecasting performance of SVM by eliminating irrelevant and distorting training instances. Nonetheless, there have been few studies that have applied instance selection to SVM, especially in the domain of stock market prediction. Instance selection tries to choose proper instance subsets from original training data. It may be considered as a method of knowledge refinement and it maintains the instance-base. This study proposes the novel instance selection algorithm for SVMs. The proposed technique in this study uses genetic algorithm (GA) to optimize instance selection process with parameter optimization simultaneously. We call the model as ISVM (SVM with Instance selection) in this study. Experiments on stock market data are implemented using ISVM. In this study, the GA searches for optimal or near-optimal values of kernel parameters and relevant instances for SVMs. This study needs two sets of parameters in chromosomes in GA setting : The codes for kernel parameters and for instance selection. For the controlling parameters of the GA search, the population size is set at 50 organisms and the value of the crossover rate is set at 0.7 while the mutation rate is 0.1. As the stopping condition, 50 generations are permitted. The application data used in this study consists of technical indicators and the direction of change in the daily Korea stock price index (KOSPI). The total number of samples is 2218 trading days. We separate the whole data into three subsets as training, test, hold-out data set. The number of data in each subset is 1056, 581, 581 respectively. This study compares ISVM to several comparative models including logistic regression (logit), backpropagation neural networks (ANN), nearest neighbor (1-NN), conventional SVM (SVM) and SVM with the optimized parameters (PSVM). In especial, PSVM uses optimized kernel parameters by the genetic algorithm. The experimental results show that ISVM outperforms 1-NN by 15.32%, ANN by 6.89%, Logit and SVM by 5.34%, and PSVM by 4.82% for the holdout data. For ISVM, only 556 data from 1056 original training data are used to produce the result. In addition, the two-sample test for proportions is used to examine whether ISVM significantly outperforms other comparative models. The results indicate that ISVM outperforms ANN and 1-NN at the 1% statistical significance level. In addition, ISVM performs better than Logit, SVM and PSVM at the 5% statistical significance level.

Development of a Failure Probability Model based on Operation Data of Thermal Piping Network in District Heating System (지역난방 열배관망 운영데이터 기반의 파손확률 모델 개발)

  • Kim, Hyoung Seok;Kim, Gye Beom;Kim, Lae Hyun
    • Korean Chemical Engineering Research
    • /
    • v.55 no.3
    • /
    • pp.322-331
    • /
    • 2017
  • District heating was first introduced in Korea in 1985. As the service life of the underground thermal piping network has increased for more than 30 years, the maintenance of the underground thermal pipe has become an important issue. A variety of complex technologies are required for periodic inspection and operation management for the maintenance of the aged thermal piping network. Especially, it is required to develop a model that can be used for decision making in order to derive optimal maintenance and replacement point from the economic viewpoint in the field. In this study, the analysis was carried out based on the repair history and accident data at the operation of the thermal pipe network of five districts in the Korea District Heating Corporation. A failure probability model was developed by introducing statistical techniques of qualitative analysis and binomial logistic regression analysis. As a result of qualitative analysis of maintenance history and accident data, the most important cause of pipeline damage was construction erosion, corrosion of pipe and bad material accounted for about 82%. In the statistical model analysis, by setting the separation point of the classification to 0.25, the accuracy of the thermal pipe breakage and non-breakage classification improved to 73.5%. In order to establish the failure probability model, the fitness of the model was verified through the Hosmer and Lemeshow test, the independent test of the independent variables, and the Chi-Square test of the model. According to the results of analysis of the risk of thermal pipe network damage, the highest probability of failure was analyzed as the thermal pipeline constructed by the F construction company in the reducer pipe of less than 250mm, which is more than 10 years on the Seoul area motorway in winter. The results of this study can be used to prioritize maintenance, preventive inspection, and replacement of thermal piping systems. In addition, it will be possible to reduce the frequency of thermal pipeline damage and to use it more aggressively to manage thermal piping network by establishing and coping with accident prevention plan in advance such as inspection and maintenance.

A New Exploratory Research on Franchisor's Provision of Exclusive Territories (가맹본부의 배타적 영업지역보호에 대한 탐색적 연구)

  • Lim, Young-Kyun;Lee, Su-Dong;Kim, Ju-Young
    • Journal of Distribution Research
    • /
    • v.17 no.1
    • /
    • pp.37-63
    • /
    • 2012
  • In franchise business, exclusive sales territory (sometimes EST in table) protection is a very important issue from an economic, social and political point of view. It affects the growth and survival of both franchisor and franchisee and often raises issues of social and political conflicts. When franchisee is not familiar with related laws and regulations, franchisor has high chance to utilize it. Exclusive sales territory protection by the manufacturer and distributors (wholesalers or retailers) means sales area restriction by which only certain distributors have right to sell products or services. The distributor, who has been granted exclusive sales territories, can protect its own territory, whereas he may be prohibited from entering in other regions. Even though exclusive sales territory is a quite critical problem in franchise business, there is not much rigorous research about the reason, results, evaluation, and future direction based on empirical data. This paper tries to address this problem not only from logical and nomological validity, but from empirical validation. While we purse an empirical analysis, we take into account the difficulties of real data collection and statistical analysis techniques. We use a set of disclosure document data collected by Korea Fair Trade Commission, instead of conventional survey method which is usually criticized for its measurement error. Existing theories about exclusive sales territory can be summarized into two groups as shown in the table below. The first one is about the effectiveness of exclusive sales territory from both franchisor and franchisee point of view. In fact, output of exclusive sales territory can be positive for franchisors but negative for franchisees. Also, it can be positive in terms of sales but negative in terms of profit. Therefore, variables and viewpoints should be set properly. The other one is about the motive or reason why exclusive sales territory is protected. The reasons can be classified into four groups - industry characteristics, franchise systems characteristics, capability to maintain exclusive sales territory, and strategic decision. Within four groups of reasons, there are more specific variables and theories as below. Based on these theories, we develop nine hypotheses which are briefly shown in the last table below with the results. In order to validate the hypothesis, data is collected from government (FTC) homepage which is open source. The sample consists of 1,896 franchisors and it contains about three year operation data, from 2006 to 2008. Within the samples, 627 have exclusive sales territory protection policy and the one with exclusive sales territory policy is not evenly distributed over 19 representative industries. Additional data are also collected from another government agency homepage, like Statistics Korea. Also, we combine data from various secondary sources to create meaningful variables as shown in the table below. All variables are dichotomized by mean or median split if they are not inherently dichotomized by its definition, since each hypothesis is composed by multiple variables and there is no solid statistical technique to incorporate all these conditions to test the hypotheses. This paper uses a simple chi-square test because hypotheses and theories are built upon quite specific conditions such as industry type, economic condition, company history and various strategic purposes. It is almost impossible to find all those samples to satisfy them and it can't be manipulated in experimental settings. However, more advanced statistical techniques are very good on clean data without exogenous variables, but not good with real complex data. The chi-square test is applied in a way that samples are grouped into four with two criteria, whether they use exclusive sales territory protection or not, and whether they satisfy conditions of each hypothesis. So the proportion of sample franchisors which satisfy conditions and protect exclusive sales territory, does significantly exceed the proportion of samples that satisfy condition and do not protect. In fact, chi-square test is equivalent with the Poisson regression which allows more flexible application. As results, only three hypotheses are accepted. When attitude toward the risk is high so loyalty fee is determined according to sales performance, EST protection makes poor results as expected. And when franchisor protects EST in order to recruit franchisee easily, EST protection makes better results. Also, when EST protection is to improve the efficiency of franchise system as a whole, it shows better performances. High efficiency is achieved as EST prohibits the free riding of franchisee who exploits other's marketing efforts, and it encourages proper investments and distributes franchisee into multiple regions evenly. Other hypotheses are not supported in the results of significance testing. Exclusive sales territory should be protected from proper motives and administered for mutual benefits. Legal restrictions driven by the government agency like FTC could be misused and cause mis-understandings. So there need more careful monitoring on real practices and more rigorous studies by both academicians and practitioners.

  • PDF

A Study on the Improvement of Recommendation Accuracy by Using Category Association Rule Mining (카테고리 연관 규칙 마이닝을 활용한 추천 정확도 향상 기법)

  • Lee, Dongwon
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.2
    • /
    • pp.27-42
    • /
    • 2020
  • Traditional companies with offline stores were unable to secure large display space due to the problems of cost. This limitation inevitably allowed limited kinds of products to be displayed on the shelves, which resulted in consumers being deprived of the opportunity to experience various items. Taking advantage of the virtual space called the Internet, online shopping goes beyond the limits of limitations in physical space of offline shopping and is now able to display numerous products on web pages that can satisfy consumers with a variety of needs. Paradoxically, however, this can also cause consumers to experience the difficulty of comparing and evaluating too many alternatives in their purchase decision-making process. As an effort to address this side effect, various kinds of consumer's purchase decision support systems have been studied, such as keyword-based item search service and recommender systems. These systems can reduce search time for items, prevent consumer from leaving while browsing, and contribute to the seller's increased sales. Among those systems, recommender systems based on association rule mining techniques can effectively detect interrelated products from transaction data such as orders. The association between products obtained by statistical analysis provides clues to predicting how interested consumers will be in another product. However, since its algorithm is based on the number of transactions, products not sold enough so far in the early days of launch may not be included in the list of recommendations even though they are highly likely to be sold. Such missing items may not have sufficient opportunities to be exposed to consumers to record sufficient sales, and then fall into a vicious cycle of a vicious cycle of declining sales and omission in the recommendation list. This situation is an inevitable outcome in situations in which recommendations are made based on past transaction histories, rather than on determining potential future sales possibilities. This study started with the idea that reflecting the means by which this potential possibility can be identified indirectly would help to select highly recommended products. In the light of the fact that the attributes of a product affect the consumer's purchasing decisions, this study was conducted to reflect them in the recommender systems. In other words, consumers who visit a product page have shown interest in the attributes of the product and would be also interested in other products with the same attributes. On such assumption, based on these attributes, the recommender system can select recommended products that can show a higher acceptance rate. Given that a category is one of the main attributes of a product, it can be a good indicator of not only direct associations between two items but also potential associations that have yet to be revealed. Based on this idea, the study devised a recommender system that reflects not only associations between products but also categories. Through regression analysis, two kinds of associations were combined to form a model that could predict the hit rate of recommendation. To evaluate the performance of the proposed model, another regression model was also developed based only on associations between products. Comparative experiments were designed to be similar to the environment in which products are actually recommended in online shopping malls. First, the association rules for all possible combinations of antecedent and consequent items were generated from the order data. Then, hit rates for each of the associated rules were predicted from the support and confidence that are calculated by each of the models. The comparative experiments using order data collected from an online shopping mall show that the recommendation accuracy can be improved by further reflecting not only the association between products but also categories in the recommendation of related products. The proposed model showed a 2 to 3 percent improvement in hit rates compared to the existing model. From a practical point of view, it is expected to have a positive effect on improving consumers' purchasing satisfaction and increasing sellers' sales.

Ensemble Learning with Support Vector Machines for Bond Rating (회사채 신용등급 예측을 위한 SVM 앙상블학습)

  • Kim, Myoung-Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.2
    • /
    • pp.29-45
    • /
    • 2012
  • Bond rating is regarded as an important event for measuring financial risk of companies and for determining the investment returns of investors. As a result, it has been a popular research topic for researchers to predict companies' credit ratings by applying statistical and machine learning techniques. The statistical techniques, including multiple regression, multiple discriminant analysis (MDA), logistic models (LOGIT), and probit analysis, have been traditionally used in bond rating. However, one major drawback is that it should be based on strict assumptions. Such strict assumptions include linearity, normality, independence among predictor variables and pre-existing functional forms relating the criterion variablesand the predictor variables. Those strict assumptions of traditional statistics have limited their application to the real world. Machine learning techniques also used in bond rating prediction models include decision trees (DT), neural networks (NN), and Support Vector Machine (SVM). Especially, SVM is recognized as a new and promising classification and regression analysis method. SVM learns a separating hyperplane that can maximize the margin between two categories. SVM is simple enough to be analyzed mathematical, and leads to high performance in practical applications. SVM implements the structuralrisk minimization principle and searches to minimize an upper bound of the generalization error. In addition, the solution of SVM may be a global optimum and thus, overfitting is unlikely to occur with SVM. In addition, SVM does not require too many data sample for training since it builds prediction models by only using some representative sample near the boundaries called support vectors. A number of experimental researches have indicated that SVM has been successfully applied in a variety of pattern recognition fields. However, there are three major drawbacks that can be potential causes for degrading SVM's performance. First, SVM is originally proposed for solving binary-class classification problems. Methods for combining SVMs for multi-class classification such as One-Against-One, One-Against-All have been proposed, but they do not improve the performance in multi-class classification problem as much as SVM for binary-class classification. Second, approximation algorithms (e.g. decomposition methods, sequential minimal optimization algorithm) could be used for effective multi-class computation to reduce computation time, but it could deteriorate classification performance. Third, the difficulty in multi-class prediction problems is in data imbalance problem that can occur when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed boundary and thus the reduction in the classification accuracy of such a classifier. SVM ensemble learning is one of machine learning methods to cope with the above drawbacks. Ensemble learning is a method for improving the performance of classification and prediction algorithms. AdaBoost is one of the widely used ensemble learning techniques. It constructs a composite classifier by sequentially training classifiers while increasing weight on the misclassified observations through iterations. The observations that are incorrectly predicted by previous classifiers are chosen more often than examples that are correctly predicted. Thus Boosting attempts to produce new classifiers that are better able to predict examples for which the current ensemble's performance is poor. In this way, it can reinforce the training of the misclassified observations of the minority class. This paper proposes a multiclass Geometric Mean-based Boosting (MGM-Boost) to resolve multiclass prediction problem. Since MGM-Boost introduces the notion of geometric mean into AdaBoost, it can perform learning process considering the geometric mean-based accuracy and errors of multiclass. This study applies MGM-Boost to the real-world bond rating case for Korean companies to examine the feasibility of MGM-Boost. 10-fold cross validations for threetimes with different random seeds are performed in order to ensure that the comparison among three different classifiers does not happen by chance. For each of 10-fold cross validation, the entire data set is first partitioned into tenequal-sized sets, and then each set is in turn used as the test set while the classifier trains on the other nine sets. That is, cross-validated folds have been tested independently of each algorithm. Through these steps, we have obtained the results for classifiers on each of the 30 experiments. In the comparison of arithmetic mean-based prediction accuracy between individual classifiers, MGM-Boost (52.95%) shows higher prediction accuracy than both AdaBoost (51.69%) and SVM (49.47%). MGM-Boost (28.12%) also shows the higher prediction accuracy than AdaBoost (24.65%) and SVM (15.42%)in terms of geometric mean-based prediction accuracy. T-test is used to examine whether the performance of each classifiers for 30 folds is significantly different. The results indicate that performance of MGM-Boost is significantly different from AdaBoost and SVM classifiers at 1% level. These results mean that MGM-Boost can provide robust and stable solutions to multi-classproblems such as bond rating.

A Study on the Availability of Spatial and Statistical Data for Assessing CO2 Absorption Rate in Forests - A Case Study on Ansan-si - (산림의 CO2 흡수량 평가를 위한 통계 및 공간자료의 활용성 검토 - 안산시를 대상으로 -)

  • Kim, Sunghoon;Kim, Ilkwon;Jun, Baysok;Kwon, Hyuksoo
    • Journal of Environmental Impact Assessment
    • /
    • v.27 no.2
    • /
    • pp.124-138
    • /
    • 2018
  • This research was conducted to examine the availability of spatial data for assessing absorption rates of $CO_2$ in the forest of Ansan-si and evaluate the validity of methods that analyze $CO_2$ absorption. To statistically assess the $CO_2$ absorption rates per year, the 1:5,000 Digital Forest-Map (Lim5000) and Standard Carbon Removal of Major Forest Species (SCRMF) methods were employed. Furthermore, Land Cover Map (LCM) was also used to verify $CO_2$ absorption rate availability per year. Great variations in $CO_2$ absorption rates occurred before and after the year 2010. This was due to improvement in precision and accuracy of the Forest Basic Statistics (FBS) in 2010, which resulted in rapid increase in growing stock. Thus, calibration of data prior to 2010 is necessary, based on recent FBS standards. Previous studies that employed Lim5000 and FBS (2015, 2010) did not take into account the $CO_2$ absorption rates of different tree species, and the combination of SCRMF and Lim5000 resulted in $CO_2$ absorption of 42,369 ton. In contrast to the combination of SCRMF and Lim5000, LCM and SCRMF resulted in $CO_2$ absorption of 40,696 ton. Homoscedasticity tests for Lim5000 and LCM resulted in p-value <0.01, with a difference in $CO_2$ absorption of 1,673 ton. Given that $CO_2$ absorption in forests is an important factor that reduces greenhouse gas emissions, the findings of this study should provide fundamental information for supporting a wide range of decision-making processes for land use and management.