• Title/Summary/Keyword: cross-validation method

Search Result 498, Processing Time 0.027 seconds

The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms (다중 클래스 데이터셋의 메타특징이 판별 알고리즘의 성능에 미치는 영향 연구)

  • Kim, Jeonghun;Kim, Min Yong;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.23-45
    • /
    • 2020
  • Big data is creating in a wide variety of fields such as medical care, manufacturing, logistics, sales site, SNS, and the dataset characteristics are also diverse. In order to secure the competitiveness of companies, it is necessary to improve decision-making capacity using a classification algorithm. However, most of them do not have sufficient knowledge on what kind of classification algorithm is appropriate for a specific problem area. In other words, determining which classification algorithm is appropriate depending on the characteristics of the dataset was has been a task that required expertise and effort. This is because the relationship between the characteristics of datasets (called meta-features) and the performance of classification algorithms has not been fully understood. Moreover, there has been little research on meta-features reflecting the characteristics of multi-class. Therefore, the purpose of this study is to empirically analyze whether meta-features of multi-class datasets have a significant effect on the performance of classification algorithms. In this study, meta-features of multi-class datasets were identified into two factors, (the data structure and the data complexity,) and seven representative meta-features were selected. Among those, we included the Herfindahl-Hirschman Index (HHI), originally a market concentration measurement index, in the meta-features to replace IR(Imbalanced Ratio). Also, we developed a new index called Reverse ReLU Silhouette Score into the meta-feature set. Among the UCI Machine Learning Repository data, six representative datasets (Balance Scale, PageBlocks, Car Evaluation, User Knowledge-Modeling, Wine Quality(red), Contraceptive Method Choice) were selected. The class of each dataset was classified by using the classification algorithms (KNN, Logistic Regression, Nave Bayes, Random Forest, and SVM) selected in the study. For each dataset, we applied 10-fold cross validation method. 10% to 100% oversampling method is applied for each fold and meta-features of the dataset is measured. The meta-features selected are HHI, Number of Classes, Number of Features, Entropy, Reverse ReLU Silhouette Score, Nonlinearity of Linear Classifier, Hub Score. F1-score was selected as the dependent variable. As a result, the results of this study showed that the six meta-features including Reverse ReLU Silhouette Score and HHI proposed in this study have a significant effect on the classification performance. (1) The meta-features HHI proposed in this study was significant in the classification performance. (2) The number of variables has a significant effect on the classification performance, unlike the number of classes, but it has a positive effect. (3) The number of classes has a negative effect on the performance of classification. (4) Entropy has a significant effect on the performance of classification. (5) The Reverse ReLU Silhouette Score also significantly affects the classification performance at a significant level of 0.01. (6) The nonlinearity of linear classifiers has a significant negative effect on classification performance. In addition, the results of the analysis by the classification algorithms were also consistent. In the regression analysis by classification algorithm, Naïve Bayes algorithm does not have a significant effect on the number of variables unlike other classification algorithms. This study has two theoretical contributions: (1) two new meta-features (HHI, Reverse ReLU Silhouette score) was proved to be significant. (2) The effects of data characteristics on the performance of classification were investigated using meta-features. The practical contribution points (1) can be utilized in the development of classification algorithm recommendation system according to the characteristics of datasets. (2) Many data scientists are often testing by adjusting the parameters of the algorithm to find the optimal algorithm for the situation because the characteristics of the data are different. In this process, excessive waste of resources occurs due to hardware, cost, time, and manpower. This study is expected to be useful for machine learning, data mining researchers, practitioners, and machine learning-based system developers. The composition of this study consists of introduction, related research, research model, experiment, conclusion and discussion.

Korean Word Sense Disambiguation using Dictionary and Corpus (사전과 말뭉치를 이용한 한국어 단어 중의성 해소)

  • Jeong, Hanjo;Park, Byeonghwa
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.1
    • /
    • pp.1-13
    • /
    • 2015
  • As opinion mining in big data applications has been highlighted, a lot of research on unstructured data has made. Lots of social media on the Internet generate unstructured or semi-structured data every second and they are often made by natural or human languages we use in daily life. Many words in human languages have multiple meanings or senses. In this result, it is very difficult for computers to extract useful information from these datasets. Traditional web search engines are usually based on keyword search, resulting in incorrect search results which are far from users' intentions. Even though a lot of progress in enhancing the performance of search engines has made over the last years in order to provide users with appropriate results, there is still so much to improve it. Word sense disambiguation can play a very important role in dealing with natural language processing and is considered as one of the most difficult problems in this area. Major approaches to word sense disambiguation can be classified as knowledge-base, supervised corpus-based, and unsupervised corpus-based approaches. This paper presents a method which automatically generates a corpus for word sense disambiguation by taking advantage of examples in existing dictionaries and avoids expensive sense tagging processes. It experiments the effectiveness of the method based on Naïve Bayes Model, which is one of supervised learning algorithms, by using Korean standard unabridged dictionary and Sejong Corpus. Korean standard unabridged dictionary has approximately 57,000 sentences. Sejong Corpus has about 790,000 sentences tagged with part-of-speech and senses all together. For the experiment of this study, Korean standard unabridged dictionary and Sejong Corpus were experimented as a combination and separate entities using cross validation. Only nouns, target subjects in word sense disambiguation, were selected. 93,522 word senses among 265,655 nouns and 56,914 sentences from related proverbs and examples were additionally combined in the corpus. Sejong Corpus was easily merged with Korean standard unabridged dictionary because Sejong Corpus was tagged based on sense indices defined by Korean standard unabridged dictionary. Sense vectors were formed after the merged corpus was created. Terms used in creating sense vectors were added in the named entity dictionary of Korean morphological analyzer. By using the extended named entity dictionary, term vectors were extracted from the input sentences and then term vectors for the sentences were created. Given the extracted term vector and the sense vector model made during the pre-processing stage, the sense-tagged terms were determined by the vector space model based word sense disambiguation. In addition, this study shows the effectiveness of merged corpus from examples in Korean standard unabridged dictionary and Sejong Corpus. The experiment shows the better results in precision and recall are found with the merged corpus. This study suggests it can practically enhance the performance of internet search engines and help us to understand more accurate meaning of a sentence in natural language processing pertinent to search engines, opinion mining, and text mining. Naïve Bayes classifier used in this study represents a supervised learning algorithm and uses Bayes theorem. Naïve Bayes classifier has an assumption that all senses are independent. Even though the assumption of Naïve Bayes classifier is not realistic and ignores the correlation between attributes, Naïve Bayes classifier is widely used because of its simplicity and in practice it is known to be very effective in many applications such as text classification and medical diagnosis. However, further research need to be carried out to consider all possible combinations and/or partial combinations of all senses in a sentence. Also, the effectiveness of word sense disambiguation may be improved if rhetorical structures or morphological dependencies between words are analyzed through syntactic analysis.

Optimization of Multiclass Support Vector Machine using Genetic Algorithm: Application to the Prediction of Corporate Credit Rating (유전자 알고리즘을 이용한 다분류 SVM의 최적화: 기업신용등급 예측에의 응용)

  • Ahn, Hyunchul
    • Information Systems Review
    • /
    • v.16 no.3
    • /
    • pp.161-177
    • /
    • 2014
  • Corporate credit rating assessment consists of complicated processes in which various factors describing a company are taken into consideration. Such assessment is known to be very expensive since domain experts should be employed to assess the ratings. As a result, the data-driven corporate credit rating prediction using statistical and artificial intelligence (AI) techniques has received considerable attention from researchers and practitioners. In particular, statistical methods such as multiple discriminant analysis (MDA) and multinomial logistic regression analysis (MLOGIT), and AI methods including case-based reasoning (CBR), artificial neural network (ANN), and multiclass support vector machine (MSVM) have been applied to corporate credit rating.2) Among them, MSVM has recently become popular because of its robustness and high prediction accuracy. In this study, we propose a novel optimized MSVM model, and appy it to corporate credit rating prediction in order to enhance the accuracy. Our model, named 'GAMSVM (Genetic Algorithm-optimized Multiclass Support Vector Machine),' is designed to simultaneously optimize the kernel parameters and the feature subset selection. Prior studies like Lorena and de Carvalho (2008), and Chatterjee (2013) show that proper kernel parameters may improve the performance of MSVMs. Also, the results from the studies such as Shieh and Yang (2008) and Chatterjee (2013) imply that appropriate feature selection may lead to higher prediction accuracy. Based on these prior studies, we propose to apply GAMSVM to corporate credit rating prediction. As a tool for optimizing the kernel parameters and the feature subset selection, we suggest genetic algorithm (GA). GA is known as an efficient and effective search method that attempts to simulate the biological evolution phenomenon. By applying genetic operations such as selection, crossover, and mutation, it is designed to gradually improve the search results. Especially, mutation operator prevents GA from falling into the local optima, thus we can find the globally optimal or near-optimal solution using it. GA has popularly been applied to search optimal parameters or feature subset selections of AI techniques including MSVM. With these reasons, we also adopt GA as an optimization tool. To empirically validate the usefulness of GAMSVM, we applied it to a real-world case of credit rating in Korea. Our application is in bond rating, which is the most frequently studied area of credit rating for specific debt issues or other financial obligations. The experimental dataset was collected from a large credit rating company in South Korea. It contained 39 financial ratios of 1,295 companies in the manufacturing industry, and their credit ratings. Using various statistical methods including the one-way ANOVA and the stepwise MDA, we selected 14 financial ratios as the candidate independent variables. The dependent variable, i.e. credit rating, was labeled as four classes: 1(A1); 2(A2); 3(A3); 4(B and C). 80 percent of total data for each class was used for training, and remaining 20 percent was used for validation. And, to overcome small sample size, we applied five-fold cross validation to our dataset. In order to examine the competitiveness of the proposed model, we also experimented several comparative models including MDA, MLOGIT, CBR, ANN and MSVM. In case of MSVM, we adopted One-Against-One (OAO) and DAGSVM (Directed Acyclic Graph SVM) approaches because they are known to be the most accurate approaches among various MSVM approaches. GAMSVM was implemented using LIBSVM-an open-source software, and Evolver 5.5-a commercial software enables GA. Other comparative models were experimented using various statistical and AI packages such as SPSS for Windows, Neuroshell, and Microsoft Excel VBA (Visual Basic for Applications). Experimental results showed that the proposed model-GAMSVM-outperformed all the competitive models. In addition, the model was found to use less independent variables, but to show higher accuracy. In our experiments, five variables such as X7 (total debt), X9 (sales per employee), X13 (years after founded), X15 (accumulated earning to total asset), and X39 (the index related to the cash flows from operating activity) were found to be the most important factors in predicting the corporate credit ratings. However, the values of the finally selected kernel parameters were found to be almost same among the data subsets. To examine whether the predictive performance of GAMSVM was significantly greater than those of other models, we used the McNemar test. As a result, we found that GAMSVM was better than MDA, MLOGIT, CBR, and ANN at the 1% significance level, and better than OAO and DAGSVM at the 5% significance level.

Accuracy evaluation of microwave water surface current meter for measurement angles in middle flow condition (전자파표면유속계의 측정 각도에 따른 평수기 유속 측정 정확도 분석)

  • Son, Geunsoo;Kim, Dongsu;Kim, Kyungdong;Kim, Jongmin
    • Journal of Korea Water Resources Association
    • /
    • v.53 no.1
    • /
    • pp.15-27
    • /
    • 2020
  • Streamflow discharge as a fundamental riverine quantity plays a crucial role in water resources management, thereby requiring accurate in-situ measurement. Recent advances in instrumentations for the streamflow discharge measurement has complemented or substituted classical devices and methods. Among various potential methods, surface current meter using microwave has increasingly begun to be applied not only for flood but also normal flow discharge measurement, remotely and safely enabling practitioners to measure flow velocity postulating indirect contact. With minimized field preparedness, this method facilitated and eased flood discharge measurement in the difficult in-situ conditions such as extreme flood in active ways emitting 24.125 GHz microwave without relying on natural lights. In South Korea, a rectangular shaped instrument named with Microwave Water Surface Current Meter (MWSCM) has been developed and commercially released around 2010, in which domestic agencies charging on streamflow observation shed lights on this approach regarding it as a potential substitute. Considering this brand-new device highlighted for efficient flow measurement, however, there has been few noticeable efforts in systematic and comprehensive evaluation of its performance in various measurement and riverine conditions that lead to lack in imminent and widely spreading usages in practices. This study attempted to evaluate the MWSCM in terms of instrumen's monitoring configuration particularly regarding tilt and yaw angle. In the middle of pointing the measurement spot in a given cross-section, the observation campaign inevitably poses accuracy issues related with different tilt and yaw angles of the instrument, which can be a conventionally major source of errors for this type of instrument. Focusing on the perspective of instrument configuration, the instrument was tested in a controlled outdoor river channel located in KICT River Experiment Center with a fixed flow condition of around 1 m/s flow speed with steady flow supply, 6 m of channel width, and less than 1 m of shallow flow depth, where the detailed velocity measurements with SonTek micro-ADV was used for validation. As results, less than 15 degree in tilting angle generated much higher deviation, and higher yawing angle proportionally increased coefficient of variance. Yaw angles affected accuracy in terms of measurement area.

Estimation of Near Surface Air Temperature Using MODIS Land Surface Temperature Data and Geostatistics (MODIS 지표면 온도 자료와 지구통계기법을 이용한 지상 기온 추정)

  • Shin, HyuSeok;Chang, Eunmi;Hong, Sungwook
    • Spatial Information Research
    • /
    • v.22 no.1
    • /
    • pp.55-63
    • /
    • 2014
  • Near surface air temperature data which are one of the essential factors in hydrology, meteorology and climatology, have drawn a substantial amount of attention from various academic domains and societies. Meteorological observations, however, have high spatio-temporal constraints with the limits in the number and distribution over the earth surface. To overcome such limits, many studies have sought to estimate the near surface air temperature from satellite image data at a regional or continental scale with simple regression methods. Alternatively, we applied various Kriging methods such as ordinary Kriging, universal Kriging, Cokriging, Regression Kriging in search of an optimal estimation method based on near surface air temperature data observed from automatic weather stations (AWS) in South Korea throughout 2010 (365 days) and MODIS land surface temperature (LST) data (MOD11A1, 365 images). Due to high spatial heterogeneity, auxiliary data have been also analyzed such as land cover, DEM (digital elevation model) to consider factors that can affect near surface air temperature. Prior to the main estimation, we calculated root mean square error (RMSE) of temperature differences from the 365-days LST and AWS data by season and landcover. The results show that the coefficient of variation (CV) of RMSE by season is 0.86, but the equivalent value of CV by landcover is 0.00746. Seasonal differences between LST and AWS data were greater than that those by landcover. Seasonal RMSE was the lowest in winter (3.72). The results from a linear regression analysis for examining the relationship among AWS, LST, and auxiliary data show that the coefficient of determination was the highest in winter (0.818) but the lowest in summer (0.078), thereby indicating a significant level of seasonal variation. Based on these results, we utilized a variety of Kriging techniques to estimate the surface temperature. The results of cross-validation in each Kriging model show that the measure of model accuracy was 1.71, 1.71, 1.848, and 1.630 for universal Kriging, ordinary Kriging, cokriging, and regression Kriging, respectively. The estimates from regression Kriging thus proved to be the most accurate among the Kriging methods compared.

Development of Prediction Equation of Diffusing Capacity of Lung for Koreans

  • Hwang, Yong Il;Park, Yong Bum;Yoon, Hyoung Kyu;Lim, Seong Yong;Kim, Tae-Hyung;Park, Joo Hun;Lee, Won-Yeon;Park, Seong Ju;Lee, Sei Won;Kim, Woo Jin;Kim, Ki Uk;Shin, Kyeong Cheol;Kim, Do Jin;Kim, Hui Jung;Kim, Tae-Eun;Yoo, Kwang Ha;Shim, Jae Jeong
    • Tuberculosis and Respiratory Diseases
    • /
    • v.81 no.1
    • /
    • pp.42-48
    • /
    • 2018
  • Background: The diffusing capacity of the lung is influenced by multiple factors such as age, sex, height, weight, ethnicity and smoking status. Although a prediction equation for the diffusing capacity of Korea was proposed in the mid-1980s, this equation is not used currently. The aim of this study was to develop a new prediction equation for the diffusing capacity for Koreans. Methods: Using the data of the Korean National Health and Nutrition Examination Survey, a total of 140 nonsmokers with normal chest X-rays were enrolled in this study. Results: Using linear regression analysis, a new predicting equation for diffusing capacity was developed. For men, the following new equations were developed: carbon monoxide diffusing capacity (DLco)=-10.4433-0.1434${\times}$age (year)+0.2482${\times}$heights (cm); DLco/alveolar volume (VA)=6.01507-0.02374${\times}$age (year)-0.00233${\times}$heights (cm). For women the prediction equations were described as followed: DLco=-12.8895-0.0532${\times}$age (year)+0.2145${\times}$heights (cm) and DLco/VA=7.69516-0.02219${\times}$age (year)-0.01377${\times}$heights (cm). All equations were internally validated by k-fold cross validation method. Conclusion: In this study, we developed new prediction equations for the diffusing capacity of the lungs of Koreans. A further study is needed to validate the new predicting equation for diffusing capacity.

A Comparative Study on Factors Affecting Satisfaction by Travel Purpose for Urban Demand Response Transport Service: Focusing on Sejong Shucle (도심형 수요응답 교통서비스의 통행목적별 만족도 영향요인 비교연구: 세종특별자치시 셔클(Shucle)을 중심으로)

  • Wonchul Kim;Woo Jin Han;Juntae Park
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.23 no.2
    • /
    • pp.132-141
    • /
    • 2024
  • In this study, the differences in user satisfaction and the variables influencing the satisfaction with demand response transport (DRT) by travel purpose were compared. The purpose of DRT travel was divided into commuting/school and shopping/leisure travel. A survey conducted on 'Shucle' users in Sejong City was used for the analysis and the least absolute shrinkage and selection operator (LASSO) regression analysis was applied to minimize the overfitting problems of the multilinear model. The results of the analysis confirmed the possibility that the introduction of the DRT service could eliminate the blind spot in the existing public transportation, reduce the use of private cars, encourage low-carbon and public transportation revitalization policies, and provide optimal transportation services to people who exhibit intermittent travel behaviors (e.g., elderly people, housewives, etc.). In addition, factors such as the waiting time after calling a DRT, travel time after boarding the DRT, convenience of using the DRT app, punctuality of expected departure/arrival time, and location of pickup and drop-off points were the common factors that positively influenced the satisfaction of users of the DRT services during their commuting/school and shopping/leisure travel. Meanwhile, the method of transfer to other transport modes was found to affect satisfaction only in the case of commuting/school travel, but not in the case of shopping/leisure travel. To activate the DRT service, it is necessary to consider the five influencing factors analyzed above. In addition, the differentiating factors between commuting/school and shopping/leisure travel were also identified. In the case of commuting/school travel, people value time and consider it to be important, so it is necessary to promote the convenience of transfer to other transport modes to reduce the total travel time. Regarding shopping/leisure travel, it is necessary to consider ways to create a facility that allows users to easily and conveniently designate the location of the pickup and drop-off point.

Clickstream Big Data Mining for Demographics based Digital Marketing (인구통계특성 기반 디지털 마케팅을 위한 클릭스트림 빅데이터 마이닝)

  • Park, Jiae;Cho, Yoonho
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.3
    • /
    • pp.143-163
    • /
    • 2016
  • The demographics of Internet users are the most basic and important sources for target marketing or personalized advertisements on the digital marketing channels which include email, mobile, and social media. However, it gradually has become difficult to collect the demographics of Internet users because their activities are anonymous in many cases. Although the marketing department is able to get the demographics using online or offline surveys, these approaches are very expensive, long processes, and likely to include false statements. Clickstream data is the recording an Internet user leaves behind while visiting websites. As the user clicks anywhere in the webpage, the activity is logged in semi-structured website log files. Such data allows us to see what pages users visited, how long they stayed there, how often they visited, when they usually visited, which site they prefer, what keywords they used to find the site, whether they purchased any, and so forth. For such a reason, some researchers tried to guess the demographics of Internet users by using their clickstream data. They derived various independent variables likely to be correlated to the demographics. The variables include search keyword, frequency and intensity for time, day and month, variety of websites visited, text information for web pages visited, etc. The demographic attributes to predict are also diverse according to the paper, and cover gender, age, job, location, income, education, marital status, presence of children. A variety of data mining methods, such as LSA, SVM, decision tree, neural network, logistic regression, and k-nearest neighbors, were used for prediction model building. However, this research has not yet identified which data mining method is appropriate to predict each demographic variable. Moreover, it is required to review independent variables studied so far and combine them as needed, and evaluate them for building the best prediction model. The objective of this study is to choose clickstream attributes mostly likely to be correlated to the demographics from the results of previous research, and then to identify which data mining method is fitting to predict each demographic attribute. Among the demographic attributes, this paper focus on predicting gender, age, marital status, residence, and job. And from the results of previous research, 64 clickstream attributes are applied to predict the demographic attributes. The overall process of predictive model building is compose of 4 steps. In the first step, we create user profiles which include 64 clickstream attributes and 5 demographic attributes. The second step performs the dimension reduction of clickstream variables to solve the curse of dimensionality and overfitting problem. We utilize three approaches which are based on decision tree, PCA, and cluster analysis. We build alternative predictive models for each demographic variable in the third step. SVM, neural network, and logistic regression are used for modeling. The last step evaluates the alternative models in view of model accuracy and selects the best model. For the experiments, we used clickstream data which represents 5 demographics and 16,962,705 online activities for 5,000 Internet users. IBM SPSS Modeler 17.0 was used for our prediction process, and the 5-fold cross validation was conducted to enhance the reliability of our experiments. As the experimental results, we can verify that there are a specific data mining method well-suited for each demographic variable. For example, age prediction is best performed when using the decision tree based dimension reduction and neural network whereas the prediction of gender and marital status is the most accurate by applying SVM without dimension reduction. We conclude that the online behaviors of the Internet users, captured from the clickstream data analysis, could be well used to predict their demographics, thereby being utilized to the digital marketing.