• Title/Summary/Keyword: MachineLearning

Search Result 5,657, Processing Time 0.031 seconds

IPC Multi-label Classification based on Functional Characteristics of Fields in Patent Documents (특허문서 필드의 기능적 특성을 활용한 IPC 다중 레이블 분류)

  • Lim, Sora;Kwon, YongJin
    • Journal of Internet Computing and Services
    • /
    • v.18 no.1
    • /
    • pp.77-88
    • /
    • 2017
  • Recently, with the advent of knowledge based society where information and knowledge make values, patents which are the representative form of intellectual property have become important, and the number of the patents follows growing trends. Thus, it needs to classify the patents depending on the technological topic of the invention appropriately in order to use a vast amount of the patent information effectively. IPC (International Patent Classification) is widely used for this situation. Researches about IPC automatic classification have been studied using data mining and machine learning algorithms to improve current IPC classification task which categorizes patent documents by hand. However, most of the previous researches have focused on applying various existing machine learning methods to the patent documents rather than considering on the characteristics of the data or the structure of patent documents. In this paper, therefore, we propose to use two structural fields, technical field and background, considered as having impacts on the patent classification, where the two field are selected by applying of the characteristics of patent documents and the role of the structural fields. We also construct multi-label classification model to reflect what a patent document could have multiple IPCs. Furthermore, we propose a method to classify patent documents at the IPC subclass level comprised of 630 categories so that we investigate the possibility of applying the IPC multi-label classification model into the real field. The effect of structural fields of patent documents are examined using 564,793 registered patents in Korea, and 87.2% precision is obtained in the case of using title, abstract, claims, technical field and background. From this sequence, we verify that the technical field and background have an important role in improving the precision of IPC multi-label classification in IPC subclass level.

Monitoring Ground-level SO2 Concentrations Based on a Stacking Ensemble Approach Using Satellite Data and Numerical Models (위성 자료와 수치모델 자료를 활용한 스태킹 앙상블 기반 SO2 지상농도 추정)

  • Choi, Hyunyoung;Kang, Yoojin;Im, Jungho;Shin, Minso;Park, Seohui;Kim, Sang-Min
    • Korean Journal of Remote Sensing
    • /
    • v.36 no.5_3
    • /
    • pp.1053-1066
    • /
    • 2020
  • Sulfur dioxide (SO2) is primarily released through industrial, residential, and transportation activities, and creates secondary air pollutants through chemical reactions in the atmosphere. Long-term exposure to SO2 can result in a negative effect on the human body causing respiratory or cardiovascular disease, which makes the effective and continuous monitoring of SO2 crucial. In South Korea, SO2 monitoring at ground stations has been performed, but this does not provide spatially continuous information of SO2 concentrations. Thus, this research estimated spatially continuous ground-level SO2 concentrations at 1 km resolution over South Korea through the synergistic use of satellite data and numerical models. A stacking ensemble approach, fusing multiple machine learning algorithms at two levels (i.e., base and meta), was adopted for ground-level SO2 estimation using data from January 2015 to April 2019. Random forest and extreme gradient boosting were used as based models and multiple linear regression was adopted for the meta-model. The cross-validation results showed that the meta-model produced the improved performance by 25% compared to the base models, resulting in the correlation coefficient of 0.48 and root-mean-square-error of 0.0032 ppm. In addition, the temporal transferability of the approach was evaluated for one-year data which were not used in the model development. The spatial distribution of ground-level SO2 concentrations based on the proposed model agreed with the general seasonality of SO2 and the temporal patterns of emission sources.

Bankruptcy Type Prediction Using A Hybrid Artificial Neural Networks Model (하이브리드 인공신경망 모형을 이용한 부도 유형 예측)

  • Jo, Nam-ok;Kim, Hyun-jung;Shin, Kyung-shik
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.3
    • /
    • pp.79-99
    • /
    • 2015
  • The prediction of bankruptcy has been extensively studied in the accounting and finance field. It can have an important impact on lending decisions and the profitability of financial institutions in terms of risk management. Many researchers have focused on constructing a more robust bankruptcy prediction model. Early studies primarily used statistical techniques such as multiple discriminant analysis (MDA) and logit analysis for bankruptcy prediction. However, many studies have demonstrated that artificial intelligence (AI) approaches, such as artificial neural networks (ANN), decision trees, case-based reasoning (CBR), and support vector machine (SVM), have been outperforming statistical techniques since 1990s for business classification problems because statistical methods have some rigid assumptions in their application. In previous studies on corporate bankruptcy, many researchers have focused on developing a bankruptcy prediction model using financial ratios. However, there are few studies that suggest the specific types of bankruptcy. Previous bankruptcy prediction models have generally been interested in predicting whether or not firms will become bankrupt. Most of the studies on bankruptcy types have focused on reviewing the previous literature or performing a case study. Thus, this study develops a model using data mining techniques for predicting the specific types of bankruptcy as well as the occurrence of bankruptcy in Korean small- and medium-sized construction firms in terms of profitability, stability, and activity index. Thus, firms will be able to prevent it from occurring in advance. We propose a hybrid approach using two artificial neural networks (ANNs) for the prediction of bankruptcy types. The first is a back-propagation neural network (BPN) model using supervised learning for bankruptcy prediction and the second is a self-organizing map (SOM) model using unsupervised learning to classify bankruptcy data into several types. Based on the constructed model, we predict the bankruptcy of companies by applying the BPN model to a validation set that was not utilized in the development of the model. This allows for identifying the specific types of bankruptcy by using bankruptcy data predicted by the BPN model. We calculated the average of selected input variables through statistical test for each cluster to interpret characteristics of the derived clusters in the SOM model. Each cluster represents bankruptcy type classified through data of bankruptcy firms, and input variables indicate financial ratios in interpreting the meaning of each cluster. The experimental result shows that each of five bankruptcy types has different characteristics according to financial ratios. Type 1 (severe bankruptcy) has inferior financial statements except for EBITDA (earnings before interest, taxes, depreciation, and amortization) to sales based on the clustering results. Type 2 (lack of stability) has a low quick ratio, low stockholder's equity to total assets, and high total borrowings to total assets. Type 3 (lack of activity) has a slightly low total asset turnover and fixed asset turnover. Type 4 (lack of profitability) has low retained earnings to total assets and EBITDA to sales which represent the indices of profitability. Type 5 (recoverable bankruptcy) includes firms that have a relatively good financial condition as compared to other bankruptcy types even though they are bankrupt. Based on the findings, researchers and practitioners engaged in the credit evaluation field can obtain more useful information about the types of corporate bankruptcy. In this paper, we utilized the financial ratios of firms to classify bankruptcy types. It is important to select the input variables that correctly predict bankruptcy and meaningfully classify the type of bankruptcy. In a further study, we will include non-financial factors such as size, industry, and age of the firms. Thus, we can obtain realistic clustering results for bankruptcy types by combining qualitative factors and reflecting the domain knowledge of experts.

Spatial Downscaling of Ocean Colour-Climate Change Initiative (OC-CCI) Forel-Ule Index Using GOCI Satellite Image and Machine Learning Technique (GOCI 위성영상과 기계학습 기법을 이용한 Ocean Colour-Climate Change Initiative (OC-CCI) Forel-Ule Index의 공간 상세화)

  • Sung, Taejun;Kim, Young Jun;Choi, Hyunyoung;Im, Jungho
    • Korean Journal of Remote Sensing
    • /
    • v.37 no.5_1
    • /
    • pp.959-974
    • /
    • 2021
  • Forel-Ule Index (FUI) is an index which classifies the colors of inland and seawater exist in nature into 21 gradesranging from indigo blue to cola brown. FUI has been analyzed in connection with the eutrophication, water quality, and light characteristics of water systems in many studies, and the possibility as a new water quality index which simultaneously contains optical information of water quality parameters has been suggested. In thisstudy, Ocean Colour-Climate Change Initiative (OC-CCI) based 4 km FUI was spatially downscaled to the resolution of 500 m using the Geostationary Ocean Color Imager (GOCI) data and Random Forest (RF) machine learning. Then, the RF-derived FUI was examined in terms of its correlation with various water quality parameters measured in coastal areas and its spatial distribution and seasonal characteristics. The results showed that the RF-derived FUI resulted in higher accuracy (Coefficient of Determination (R2)=0.81, Root Mean Square Error (RMSE)=0.7784) than GOCI-derived FUI estimated by Pitarch's OC-CCI FUI algorithm (R2=0.72, RMSE=0.9708). RF-derived FUI showed a high correlation with five water quality parameters including Total Nitrogen, Total Phosphorus, Chlorophyll-a, Total Suspended Solids, Transparency with the correlation coefficients of 0.87, 0.88, 0.97, 0.65, and -0.98, respectively. The temporal pattern of the RF-derived FUI well reflected the physical relationship with various water quality parameters with a strong seasonality. The research findingssuggested the potential of the high resolution FUI in coastal water quality management in the Korean Peninsula.

Comparative Assessment of Linear Regression and Machine Learning for Analyzing the Spatial Distribution of Ground-level NO2 Concentrations: A Case Study for Seoul, Korea (서울 지역 지상 NO2 농도 공간 분포 분석을 위한 회귀 모델 및 기계학습 기법 비교)

  • Kang, Eunjin;Yoo, Cheolhee;Shin, Yeji;Cho, Dongjin;Im, Jungho
    • Korean Journal of Remote Sensing
    • /
    • v.37 no.6_1
    • /
    • pp.1739-1756
    • /
    • 2021
  • Atmospheric nitrogen dioxide (NO2) is mainly caused by anthropogenic emissions. It contributes to the formation of secondary pollutants and ozone through chemical reactions, and adversely affects human health. Although ground stations to monitor NO2 concentrations in real time are operated in Korea, they have a limitation that it is difficult to analyze the spatial distribution of NO2 concentrations, especially over the areas with no stations. Therefore, this study conducted a comparative experiment of spatial interpolation of NO2 concentrations based on two linear-regression methods(i.e., multi linear regression (MLR), and regression kriging (RK)), and two machine learning approaches (i.e., random forest (RF), and support vector regression (SVR)) for the year of 2020. Four approaches were compared using leave-one-out-cross validation (LOOCV). The daily LOOCV results showed that MLR, RK, and SVR produced the average daily index of agreement (IOA) of 0.57, which was higher than that of RF (0.50). The average daily normalized root mean square error of RK was 0.9483%, which was slightly lower than those of the other models. MLR, RK and SVR showed similar seasonal distribution patterns, and the dynamic range of the resultant NO2 concentrations from these three models was similar while that from RF was relatively small. The multivariate linear regression approaches are expected to be a promising method for spatial interpolation of ground-level NO2 concentrations and other parameters in urban areas.

A Comparative Evaluation of Multiple Meteorological Datasets for the Rice Yield Prediction at the County Level in South Korea (우리나라 시군단위 벼 수확량 예측을 위한 다종 기상자료의 비교평가)

  • Cho, Subin;Youn, Youjeong;Kim, Seoyeon;Jeong, Yemin;Kim, Gunah;Kang, Jonggu;Kim, Kwangjin;Cho, Jaeil;Lee, Yangwon
    • Korean Journal of Remote Sensing
    • /
    • v.37 no.2
    • /
    • pp.337-357
    • /
    • 2021
  • Because the growth of paddy rice is affected by meteorological factors, the selection of appropriate meteorological variables is essential to build a rice yield prediction model. This paper examines the suitability of multiple meteorological datasets for the rice yield modeling in South Korea, 1996-2019, and a hindcast experiment for rice yield using a machine learning method by considering the nonlinear relationships between meteorological variables and the rice yield. In addition to the ASOS in-situ observations, we used CRU-JRA ver. 2.1 and ERA5 reanalysis. From the multiple meteorological datasets, we extracted the four common variables (air temperature, relative humidity, solar radiation, and precipitation) and analyzed the characteristics of each data and the associations with rice yields. CRU-JRA ver. 2.1 showed an overall agreement with the other datasets. While relative humidity had a rare relationship with rice yields, solar radiation showed a somewhat high correlation with rice yields. Using the air temperature, solar radiation, and precipitation of July, August, and September, we built a random forest model for the hindcast experiments of rice yields. The model with CRU-JRA ver. 2.1 showed the best performance with a correlation coefficient of 0.772. The solar radiation in the prediction model had the most significant importance among the variables, which is in accordance with the generic agricultural knowledge. This paper has an implication for selecting from multiple meteorological datasets for rice yield modeling.

Regeneration of a defective Railroad Surface for defect detection with Deep Convolution Neural Networks (Deep Convolution Neural Networks 이용하여 결함 검출을 위한 결함이 있는 철도선로표면 디지털영상 재 생성)

  • Kim, Hyeonho;Han, Seokmin
    • Journal of Internet Computing and Services
    • /
    • v.21 no.6
    • /
    • pp.23-31
    • /
    • 2020
  • This study was carried out to generate various images of railroad surfaces with random defects as training data to be better at the detection of defects. Defects on the surface of railroads are caused by various factors such as friction between track binding devices and adjacent tracks and can cause accidents such as broken rails, so railroad maintenance for defects is necessary. Therefore, various researches on defect detection and inspection using image processing or machine learning on railway surface images have been conducted to automate railroad inspection and to reduce railroad maintenance costs. In general, the performance of the image processing analysis method and machine learning technology is affected by the quantity and quality of data. For this reason, some researches require specific devices or vehicles to acquire images of the track surface at regular intervals to obtain a database of various railway surface images. On the contrary, in this study, in order to reduce and improve the operating cost of image acquisition, we constructed the 'Defective Railroad Surface Regeneration Model' by applying the methods presented in the related studies of the Generative Adversarial Network (GAN). Thus, we aimed to detect defects on railroad surface even without a dedicated database. This constructed model is designed to learn to generate the railroad surface combining the different railroad surface textures and the original surface, considering the ground truth of the railroad defects. The generated images of the railroad surface were used as training data in defect detection network, which is based on Fully Convolutional Network (FCN). To validate its performance, we clustered and divided the railroad data into three subsets, one subset as original railroad texture images and the remaining two subsets as another railroad surface texture images. In the first experiment, we used only original texture images for training sets in the defect detection model. And in the second experiment, we trained the generated images that were generated by combining the original images with a few railroad textures of the other images. Each defect detection model was evaluated in terms of 'intersection of union(IoU)' and F1-score measures with ground truths. As a result, the scores increased by about 10~15% when the generated images were used, compared to the case that only the original images were used. This proves that it is possible to detect defects by using the existing data and a few different texture images, even for the railroad surface images in which dedicated training database is not constructed.

Gridded Expansion of Forest Flux Observations and Mapping of Daily CO2 Absorption by the Forests in Korea Using Numerical Weather Prediction Data and Satellite Images (국지예보모델과 위성영상을 이용한 극상림 플럭스 관측의 공간연속면 확장 및 우리나라 산림의 일일 탄소흡수능 격자자료 산출)

  • Kim, Gunah;Cho, Jaeil;Kang, Minseok;Lee, Bora;Kim, Eun-Sook;Choi, Chuluong;Lee, Hanlim;Lee, Taeyun;Lee, Yangwon
    • Korean Journal of Remote Sensing
    • /
    • v.36 no.6_1
    • /
    • pp.1449-1463
    • /
    • 2020
  • As recent global warming and climate changes become more serious, the importance of CO2 absorption by forests is increasing to cope with the greenhouse gas issues. According to the UN Framework Convention on Climate Change, it is required to calculate national CO2 absorptions at the local level in a more scientific and rigorous manner. This paper presents the gridded expansion of forest flux observations and mapping of daily CO2 absorption by the forests in Korea using numerical weather prediction data and satellite images. To consider the sensitive daily changes of plant photosynthesis, we built a machine learning model to retrieve the daily RACA (reference amount of CO2 absorption) by referring to the climax forest in Gwangneung and adopted the NIFoS (National Institute of Forest Science) lookup table for the CO2 absorption by forest type and age to produce the daily AACA (actual amount of CO2 absorption) raster data with the spatial variation of the forests in Korea. In the experiment for the 1,095 days between Jan 1, 2013 and Dec 31, 2015, our RACA retrieval model showed high accuracy with a correlation coefficient of 0.948. To achieve the tier 3 daily statistics for AACA, long-term and detailed forest surveying should be combined with the model in the future.

Prediction of patent lifespan and analysis of influencing factors using machine learning (기계학습을 활용한 특허수명 예측 및 영향요인 분석)

  • Kim, Yongwoo;Kim, Min Gu;Kim, Young-Min
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.2
    • /
    • pp.147-170
    • /
    • 2022
  • Although the number of patent which is one of the core outputs of technological innovation continues to increase, the number of low-value patents also hugely increased. Therefore, efficient evaluation of patents has become important. Estimation of patent lifespan which represents private value of a patent, has been studied for a long time, but in most cases it relied on a linear model. Even if machine learning methods were used, interpretation or explanation of the relationship between explanatory variables and patent lifespan was insufficient. In this study, patent lifespan (number of renewals) is predicted based on the idea that patent lifespan represents the value of the patent. For the research, 4,033,414 patents applied between 1996 and 2017 and finally granted were collected from USPTO (US Patent and Trademark Office). To predict the patent lifespan, we use variables that can reflect the characteristics of the patent, the patent owner's characteristics, and the inventor's characteristics. We build four different models (Ridge Regression, Random Forest, Feed Forward Neural Network, Gradient Boosting Models) and perform hyperparameter tuning through 5-fold Cross Validation. Then, the performance of the generated models are evaluated, and the relative importance of predictors is also presented. In addition, based on the Gradient Boosting Model which have excellent performance, Accumulated Local Effects Plot is presented to visualize the relationship between predictors and patent lifespan. Finally, we apply Kernal SHAP (SHapley Additive exPlanations) to present the evaluation reason of individual patents, and discuss applicability to the patent evaluation system. This study has academic significance in that it cumulatively contributes to the existing patent life estimation research and supplements the limitations of existing patent life estimation studies based on linearity. It is academically meaningful that this study contributes cumulatively to the existing studies which estimate patent lifespan, and that it supplements the limitations of linear models. Also, it is practically meaningful to suggest a method for deriving the evaluation basis for individual patent value and examine the applicability to patent evaluation systems.

Preliminary Inspection Prediction Model to select the on-Site Inspected Foreign Food Facility using Multiple Correspondence Analysis (차원축소를 활용한 해외제조업체 대상 사전점검 예측 모형에 관한 연구)

  • Hae Jin Park;Jae Suk Choi;Sang Goo Cho
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.1
    • /
    • pp.121-142
    • /
    • 2023
  • As the number and weight of imported food are steadily increasing, safety management of imported food to prevent food safety accidents is becoming more important. The Ministry of Food and Drug Safety conducts on-site inspections of foreign food facilities before customs clearance as well as import inspection at the customs clearance stage. However, a data-based safety management plan for imported food is needed due to time, cost, and limited resources. In this study, we tried to increase the efficiency of the on-site inspection by preparing a machine learning prediction model that pre-selects the companies that are expected to fail before the on-site inspection. Basic information of 303,272 foreign food facilities and processing businesses collected in the Integrated Food Safety Information Network and 1,689 cases of on-site inspection information data collected from 2019 to April 2022 were collected. After preprocessing the data of foreign food facilities, only the data subject to on-site inspection were extracted using the foreign food facility_code. As a result, it consisted of a total of 1,689 data and 103 variables. For 103 variables, variables that were '0' were removed based on the Theil-U index, and after reducing by applying Multiple Correspondence Analysis, 49 characteristic variables were finally derived. We build eight different models and perform hyperparameter tuning through 5-fold cross validation. Then, the performance of the generated models are evaluated. The research purpose of selecting companies subject to on-site inspection is to maximize the recall, which is the probability of judging nonconforming companies as nonconforming. As a result of applying various algorithms of machine learning, the Random Forest model with the highest Recall_macro, AUROC, Average PR, F1-score, and Balanced Accuracy was evaluated as the best model. Finally, we apply Kernal SHAP (SHapley Additive exPlanations) to present the selection reason for nonconforming facilities of individual instances, and discuss applicability to the on-site inspection facility selection system. Based on the results of this study, it is expected that it will contribute to the efficient operation of limited resources such as manpower and budget by establishing an imported food management system through a data-based scientific risk management model.