• Title/Summary/Keyword: Machine Learning Applications

Search Result 535, Processing Time 0.025 seconds

An Improvement in K-NN Graph Construction using re-grouping with Locality Sensitive Hashing on MapReduce (MapReduce 환경에서 재그룹핑을 이용한 Locality Sensitive Hashing 기반의 K-Nearest Neighbor 그래프 생성 알고리즘의 개선)

  • Lee, Inhoe;Oh, Hyesung;Kim, Hyoung-Joo
    • KIISE Transactions on Computing Practices
    • /
    • v.21 no.11
    • /
    • pp.681-688
    • /
    • 2015
  • The k nearest neighbor (k-NN) graph construction is an important operation with many web-related applications, including collaborative filtering, similarity search, and many others in data mining and machine learning. Despite its many elegant properties, the brute force k-NN graph construction method has a computational complexity of $O(n^2)$, which is prohibitive for large scale data sets. Thus, (Key, Value)-based distributed framework, MapReduce, is gaining increasingly widespread use in Locality Sensitive Hashing which is efficient for high-dimension and sparse data. Based on the two-stage strategy, we engage the locality sensitive hashing technique to divide users into small subsets, and then calculate similarity between pairs in the small subsets using a brute force method on MapReduce. Specifically, generating a candidate group stage is important since brute-force calculation is performed in the following step. However, existing methods do not prevent large candidate groups. In this paper, we proposed an efficient algorithm for approximate k-NN graph construction by regrouping candidate groups. Experimental results show that our approach is more effective than existing methods in terms of graph accuracy and scan rate.

Design and Implementation of Memory-Centric Computing System for Big Data Analysis

  • Jung, Byung-Kwon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.7
    • /
    • pp.1-7
    • /
    • 2022
  • Recently, as the use of applications such as big data programs and machine learning programs that are driven while generating large amounts of data in the program itself becomes common, the existing main memory alone lacks memory, making it difficult to execute the program quickly. In particular, the need to derive results more quickly has emerged in a situation where it is necessary to analyze whether the entire sequence is genetically altered due to the outbreak of the coronavirus. As a result of measuring performance by applying large-capacity data to a computing system equipped with a self-developed memory pool MOCA host adapter instead of processing large-capacity data from an existing SSD, performance improved by 16% compared to the existing SSD system. In addition, in various other benchmark tests, IO performance was 92.8%, 80.6%, and 32.8% faster than SSD in computing systems equipped with memory pool MOCA host adapters such as SortSampleBam, ApplyBQSR, and GatherBamFiles by task of workflow. When analyzing large amounts of data, such as electrical dielectric pipeline analysis, it is judged that the measurement delay occurring at runtime can be reduced in the computing system equipped with the memory pool MOCA host adapter developed in this research.

Domain Knowledge Incorporated Local Rule-based Explanation for ML-based Bankruptcy Prediction Model (머신러닝 기반 부도예측모형에서 로컬영역의 도메인 지식 통합 규칙 기반 설명 방법)

  • Soo Hyun Cho;Kyung-shik Shin
    • Information Systems Review
    • /
    • v.24 no.1
    • /
    • pp.105-123
    • /
    • 2022
  • Thanks to the remarkable success of Artificial Intelligence (A.I.) techniques, a new possibility for its application on the real-world problem has begun. One of the prominent applications is the bankruptcy prediction model as it is often used as a basic knowledge base for credit scoring models in the financial industry. As a result, there has been extensive research on how to improve the prediction accuracy of the model. However, despite its impressive performance, it is difficult to implement machine learning (ML)-based models due to its intrinsic trait of obscurity, especially when the field requires or values an explanation about the result obtained by the model. The financial domain is one of the areas where explanation matters to stakeholders such as domain experts and customers. In this paper, we propose a novel approach to incorporate financial domain knowledge into local rule generation to provide explanations for the bankruptcy prediction model at instance level. The result shows the proposed method successfully selects and classifies the extracted rules based on the feasibility and information they convey to the users.

Performance Evaluation and Analysis on Single and Multi-Network Virtualization Systems with Virtio and SR-IOV (가상화 시스템에서 Virtio와 SR-IOV 적용에 대한 단일 및 다중 네트워크 성능 평가 및 분석)

  • Jaehak Lee;Jongbeom Lim;Heonchang Yu
    • The Transactions of the Korea Information Processing Society
    • /
    • v.13 no.2
    • /
    • pp.48-59
    • /
    • 2024
  • As functions that support virtualization on their own in hardware are developed, user applications having various workloads are operating efficiently in the virtualization system. SR-IOV is a virtualization support function that takes direct access to PCI devices, thus giving a high I/O performance by minimizing the need for hypervisor or operating system interventions. With SR-IOV, network I/O acceleration can be realized in virtualization systems that have relatively long I/O paths compared to bare-metal systems and frequent context switches between the user area and kernel area. To take performance advantages of SR-IOV, network resource management policies that can derive optimal network performance when SR-IOV is applied to an instance such as a virtual machine(VM) or container are being actively studied.This paper evaluates and analyzes the network performance of SR-IOV implementing I/O acceleration is compared with Virtio in terms of 1) network delay, 2) network throughput, 3) network fairness, 4) performance interference, and 5) multi-network. The contributions of this paper are as follows. First, the network I/O process of Virtio and SR-IOV was clearly explained in the virtualization system, and second, the evaluation results of the network performance of Virtio and SR-IOV were analyzed based on various performance metrics. Third, the system overhead and the possibility of optimization for the SR-IOV network in a virtualization system with high VM density were experimentally confirmed. The experimental results and analysis of the paper are expected to be referenced in the network resource management policy for virtualization systems that operate network-intensive services such as smart factories, connected cars, deep learning inference models, and crowdsourcing.

Study on the Seismic Random Noise Attenuation for the Seismic Attribute Analysis (탄성파 속성 분석을 위한 탄성파 자료 무작위 잡음 제거 연구)

  • Jongpil Won;Jungkyun Shin;Jiho Ha;Hyunggu Jun
    • Economic and Environmental Geology
    • /
    • v.57 no.1
    • /
    • pp.51-71
    • /
    • 2024
  • Seismic exploration is one of the widely used geophysical exploration methods with various applications such as resource development, geotechnical investigation, and subsurface monitoring. It is essential for interpreting the geological characteristics of subsurface by providing accurate images of stratum structures. Typically, geological features are interpreted by visually analyzing seismic sections. However, recently, quantitative analysis of seismic data has been extensively researched to accurately extract and interpret target geological features. Seismic attribute analysis can provide quantitative information for geological interpretation based on seismic data. Therefore, it is widely used in various fields, including the analysis of oil and gas reservoirs, investigation of fault and fracture, and assessment of shallow gas distributions. However, seismic attribute analysis is sensitive to noise within the seismic data, thus additional noise attenuation is required to enhance the accuracy of the seismic attribute analysis. In this study, four kinds of seismic noise attenuation methods are applied and compared to mitigate random noise of poststack seismic data and enhance the attribute analysis results. FX deconvolution, DSMF, Noise2Noise, and DnCNN are applied to the Youngil Bay high-resolution seismic data to remove seismic random noise. Energy, sweetness, and similarity attributes are calculated from noise-removed seismic data. Subsequently, the characteristics of each noise attenuation method, noise removal results, and seismic attribute analysis results are qualitatively and quantitatively analyzed. Based on the advantages and disadvantages of each noise attenuation method and the characteristics of each seismic attribute analysis, we propose a suitable noise attenuation method to improve the result of seismic attribute analysis.

2023 Survey on User Experience of Artificial Intelligence Software in Radiology by the Korean Society of Radiology

  • Eui Jin Hwang;Ji Eun Park;Kyoung Doo Song;Dong Hyun Yang;Kyung Won Kim;June-Goo Lee;Jung Hyun Yoon;Kyunghwa Han;Dong Hyun Kim;Hwiyoung Kim;Chang Min Park;Radiology Imaging Network of Korea for Clinical Research (RINK-CR)
    • Korean Journal of Radiology
    • /
    • v.25 no.7
    • /
    • pp.613-622
    • /
    • 2024
  • Objective: In Korea, radiology has been positioned towards the early adoption of artificial intelligence-based software as medical devices (AI-SaMDs); however, little is known about the current usage, implementation, and future needs of AI-SaMDs. We surveyed the current trends and expectations for AI-SaMDs among members of the Korean Society of Radiology (KSR). Materials and Methods: An anonymous and voluntary online survey was open to all KSR members between April 17 and May 15, 2023. The survey was focused on the experiences of using AI-SaMDs, patterns of usage, levels of satisfaction, and expectations regarding the use of AI-SaMDs, including the roles of the industry, government, and KSR regarding the clinical use of AI-SaMDs. Results: Among the 370 respondents (response rate: 7.7% [370/4792]; 340 board-certified radiologists; 210 from academic institutions), 60.3% (223/370) had experience using AI-SaMDs. The two most common use-case of AI-SaMDs among the respondents were lesion detection (82.1%, 183/223), lesion diagnosis/classification (55.2%, 123/223), with the target imaging modalities being plain radiography (62.3%, 139/223), CT (42.6%, 95/223), mammography (29.1%, 65/223), and MRI (28.7%, 64/223). Most users were satisfied with AI-SaMDs (67.6% [115/170, for improvement of patient management] to 85.1% [189/222, for performance]). Regarding the expansion of clinical applications, most respondents expressed a preference for AI-SaMDs to assist in detection/diagnosis (77.0%, 285/370) and to perform automated measurement/quantification (63.5%, 235/370). Most respondents indicated that future development of AI-SaMDs should focus on improving practice efficiency (81.9%, 303/370) and quality (71.4%, 264/370). Overall, 91.9% of the respondents (340/370) agreed that there is a need for education or guidelines driven by the KSR regarding the use of AI-SaMDs. Conclusion: The penetration rate of AI-SaMDs in clinical practice and the corresponding satisfaction levels were high among members of the KSR. Most AI-SaMDs have been used for lesion detection, diagnosis, and classification. Most respondents requested KSR-driven education or guidelines on the use of AI-SaMDs.

Neural Network Analysis of Determinants Affecting Purchase Decisions in Fashion Eyewear (신경망분석기법을 이용한 패션 아이웨어 구매결정요소에 관한 연구)

  • Kim Ji Min
    • The Journal of the Convergence on Culture Technology
    • /
    • v.10 no.5
    • /
    • pp.163-171
    • /
    • 2024
  • This study applies neural network analysis techniques to examine the factors influencing the purchasing decisions of fashion eyewear among women in their 30s and 40s, comparing these findings with traditional parametric analysis methods. In the fashion area, machine learning techniques are utilized for personalized fashion recommendation systems. However, research on such applications in Korea remains insufficient. By reanalyzing a study conducted in 2017 using traditional quantitative methods with these new techniques, this study aims to confirm the utility of neural network methods. Notably, the study finds that the classification accuracy of preferred sunglasses design is highest, at 86.2%, when the L-BFGS-B neural network is activated using the hyperbolic tangent function. The most critical factors influencing purchasing decisions were consumers' occupations and their pursuit of new styles. It is interpreted that Korean sunglasses consumers prefer "safe changes." These findings are consistent for selecting both the frames and lenses of sunglasses. Traditional quantitative analysis suggests that the type of sunglasses preferred varies according to the group to which a consumer belongs. In contrast, neural network analysis predicts the preferred sunglasses for each individual, thereby facilitating the development of personalized sunglasses recommendation systems.

The Prediction of DEA based Efficiency Rating for Venture Business Using Multi-class SVM (다분류 SVM을 이용한 DEA기반 벤처기업 효율성등급 예측모형)

  • Park, Ji-Young;Hong, Tae-Ho
    • Asia pacific journal of information systems
    • /
    • v.19 no.2
    • /
    • pp.139-155
    • /
    • 2009
  • For the last few decades, many studies have tried to explore and unveil venture companies' success factors and unique features in order to identify the sources of such companies' competitive advantages over their rivals. Such venture companies have shown tendency to give high returns for investors generally making the best use of information technology. For this reason, many venture companies are keen on attracting avid investors' attention. Investors generally make their investment decisions by carefully examining the evaluation criteria of the alternatives. To them, credit rating information provided by international rating agencies, such as Standard and Poor's, Moody's and Fitch is crucial source as to such pivotal concerns as companies stability, growth, and risk status. But these types of information are generated only for the companies issuing corporate bonds, not venture companies. Therefore, this study proposes a method for evaluating venture businesses by presenting our recent empirical results using financial data of Korean venture companies listed on KOSDAQ in Korea exchange. In addition, this paper used multi-class SVM for the prediction of DEA-based efficiency rating for venture businesses, which was derived from our proposed method. Our approach sheds light on ways to locate efficient companies generating high level of profits. Above all, in determining effective ways to evaluate a venture firm's efficiency, it is important to understand the major contributing factors of such efficiency. Therefore, this paper is constructed on the basis of following two ideas to classify which companies are more efficient venture companies: i) making DEA based multi-class rating for sample companies and ii) developing multi-class SVM-based efficiency prediction model for classifying all companies. First, the Data Envelopment Analysis(DEA) is a non-parametric multiple input-output efficiency technique that measures the relative efficiency of decision making units(DMUs) using a linear programming based model. It is non-parametric because it requires no assumption on the shape or parameters of the underlying production function. DEA has been already widely applied for evaluating the relative efficiency of DMUs. Recently, a number of DEA based studies have evaluated the efficiency of various types of companies, such as internet companies and venture companies. It has been also applied to corporate credit ratings. In this study we utilized DEA for sorting venture companies by efficiency based ratings. The Support Vector Machine(SVM), on the other hand, is a popular technique for solving data classification problems. In this paper, we employed SVM to classify the efficiency ratings in IT venture companies according to the results of DEA. The SVM method was first developed by Vapnik (1995). As one of many machine learning techniques, SVM is based on a statistical theory. Thus far, the method has shown good performances especially in generalizing capacity in classification tasks, resulting in numerous applications in many areas of business, SVM is basically the algorithm that finds the maximum margin hyperplane, which is the maximum separation between classes. According to this method, support vectors are the closest to the maximum margin hyperplane. If it is impossible to classify, we can use the kernel function. In the case of nonlinear class boundaries, we can transform the inputs into a high-dimensional feature space, This is the original input space and is mapped into a high-dimensional dot-product space. Many studies applied SVM to the prediction of bankruptcy, the forecast a financial time series, and the problem of estimating credit rating, In this study we employed SVM for developing data mining-based efficiency prediction model. We used the Gaussian radial function as a kernel function of SVM. In multi-class SVM, we adopted one-against-one approach between binary classification method and two all-together methods, proposed by Weston and Watkins(1999) and Crammer and Singer(2000), respectively. In this research, we used corporate information of 154 companies listed on KOSDAQ market in Korea exchange. We obtained companies' financial information of 2005 from the KIS(Korea Information Service, Inc.). Using this data, we made multi-class rating with DEA efficiency and built multi-class prediction model based data mining. Among three manners of multi-classification, the hit ratio of the Weston and Watkins method is the best in the test data set. In multi classification problems as efficiency ratings of venture business, it is very useful for investors to know the class with errors, one class difference, when it is difficult to find out the accurate class in the actual market. So we presented accuracy results within 1-class errors, and the Weston and Watkins method showed 85.7% accuracy in our test samples. We conclude that the DEA based multi-class approach in venture business generates more information than the binary classification problem, notwithstanding its efficiency level. We believe this model can help investors in decision making as it provides a reliably tool to evaluate venture companies in the financial domain. For the future research, we perceive the need to enhance such areas as the variable selection process, the parameter selection of kernel function, the generalization, and the sample size of multi-class.

KB-BERT: Training and Application of Korean Pre-trained Language Model in Financial Domain (KB-BERT: 금융 특화 한국어 사전학습 언어모델과 그 응용)

  • Kim, Donggyu;Lee, Dongwook;Park, Jangwon;Oh, Sungwoo;Kwon, Sungjun;Lee, Inyong;Choi, Dongwon
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.2
    • /
    • pp.191-206
    • /
    • 2022
  • Recently, it is a de-facto approach to utilize a pre-trained language model(PLM) to achieve the state-of-the-art performance for various natural language tasks(called downstream tasks) such as sentiment analysis and question answering. However, similar to any other machine learning method, PLM tends to depend on the data distribution seen during the training phase and shows worse performance on the unseen (Out-of-Distribution) domain. Due to the aforementioned reason, there have been many efforts to develop domain-specified PLM for various fields such as medical and legal industries. In this paper, we discuss the training of a finance domain-specified PLM for the Korean language and its applications. Our finance domain-specified PLM, KB-BERT, is trained on a carefully curated financial corpus that includes domain-specific documents such as financial reports. We provide extensive performance evaluation results on three natural language tasks, topic classification, sentiment analysis, and question answering. Compared to the state-of-the-art Korean PLM models such as KoELECTRA and KLUE-RoBERTa, KB-BERT shows comparable performance on general datasets based on common corpora like Wikipedia and news articles. Moreover, KB-BERT outperforms compared models on finance domain datasets that require finance-specific knowledge to solve given problems.

Ensemble of Nested Dichotomies for Activity Recognition Using Accelerometer Data on Smartphone (Ensemble of Nested Dichotomies 기법을 이용한 스마트폰 가속도 센서 데이터 기반의 동작 인지)

  • Ha, Eu Tteum;Kim, Jeongmin;Ryu, Kwang Ryel
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.4
    • /
    • pp.123-132
    • /
    • 2013
  • As the smartphones are equipped with various sensors such as the accelerometer, GPS, gravity sensor, gyros, ambient light sensor, proximity sensor, and so on, there have been many research works on making use of these sensors to create valuable applications. Human activity recognition is one such application that is motivated by various welfare applications such as the support for the elderly, measurement of calorie consumption, analysis of lifestyles, analysis of exercise patterns, and so on. One of the challenges faced when using the smartphone sensors for activity recognition is that the number of sensors used should be minimized to save the battery power. When the number of sensors used are restricted, it is difficult to realize a highly accurate activity recognizer or a classifier because it is hard to distinguish between subtly different activities relying on only limited information. The difficulty gets especially severe when the number of different activity classes to be distinguished is very large. In this paper, we show that a fairly accurate classifier can be built that can distinguish ten different activities by using only a single sensor data, i.e., the smartphone accelerometer data. The approach that we take to dealing with this ten-class problem is to use the ensemble of nested dichotomy (END) method that transforms a multi-class problem into multiple two-class problems. END builds a committee of binary classifiers in a nested fashion using a binary tree. At the root of the binary tree, the set of all the classes are split into two subsets of classes by using a binary classifier. At a child node of the tree, a subset of classes is again split into two smaller subsets by using another binary classifier. Continuing in this way, we can obtain a binary tree where each leaf node contains a single class. This binary tree can be viewed as a nested dichotomy that can make multi-class predictions. Depending on how a set of classes are split into two subsets at each node, the final tree that we obtain can be different. Since there can be some classes that are correlated, a particular tree may perform better than the others. However, we can hardly identify the best tree without deep domain knowledge. The END method copes with this problem by building multiple dichotomy trees randomly during learning, and then combining the predictions made by each tree during classification. The END method is generally known to perform well even when the base learner is unable to model complex decision boundaries As the base classifier at each node of the dichotomy, we have used another ensemble classifier called the random forest. A random forest is built by repeatedly generating a decision tree each time with a different random subset of features using a bootstrap sample. By combining bagging with random feature subset selection, a random forest enjoys the advantage of having more diverse ensemble members than a simple bagging. As an overall result, our ensemble of nested dichotomy can actually be seen as a committee of committees of decision trees that can deal with a multi-class problem with high accuracy. The ten classes of activities that we distinguish in this paper are 'Sitting', 'Standing', 'Walking', 'Running', 'Walking Uphill', 'Walking Downhill', 'Running Uphill', 'Running Downhill', 'Falling', and 'Hobbling'. The features used for classifying these activities include not only the magnitude of acceleration vector at each time point but also the maximum, the minimum, and the standard deviation of vector magnitude within a time window of the last 2 seconds, etc. For experiments to compare the performance of END with those of other methods, the accelerometer data has been collected at every 0.1 second for 2 minutes for each activity from 5 volunteers. Among these 5,900 ($=5{\times}(60{\times}2-2)/0.1$) data collected for each activity (the data for the first 2 seconds are trashed because they do not have time window data), 4,700 have been used for training and the rest for testing. Although 'Walking Uphill' is often confused with some other similar activities, END has been found to classify all of the ten activities with a fairly high accuracy of 98.4%. On the other hand, the accuracies achieved by a decision tree, a k-nearest neighbor, and a one-versus-rest support vector machine have been observed as 97.6%, 96.5%, and 97.6%, respectively.