• Title/Summary/Keyword: feature model validation

Search Result 111, Processing Time 0.026 seconds

cmicroRNA prediction using Bayesian network with biologically relevant feature set (생물학적으로 의미 있는 특질에 기반한 베이지안 네트웍을 이용한 microRNA의 예측)

  • Nam, Jin-Wu;Park, Jong-Sun;Zhang, Byoung-Tak
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2006.10a
    • /
    • pp.53-58
    • /
    • 2006
  • MicroRNA (miRNA)는 약 22 nt의 작은 RNA 조각으로 이루어져 있으며 stem-loop 구조의 precursor 형태에서 최종적으로 만들어 진다. miRNA는 mRNA의 3‘UTR에 상보적으로 결합하여 유전자의 발현을 억제하거나 mRNA의 분해를 촉진한다. miRNA를 동정하기 위한 실험적인 방법은 조직 특이적인 발현, 적은 발현양 때문에 방법상 한계를 가지고 있다. 이러한 한계는 컴퓨터를 이용한 방법으로 어느 정도 해결될 수 있다. 하지만 miRNA의 서열상의 낮은 보존성은 homology를 기반으로 한 예측을 어렵게 한다. 또한 기계학습 방법인 support vector machine (SVM) 이나 naive bayes가 적용되었지만, 생물학적인 의미를 해석할 수 있는 generative model을 제시해 주지 못했다. 본 연구에서는 우수한 miRNA 예측을 보일 뿐만 아니라 학습된 모델로부터 생물학적인 지식을 얻을 수 있는 Bayesian network을 적용한다. 이를 위해서는 생물학적으로 의미 있는 특질들의 선택이 중요하다. 여기서는 position weighted matrix (PWM)과 Markov chain probability (MCP), Loop 크기, Bulge 수, spectrum, free energy profile 등을 특질로서 선택한 후 Information gain의 특질 선택법을 통해 예측에 기여도가 높은 특질 25개 와 27개를 최종적으로 선택하였다. 이로부터 Bayesian network을 학습한 후 miRNA의 예측 성능을 10 fold cross-validation으로 확인하였다. 그 결과 pre-/mature miRNA 각 각에 대한 예측 accuracy가 99.99% 100.00%를 보여, SVM이나 naive bayes 방법보다 높은 결과를 보였으며, 학습된 Bayesian network으로부터 이전 연구 결과와 일치하는 pre-miRNA 상의 의존관계를 분석할 수 있었다.

  • PDF

A Study on the Drug Classification Using Machine Learning Techniques (머신러닝 기법을 이용한 약물 분류 방법 연구)

  • Anmol Kumar Singh;Ayush Kumar;Adya Singh;Akashika Anshum;Pradeep Kumar Mallick
    • Advanced Industrial SCIence
    • /
    • v.3 no.2
    • /
    • pp.8-16
    • /
    • 2024
  • This paper shows the system of drug classification, the goal of this is to foretell the apt drug for the patients based on their demographic and physiological traits. The dataset consists of various attributes like Age, Sex, BP (Blood Pressure), Cholesterol Level, and Na_to_K (Sodium to Potassium ratio), with the objective to determine the kind of drug being given. The models used in this paper are K-Nearest Neighbors (KNN), Logistic Regression and Random Forest. Further to fine-tune hyper parameters using 5-fold cross-validation, GridSearchCV was used and each model was trained and tested on the dataset. To assess the performance of each model both with and without hyper parameter tuning evaluation metrics like accuracy, confusion matrices, and classification reports were used and the accuracy of the models without GridSearchCV was 0.7, 0.875, 0.975 and with GridSearchCV was 0.75, 1.0, 0.975. According to GridSearchCV Logistic Regression is the most suitable model for drug classification among the three-model used followed by the K-Nearest Neighbors. Also, Na_to_K is an essential feature in predicting the outcome.

Regional Realtime Ocean Tide and Storm-surge Simulation for the South China Sea (남중국해 지역 실시간 해양 조석 및 폭풍해일 시뮬레이션)

  • Kim, Kyeong Ok;Choi, Byung Ho;Lee, Han Soo;Yuk, Jin-Hee
    • Journal of Korean Society of Coastal and Ocean Engineers
    • /
    • v.30 no.2
    • /
    • pp.69-83
    • /
    • 2018
  • The South China Sea (SCS) is a typical marginal sea characterized with the deep basin, shelf break, shallow shelf, many straits, and complex bathymetry. This study investigated the tidal characteristics and propagation, and reproduced typhoon-induced storm surge in this region using the regional real-time tide-surge model, which was based on the unstructured grid, resolving in detail the region of interest and forced by tide at the open boundary and by wind and air pressure at the surface. Typhoon Haiyan, which occurred in 2013 and caused great damage in the Philippines, was chosen as a case study to simulate typhoon's impact. Amplitudes and phases of four major constituents were reproduced reasonably in general, and the tidal distributions of four constituents were similar to the previous studies. The modelled tide seemed to be within the acceptable levels, considering it was difficult to reproduce the tide in this region based on the previous studies. The free oscillation experiment results described well the feature of tide that the diurnal tide is prevailing in the SCS. The tidal residual current and total energy dissipation were discussed to understand the tidal and sedimentary environments. The storm-surge caused by typhoon Haiyan was reasonably simulated using this modeling system. This study established the regional real-time barotropic tide/water level prediction system for the South China Sea including the seas around the Philippines through the validation of the model and the understanding of tidal characteristics.

The Intelligent Determination Model of Audience Emotion for Implementing Personalized Exhibition (개인화 전시 서비스 구현을 위한 지능형 관객 감정 판단 모형)

  • Jung, Min-Kyu;Kim, Jae-Kyeong
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.1
    • /
    • pp.39-57
    • /
    • 2012
  • Recently, due to the introduction of high-tech equipment in interactive exhibits, many people's attention has been concentrated on Interactive exhibits that can double the exhibition effect through the interaction with the audience. In addition, it is also possible to measure a variety of audience reaction in the interactive exhibition. Among various audience reactions, this research uses the change of the facial features that can be collected in an interactive exhibition space. This research develops an artificial neural network-based prediction model to predict the response of the audience by measuring the change of the facial features when the audience is given stimulation from the non-excited state. To present the emotion state of the audience, this research uses a Valence-Arousal model. So, this research suggests an overall framework composed of the following six steps. The first step is a step of collecting data for modeling. The data was collected from people participated in the 2012 Seoul DMC Culture Open, and the collected data was used for the experiments. The second step extracts 64 facial features from the collected data and compensates the facial feature values. The third step generates independent and dependent variables of an artificial neural network model. The fourth step extracts the independent variable that affects the dependent variable using the statistical technique. The fifth step builds an artificial neural network model and performs a learning process using train set and test set. Finally the last sixth step is to validate the prediction performance of artificial neural network model using the validation data set. The proposed model is compared with statistical predictive model to see whether it had better performance or not. As a result, although the data set in this experiment had much noise, the proposed model showed better results when the model was compared with multiple regression analysis model. If the prediction model of audience reaction was used in the real exhibition, it will be able to provide countermeasures and services appropriate to the audience's reaction viewing the exhibits. Specifically, if the arousal of audience about Exhibits is low, Action to increase arousal of the audience will be taken. For instance, we recommend the audience another preferred contents or using a light or sound to focus on these exhibits. In other words, when planning future exhibitions, planning the exhibition to satisfy various audience preferences would be possible. And it is expected to foster a personalized environment to concentrate on the exhibits. But, the proposed model in this research still shows the low prediction accuracy. The cause is in some parts as follows : First, the data covers diverse visitors of real exhibitions, so it was difficult to control the optimized experimental environment. So, the collected data has much noise, and it would results a lower accuracy. In further research, the data collection will be conducted in a more optimized experimental environment. The further research to increase the accuracy of the predictions of the model will be conducted. Second, using changes of facial expression only is thought to be not enough to extract audience emotions. If facial expression is combined with other responses, such as the sound, audience behavior, it would result a better result.

An Electric Load Forecasting Scheme with High Time Resolution Based on Artificial Neural Network (인공 신경망 기반의 고시간 해상도를 갖는 전력수요 예측기법)

  • Park, Jinwoong;Moon, Jihoon;Hwang, Eenjun
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.6 no.11
    • /
    • pp.527-536
    • /
    • 2017
  • With the recent development of smart grid industry, the necessity for efficient EMS(Energy Management System) has been increased. In particular, in order to reduce electric load and energy cost, sophisticated electric load forecasting and efficient smart grid operation strategy are required. In this paper, for more accurate electric load forecasting, we extend the data collected at demand time into high time resolution and construct an artificial neural network-based forecasting model appropriate for the high time resolution data. Furthermore, to improve the accuracy of electric load forecasting, time series data of sequence form are transformed into continuous data of two-dimensional space to solve that problem that machine learning methods cannot reflect the periodicity of time series data. In addition, to consider external factors such as temperature and humidity in accordance with the time resolution, we estimate their value at the time resolution using linear interpolation method. Finally, we apply the PCA(Principal Component Analysis) algorithm to the feature vector composed of external factors to remove data which have little correlation with the power data. Finally, we perform the evaluation of our model through 5-fold cross-validation. The results show that forecasting based on higher time resolution improve the accuracy and the best error rate of 3.71% was achieved at the 3-min resolution.

Schematic Cost Estimation Method using Case-Based Reasoning: Focusing on Determining Attribute Weight (사례기반추론을 이용한 초기단계 공사비 예측 방법: 속성 가중치 산정을 중심으로)

  • Park, Moon-Seo;Seong, Ki-Hoon;Lee, Hyun-Soo;Ji, Sae-Hyun;Kim, Soo-Young
    • Korean Journal of Construction Engineering and Management
    • /
    • v.11 no.4
    • /
    • pp.22-31
    • /
    • 2010
  • Because the estimated cost at early stage has great influence on decisions of project owner, the importance of early cost estimation is increasing. However, it depends on experience and knowledge of the estimator mainly due to shortage of information. Those tendency developed into case-based reasoning(CBR) method which solves new problems by adapting previous solution to similar past problems. The performance of CBR model is affected by attribute weight, so that its accurate determination is necessary. Previous research utilizes mathematical method or subjective judgement of estimator. In order to improve the problem of previous research, this suggests CBR schematic cost estimation method using genetic algorithm to determine attribute weight. The cost model employs nearest neighbor retrieval for selecting past case. And it estimates the cost of new cases based on cost information of extracted cases. As the result of validation for 17 testing cases, 3.57% of error rate is calculated. This rate is superior to accuracy rate proposed by AACE and the method to determine attribute weight using multiple regression analysis and feature counting. The CBR cost estimation method improve the accuracy by introducing genetic algorithm for attribute weight. Moreover, this makes user understand the problem-solving process easier than other artificial intelligence method, and find solution within short time through case retrieval algorithm.

Development of a Korean Speech Recognition Platform (ECHOS) (한국어 음성인식 플랫폼 (ECHOS) 개발)

  • Kwon Oh-Wook;Kwon Sukbong;Jang Gyucheol;Yun Sungrack;Kim Yong-Rae;Jang Kwang-Dong;Kim Hoi-Rin;Yoo Changdong;Kim Bong-Wan;Lee Yong-Ju
    • The Journal of the Acoustical Society of Korea
    • /
    • v.24 no.8
    • /
    • pp.498-504
    • /
    • 2005
  • We introduce a Korean speech recognition platform (ECHOS) developed for education and research Purposes. ECHOS lowers the entry barrier to speech recognition research and can be used as a reference engine by providing elementary speech recognition modules. It has an easy simple object-oriented architecture, implemented in the C++ language with the standard template library. The input of the ECHOS is digital speech data sampled at 8 or 16 kHz. Its output is the 1-best recognition result. N-best recognition results, and a word graph. The recognition engine is composed of MFCC/PLP feature extraction, HMM-based acoustic modeling, n-gram language modeling, finite state network (FSN)- and lexical tree-based search algorithms. It can handle various tasks from isolated word recognition to large vocabulary continuous speech recognition. We compare the performance of ECHOS and hidden Markov model toolkit (HTK) for validation. In an FSN-based task. ECHOS shows similar word accuracy while the recognition time is doubled because of object-oriented implementation. For a 8000-word continuous speech recognition task, using the lexical tree search algorithm different from the algorithm used in HTK, it increases the word error rate by $40\%$ relatively but reduces the recognition time to half.

CNN-LSTM-based Upper Extremity Rehabilitation Exercise Real-time Monitoring System (CNN-LSTM 기반의 상지 재활운동 실시간 모니터링 시스템)

  • Jae-Jung Kim;Jung-Hyun Kim;Sol Lee;Ji-Yun Seo;Do-Un Jeong
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.24 no.3
    • /
    • pp.134-139
    • /
    • 2023
  • Rehabilitators perform outpatient treatment and daily rehabilitation exercises to recover physical function with the aim of quickly returning to society after surgical treatment. Unlike performing exercises in a hospital with the help of a professional therapist, there are many difficulties in performing rehabilitation exercises by the patient on a daily basis. In this paper, we propose a CNN-LSTM-based upper limb rehabilitation real-time monitoring system so that patients can perform rehabilitation efficiently and with correct posture on a daily basis. The proposed system measures biological signals through shoulder-mounted hardware equipped with EMG and IMU, performs preprocessing and normalization for learning, and uses them as a learning dataset. The implemented model consists of three polling layers of three synthetic stacks for feature detection and two LSTM layers for classification, and we were able to confirm a learning result of 97.44% on the validation data. After that, we conducted a comparative evaluation with the Teachable machine, and as a result of the comparative evaluation, we confirmed that the model was implemented at 93.6% and the Teachable machine at 94.4%, and both models showed similar classification performance.

Development of the Cloud Monitoring Program using Machine Learning-based Python Module from the MAAO All-sky Camera Images (기계학습 기반의 파이썬 모듈을 이용한 밀양아리랑우주천문대 전천 영상의 운량 모니터링 프로그램 개발)

  • Gu Lim;Dohyeong Kim;Donghyun Kim;Keun-Hong Park
    • Journal of the Korean earth science society
    • /
    • v.45 no.2
    • /
    • pp.111-120
    • /
    • 2024
  • Cloud coverage is a key factor in determining whether to proceed with observations. In the past, human judgment played an important role in weather evaluation for observations. However, the development of remote and robotic observation has diminished the role of human judgment. Moreover, it is not easy to evaluate weather conditions automatically because of the diverse cloud shapes and their rapid movement. In this paper, we present the development of a cloud monitoring program by applying a machine learning-based Python module "cloudynight" on all-sky camera images obtained at Miryang Arirang Astronomical Observatory (MAAO). The machine learning model was built by training 39,996 subregions divided from 1,212 images with altitude/azimuth angles and extracting 16 feature spaces. For our training model, the F1-score from the validation samples was 0.97, indicating good performance in identifying clouds in the all-sky image. As a result, this program calculates "Cloudiness" as the ratio of the number of total subregions to the number of subregions predicted to be covered by clouds. In the robotic observation, we set a policy that allows the telescope system to halt the observation when the "Cloudiness" exceeds 0.6 during the last 30 minutes. Following this policy, we found that there were no improper halts in the telescope system due to incorrect program decisions. We expect that robotic observation with the 0.7 m telescope at MAAO can be successfully operated using the cloud monitoring program.

The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms (다중 클래스 데이터셋의 메타특징이 판별 알고리즘의 성능에 미치는 영향 연구)

  • Kim, Jeonghun;Kim, Min Yong;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.23-45
    • /
    • 2020
  • Big data is creating in a wide variety of fields such as medical care, manufacturing, logistics, sales site, SNS, and the dataset characteristics are also diverse. In order to secure the competitiveness of companies, it is necessary to improve decision-making capacity using a classification algorithm. However, most of them do not have sufficient knowledge on what kind of classification algorithm is appropriate for a specific problem area. In other words, determining which classification algorithm is appropriate depending on the characteristics of the dataset was has been a task that required expertise and effort. This is because the relationship between the characteristics of datasets (called meta-features) and the performance of classification algorithms has not been fully understood. Moreover, there has been little research on meta-features reflecting the characteristics of multi-class. Therefore, the purpose of this study is to empirically analyze whether meta-features of multi-class datasets have a significant effect on the performance of classification algorithms. In this study, meta-features of multi-class datasets were identified into two factors, (the data structure and the data complexity,) and seven representative meta-features were selected. Among those, we included the Herfindahl-Hirschman Index (HHI), originally a market concentration measurement index, in the meta-features to replace IR(Imbalanced Ratio). Also, we developed a new index called Reverse ReLU Silhouette Score into the meta-feature set. Among the UCI Machine Learning Repository data, six representative datasets (Balance Scale, PageBlocks, Car Evaluation, User Knowledge-Modeling, Wine Quality(red), Contraceptive Method Choice) were selected. The class of each dataset was classified by using the classification algorithms (KNN, Logistic Regression, Nave Bayes, Random Forest, and SVM) selected in the study. For each dataset, we applied 10-fold cross validation method. 10% to 100% oversampling method is applied for each fold and meta-features of the dataset is measured. The meta-features selected are HHI, Number of Classes, Number of Features, Entropy, Reverse ReLU Silhouette Score, Nonlinearity of Linear Classifier, Hub Score. F1-score was selected as the dependent variable. As a result, the results of this study showed that the six meta-features including Reverse ReLU Silhouette Score and HHI proposed in this study have a significant effect on the classification performance. (1) The meta-features HHI proposed in this study was significant in the classification performance. (2) The number of variables has a significant effect on the classification performance, unlike the number of classes, but it has a positive effect. (3) The number of classes has a negative effect on the performance of classification. (4) Entropy has a significant effect on the performance of classification. (5) The Reverse ReLU Silhouette Score also significantly affects the classification performance at a significant level of 0.01. (6) The nonlinearity of linear classifiers has a significant negative effect on classification performance. In addition, the results of the analysis by the classification algorithms were also consistent. In the regression analysis by classification algorithm, Naïve Bayes algorithm does not have a significant effect on the number of variables unlike other classification algorithms. This study has two theoretical contributions: (1) two new meta-features (HHI, Reverse ReLU Silhouette score) was proved to be significant. (2) The effects of data characteristics on the performance of classification were investigated using meta-features. The practical contribution points (1) can be utilized in the development of classification algorithm recommendation system according to the characteristics of datasets. (2) Many data scientists are often testing by adjusting the parameters of the algorithm to find the optimal algorithm for the situation because the characteristics of the data are different. In this process, excessive waste of resources occurs due to hardware, cost, time, and manpower. This study is expected to be useful for machine learning, data mining researchers, practitioners, and machine learning-based system developers. The composition of this study consists of introduction, related research, research model, experiment, conclusion and discussion.