• Title/Summary/Keyword: KNN

Search Result 261, Processing Time 0.033 seconds

Corpus of Eye Movements in L3 Spanish Reading: A Prediction Model

  • Hui-Chuan Lu;Li-Chi Kao;Zong-Han Li;Wen-Hsiang Lu;An-Chung Cheng
    • Asia Pacific Journal of Corpus Research
    • /
    • v.5 no.1
    • /
    • pp.23-36
    • /
    • 2024
  • This research centers on the Taiwan Eye-Movement Corpus of Spanish (TECS), a specially created corpus comprising eye-tracking data from Chinese-speaking learners of Spanish as a third language in Taiwan. Its primary purpose is to explore the broad utility of TECS in understanding language learning processes, particularly the initial stages of language learning. Constructing this corpus involves gathering data on eye-tracking, reading comprehension, and language proficiency to develop a machine-learning model that predicts learner behaviors, and subsequently undergoes a predictability test for validation. The focus is on examining attention in input processing and their relationship to language learning outcomes. The TECS eye-tracking data consists of indicators derived from eye movement recordings while reading Spanish sentences with temporal references. These indicators are obtained from eye movement experiments focusing on tense verbal inflections and temporal adverbs. Chinese expresses tense using aspect markers, lexical references, and contextual cues, differing significantly from inflectional languages like Spanish. Chinese-speaking learners of Spanish face particular challenges in learning verbal morphology and tenses. The data from eye movement experiments were structured into feature vectors, with learner behaviors serving as class labels. After categorizing the collected data, we used two types of machine learning methods for classification and regression: Random Forests and the k-nearest neighbors algorithm (KNN). By leveraging these algorithms, we predicted learner behaviors and conducted performance evaluations to enhance our understanding of the nexus between learner behaviors and language learning process. Future research may further enrich TECS by gathering data from subsequent eye-movement experiments, specifically targeting various Spanish tenses and temporal lexical references during text reading. These endeavors promise to broaden and refine the corpus, advancing our understanding of language processing.

Performance Evaluation of Machine Learning Model for Seismic Response Prediction of Nuclear Power Plant Structures considering Aging deterioration (원전 구조물의 경년열화를 고려한 지진응답예측 기계학습 모델의 성능평가)

  • Kim, Hyun-Su;Kim, Yukyung;Lee, So Yeon;Jang, Jun Su
    • Journal of Korean Association for Spatial Structures
    • /
    • v.24 no.3
    • /
    • pp.43-51
    • /
    • 2024
  • Dynamic responses of nuclear power plant structure subjected to earthquake loads should be carefully investigated for safety. Because nuclear power plant structure are usually constructed by material of reinforced concrete, the aging deterioration of R.C. have no small effect on structural behavior of nuclear power plant structure. Therefore, aging deterioration of R.C. nuclear power plant structure should be considered for exact prediction of seismic responses of the structure. In this study, a machine learning model for seismic response prediction of nuclear power plant structure was developed by considering aging deterioration. The OPR-1000 was selected as an example structure for numerical simulation. The OPR-1000 was originally designated as the Korean Standard Nuclear Power Plant (KSNP), and was re-designated as the OPR-1000 in 2005 for foreign sales. 500 artificial ground motions were generated based on site characteristics of Korea. Elastic modulus, damping ratio, poisson's ratio and density were selected to consider material property variation due to aging deterioration. Six machine learning algorithms such as, Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Artificial Neural Networks (ANN), eXtreme Gradient Boosting (XGBoost), were used t o construct seispic response prediction model. 13 intensity measures and 4 material properties were used input parameters of the training database. Performance evaluation was performed using metrics like root mean square error, mean square error, mean absolute error, and coefficient of determination. The optimization of hyperparameters was achieved through k-fold cross-validation and grid search techniques. The analysis results show that neural networks present good prediction performance considering aging deterioration.

A study on solar radiation prediction using medium-range weather forecasts (중기예보를 이용한 태양광 일사량 예측 연구)

  • Sujin Park;Hyojeoung Kim;Sahm Kim
    • The Korean Journal of Applied Statistics
    • /
    • v.36 no.1
    • /
    • pp.49-62
    • /
    • 2023
  • Solar energy, which is rapidly increasing in proportion, is being continuously developed and invested. As the installation of new and renewable energy policy green new deal and home solar panels increases, the supply of solar energy in Korea is gradually expanding, and research on accurate demand prediction of power generation is actively underway. In addition, the importance of solar radiation prediction was identified in that solar radiation prediction is acting as a factor that most influences power generation demand prediction. In addition, this study can confirm the biggest difference in that it attempted to predict solar radiation using medium-term forecast weather data not used in previous studies. In this paper, we combined the multi-linear regression model, KNN, random fores, and SVR model and the clustering technique, K-means, to predict solar radiation by hour, by calculating the probability density function for each cluster. Before using medium-term forecast data, mean absolute error (MAE) and root mean squared error (RMSE) were used as indicators to compare model prediction results. The data were converted into daily data according to the medium-term forecast data format from March 1, 2017 to February 28, 2022. As a result of comparing the predictive performance of the model, the method showed the best performance by predicting daily solar radiation with random forest, classifying dates with similar climate factors, and calculating the probability density function of solar radiation by cluster. In addition, when the prediction results were checked after fitting the model to the medium-term forecast data using this methodology, it was confirmed that the prediction error increased by date. This seems to be due to a prediction error in the mid-term forecast weather data. In future studies, among the weather factors that can be used in the mid-term forecast data, studies that add exogenous variables such as precipitation or apply time series clustering techniques should be conducted.

One-probe P300 based concealed information test with machine learning (기계학습을 이용한 단일 관련자극 P300기반 숨김정보검사)

  • Hyuk Kim;Hyun-Taek Kim
    • Korean Journal of Cognitive Science
    • /
    • v.35 no.1
    • /
    • pp.49-95
    • /
    • 2024
  • Polygraph examination, statement validity analysis and P300-based concealed information test are major three examination tools, which are use to determine a person's truthfulness and credibility in criminal procedure. Although polygraph examination is most common in criminal procedure, but it has little admissibility of evidence due to the weakness of scientific basis. In 1990s to support the weakness of scientific basis about polygraph, Farwell and Donchin proposed the P300-based concealed information test technique. The P300-based concealed information test has two strong points. First, the P300-based concealed information test is easy to conduct with polygraph. Second, the P300-based concealed information test has plentiful scientific basis. Nevertheless, the utilization of P300-based concealed information test is infrequent, because of the quantity of probe stimulus. The probe stimulus contains closed information that is relevant to the crime or other investigated situation. In tradition P300-based concealed information test protocol, three or more probe stimuli are necessarily needed. But it is hard to acquire three or more probe stimuli, because most of the crime relevant information is opened in investigative situation. In addition, P300-based concealed information test uses oddball paradigm, and oddball paradigm makes imbalance between the number of probe and irrelevant stimulus. Thus, there is a possibility that the unbalanced number of probe and irrelevant stimulus caused systematic underestimation of P300 amplitude of irrelevant stimuli. To overcome the these two limitation of P300-based concealed information test, one-probe P300-based concealed information test protocol is explored with various machine learning algorithms. According to this study, parameters of the modified one-probe protocol are as follows. In the condition of female and male face stimuli, the duration of stimuli are encouraged 400ms, the repetition of stimuli are encouraged 60 times, the analysis method of P300 amplitude is encouraged peak to peak method, the cut-off of guilty condition is encouraged 90% and the cut-off of innocent condition is encouraged 30%. In the condition of two-syllable word stimulus, the duration of stimulus is encouraged 300ms, the repetition of stimulus is encouraged 60 times, the analysis method of P300 amplitude is encouraged peak to peak method, the cut-off of guilty condition is encouraged 90% and the cut-off of innocent condition is encouraged 30%. It was also conformed that the logistic regression (LR), linear discriminant analysis (LDA), K Neighbors (KNN) algorithms were probable methods for analysis of P300 amplitude. The one-probe P300-based concealed information test with machine learning protocol is helpful to increase utilization of P300-based concealed information test, and supports to determine a person's truthfulness and credibility with the polygraph examination in criminal procedure.

Optimization of Number of Training Documents in Text Categorization (문헌범주화에서 학습문헌수 최적화에 관한 연구)

  • Shim, Kyung
    • Journal of the Korean Society for information Management
    • /
    • v.23 no.4 s.62
    • /
    • pp.277-294
    • /
    • 2006
  • This paper examines a level of categorization performance in a real-life collection of abstract articles in the fields of science and technology, and tests the optimal size of documents per category in a training set using a kNN classifier. The corpus is built by choosing categories that hold more than 2,556 documents first, and then 2,556 documents per category are randomly selected. It is further divided into eight subsets of different size of training documents : each set is randomly selected to build training documents ranging from 20 documents (Tr-20) to 2,000 documents (Tr-2000) per category. The categorization performances of the 8 subsets are compared. The average performance of the eight subsets is 30% in $F_1$ measure which is relatively poor compared to the findings of previous studies. The experimental results suggest that among the eight subsets the Tr-100 appears to be the most optimal size for training a km classifier In addition, the correctness of subject categories assigned to the training sets is probed by manually reclassifying the training sets in order to support the above conclusion by establishing a relation between and the correctness and categorization performance.

Multiple Period Forecasting of Motorway Traffic Volumes by Using Big Historical Data (대용량 이력자료를 활용한 다중시간대 고속도로 교통량 예측)

  • Chang, Hyun-ho;Yoon, Byoung-jo
    • KSCE Journal of Civil and Environmental Engineering Research
    • /
    • v.38 no.1
    • /
    • pp.73-80
    • /
    • 2018
  • In motorway traffic flow control, the conventional way based on real-time response has been changed into advanced way based on proactive response. Future traffic conditions over multiple time intervals are crucial input data for advanced motorway traffic flow control. It is necessary to overcome the uncertainty of the future state in order for forecasting multiple-period traffic volumes, as the number of uncertainty concurrently increase when the forecasting horizon expands. In this vein, multi-interval forecasting of traffic volumes requires a viable approach to conquer future uncertainties successfully. In this paper, a forecasting model is proposed which effectively addresses the uncertainties of future state based on the behaviors of temporal evolution of traffic volume states that intrinsically exits in the big past data. The model selects the past states from the big past data based on the state evolution of current traffic volumes, and then the selected past states are employed for estimating future states. The model was also designed to be suitable for data management systems in practice. Test results demonstrated that the model can effectively overcome the uncertainties over multiple time periods and can generate very reliable predictions in term of prediction accuracy. Hence, it is indicated that the model can be mounted and utilized on advanced data management systems.

A Study on Pipe Model Registration for Augmented Reality Based O&M Environment Improving (증강현실 기반의 O&M 환경 개선을 위한 배관 모델 정합에 관한 연구)

  • Lee, Won-Hyuk;Lee, Kyung-Ho;Lee, Jae-Joon;Nam, Byeong-Wook
    • Journal of the Computational Structural Engineering Institute of Korea
    • /
    • v.32 no.3
    • /
    • pp.191-197
    • /
    • 2019
  • As the shipbuilding and offshore plant industries grow larger and more complex, their maintenance and inspection systems become more important. Recently, maintenance and inspection systems based on augmented reality have been attracting much attention for improving worker's understanding of work and efficiency, but it is often difficult to work with because accurate matching between the augmented model and reality information is not. To solve this problem, marker based AR technology is used to attach a specific image to the model. However, the markers get damaged due to the characteristic of the shipbuilding and offshore plant industry, and the camera needs to be able to detect the entire marker clearly, and thus requires sufficient space to exist between the operator. In order to overcome the limitations of the existing AR system, in this study, a markerless AR was adopted to accurately recognize the actual model of the pipe system that occupies the most processes in the shipbuilding and offshore plant industries. The matching methodology. Through this system, it is expected that the twist phenomenon of the augmented model according to the attitude of the real worker and the limited environment can be improved.

A Study on the Development of Flight Prediction Model and Rules for Military Aircraft Using Data Mining Techniques (데이터 마이닝 기법을 활용한 군용 항공기 비행 예측모형 및 비행규칙 도출 연구)

  • Yu, Kyoung Yul;Moon, Young Joo;Jeong, Dae Yul
    • The Journal of Information Systems
    • /
    • v.31 no.3
    • /
    • pp.177-195
    • /
    • 2022
  • Purpose This paper aims to prepare a full operational readiness by establishing an optimal flight plan considering the weather conditions in order to effectively perform the mission and operation of military aircraft. This paper suggests a flight prediction model and rules by analyzing the correlation between flight implementation and cancellation according to weather conditions by using big data collected from historical flight information of military aircraft supplied by Korean manufacturers and meteorological information from the Korea Meteorological Administration. In addition, by deriving flight rules according to weather information, it was possible to discover an efficient flight schedule establishment method in consideration of weather information. Design/methodology/approach This study is an analytic study using data mining techniques based on flight historical data of 44,558 flights of military aircraft accumulated by the Republic of Korea Air Force for a total of 36 months from January 2013 to December 2015 and meteorological information provided by the Korea Meteorological Administration. Four steps were taken to develop optimal flight prediction models and to derive rules for flight implementation and cancellation. First, a total of 10 independent variables and one dependent variable were used to develop the optimal model for flight implementation according to weather condition. Second, optimal flight prediction models were derived using algorithms such as logistics regression, Adaboost, KNN, Random forest and LightGBM, which are data mining techniques. Third, we collected the opinions of military aircraft pilots who have more than 25 years experience and evaluated importance level about independent variables using Python heatmap to develop flight implementation and cancellation rules according to weather conditions. Finally, the decision tree model was constructed, and the flight rules were derived to see how the weather conditions at each airport affect the implementation and cancellation of the flight. Findings Based on historical flight information of military aircraft and weather information of flight zone. We developed flight prediction model using data mining techniques. As a result of optimal flight prediction model development for each airbase, it was confirmed that the LightGBM algorithm had the best prediction rate in terms of recall rate. Each flight rules were checked according to the weather condition, and it was confirmed that precipitation, humidity, and the total cloud had a significant effect on flight cancellation. Whereas, the effect of visibility was found to be relatively insignificant. When a flight schedule was established, the rules will provide some insight to decide flight training more systematically and effectively.

Estimation of River Flow Data Using Machine Learning (머신러닝 기법을 이용한 유량 자료 생산 방법)

  • Kang, Noel;Lee, Ji Hun;Lee, Jung Hoon;Lee, Chungdae
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2020.06a
    • /
    • pp.261-261
    • /
    • 2020
  • 물관리의 기본이 되는 연속적인 유량 자료 확보를 위해서는 정확도 높은 수위-유량 관계 곡선식 개발이 필수적이다. 수위-유량 관계곡선식은 모든 수문시설 설계의 기초가 되며 홍수, 가뭄 등 물재해 대응을 위해서도 중요한 의미를 가지고 있다. 그러나 일반적으로 유량 측정은 많은 비용과 시간이 들고, 식생성장, 단면변화 등의 통제특성(control)이 변함에 따라 구간분리, 기간분리와 같은 비선형적인 양상이 나타나 자료 해석에 어려움이 존재한다. 특히, 국내 하천의 경우 자연적 및 인위적인 환경 변화가 다양하여 지점 및 기간에 따라 세밀한 분석이 요구된다. 머신러닝(Machine Learning)이란 데이터를 통해 컴퓨터가 스스로 학습하여 모델을 구축하고 성능을 향상시키는 일련의 과정을 뜻한다. 기존의 수위-유량 관계곡선식은 개발자의 판단에 의해 데이터의 종류와 기간 등을 설정하여 회귀식의 파라미터를 산출한다면, 머신러닝은 유효한 전체 데이터를 이용해 스스로 학습하여 자료 간 상관성을 찾아내 모델을 구축하고 성능을 지속적으로 향상 시킬 수 있다. 머신러닝은 충분한 수문자료가 확보되었다는 전제 하에 복잡하고 가변적인 수자원 환경을 반영하여 유량 추정의 정확도를 지속적으로 향상시킬 수 있다는 이점을 가지고 있다. 본 연구는 머신러닝의 대표적인 알고리즘들을 활용하여 유량을 추정하는 모델을 구축하고 성능을 비교·분석하였다. 대상지역은 안정적인 수량을 확보하고 있는 한강수계의 거운교 지점이며, 사용자료는 2010~2018년의 시간, 수위, 유량, 수면폭 등 이다. 프로그램은 파이썬을 기반으로 한 머신러닝 라이브러리인 사이킷런(sklearn)을 사용하였고 알고리즘은 랜덤포레스트 회귀, 의사결정트리, KNN(K-Nearest Neighbor), rgboost을 적용하였다. 학습(train) 데이터는 입력자료 종류별로 조합하여 6개의 세트로 구분하여 모델을 구축하였고, 이를 적용해 검증(test) 데이터를 RMSE(Roog Mean Square Error)로 평가하였다. 그 결과 모델 및 입력 자료의 조합에 따라 3.67~171.46로 다소 넓은 범위의 값이 도출되었다. 그 중 가장 우수한 유형은 수위, 연도, 수면폭 3개의 입력자료를 조합하여 랜덤포레스트 회귀 모델에 적용한 경우이다. 비교를 위해 동일한 검증 데이터를 한국수문조사연보(2018년) 내거운교 지점의 수위별 수위-유량 곡선식을 이용해 유량을 추정한 결과 RMSE가 3.76이 산출되어, 머신러닝이 세분화된 수위-유량 곡선식과 비슷한 수준까지 성능을 내는 것으로 확인되었다. 본 연구는 양질의 유량자료 생산을 위해 기 구축된 수문자료를 기반으로 머신러닝 기법의 적용 가능성을 검토한 기초 연구로써, 국내 효율적인 수문자료 측정 및 수위-유량 곡선 산출에 도움이 될 수 있을 것으로 판단된다. 향후 수자원 환경 및 통제특성에 영향을 미치는 다양한 영향변수를 파악하기 위해 기상자료, 취수량 등의 입력 자료를 적용할 필요가 있으며, 머신러닝 내 비지도학습인 딥러닝과 같은 보다 정교한 모델에 대한 추가적인 연구도 수행되어야 할 것이다.

  • PDF

The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms (다중 클래스 데이터셋의 메타특징이 판별 알고리즘의 성능에 미치는 영향 연구)

  • Kim, Jeonghun;Kim, Min Yong;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.23-45
    • /
    • 2020
  • Big data is creating in a wide variety of fields such as medical care, manufacturing, logistics, sales site, SNS, and the dataset characteristics are also diverse. In order to secure the competitiveness of companies, it is necessary to improve decision-making capacity using a classification algorithm. However, most of them do not have sufficient knowledge on what kind of classification algorithm is appropriate for a specific problem area. In other words, determining which classification algorithm is appropriate depending on the characteristics of the dataset was has been a task that required expertise and effort. This is because the relationship between the characteristics of datasets (called meta-features) and the performance of classification algorithms has not been fully understood. Moreover, there has been little research on meta-features reflecting the characteristics of multi-class. Therefore, the purpose of this study is to empirically analyze whether meta-features of multi-class datasets have a significant effect on the performance of classification algorithms. In this study, meta-features of multi-class datasets were identified into two factors, (the data structure and the data complexity,) and seven representative meta-features were selected. Among those, we included the Herfindahl-Hirschman Index (HHI), originally a market concentration measurement index, in the meta-features to replace IR(Imbalanced Ratio). Also, we developed a new index called Reverse ReLU Silhouette Score into the meta-feature set. Among the UCI Machine Learning Repository data, six representative datasets (Balance Scale, PageBlocks, Car Evaluation, User Knowledge-Modeling, Wine Quality(red), Contraceptive Method Choice) were selected. The class of each dataset was classified by using the classification algorithms (KNN, Logistic Regression, Nave Bayes, Random Forest, and SVM) selected in the study. For each dataset, we applied 10-fold cross validation method. 10% to 100% oversampling method is applied for each fold and meta-features of the dataset is measured. The meta-features selected are HHI, Number of Classes, Number of Features, Entropy, Reverse ReLU Silhouette Score, Nonlinearity of Linear Classifier, Hub Score. F1-score was selected as the dependent variable. As a result, the results of this study showed that the six meta-features including Reverse ReLU Silhouette Score and HHI proposed in this study have a significant effect on the classification performance. (1) The meta-features HHI proposed in this study was significant in the classification performance. (2) The number of variables has a significant effect on the classification performance, unlike the number of classes, but it has a positive effect. (3) The number of classes has a negative effect on the performance of classification. (4) Entropy has a significant effect on the performance of classification. (5) The Reverse ReLU Silhouette Score also significantly affects the classification performance at a significant level of 0.01. (6) The nonlinearity of linear classifiers has a significant negative effect on classification performance. In addition, the results of the analysis by the classification algorithms were also consistent. In the regression analysis by classification algorithm, Naïve Bayes algorithm does not have a significant effect on the number of variables unlike other classification algorithms. This study has two theoretical contributions: (1) two new meta-features (HHI, Reverse ReLU Silhouette score) was proved to be significant. (2) The effects of data characteristics on the performance of classification were investigated using meta-features. The practical contribution points (1) can be utilized in the development of classification algorithm recommendation system according to the characteristics of datasets. (2) Many data scientists are often testing by adjusting the parameters of the algorithm to find the optimal algorithm for the situation because the characteristics of the data are different. In this process, excessive waste of resources occurs due to hardware, cost, time, and manpower. This study is expected to be useful for machine learning, data mining researchers, practitioners, and machine learning-based system developers. The composition of this study consists of introduction, related research, research model, experiment, conclusion and discussion.