• Title/Summary/Keyword: KNN(K-Nearest Neighbor) model

Search Result 28, Processing Time 0.021 seconds

K Nearest Neighbor Joins for Big Data Processing based on Spark (Spark 기반 빅데이터 처리를 위한 K-최근접 이웃 연결)

  • JIAQI, JI;Chung, Yeongjee
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.9
    • /
    • pp.1731-1737
    • /
    • 2017
  • K Nearest Neighbor Join (KNN Join) is a simple yet effective method in machine learning. It is widely used in small dataset of the past time. As the number of data increases, it is infeasible to run this model on an actual application by a single machine due to memory and time restrictions. Nowadays a popular batch process model called MapReduce which can run on a cluster with a large number of computers is widely used for large-scale data processing. Hadoop is a framework to implement MapReduce, but its performance can be further improved by a new framework named Spark. In the present study, we will provide a KNN Join implement based on Spark. With the advantage of its in-memory calculation capability, it will be faster and more effective than Hadoop. In our experiments, we study the influence of different factors on running time and demonstrate robustness and efficiency of our approach.

Development of Freeway Traffic Incident Clearance Time Prediction Model by Accident Level (사고등급별 고속도로 교통사고 처리시간 예측모형 개발)

  • LEE, Soong-bong;HAN, Dong Hee;LEE, Young-Ihn
    • Journal of Korean Society of Transportation
    • /
    • v.33 no.5
    • /
    • pp.497-507
    • /
    • 2015
  • Nonrecurrent congestion of freeway was primarily caused by incident. The main cause of incident was known as a traffic accident. Therefore, accurate prediction of traffic incident clearance time is very important in accident management. Traffic accident data on freeway during year 2008 to year 2014 period were analyzed for this study. KNN(K-Nearest Neighbor) algorithm was hired for developing incident clearance time prediction model with the historical traffic accident data. Analysis result of accident data explains the level of accident significantly affect on the incident clearance time. For this reason, incident clearance time was categorized by accident level. Data were sorted by classification of traffic volume, number of lanes and time periods to consider traffic conditions and roadway geometry. Factors affecting incident clearance time were analyzed from the extracted data for identifying similar types of accident. Lastly, weight of detail factors was calculated in order to measure distance metric. Weight was calculated with applying standard method of normal distribution, then incident clearance time was predicted. Prediction result of model showed a lower prediction error(MAPE) than models of previous studies. The improve model developed in this study is expected to contribute to the efficient highway operation management when incident occurs.

Three Dimensional Object Recognition using PCA and KNN (peA 와 KNN를 이용한 3차원 물체인식)

  • Lee, Kee-Jun
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.8
    • /
    • pp.57-63
    • /
    • 2009
  • Object recognition technologies using PCA(principal component analysis) recognize objects by deciding representative features of objects in the model image, extracting feature vectors from objects in a image and measuring the distance between them and object representation. Given frequent recognition problems associated with the use of point-to-point distance approach, this study adopted the k-nearest neighbor technique(class-to-class) in which a group of object models of the same class is used as recognition unit for the images in-putted on a continual input image. However, the robustness of recognition strategies using PCA depends on several factors, including illumination. When scene constancy is not secured due to varying illumination conditions, the learning performance the feature detector can be compromised, undermining the recognition quality. This paper proposes a new PCA recognition in which database of objects can be detected under different illuminations between input images and the model images.

A Study of Travel Time Prediction using K-Nearest Neighborhood Method (K 최대근접이웃 방법을 이용한 통행시간 예측에 대한 연구)

  • Lim, Sung-Han;Lee, Hyang-Mi;Park, Seong-Lyong;Heo, Tae-Young
    • The Korean Journal of Applied Statistics
    • /
    • v.26 no.5
    • /
    • pp.835-845
    • /
    • 2013
  • Travel-time is considered the most typical and preferred traffic information for intelligent transportation systems(ITS). This paper proposes a real-time travel-time prediction method for a national highway. In this paper, the K-nearest neighbor(KNN) method is used for travel time prediction. The KNN method (a nonparametric method) is appropriate for a real-time traffic management system because the method needs no additional assumptions or parameter calibration. The performances of various models are compared based on mean absolute percentage error(MAPE) and coefficient of variation(CV). In real application, the analysis of real traffic data collected from Korean national highways indicates that the proposed model outperforms other prediction models such as the historical average model and the Kalman filter model. It is expected to improve travel-time reliability by flexibly using travel-time from the proposed model with travel-time from the interval detectors.

A Classification Algorithm Based on Data Clustering and Data Reduction for Intrusion Detection System over Big Data

  • Wang, Qiuhua;Ouyang, Xiaoqin;Zhan, Jiacheng
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.7
    • /
    • pp.3714-3732
    • /
    • 2019
  • With the rapid development of network, Intrusion Detection System(IDS) plays a more and more important role in network applications. Many data mining algorithms are used to build IDS. However, due to the advent of big data era, massive data are generated. When dealing with large-scale data sets, most data mining algorithms suffer from a high computational burden which makes IDS much less efficient. To build an efficient IDS over big data, we propose a classification algorithm based on data clustering and data reduction. In the training stage, the training data are divided into clusters with similar size by Mini Batch K-Means algorithm, meanwhile, the center of each cluster is used as its index. Then, we select representative instances for each cluster to perform the task of data reduction and use the clusters that consist of representative instances to build a K-Nearest Neighbor(KNN) detection model. In the detection stage, we sort clusters according to the distances between the test sample and cluster indexes, and obtain k nearest clusters where we find k nearest neighbors. Experimental results show that searching neighbors by cluster indexes reduces the computational complexity significantly, and classification with reduced data of representative instances not only improves the efficiency, but also maintains high accuracy.

A study on the spatial neighborhood in spatial regression analysis (공간이웃정보를 고려한 공간회귀분석)

  • Kim, Sujung
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.3
    • /
    • pp.505-513
    • /
    • 2017
  • Recently, numerous small area estimation studies have been conducted to obtain more detailed and accurate estimation results. Most of these studies have employed spatial regression models, which require a clear definition of spatial neighborhoods. In this study, we introduce the Delaunay triangulation as a method to define spatial neighborhood, and compare this method with the k-nearest neighbor method. A simulation was conducted to determine which of the two methods is more efficient in defining spatial neighborhood, and we demonstrate the performance of the proposed method using a land price data.

Performance Comparison of Machine Learning Models for Grid-Based Flood Risk Mapping - Focusing on the Case of Typhoon Chaba in 2016 - (격자 기반 침수위험지도 작성을 위한 기계학습 모델별 성능 비교 연구 - 2016 태풍 차바 사례를 중심으로 -)

  • Jihye Han;Changjae Kwak;Kuyoon Kim;Miran Lee
    • Korean Journal of Remote Sensing
    • /
    • v.39 no.5_2
    • /
    • pp.771-783
    • /
    • 2023
  • This study aims to compare the performance of each machine learning model for preparing a grid-based disaster risk map related to flooding in Jung-gu, Ulsan, for Typhoon Chaba which occurred in 2016. Dynamic data such as rainfall and river height, and static data such as building, population, and land cover data were used to conduct a risk analysis of flooding disasters. The data were constructed as 10 m-sized grid data based on the national point number, and a sample dataset was constructed using the risk value calculated for each grid as a dependent variable and the value of five influencing factors as an independent variable. The total number of sample datasets is 15,910, and the training, verification, and test datasets are randomly extracted at a 6:2:2 ratio to build a machine-learning model. Machine learning used random forest (RF), support vector machine (SVM), and k-nearest neighbor (KNN) techniques, and prediction accuracy by the model was found to be excellent in the order of SVM (91.05%), RF (83.08%), and KNN (76.52%). As a result of deriving the priority of influencing factors through the RF model, it was confirmed that rainfall and river water levels greatly influenced the risk.

Research on Classification of Sitting Posture with a IMU (하나의 IMU를 이용한 앉은 자세 분류 연구)

  • Kim, Yeon-Wook;Cho, Woo-Hyeong;Jeon, Yu-Yong;Lee, Sangmin
    • Journal of rehabilitation welfare engineering & assistive technology
    • /
    • v.11 no.3
    • /
    • pp.261-270
    • /
    • 2017
  • Bad sitting postures are known to cause for a variety of diseases or physical deformation. However, it is not easy to fit right sitting posture for long periods of time. Therefore, methods of distinguishing and inducing good sitting posture have been constantly proposed. Proposed methods were image processing, using pressure sensor attached to the chair, and using the IMU (Internal Measurement Unit). The method of using IMU has advantages of simple hardware configuration and free of various constraints in measurement. In this paper, we researched on distinguishing sitting postures with a small amount of data using just one IMU. Feature extraction method was used to find data which contribution is the least for classification. Machine learning algorithms were used to find the best position to classify and we found best machine learning algorithm. Used feature extraction method was PCA(Principal Component Analysis). Used Machine learning models were five : SVM(Support Vector Machine), KNN(K Nearest Neighbor), K-means (K-means Algorithm) GMM (Gaussian Mixture Model), and HMM (Hidden Marcov Model). As a result of research, back neck is suitable position for classification because classification rate of it was highest in every model. It was confirmed that Yaw data which is one of the IMU data has the smallest contribution to classification rate using PCA and there was no changes in classification rate after removal it. SVM, KNN are suitable for classification because their classification rate are higher than the others.

Hyperparameter Tuning Based Machine Learning classifier for Breast Cancer Prediction

  • Md. Mijanur Rahman;Asikur Rahman Raju;Sumiea Akter Pinky;Swarnali Akter
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.2
    • /
    • pp.196-202
    • /
    • 2024
  • Currently, the second most devastating form of cancer in people, particularly in women, is Breast Cancer (BC). In the healthcare industry, Machine Learning (ML) is commonly employed in fatal disease prediction. Due to breast cancer's favorable prognosis at an early stage, a model is created to utilize the Dataset on Wisconsin Diagnostic Breast Cancer (WDBC). Conversely, this model's overarching axiom is to compare the effectiveness of five well-known ML classifiers, including Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), K-Nearest Neighbor (KNN), and Naive Bayes (NB) with the conventional method. To counterbalance the effect with conventional methods, the overarching tactic we utilized was hyperparameter tuning utilizing the grid search method, which improved accuracy, secondary precision, third recall, and finally the F1 score. In this study hyperparameter tuning model, the rate of accuracy increased from 94.15% to 98.83% whereas the accuracy of the conventional method increased from 93.56% to 97.08%. According to this investigation, KNN outperformed all other classifiers in terms of accuracy, achieving a score of 98.83%. In conclusion, our study shows that KNN works well with the hyper-tuning method. These analyses show that this study prediction approach is useful in prognosticating women with breast cancer with a viable performance and more accurate findings when compared to the conventional approach.

Investigating Dynamic Mutation Process of Issues Using Unstructured Text Analysis (부도예측을 위한 KNN 앙상블 모형의 동시 최적화)

  • Min, Sung-Hwan
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.1
    • /
    • pp.139-157
    • /
    • 2016
  • Bankruptcy involves considerable costs, so it can have significant effects on a country's economy. Thus, bankruptcy prediction is an important issue. Over the past several decades, many researchers have addressed topics associated with bankruptcy prediction. Early research on bankruptcy prediction employed conventional statistical methods such as univariate analysis, discriminant analysis, multiple regression, and logistic regression. Later on, many studies began utilizing artificial intelligence techniques such as inductive learning, neural networks, and case-based reasoning. Currently, ensemble models are being utilized to enhance the accuracy of bankruptcy prediction. Ensemble classification involves combining multiple classifiers to obtain more accurate predictions than those obtained using individual models. Ensemble learning techniques are known to be very useful for improving the generalization ability of the classifier. Base classifiers in the ensemble must be as accurate and diverse as possible in order to enhance the generalization ability of an ensemble model. Commonly used methods for constructing ensemble classifiers include bagging, boosting, and random subspace. The random subspace method selects a random feature subset for each classifier from the original feature space to diversify the base classifiers of an ensemble. Each ensemble member is trained by a randomly chosen feature subspace from the original feature set, and predictions from each ensemble member are combined by an aggregation method. The k-nearest neighbors (KNN) classifier is robust with respect to variations in the dataset but is very sensitive to changes in the feature space. For this reason, KNN is a good classifier for the random subspace method. The KNN random subspace ensemble model has been shown to be very effective for improving an individual KNN model. The k parameter of KNN base classifiers and selected feature subsets for base classifiers play an important role in determining the performance of the KNN ensemble model. However, few studies have focused on optimizing the k parameter and feature subsets of base classifiers in the ensemble. This study proposed a new ensemble method that improves upon the performance KNN ensemble model by optimizing both k parameters and feature subsets of base classifiers. A genetic algorithm was used to optimize the KNN ensemble model and improve the prediction accuracy of the ensemble model. The proposed model was applied to a bankruptcy prediction problem by using a real dataset from Korean companies. The research data included 1800 externally non-audited firms that filed for bankruptcy (900 cases) or non-bankruptcy (900 cases). Initially, the dataset consisted of 134 financial ratios. Prior to the experiments, 75 financial ratios were selected based on an independent sample t-test of each financial ratio as an input variable and bankruptcy or non-bankruptcy as an output variable. Of these, 24 financial ratios were selected by using a logistic regression backward feature selection method. The complete dataset was separated into two parts: training and validation. The training dataset was further divided into two portions: one for the training model and the other to avoid overfitting. The prediction accuracy against this dataset was used to determine the fitness value in order to avoid overfitting. The validation dataset was used to evaluate the effectiveness of the final model. A 10-fold cross-validation was implemented to compare the performances of the proposed model and other models. To evaluate the effectiveness of the proposed model, the classification accuracy of the proposed model was compared with that of other models. The Q-statistic values and average classification accuracies of base classifiers were investigated. The experimental results showed that the proposed model outperformed other models, such as the single model and random subspace ensemble model.