• 제목/요약/키워드: Classification Algorithms

Search Result 1,182, Processing Time 0.1 seconds

Applications of Machine Learning Models on Yelp Data

  • Ruchi Singh;Jongwook Woo
    • Asia pacific journal of information systems
    • /
    • v.29 no.1
    • /
    • pp.35-49
    • /
    • 2019
  • The paper attempts to document the application of relevant Machine Learning (ML) models on Yelp (a crowd-sourced local business review and social networking site) dataset to analyze, predict and recommend business. Strategically using two cloud platforms to minimize the effort and time required for this project. Seven machine learning algorithms in Azure ML of which four algorithms are implemented in Databricks Spark ML. The analyzed Yelp business dataset contained 70 business attributes for more than 350,000 registered business. Additionally, review tips and likes from 500,000 users have been processed for the project. A Recommendation Model is built to provide Yelp users with recommendations for business categories based on their previous business ratings, as well as the business ratings of other users. Classification Model is implemented to predict the popularity of the business as defining the popular business to have stars greater than 3 and unpopular business to have stars less than 3. Text Analysis model is developed by comparing two algorithms, uni-gram feature extraction and n-feature extraction in Azure ML studio and logistic regression model in Spark. Comparative conclusions have been made related to efficiency of Spark ML and Azure ML for these models.

Comparative Study on Three Algorithms of the ICD-10 Charlson Comorbidity Index with Myocardial Infarction Patients (Charlson 동반질환의 ICD-10 알고리즘 예측력 비교연구)

  • Kim, Kyoung-Hoon
    • Journal of Preventive Medicine and Public Health
    • /
    • v.43 no.1
    • /
    • pp.42-49
    • /
    • 2010
  • Objectives: To compare the performance of three International Statistical Classification of Diseases, 10th Revision translations of the Charlson comorbidities when predicting in-hospital among patients with myocardial infarction (MI). Methods: MI patients ${\geq}20$ years of age with the first admission during 2006 were identified(n=20,280). Charlson comorbidities were drawn from Heath Insurance Claims Data managed by Health Insurance Review and Assessment Service in Korea. Comparisions for various conditions included (a) three algorithms (Halfon, Sundararajan, and Quan algorithms), (b) lookback periods (1-, 3- and 5-years), (c) data range (admission data, admission and ambulatory data), and (d) diagnosis range (primary diagnosis and first secondary diagnoses, all diagnoses). The performance of each procedure was measured with the c-statistic derived from multiple logistic regression adjusted for age, sex, admission type and Charlson comorbidity index. A bootstrapping procedure was done to determine the approximate 95% confidence interval. Results: Among the 20,280 patients, the mean age was 63.3 years, 67.8% were men and 7.1% died while hospitalized. The Quan and Sundararajan algorithms produced higher prevalences than the Halfon algorithm. The c-statistic of the Quan algorithm was slightly higher, but not significantly different, than that of other two algorithms under all conditions. There was no evidence that on longer lookback periods, additional data, and diagnoses improved the predictive ability. Conclusions: In health services study of MI patients using Health Insurance Claims Data, the present results suggest that the Quan Algorithm using a 1-year lookback involving primary diagnosis and the first secondary diagnosis is adequate in predicting in-hospital mortality.

Computational Analysis of PCA-based Face Recognition Algorithms (PCA기반의 얼굴인식 알고리즘들에 대한 연산방법 분석)

  • Hyeon Joon Moon;Sang Hoon Kim
    • Journal of Korea Multimedia Society
    • /
    • v.6 no.2
    • /
    • pp.247-258
    • /
    • 2003
  • Principal component analysis (PCA) based algorithms form the basis of numerous algorithms and studies in the face recognition literature. PCA is a statistical technique and its incorporation into a face recognition system requires numerous design decisions. We explicitly take the design decisions by in-troducing a generic modular PCA-algorithm since some of these decision ate not documented in the literature We experiment with different implementations of each module, and evaluate the different im-plementations using the September 1996 FERET evaluation protocol (the do facto standard method for evaluating face recognition algorithms). We experiment with (1) changing the illumination normalization procedure; (2) studying effects on algorithm performance of compressing images using JPEG and wavelet compression algorithms; (3) varying the number of eigenvectors in the representation; and (4) changing the similarity measure in classification process. We perform two experiments. In the first experiment, we report performance results on the standard September 1996 FERET large gallery image sets. The result shows that empirical analysis of preprocessing, feature extraction, and matching performance is extremely important in order to produce optimized performance. In the second experiment, we examine variations in algorithm performance based on 100 randomly generated image sets (galleries) of the same size. The result shows that a reasonable threshold for measuring significant difference in performance for the classifiers is 0.10.

  • PDF

A Smart Image Classification Algorithm for Digital Camera by Exploiting Focal Length Information (초점거리 정보를 이용한 디지털 사진 분류 알고리즘)

  • Ju, Young-Ho;Cho, Hwan-Gue
    • Journal of the Korea Computer Graphics Society
    • /
    • v.12 no.4
    • /
    • pp.23-32
    • /
    • 2006
  • In recent years, since the digital camera has been popularized, so users can easily collect hundreds of photos in a single usage. Thus the managing of hundreds of digital photos is not a simple job comparing to the keeping paper photos. We know that managing and classifying a number of digital photo files are burdensome and annoying sometimes. So people hope to use an automated system for managing digital photos especially for their own purposes. The previous studies, e.g. content-based image retrieval, were focused on the clustering of general images, which it is not to be applied on digital photo clustering and classification. Recently, some specialized clustering algorithms for images clustering digital camera images were proposed. These algorithms exploit mainly the statistics of time gap between sequent photos. Though they showed a quite good result in image clustering for digital cameras, still lots of improvements are remained and unsolved. For example the current tools ignore completely the image transformation with the different focal lengths. In this paper, we present a photo considering focal length information recorded in EXIF. We propose an algorithms based on MVA(Matching Vector Analysis) for classification of digital images taken in the every day activity. Our experiment shows that our algorithm gives more than 95% success rates, which is competitive among all available methods in terms of sensitivity, specificity and flexibility.

  • PDF

Data mining Algorithms for the Development of Sasang Type Diagnosis (사상체질 진단검사를 위한 데이터마이닝 알고리즘 연구)

  • Hong, Jin-Woo;Kim, Young-In;Park, So-Jung;Kim, Byoung-Chul;Eom, Il-Kyu;Hwang, Min-Woo;Shin, Sang-Woo;Kim, Byung-Joo;Kwon, Young-Kyu;Chae, Han
    • Journal of Physiology & Pathology in Korean Medicine
    • /
    • v.23 no.6
    • /
    • pp.1234-1240
    • /
    • 2009
  • This study was to compare the effectiveness and validity of various data-mining algorithm for Sasang type diagnostic test. We compared the sensitivity and specificity index of nine attribute selection and eleven class classification algorithms with 31 data-set characterizing Sasang typology and 10-fold validation methods installed in Waikato Environment Knowledge Analysis (WEKA). The highest classification validity score can be acquired as follows; 69.9 as Percentage Correctly Predicted index with Naive Bayes Classifier, 80 as sensitivity index with LWL/Tae-Eum type, 93.5 as specificity index with Naive Bayes Classifier/So-Eum type. The classification algorithm with highest PCP index of 69.62 after attribute selection was Naive Bayes Classifier. In this study we can find that the best-fit algorithm for traditional medicine is case sensitive and that characteristics of clinical circumstances, and data-mining algorithms and study purpose should be considered to get the highest validity even with the well defined data sets. It is also confirmed that we can't find one-fits-all algorithm and there should be many studies with trials and errors. This study will serve as a pivotal foundation for the development of medical instruments for Pattern Identification and Sasang type diagnosis on the basis of traditional Korean Medicine.

Performance Evaluation of Machine Learning and Deep Learning Algorithms in Crop Classification: Impact of Hyper-parameters and Training Sample Size (작물분류에서 기계학습 및 딥러닝 알고리즘의 분류 성능 평가: 하이퍼파라미터와 훈련자료 크기의 영향 분석)

  • Kim, Yeseul;Kwak, Geun-Ho;Lee, Kyung-Do;Na, Sang-Il;Park, Chan-Won;Park, No-Wook
    • Korean Journal of Remote Sensing
    • /
    • v.34 no.5
    • /
    • pp.811-827
    • /
    • 2018
  • The purpose of this study is to compare machine learning algorithm and deep learning algorithm in crop classification using multi-temporal remote sensing data. For this, impacts of machine learning and deep learning algorithms on (a) hyper-parameter and (2) training sample size were compared and analyzed for Haenam-gun, Korea and Illinois State, USA. In the comparison experiment, support vector machine (SVM) was applied as machine learning algorithm and convolutional neural network (CNN) was applied as deep learning algorithm. In particular, 2D-CNN considering 2-dimensional spatial information and 3D-CNN with extended time dimension from 2D-CNN were applied as CNN. As a result of the experiment, it was found that the hyper-parameter values of CNN, considering various hyper-parameter, defined in the two study areas were similar compared with SVM. Based on this result, although it takes much time to optimize the model in CNN, it is considered that it is possible to apply transfer learning that can extend optimized CNN model to other regions. Then, in the experiment results with various training sample size, the impact of that on CNN was larger than SVM. In particular, this impact was exaggerated in Illinois State with heterogeneous spatial patterns. In addition, the lowest classification performance of 3D-CNN was presented in Illinois State, which is considered to be due to over-fitting as complexity of the model. That is, the classification performance was relatively degraded due to heterogeneous patterns and noise effect of input data, although the training accuracy of 3D-CNN model was high. This result simply that a proper classification algorithms should be selected considering spatial characteristics of study areas. Also, a large amount of training samples is necessary to guarantee higher classification performance in CNN, particularly in 3D-CNN.

A study on the rock mass classification in boreholes for a tunnel design using machine learning algorithms (머신러닝 기법을 활용한 터널 설계 시 시추공 내 암반분류에 관한 연구)

  • Lee, Je-Kyum;Choi, Won-Hyuk;Kim, Yangkyun;Lee, Sean Seungwon
    • Journal of Korean Tunnelling and Underground Space Association
    • /
    • v.23 no.6
    • /
    • pp.469-484
    • /
    • 2021
  • Rock mass classification results have a great influence on construction schedule and budget as well as tunnel stability in tunnel design. A total of 3,526 tunnels have been constructed in Korea and the associated techniques in tunnel design and construction have been continuously developed, however, not many studies have been performed on how to assess rock mass quality and grade more accurately. Thus, numerous cases show big differences in the results according to inspectors' experience and judgement. Hence, this study aims to suggest a more reliable rock mass classification (RMR) model using machine learning algorithms, which is surging in availability, through the analyses based on various rock and rock mass information collected from boring investigations. For this, 11 learning parameters (depth, rock type, RQD, electrical resistivity, UCS, Vp, Vs, Young's modulus, unit weight, Poisson's ratio, RMR) from 13 local tunnel cases were selected, 337 learning data sets as well as 60 test data sets were prepared, and 6 machine learning algorithms (DT, SVM, ANN, PCA & ANN, RF, XGBoost) were tested for various hyperparameters for each algorithm. The results show that the mean absolute errors in RMR value from five algorithms except Decision Tree were less than 8 and a Support Vector Machine model is the best model. The applicability of the model, established through this study, was confirmed and this prediction model can be applied for more reliable rock mass classification when additional various data is continuously cumulated.

Accuracy Evaluation of Supervised Classification by Using Morphological Attribute Profiles and Additional Band of Hyperspectral Imagery (초분광 영상의 Morphological Attribute Profiles와 추가 밴드를 이용한 감독분류의 정확도 평가)

  • Park, Hong Lyun;Choi, Jae Wan
    • Journal of Korean Society for Geospatial Information Science
    • /
    • v.25 no.1
    • /
    • pp.9-17
    • /
    • 2017
  • Hyperspectral imagery is used in the land cover classification with the principle component analysis and minimum noise fraction to reduce the data dimensionality and noise. Recently, studies on the supervised classification using various features having spectral information and spatial characteristic have been carried out. In this study, principle component bands and normalized difference vegetation index(NDVI) was utilized in the supervised classification for the land cover classification. To utilize additional information not included in the principle component bands by the hyperspectral imagery, we tried to increase the classification accuracy by using the NDVI. In addition, the extended attribute profiles(EAP) generated using the morphological filter was used as the input data. The random forest algorithm, which is one of the representative supervised classification, was used. The classification accuracy according to the application of various features based on EAP was compared. Two areas was selected in the experiments, and the quantitative evaluation was performed by using reference data. The classification accuracy of the proposed algorithm showed the highest classification accuracy of 85.72% and 91.14% compared with existing algorithms. Further research will need to develop a supervised classification algorithm and additional input datasets to improve the accuracy of land cover classification using hyperspectral imagery.

Application of Multispectral Remotely Sensed Imagery for the Characterization of Complex Coastal Wetland Ecosystems of southern India: A Special Emphasis on Comparing Soft and Hard Classification Methods

  • Shanmugam, Palanisamy;Ahn, Yu-Hwan;Sanjeevi , Shanmugam
    • Korean Journal of Remote Sensing
    • /
    • v.21 no.3
    • /
    • pp.189-211
    • /
    • 2005
  • This paper makes an effort to compare the recently evolved soft classification method based on Linear Spectral Mixture Modeling (LSMM) with the traditional hard classification methods based on Iterative Self-Organizing Data Analysis (ISODATA) and Maximum Likelihood Classification (MLC) algorithms in order to achieve appropriate results for mapping, monitoring and preserving valuable coastal wetland ecosystems of southern India using Indian Remote Sensing Satellite (IRS) 1C/1D LISS-III and Landsat-5 Thematic Mapper image data. ISODATA and MLC methods were attempted on these satellite image data to produce maps of 5, 10, 15 and 20 wetland classes for each of three contrast coastal wetland sites, Pitchavaram, Vedaranniyam and Rameswaram. The accuracy of the derived classes was assessed with the simplest descriptive statistic technique called overall accuracy and a discrete multivariate technique called KAPPA accuracy. ISODATA classification resulted in maps with poor accuracy compared to MLC classification that produced maps with improved accuracy. However, there was a systematic decrease in overall accuracy and KAPPA accuracy, when more number of classes was derived from IRS-1C/1D and Landsat-5 TM imagery by ISODATA and MLC. There were two principal factors for the decreased classification accuracy, namely spectral overlapping/confusion and inadequate spatial resolution of the sensors. Compared to the former, the limited instantaneous field of view (IFOV) of these sensors caused occurrence of number of mixture pixels (mixels) in the image and its effect on the classification process was a major problem to deriving accurate wetland cover types, in spite of the increasing spatial resolution of new generation Earth Observation Sensors (EOS). In order to improve the classification accuracy, a soft classification method based on Linear Spectral Mixture Modeling (LSMM) was described to calculate the spectral mixture and classify IRS-1C/1D LISS-III and Landsat-5 TM Imagery. This method considered number of reflectance end-members that form the scene spectra, followed by the determination of their nature and finally the decomposition of the spectra into their endmembers. To evaluate the LSMM areal estimates, resulted fractional end-members were compared with normalized difference vegetation index (NDVI), ground truth data, as well as those estimates derived from the traditional hard classifier (MLC). The findings revealed that NDVI values and vegetation fractions were positively correlated ($r^2$= 0.96, 0.95 and 0.92 for Rameswaram, Vedaranniyam and Pitchavaram respectively) and NDVI and soil fraction values were negatively correlated ($r^2$ =0.53, 0.39 and 0.13), indicating the reliability of the sub-pixel classification. Comparing with ground truth data, the precision of LSMM for deriving moisture fraction was 92% and 96% for soil fraction. The LSMM in general would seem well suited to locating small wetland habitats which occurred as sub-pixel inclusions, and to representing continuous gradations between different habitat types.

A Study on the Hyperspectral Image Classification with the Iterative Self-Organizing Unsupervised Spectral Angle Classification (반복최적화 무감독 분광각 분류 기법을 이용한 하이퍼스펙트럴 영상 분류에 관한 연구)

  • Jo Hyun-Gee;Kim Dae-Sung;Yu Ki-Yun;Kim Yong-Il
    • Korean Journal of Remote Sensing
    • /
    • v.22 no.2
    • /
    • pp.111-121
    • /
    • 2006
  • The classification using spectral angle is a new approach based on the fact that the spectra of the same type of surface objects in RS data are approximately linearly scaled variations of one another due to atmospheric and topographic effects. There are many researches on the unsupervised classification using spectral angle recently. Nevertheless, there are only a few which consider the characteristics of Hyperspectral data. On this study, we propose the ISOMUSAC(Iterative Self-Organizing Modified Unsupervised Spectral Angle Classification) which can supplement the defects of previous unsupervised spectral angle classification. ISOMUSAC uses the Angle Division for the selection of seed points and calculates the center of clusters using spectral angle. In addition, ISOMUSAC perform the iterative merging and splitting clusters. As a result, the proposed algorithm can reduce the time of processing and generate better classification result than previous unsupervised classification algorithms by visual and quantitative analysis. For the comparison with previous unsupervised spectral angle classification by quantitative analysis, we propose Validity Index using spectral angle.