• Title/Summary/Keyword: tree classification method

Search Result 361, Processing Time 0.028 seconds

A Novel Feature Selection Method in the Categorization of Imbalanced Textual Data

  • Pouramini, Jafar;Minaei-Bidgoli, Behrouze;Esmaeili, Mahdi
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.8
    • /
    • pp.3725-3748
    • /
    • 2018
  • Text data distribution is often imbalanced. Imbalanced data is one of the challenges in text classification, as it leads to the loss of performance of classifiers. Many studies have been conducted so far in this regard. The proposed solutions are divided into several general categories, include sampling-based and algorithm-based methods. In recent studies, feature selection has also been considered as one of the solutions for the imbalance problem. In this paper, a novel one-sided feature selection known as probabilistic feature selection (PFS) was presented for imbalanced text classification. The PFS is a probabilistic method that is calculated using feature distribution. Compared to the similar methods, the PFS has more parameters. In order to evaluate the performance of the proposed method, the feature selection methods including Gini, MI, FAST and DFS were implemented. To assess the proposed method, the decision tree classifications such as C4.5 and Naive Bayes were used. The results of tests on Reuters-21875 and WebKB figures per F-measure suggested that the proposed feature selection has significantly improved the performance of the classifiers.

A performance improvement methodology of web document clustering using FDC-TCT (FDC-TCT를 이용한 웹 문서 클러스터링 성능 개선 기법)

  • Ko, Suc-Bum;Youn, Sung-Dae
    • The KIPS Transactions:PartD
    • /
    • v.12D no.4 s.100
    • /
    • pp.637-646
    • /
    • 2005
  • There are various problems while applying classification or clustering algorithm in that document classification which requires post processing or classification after getting as a web search result due to my keyword. Among those, two problems are severe. The first problem is the need to categorize the document with the help of the expert. And, the second problem is the long processing time the document classification takes. Therefore we propose a new method of web document clustering which can dramatically decrease the number of times to calculate a document similarity using the Transitive Closure Tree(TCT) and which is able to speed up the processing without loosing the precision. We also compare the effectivity of the proposed method with those existing algorithms and present the experimental results.

Spherical Signature Description of 3D Point Cloud and Environmental Feature Learning based on Deep Belief Nets for Urban Structure Classification (도시 구조물 분류를 위한 3차원 점 군의 구형 특징 표현과 심층 신뢰 신경망 기반의 환경 형상 학습)

  • Lee, Sejin;Kim, Donghyun
    • The Journal of Korea Robotics Society
    • /
    • v.11 no.3
    • /
    • pp.115-126
    • /
    • 2016
  • This paper suggests the method of the spherical signature description of 3D point clouds taken from the laser range scanner on the ground vehicle. Based on the spherical signature description of each point, the extractor of significant environmental features is learned by the Deep Belief Nets for the urban structure classification. Arbitrary point among the 3D point cloud can represents its signature in its sky surface by using several neighborhood points. The unit spherical surface centered on that point can be considered to accumulate the evidence of each angular tessellation. According to a kind of point area such as wall, ground, tree, car, and so on, the results of spherical signature description look so different each other. These data can be applied into the Deep Belief Nets, which is one of the Deep Neural Networks, for learning the environmental feature extractor. With this learned feature extractor, 3D points can be classified due to its urban structures well. Experimental results prove that the proposed method based on the spherical signature description and the Deep Belief Nets is suitable for the mobile robots in terms of the classification accuracy.

Machine learning application to seismic site classification prediction model using Horizontal-to-Vertical Spectral Ratio (HVSR) of strong-ground motions

  • Francis G. Phi;Bumsu Cho;Jungeun Kim;Hyungik Cho;Yun Wook Choo;Dookie Kim;Inhi Kim
    • Geomechanics and Engineering
    • /
    • v.37 no.6
    • /
    • pp.539-554
    • /
    • 2024
  • This study explores development of prediction model for seismic site classification through the integration of machine learning techniques with horizontal-to-vertical spectral ratio (HVSR) methodologies. To improve model accuracy, the research employs outlier detection methods and, synthetic minority over-sampling technique (SMOTE) for data balance, and evaluates using seven machine learning models using seismic data from KiK-net. Notably, light gradient boosting method (LGBM), gradient boosting, and decision tree models exhibit improved performance when coupled with SMOTE, while Multiple linear regression (MLR) and Support vector machine (SVM) models show reduced efficacy. Outlier detection techniques significantly enhance accuracy, particularly for LGBM, gradient boosting, and voting boosting. The ensemble of LGBM with the isolation forest and SMOTE achieves the highest accuracy of 0.91, with LGBM and local outlier factor yielding the highest F1-score of 0.79. Consistently outperforming other models, LGBM proves most efficient for seismic site classification when supported by appropriate preprocessing procedures. These findings show the significance of outlier detection and data balancing for precise seismic soil classification prediction, offering insights and highlighting the potential of machine learning in optimizing site classification accuracy.

AP, IP Prediction For Corpus-based Korean Text-To-Speech (코퍼스 방식 음성합성에서의 개선된 운율구 경계 예측)

  • Kwon, O-Hil;Hong, Mun-Ki;Kang, Sun-Mee;Shin, Ji-Young
    • Speech Sciences
    • /
    • v.9 no.3
    • /
    • pp.25-34
    • /
    • 2002
  • One of the most important factor in the performance of Korean text-to-speech system is the prediction of accentual and intonational phrase boundary. The previous method of prediction shows only the 75-85% which is not proper in the practical and commercial system. Therefore, more accurate prediction must be needed in the practical system. In this study, we propose the simple and more accurate method of the prediction of AP, IP.

  • PDF

One Channel Five-Way Classification Algorithm For Automatically Classifying Speech

  • Lee, Kyo-Sik
    • The Journal of the Acoustical Society of Korea
    • /
    • v.17 no.3E
    • /
    • pp.12-21
    • /
    • 1998
  • In this paper, we describe the one channel five-way, V/U/M/N/S (Voice/Unvoice/Nasal/Silent), classification algorithm for automatically classifying speech. The decision making process is viewed as a pattern viewed as a pattern recognition problem. Two aspects of the algorithm are developed: feature selection and classifier type. The feature selection procedure is studied for identifying a set of features to make V/U/M/N/S classification. The classifiers used are a vector quantization (VQ), a neural network(NN), and a decision tree method. Actual five sentences spoken by six speakers, three male and three female, are tested with proposed classifiers. From a set of measurement tests, the proposed classifiers show fairly good accuracy for V/U/M/N/S decision.

  • PDF

Review of Korean Speech Act Classification: Machine Learning Methods

  • Kim, Hark-Soo;Seon, Choong-Nyoung;Seo, Jung-Yun
    • Journal of Computing Science and Engineering
    • /
    • v.5 no.4
    • /
    • pp.288-293
    • /
    • 2011
  • To resolve ambiguities in speech act classification, various machine learning models have been proposed over the past 10 years. In this paper, we review these machine learning models and present the results of experimental comparison of three representative models, namely the decision tree, the support vector machine (SVM), and the maximum entropy model (MEM). In experiments with a goal-oriented dialogue corpus in the schedule management domain, we found that the MEM has lighter hardware requirements, whereas the SVM has better performance characteristics.

A K-Nearest Neighbor Algorithm for Categorical Sequence Data (범주형 시퀀스 데이터의 K-Nearest Neighbor알고리즘)

  • Oh Seung-Joon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.10 no.2 s.34
    • /
    • pp.215-221
    • /
    • 2005
  • TRecently, there has been enormous growth in the amount of commercial and scientific data, such as protein sequences, retail transactions, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. In this Paper, we study how to classify these sequence datasets. There are several kinds techniques for data classification such as decision tree induction, Bayesian classification and K-NN etc. In our approach, we use a K-NN algorithm for classifying sequences. In addition, we propose a new similarity measure to compute the similarity between two sequences and an efficient method for measuring similarity.

  • PDF

Analysis on the Structure of Plant Community in Mt. Yongmun by Classification and Ordination Techniques (Classification 및 Ordination 방법에 의한 융문산 삼림의 식물군집 구조분석)

  • 이경재
    • Journal of Plant Biology
    • /
    • v.33 no.3
    • /
    • pp.173-182
    • /
    • 1990
  • To investigate the structure of the plant community structure of Mt. Yongmun in Kyonggi-do, fifty-four plots were set up by the clumped sampling method. The classification by TWINSPAN and DCA ordination were applied to the study area in order to classify them into several groups based on woody plant and environmental variables. By both techniques, the plant community were divided into two groups by the aspect. the dominant species of south aspect were Pinus densiflora, Quercus aliena, Q. mongolica, Carpinus laxiflora and of north aspect were Q. ongolica, Fraxinus rhynchophylla. The successional trends of tree species in south aspect seem to be from P. densiflora through Q. serrata, Q. aliena, A. mongolica to C. laxiflora. As a result of the analysis for the relationship between the stand scores of DCA and environmental variables, they had a tendency to increase significantly from the P. densiflora and Q. mongolica community to C. laxiflora and F. rhynchophylla community that was the soil moisture, the amount of soil humus and soil pH.

  • PDF

Video retrieval method using non-parametric based motion classification (비-파라미터 기반의 움직임 분류를 통한 비디오 검색 기법)

  • Kim Nac-Woo;Choi Jong-Soo
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.43 no.2 s.308
    • /
    • pp.1-11
    • /
    • 2006
  • In this paper, we propose the novel video retrieval algorithm using non-parametric based motion classification in the shot-based video indexing structure. The proposed system firstly gets the key frame and motion information from each shot segmented by scene change detection method, and then extracts visual features and non-parametric based motion information from them. Finally, we construct real-time retrieval system supporting similarity comparison of these spatio-temporal features. After the normalized motion vector fields is created from MPEG compressed stream, the extraction of non-parametric based motion feature is effectively achieved by discretizing each normalized motion vectors into various angle bins, and considering a mean, a variance, and a direction of these bins. We use the edge-based spatial descriptor to extract the visual feature in key frames. Experimental evidence shows that our algorithm outperforms other video retrieval methods for image indexing and retrieval. To index the feature vectors, we use R*-tree structures.