• Title/Summary/Keyword: Text Classifier

Search Result 132, Processing Time 0.027 seconds

Classifications of Hadiths based on Supervised Learning Techniques

  • AbdElaal, Hammam M.;Bouallegue, Belgacem;Elshourbagy, Motasem;Matter, Safaa S.;AbdElghfar, Hany A.;Khattab, Mahmoud M.;Ahmed, Abdelmoty M.
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.11
    • /
    • pp.1-10
    • /
    • 2022
  • This study aims to build a model is capable of classifying the categories of hadith, according to the reliability of hadith' narrators (sahih, hassan, da'if, maudu) and according to what was attributed to the Prophet Muhammad (saying, doing, describing, reporting ) using the supervised learning algorithms, with a view to discover a relationship between these classifications, based on the outputs of this model, which might be useful to avoid the controversy and useless debate on automatic classifications of hadith, using some of the statistical methods such as chi-square, information gain and association rules. The experimental results showed that there is a relation between these classifications, most of Sahih hadiths are belong to saying class, and most of maudu hadiths are belong to reporting class. Also the best classifier had given high accuracy was MultinomialNB, it achieved higher accuracy reached up to 0.9708 %, for his ability to process high dimensional problems and identifying the most important features that are relevant to target data in training stage. Followed by LinearSVC classifier, reached up to 0.9655, and finally, KNeighborsClassifier reached up to 0.9644.

A Hierarchical Text Rating System for Objectionable Documents

  • Jeong, Chi-Yoon;Han, Seung-Wan;Nam, Taek-Yong
    • Journal of Information Processing Systems
    • /
    • v.1 no.1 s.1
    • /
    • pp.22-26
    • /
    • 2005
  • In this paper, we classified the objectionable texts into four rates according to their harmfulness and proposed the hierarchical text rating system for objectionable documents. Since the documents in the same category have similarities in used words, expressions and structure of the document, the text rating system, which uses a single classification model, has low accuracy. To solve this problem, we separate objectionable documents into several subsets by using their properties, and then classify the subsets hierarchically. The proposed system consists of three layers. In each layer, we select features using the chi-square statistics, and then the weight of the features, which is calculated by using the TF-IDF weighting scheme, is used as an input of the non-linear SVM classifier. By means of a hierarchical scheme using the different features and the different number of features in each layer, we can characterize the objectionability of documents more effectively and expect to improve the performance of the rating system. We compared the performance of the proposed system and performance of several text rating systems and experimental results show that the proposed system can archive an excellent classification performance.

Robust Algorithms for Combining Multiple Term Weighting Vectors for Document Classification

  • Kim, Minyoung
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.16 no.2
    • /
    • pp.81-86
    • /
    • 2016
  • Term weighting is a popular technique that effectively weighs the term features to improve accuracy in document classification. While several successful term weighting algorithms have been suggested, none of them appears to perform well consistently across different data domains. In this paper we propose several reasonable methods to combine different term weight vectors to yield a robust document classifier that performs consistently well on diverse datasets. Specifically we suggest two approaches: i) learning a single weight vector that lies in a convex hull of the base vectors while minimizing the class prediction loss, and ii) a mini-max classifier that aims for robustness of the individual weight vectors by minimizing the loss of the worst-performing strategy among the base vectors. We provide efficient solution methods for these optimization problems. The effectiveness and robustness of the proposed approaches are demonstrated on several benchmark document datasets, significantly outperforming the existing term weighting methods.

Performance Comparison of Naive Bayesian Learning and Centroid-Based Classification for e-Mail Classification (전자메일 분류를 위한 나이브 베이지안 학습과 중심점 기반 분류의 성능 비교)

  • Kim, Kuk-Pyo;Kwon, Young-S.
    • IE interfaces
    • /
    • v.18 no.1
    • /
    • pp.10-21
    • /
    • 2005
  • With the increasing proliferation of World Wide Web, electronic mail systems have become very widely used communication tools. Researches on e-mail classification have been very important in that e-mail classification system is a major engine for e-mail response management systems which mine unstructured e-mail messages and automatically categorize them. In this research we compare the performance of Naive Bayesian learning and Centroid-Based Classification using the different data set of an on-line shopping mall and a credit card company. We analyze which method performs better under which conditions. We compared classification accuracy of them which depends on structure and size of train set and increasing numbers of class. The experimental results indicate that Naive Bayesian learning performs better, while Centroid-Based Classification is more robust in terms of classification accuracy.

Hybrid Approach of Texture and Connected Component Methods for Text Extraction in Complex Images (복잡한 영상 내의 문자영역 추출을 위한 텍스춰와 연결성분 방법의 결합)

  • 정기철
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.41 no.6
    • /
    • pp.175-186
    • /
    • 2004
  • We present a hybrid approach of texture-based method and connected component (CC)-based method for text extraction in complex images. Two primary methods, which are mainly utilized in this area, are sequentially merged for compensating for their weak points. An automatically constructed MLP-based texture classifier can increase recall rates for complex images with small amount of user intervention and without explicit feature extraction. CC-based filtering based on the shape information using NMF enhances the precision rate without affecting overall performance. As a result, a combination of texture and CC-based methods leads to not only robust but also efficient text extraction. We also enhance the processing speed by adopting appropriate region marking methods for each input image category.

A Step towards the Improvement in the Performance of Text Classification

  • Hussain, Shahid;Mufti, Muhammad Rafiq;Sohail, Muhammad Khalid;Afzal, Humaira;Ahmad, Ghufran;Khan, Arif Ali
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.4
    • /
    • pp.2162-2179
    • /
    • 2019
  • The performance of text classification is highly related to the feature selection methods. Usually, two tasks are performed when a feature selection method is applied to construct a feature set; 1) assign score to each feature and 2) select the top-N features. The selection of top-N features in the existing filter-based feature selection methods is biased by their discriminative power and the empirical process which is followed to determine the value of N. In order to improve the text classification performance by presenting a more illustrative feature set, we present an approach via a potent representation learning technique, namely DBN (Deep Belief Network). This algorithm learns via the semantic illustration of documents and uses feature vectors for their formulation. The nodes, iteration, and a number of hidden layers are the main parameters of DBN, which can tune to improve the classifier's performance. The results of experiments indicate the effectiveness of the proposed method to increase the classification performance and aid developers to make effective decisions in certain domains.

Predicting numeric ratings for Google apps using text features and ensemble learning

  • Umer, Muhammad;Ashraf, Imran;Mehmood, Arif;Ullah, Saleem;Choi, Gyu Sang
    • ETRI Journal
    • /
    • v.43 no.1
    • /
    • pp.95-108
    • /
    • 2021
  • Application (app) ratings are feedback provided voluntarily by users and serve as important evaluation criteria for apps. However, these ratings can often be biased owing to insufficient or missing votes. Additionally, significant differences have been observed between numeric ratings and user reviews. This study aims to predict the numeric ratings of Google apps using machine learning classifiers. It exploits numeric app ratings provided by users as training data and returns authentic mobile app ratings by analyzing user reviews. An ensemble learning model is proposed for this purpose that considers term frequency/inverse document frequency (TF/IDF) features. Three TF/IDF features, including unigrams, bigrams, and trigrams, were used. The dataset was scraped from the Google Play store, extracting data from 14 different app categories. Biased and unbiased user ratings were discriminated using TextBlob analysis to formulate the ground truth, from which the classifier prediction accuracy was then evaluated. The results demonstrate the high potential for machine learning-based classifiers to predict authentic numeric ratings based on actual user reviews.

Effects of Preprocessing on Text Classification in Balanced and Imbalanced Datasets

  • Mehmet F. Karaca
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.3
    • /
    • pp.591-609
    • /
    • 2024
  • In this study, preprocessings with all combinations were examined in terms of the effects on decreasing word number, shortening the duration of the process and the classification success in balanced and imbalanced datasets which were unbalanced in different ratios. The decreases in the word number and the processing time provided by preprocessings were interrelated. It was seen that more successful classifications were made with Turkish datasets and English datasets were affected more from the situation of whether the dataset is balanced or not. It was found out that the incorrect classifications, which are in the classes having few documents in highly imbalanced datasets, were made by assigning to the class close to the related class in terms of topic in Turkish datasets and to the class which have many documents in English datasets. In terms of average scores, the highest classification was obtained in Turkish datasets as follows: with not applying lowercase, applying stemming and removing stop words, and in English datasets as follows: with applying lowercase and stemming, removing stop words. Applying stemming was the most important preprocessing method which increases the success in Turkish datasets, whereas removing stop words in English datasets. The maximum scores revealed that feature selection, feature size and classifier are more effective than preprocessing in classification success. It was concluded that preprocessing is necessary for text classification because it shortens the processing time and can achieve high classification success, a preprocessing method does not have the same effect in all languages, and different preprocessing methods are more successful for different languages.

Text Detection and Recognition in Outdoor Korean Signboards for Mobile System Applications (모바일 시스템 응용을 위한 실외 한국어 간판 영상에서 텍스트 검출 및 인식)

  • Park, J.H.;Lee, G.S.;Kim, S.H.;Lee, M.H.;Toan, N.D.
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.46 no.2
    • /
    • pp.44-51
    • /
    • 2009
  • Text understand in natural images has become an active research field in the past few decades. In this paper, we present an automatic recognition system in Korean signboards with a complex background. The proposed algorithm includes detection, binarization and extraction of text for the recognition of shop names. First, we utilize an elaborate detection algorithm to detect possible text region based on edge histogram of vertical and horizontal direction. And detected text region is segmented by clustering method. Second, the text is divided into individual characters based on connected components whose center of mass lie below the center line, which are recognized by using a minimum distance classifier. A shape-based statistical feature is adopted, which is adequate for Korean character recognition. The system has been implemented in a mobile phone and is demonstrated to show acceptable performance.

Detecting Errors in POS-Tagged Corpus on XGBoost and Cross Validation (XGBoost와 교차검증을 이용한 품사부착말뭉치에서의 오류 탐지)

  • Choi, Min-Seok;Kim, Chang-Hyun;Park, Ho-Min;Cheon, Min-Ah;Yoon, Ho;Namgoong, Young;Kim, Jae-Kyun;Kim, Jae-Hoon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.7
    • /
    • pp.221-228
    • /
    • 2020
  • Part-of-Speech (POS) tagged corpus is a collection of electronic text in which each word is annotated with a tag as the corresponding POS and is widely used for various training data for natural language processing. The training data generally assumes that there are no errors, but in reality they include various types of errors, which cause performance degradation of systems trained using the data. To alleviate this problem, we propose a novel method for detecting errors in the existing POS tagged corpus using the classifier of XGBoost and cross-validation as evaluation techniques. We first train a classifier of a POS tagger using the POS-tagged corpus with some errors and then detect errors from the POS-tagged corpus using cross-validation, but the classifier cannot detect errors because there is no training data for detecting POS tagged errors. We thus detect errors by comparing the outputs (probabilities of POS) of the classifier, adjusting hyperparameters. The hyperparameters is estimated by a small scale error-tagged corpus, in which text is sampled from a POS-tagged corpus and which is marked up POS errors by experts. In this paper, we use recall and precision as evaluation metrics which are widely used in information retrieval. We have shown that the proposed method is valid by comparing two distributions of the sample (the error-tagged corpus) and the population (the POS-tagged corpus) because all detected errors cannot be checked. In the near future, we will apply the proposed method to a dependency tree-tagged corpus and a semantic role tagged corpus.