• Title/Summary/Keyword: Text-based classification

Search Result 455, Processing Time 0.024 seconds

Text Categorization for Authorship based on the Features of Lingual Conceptual Expression

  • Zhang, Quan;Zhang, Yun-liang;Yuan, Yi
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2007.11a
    • /
    • pp.515-521
    • /
    • 2007
  • The text categorization is an important field for the automatic text information processing. Moreover, the authorship identification of a text can be treated as a special text categorization. This paper adopts the conceptual primitives' expression based on the Hierarchical Network of Concepts (HNC) theory, which can describe the words meaning in hierarchical symbols, in order to avoid the sparse data shortcoming that is aroused by the natural language surface features in text categorization. The KNN algorithm is used as computing classification element. Then, the experiment has been done on the Chinese text authorship identification. The experiment result gives out that the processing mode that is put forward in this paper achieves high correct rate, so it is feasible for the text authorship identification.

  • PDF

Context-based classification for harmful web documents and comparison of feature selecting algorithms

  • Kim, Young-Soo;Park, Nam-Je;Hong, Do-Won;Won, Dong-Ho
    • Journal of Korea Multimedia Society
    • /
    • v.12 no.6
    • /
    • pp.867-875
    • /
    • 2009
  • More and richer information sources and services are available on the web everyday. However, harmful information, such as adult content, is not appropriate for all users, notably children. Since internet is a worldwide open network, it has a limit to regulate users providing harmful contents through each countrie's national laws or systems. Additionally it is not a desirable way of developing a certain system-specific classification technology for harmful contents, because internet users can contact with them in diverse ways, for example, porn sites, harmful spams, or peer-to-peer networks, etc. Therefore, it is being emphasized to research and develop context-based core technologies for classifying harmful contents. In this paper, we propose an efficient text filter for blocking harmful texts of web documents using context-based technologies and examine which algorithms for feature selection, the process that select content terms, as features, can be useful for text categorization in all content term occurs in documents, are suitable for classifying harmful contents through implementation and experiment.

  • PDF

Performance Comparison of Naive Bayesian Learning and Centroid-Based Classification for e-Mail Classification (전자메일 분류를 위한 나이브 베이지안 학습과 중심점 기반 분류의 성능 비교)

  • Kim, Kuk-Pyo;Kwon, Young-S.
    • IE interfaces
    • /
    • v.18 no.1
    • /
    • pp.10-21
    • /
    • 2005
  • With the increasing proliferation of World Wide Web, electronic mail systems have become very widely used communication tools. Researches on e-mail classification have been very important in that e-mail classification system is a major engine for e-mail response management systems which mine unstructured e-mail messages and automatically categorize them. In this research we compare the performance of Naive Bayesian learning and Centroid-Based Classification using the different data set of an on-line shopping mall and a credit card company. We analyze which method performs better under which conditions. We compared classification accuracy of them which depends on structure and size of train set and increasing numbers of class. The experimental results indicate that Naive Bayesian learning performs better, while Centroid-Based Classification is more robust in terms of classification accuracy.

Novel Optimizer AdamW+ implementation in LSTM Model for DGA Detection

  • Awais Javed;Adnan Rashdi;Imran Rashid;Faisal Amir
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.11
    • /
    • pp.133-141
    • /
    • 2023
  • This work take deeper analysis of Adaptive Moment Estimation (Adam) and Adam with Weight Decay (AdamW) implementation in real world text classification problem (DGA Malware Detection). AdamW is introduced by decoupling weight decay from L2 regularization and implemented as improved optimizer. This work introduces a novel implementation of AdamW variant as AdamW+ by further simplifying weight decay implementation in AdamW. DGA malware detection LSTM models results for Adam, AdamW and AdamW+ are evaluated on various DGA families/ groups as multiclass text classification. Proposed AdamW+ optimizer results has shown improvement in all standard performance metrics over Adam and AdamW. Analysis of outcome has shown that novel optimizer has outperformed both Adam and AdamW text classification based problems.

Chinese Prosody Generation Based on C-ToBI Representation for Text-to-Speech (음성합성을 위한 C-ToBI기반의 중국어 운율 경계와 F0 contour 생성)

  • Kim, Seung-Won;Zheng, Yu;Lee, Gary-Geunbae;Kim, Byeong-Chang
    • MALSORI
    • /
    • no.53
    • /
    • pp.75-92
    • /
    • 2005
  • Prosody Generation Based on C-ToBI Representation for Text-to-SpeechSeungwon Kim, Yu Zheng, Gary Geunbae Lee, Byeongchang KimProsody modeling is critical in developing text-to-speech (TTS) systems where speech synthesis is used to automatically generate natural speech. In this paper, we present a prosody generation architecture based on Chinese Tone and Break Index (C-ToBI) representation. ToBI is a multi-tier representation system based on linguistic knowledge to transcribe events in an utterance. The TTS system which adopts ToBI as an intermediate representation is known to exhibit higher flexibility, modularity and domain/task portability compared with the direct prosody generation TTS systems. However, the cost of corpus preparation is very expensive for practical-level performance because the ToBI labeled corpus has been manually constructed by many prosody experts and normally requires a large amount of data for accurate statistical prosody modeling. This paper proposes a new method which transcribes the C-ToBI labels automatically in Chinese speech. We model Chinese prosody generation as a classification problem and apply conditional Maximum Entropy (ME) classification to this problem. We empirically verify the usefulness of various natural language and phonology features to make well-integrated features for ME framework.

  • PDF

Optimizing Input Parameters of Paralichthys olivaceus Disease Classification based on SHAP Analysis (SHAP 분석 기반의 넙치 질병 분류 입력 파라미터 최적화)

  • Kyung-Won Cho;Ran Baik
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.18 no.6
    • /
    • pp.1331-1336
    • /
    • 2023
  • In text-based fish disease classification using machine learning, there is a problem that the input parameters of the machine learning model are too many, but due to performance problems, the input parameters cannot be arbitrarily reduced. This paper proposes a method of optimizing input parameters specialized for Paralichthys olivaceus disease classification using SHAP analysis techniques to solve this problem,. The proposed method includes data preprocessing of disease information extracted from the halibut disease questionnaire by applying the SHAP analysis technique and evaluating a machine learning model using AutoML. Through this, the performance of the input parameters of AutoML is evaluated and the optimal input parameter combination is derived. In this study, the proposed method is expected to be able to maintain the existing performance while reducing the number of input parameters required, which will contribute to enhancing the efficiency and practicality of text-based Paralichthys olivaceus disease classification.

Classifying Biomedical Literature Providing Protein Function Evidence

  • Lim, Joon-Ho;Lee, Kyu-Chul
    • ETRI Journal
    • /
    • v.37 no.4
    • /
    • pp.813-823
    • /
    • 2015
  • Because protein is a primary element responsible for biological or biochemical roles in living bodies, protein function is the core and basis information for biomedical studies. However, recent advances in bio technologies have created an explosive increase in the amount of published literature; therefore, biomedical researchers have a hard time finding needed protein function information. In this paper, a classification system for biomedical literature providing protein function evidence is proposed. Note that, despite our best efforts, we have been unable to find previous studies on the proposed issue. To classify papers based on protein function evidence, we should consider whether the main claim of a paper is to assert a protein function. We, therefore, propose two novel features - protein and assertion. Our experimental results show a classification performance with 71.89% precision, 90.0% recall, and a 79.94% F-measure. In addition, to verify the usefulness of the proposed classification system, two case study applications are investigated - information retrieval for protein function and automatic summarization for protein function text. It is shown that the proposed classification system can be successfully applied to these applications.

Building a Hierarchy of Product Categories through Text Analysis of Product Description (텍스트 분석을 통한 제품 분류 체계 수립방안: 관광분야 App을 중심으로)

  • Lim, Hyuna;Choi, Jaewon;Lee, Hong Joo
    • Knowledge Management Research
    • /
    • v.20 no.3
    • /
    • pp.139-154
    • /
    • 2019
  • With the increasing use of smartphone apps, many apps are coming out in various fields. In order to analyze the current status and trends of apps in a specific field, it is necessary to establish a classification scheme. Various schemes considering users' behavior and characteristics of apps have been proposed, but there is a problem in that many apps are released and a fixed classification scheme must be updated according to the passage of time. Although it is necessary to consider many aspects in establishing classification scheme, it is possible to grasp the trend of the app through the proposal of a classification scheme according to the characteristic of the app. This research proposes a method of establishing an app classification scheme through the description of the app written by the app developers. For this purpose, we collected explanations about apps in the tourism field and identified major categories through topic modeling. Using only the apps corresponding to the topic, we construct a network of words contained in the explanatory text and identify subcategories based on the networks of words. Six topics were selected, and Clauset Newman Moore algorithm was applied to each topic to identify subcategories. Four or five subcategories were identified for each topic.

Topic Extraction and Classification Method Based on Comment Sets

  • Tan, Xiaodong
    • Journal of Information Processing Systems
    • /
    • v.16 no.2
    • /
    • pp.329-342
    • /
    • 2020
  • In recent years, emotional text classification is one of the essential research contents in the field of natural language processing. It has been widely used in the sentiment analysis of commodities like hotels, and other commentary corpus. This paper proposes an improved W-LDA (weighted latent Dirichlet allocation) topic model to improve the shortcomings of traditional LDA topic models. In the process of the topic of word sampling and its word distribution expectation calculation of the Gibbs of the W-LDA topic model. An average weighted value is adopted to avoid topic-related words from being submerged by high-frequency words, to improve the distinction of the topic. It further integrates the highest classification of the algorithm of support vector machine based on the extracted high-quality document-topic distribution and topic-word vectors. Finally, an efficient integration method is constructed for the analysis and extraction of emotional words, topic distribution calculations, and sentiment classification. Through tests on real teaching evaluation data and test set of public comment set, the results show that the method proposed in the paper has distinct advantages compared with other two typical algorithms in terms of subject differentiation, classification precision, and F1-measure.

Modality-Based Sentence-Final Intonation Prediction for Korean Conversational-Style Text-to-Speech Systems

  • Oh, Seung-Shin;Kim, Sang-Hun
    • ETRI Journal
    • /
    • v.28 no.6
    • /
    • pp.807-810
    • /
    • 2006
  • This letter presents a prediction model for sentence-final intonations for Korean conversational-style text-to-speech systems in which we introduce the linguistic feature of 'modality' as a new parameter. Based on their function and meaning, we classify tonal forms in speech data into tone types meaningful for speech synthesis and use the result of this classification to build our prediction model using a tree structured classification algorithm. In order to show that modality is more effective for the prediction model than features such as sentence type or speech act, an experiment is performed on a test set of 970 utterances with a training set of 3,883 utterances. The results show that modality makes a higher contribution to the determination of sentence-final intonation than sentence type or speech act, and that prediction accuracy improves up to 25% when the feature of modality is introduced.

  • PDF