• Title/Summary/Keyword: Text features

Search Result 580, Processing Time 0.031 seconds

Text Classification with Heterogeneous Data Using Multiple Self-Training Classifiers

  • William Xiu Shun Wong;Donghoon Lee;Namgyu Kim
    • Asia pacific journal of information systems
    • /
    • v.29 no.4
    • /
    • pp.789-816
    • /
    • 2019
  • Text classification is a challenging task, especially when dealing with a huge amount of text data. The performance of a classification model can be varied depending on what type of words contained in the document corpus and what type of features generated for classification. Aside from proposing a new modified version of the existing algorithm or creating a new algorithm, we attempt to modify the use of data. The classifier performance is usually affected by the quality of learning data as the classifier is built based on these training data. We assume that the data from different domains might have different characteristics of noise, which can be utilized in the process of learning the classifier. Therefore, we attempt to enhance the robustness of the classifier by injecting the heterogeneous data artificially into the learning process in order to improve the classification accuracy. Semi-supervised approach was applied for utilizing the heterogeneous data in the process of learning the document classifier. However, the performance of document classifier might be degraded by the unlabeled data. Therefore, we further proposed an algorithm to extract only the documents that contribute to the accuracy improvement of the classifier.

Improvements on Phrase Breaks Prediction Using CRF (Conditional Random Fields) (CRF를 이용한 운율경계추성 성능개선)

  • Kim Seung-Won;Lee Geun-Bae;Kim Byeong-Chang
    • MALSORI
    • /
    • no.57
    • /
    • pp.139-152
    • /
    • 2006
  • In this paper, we present a phrase break prediction method using CRF(Conditional Random Fields), which has good performance at classification problems. The phrase break prediction problem was mapped into a classification problem in our research. We trained the CRF using the various linguistic features which was extracted from POS(Part Of Speech) tag, lexicon, length of word, and location of word in the sentences. Combined linguistic features were used in the experiments, and we could collect some linguistic features which generate good performance in the phrase break prediction. From the results of experiments, we can see that the proposed method shows improved performance on previous methods. Additionally, because the linguistic features are independent of each other in our research, the proposed method has higher flexibility than other methods.

  • PDF

In-depth Recommendation Model Based on Self-Attention Factorization

  • Hongshuang Ma;Qicheng Liu
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.3
    • /
    • pp.721-739
    • /
    • 2023
  • Rating prediction is an important issue in recommender systems, and its accuracy affects the experience of the user and the revenue of the company. Traditional recommender systems use Factorization Machinesfor rating predictions and each feature is selected with the same weight. Thus, there are problems with inaccurate ratings and limited data representation. This study proposes a deep recommendation model based on self-attention Factorization (SAFMR) to solve these problems. This model uses Convolutional Neural Networks to extract features from user and item reviews. The obtained features are fed into self-attention mechanism Factorization Machines, where the self-attention network automatically learns the dependencies of the features and distinguishes the weights of the different features, thereby reducing the prediction error. The model was experimentally evaluated using six classes of dataset. We compared MSE, NDCG and time for several real datasets. The experiment demonstrated that the SAFMR model achieved excellent rating prediction results and recommendation correlations, thereby verifying the effectiveness of the model.

Writer verification using feature selection based on genetic algorithm: A case study on handwritten Bangla dataset

  • Jaya Paul;Kalpita Dutta;Anasua Sarkar;Kaushik Roy;Nibaran Das
    • ETRI Journal
    • /
    • v.46 no.4
    • /
    • pp.648-659
    • /
    • 2024
  • Author verification is challenging because of the diversity in writing styles. We propose an enhanced handwriting verification method that combines handcrafted and automatically extracted features. The method uses a genetic algorithm to reduce the dimensionality of the feature set. We consider offline Bangla handwriting content and evaluate the proposed method using handcrafted features with a simple logistic regression, radial basis function network, and sequential minimal optimization as well as automatically extracted features using a convolutional neural network. The handcrafted features outperform the automatically extracted ones, achieving an average verification accuracy of 94.54% for 100 writers. The handcrafted features include Radon transform, histogram of oriented gradients, local phase quantization, and local binary patterns from interwriter and intrawriter content. The genetic algorithm reduces the feature dimensionality and selects salient features using a support vector machine. The top five experimental results are obtained from the optimal feature set selected using a consensus strategy. Comparisons with other methods and features confirm the satisfactory results.

A Classification Model for Attack Mail Detection based on the Authorship Analysis (작성자 분석 기반의 공격 메일 탐지를 위한 분류 모델)

  • Hong, Sung-Sam;Shin, Gun-Yoon;Han, Myung-Mook
    • Journal of Internet Computing and Services
    • /
    • v.18 no.6
    • /
    • pp.35-46
    • /
    • 2017
  • Recently, attackers using malicious code in cyber security have been increased by attaching malicious code to a mail and inducing the user to execute it. Especially, it is dangerous because it is easy to execute by attaching a document type file. The author analysis is a research area that is being studied in NLP (Neutral Language Process) and text mining, and it studies methods of analyzing authors by analyzing text sentences, texts, and documents in a specific language. In case of attack mail, it is created by the attacker. Therefore, by analyzing the contents of the mail and the attached document file and identifying the corresponding author, it is possible to discover more distinctive features from the normal mail and improve the detection accuracy. In this pager, we proposed IADA2(Intelligent Attack mail Detection based on Authorship Analysis) model for attack mail detection. The feature vector that can classify and detect attack mail from the features used in the existing machine learning based spam detection model and the features used in the author analysis of the document and the IADA2 detection model. We have improved the detection models of attack mails by simply detecting term features and extracted features that reflect the sequence characteristics of words by applying n-grams. Result of experiment show that the proposed method improves performance according to feature combinations, feature selection techniques, and appropriate models.

Design and Development of a Multimodal Biomedical Information Retrieval System

  • Demner-Fushman, Dina;Antani, Sameer;Simpson, Matthew;Thoma, George R.
    • Journal of Computing Science and Engineering
    • /
    • v.6 no.2
    • /
    • pp.168-177
    • /
    • 2012
  • The search for relevant and actionable information is a key to achieving clinical and research goals in biomedicine. Biomedical information exists in different forms: as text and illustrations in journal articles and other documents, in images stored in databases, and as patients' cases in electronic health records. This paper presents ways to move beyond conventional text-based searching of these resources, by combining text and visual features in search queries and document representation. A combination of techniques and tools from the fields of natural language processing, information retrieval, and content-based image retrieval allows the development of building blocks for advanced information services. Such services enable searching by textual as well as visual queries, and retrieving documents enriched by relevant images, charts, and other illustrations from the journal literature, patient records and image databases.

Learner-Generated Digital Listening Materials Using Text-to-Speech for Self-Directed Listening Practice

  • Moon, Dosik
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.12 no.4
    • /
    • pp.148-155
    • /
    • 2020
  • This study investigated learners' perceptions of using self-generated listening materials based on Text to Speech. After taking an online training session to learn how to make listening materials for extensive listening practice outside the classroom, the learners were engaged in practice with self-generated listening materials for 10 weeks in a self-directed way. The results show that a majority of the learners found the TTS-based listening materials helpful to reduce anxiety toward listening and enhance self-confidence and motivation, with a positive effect on improving their listening ability. The learners' general satisfaction can be attributed to some beneficial features of TTS-based listening material, including freedom to choose what they want to learn, convenient accessibility to the material, availability of various native speakers' voices, and novelty of digital tools. This suggests that TTS-based digital listening materials can be a useful educational tool to support learners' self-directed listening practice outside the classroom in EFL settings.

An Algorithm for Text Image Watermarking based on Word Classification (단어 분류에 기반한 텍스트 영상 워터마킹 알고리즘)

  • Kim Young-Won;Oh Il-Seok
    • Journal of KIISE:Software and Applications
    • /
    • v.32 no.8
    • /
    • pp.742-751
    • /
    • 2005
  • This paper proposes a novel text image watermarking algorithm based on word classification. The words are classified into K classes using simple features. Several adjacent words are grouped into a segment. and the segments are also classified using the word class information. The same amount of information is inserted into each of the segment classes. The signal is encoded by modifying some inter-word spaces statistics of segment classes. Subjective comparisons with conventional word-shift algorithms are presented under several criteria.

Modality-Based Sentence-Final Intonation Prediction for Korean Conversational-Style Text-to-Speech Systems

  • Oh, Seung-Shin;Kim, Sang-Hun
    • ETRI Journal
    • /
    • v.28 no.6
    • /
    • pp.807-810
    • /
    • 2006
  • This letter presents a prediction model for sentence-final intonations for Korean conversational-style text-to-speech systems in which we introduce the linguistic feature of 'modality' as a new parameter. Based on their function and meaning, we classify tonal forms in speech data into tone types meaningful for speech synthesis and use the result of this classification to build our prediction model using a tree structured classification algorithm. In order to show that modality is more effective for the prediction model than features such as sentence type or speech act, an experiment is performed on a test set of 970 utterances with a training set of 3,883 utterances. The results show that modality makes a higher contribution to the determination of sentence-final intonation than sentence type or speech act, and that prediction accuracy improves up to 25% when the feature of modality is introduced.

  • PDF

A Method for Short Text Classification using SNS Feature Information based on Markov Logic Networks (SNS 특징정보를 활용한 마르코프 논리 네트워크 기반의 단문 텍스트 분류 방법)

  • Lee, Eunji;Kim, Pankoo
    • Journal of Korea Multimedia Society
    • /
    • v.20 no.7
    • /
    • pp.1065-1072
    • /
    • 2017
  • As smart devices and social network services (SNSs) become increasingly pervasive, individuals produce large amounts of data in real time. Accordingly, studies on unstructured data analysis are actively being conducted to solve the resultant problem of information overload and to facilitate effective data processing. Many such studies are conducted for filtering inappropriate information. In this paper, a feature-weighting method considering SNS-message features is proposed for the classification of short text messages generated on SNSs, using Markov logic networks for category inference. The performance of the proposed method is verified through a comparison with an existing frequency-based classification methods.