• Title/Summary/Keyword: Large-set Classification

Search Result 183, Processing Time 0.031 seconds

An Active Co-Training Algorithm for Biomedical Named-Entity Recognition

  • Munkhdalai, Tsendsuren;Li, Meijing;Yun, Unil;Namsrai, Oyun-Erdene;Ryu, Keun Ho
    • Journal of Information Processing Systems
    • /
    • v.8 no.4
    • /
    • pp.575-588
    • /
    • 2012
  • Exploiting unlabeled text data with a relatively small labeled corpus has been an active and challenging research topic in text mining, due to the recent growth of the amount of biomedical literature. Biomedical named-entity recognition is an essential prerequisite task before effective text mining of biomedical literature can begin. This paper proposes an Active Co-Training (ACT) algorithm for biomedical named-entity recognition. ACT is a semi-supervised learning method in which two classifiers based on two different feature sets iteratively learn from informative examples that have been queried from the unlabeled data. We design a new classification problem to measure the informativeness of an example in unlabeled data. In this classification problem, the examples are classified based on a joint view of a feature set to be informative/non-informative to both classifiers. To form the training data for the classification problem, we adopt a query-by-committee method. Therefore, in the ACT, both classifiers are considered to be one committee, which is used on the labeled data to give the informativeness label to each example. The ACT method outperforms the traditional co-training algorithm in terms of f-measure as well as the number of training iterations performed to build a good classification model. The proposed method tends to efficiently exploit a large amount of unlabeled data by selecting a small number of examples having not only useful information but also a comprehensive pattern.

Ensemble Based Optimal Feature Selection Algorithm for Efficient Intrusion Detection in Wireless Sensor Network

  • Shyam Sundar S;R.S. Bhuvaneswaran;SaiRamesh L
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.8
    • /
    • pp.2214-2229
    • /
    • 2024
  • Wireless sensor network (WSN) consists of large number of sensor nodes that are deployed in geographical locations to collect sensed information, process data and communicate it to the control station for further processing. Due the unfriendly environment where the sensors are deployed, there exist many possibilities of malicious nodes which performs malicious activities in the network. Therefore, the security threats affect performance and life time of sensor networks, whereas various security aspects are there to address security issues in WSN namely Cryptography, Trust Management, Intrusion Detection System (IDS) and Intrusion Prevention Systems (IPS). However, IDS detect the malicious activities and produce an alarm. These malicious activities exploit vulnerabilities in the network layer and affect all layers in the network. Existing feature selection methods such as filter-based methods are not considering the redundancy of the selected features and wrapper method has high risk of overfitting the classification of intrusion. Due to overfitting, the classification algorithm fails to detect the intrusion in better manner. The main objective of this paper is to provide the efficient feature selection algorithm which was suitable for any type classification algorithm to detect the intrusion in an effective manner. This paper, the security of the network is addressed by proposing Feature Selection Algorithm using Chi Squared with Ensemble Method (FSChE). The proposed scheme employs the combination of decision tree along with the random forest classification algorithm to form ensemble classifier. The experimental results justify the feasibility of the proposed scheme in terms of attack detection, packet delivery ratio and time analysis by employing NSL KDD cup data Set. The obtained results shows that the proposed ensemble method increases the overall performance by 10% to 25% with respect to mentioned parameters.

Evaluation of damage probability matrices from observational seismic damage data

  • Eleftheriadou, Anastasia K.;Karabinis, Athanasios I.
    • Earthquakes and Structures
    • /
    • v.4 no.3
    • /
    • pp.299-324
    • /
    • 2013
  • The current research focuses on the seismic vulnerability assessment of typical Southern Europe buildings, based on processing of a large set of observational damage data. The presented study constitutes a sequel of a previous research. The damage statistics have been enriched and a wider damage database (178578 buildings) is created compared to the one of the first presented paper (73468 buildings) with Damage Probability Matrices (DPMs) after the elaboration of the results from post-earthquake surveys carried out in the area struck by the 7-9-1999 near field Athens earthquake. The dataset comprises buildings which developed damage in several degree, type and extent. Two different parameters are estimated for the description of the seismic demand. After the classification of damaged buildings into structural types they are further categorized according to the level of damage and macroseismic intensity. The relative and the cumulative frequencies of the different damage states, for each structural type and each intensity level, are computed and presented, in terms of damage ratio. Damage Probability Matrices (DPMs) are obtained for typical structural types and they are compared to existing matrices derived from regions with similar building stock and soil conditions. A procedure is presented for the classification of those buildings which initially could not be discriminated into structural types due to restricted information and hence they had been disregarded. New proportional DPMs are developed and a correlation analysis is fulfilled with the existing vulnerability relations.

A Deep Learning Model for Extracting Consumer Sentiments using Recurrent Neural Network Techniques

  • Ranjan, Roop;Daniel, AK
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.8
    • /
    • pp.238-246
    • /
    • 2021
  • The rapid rise of the Internet and social media has resulted in a large number of text-based reviews being placed on sites such as social media. In the age of social media, utilizing machine learning technologies to analyze the emotional context of comments aids in the understanding of QoS for any product or service. The classification and analysis of user reviews aids in the improvement of QoS. (Quality of Services). Machine Learning algorithms have evolved into a powerful tool for analyzing user sentiment. Unlike traditional categorization models, which are based on a set of rules. In sentiment categorization, Bidirectional Long Short-Term Memory (BiLSTM) has shown significant results, and Convolution Neural Network (CNN) has shown promising results. Using convolutions and pooling layers, CNN can successfully extract local information. BiLSTM uses dual LSTM orientations to increase the amount of background knowledge available to deep learning models. The suggested hybrid model combines the benefits of these two deep learning-based algorithms. The data source for analysis and classification was user reviews of Indian Railway Services on Twitter. The suggested hybrid model uses the Keras Embedding technique as an input source. The suggested model takes in data and generates lower-dimensional characteristics that result in a categorization result. The suggested hybrid model's performance was compared using Keras and Word2Vec, and the proposed model showed a significant improvement in response with an accuracy of 95.19 percent.

Modified Kernel PCA Applied To Classification Problem (수정된 커널 주성분 분석 기법의 분류 문제에의 적용)

  • Kim, Byung-Joo;Sim, Joo-Yong;Hwang, Chang-Ha;Kim, Il-Kon
    • The KIPS Transactions:PartB
    • /
    • v.10B no.3
    • /
    • pp.243-248
    • /
    • 2003
  • An incremental kernel principal component analysis (IKPCA) is proposed for the nonlinear feature extraction from the data. The problem of batch kernel principal component analysis (KPCA) is that the computation becomes prohibitive when the data set is large. Another problem is that, in order to update the eigenvectors with another data, the whole eigenspace should be recomputed. IKPCA overcomes these problems by incrementally computing eigenspace model and empirical kernel map The IKPCA is more efficient in memory requirement than a batch KPCA and can be easily improved by re-learning the data. In our experiments we show that IKPCA is comparable in performance to a batch KPCA for the feature extraction and classification problem on nonlinear data set.

Novel Category Discovery in Plant Species and Disease Identification through Knowledge Distillation

  • Jiuqing Dong;Alvaro Fuentes;Mun Haeng Lee;Taehyun Kim;Sook Yoon;Dong Sun Park
    • Smart Media Journal
    • /
    • v.13 no.7
    • /
    • pp.36-44
    • /
    • 2024
  • Identifying plant species and diseases is crucial for maintaining biodiversity and achieving optimal crop yields, making it a topic of significant practical importance. Recent studies have extended plant disease recognition from traditional closed-set scenarios to open-set environments, where the goal is to reject samples that do not belong to known categories. However, in open-world tasks, it is essential not only to define unknown samples as "unknown" but also to classify them further. This task assumes that images and labels of known categories are available and that samples of unknown categories can be accessed. The model classifies unknown samples by learning the prior knowledge of known categories. To the best of our knowledge, there is no existing research on this topic in plant-related recognition tasks. To address this gap, this paper utilizes knowledge distillation to model the category space relationships between known and unknown categories. Specifically, we identify similarities between different species or diseases. By leveraging a fine-tuned model on known categories, we generate pseudo-labels for unknown categories. Additionally, we enhance the baseline method's performance by using a larger pre-trained model, dino-v2. We evaluate the effectiveness of our method on the large plant specimen dataset Herbarium 19 and the disease dataset Plant Village. Notably, our method outperforms the baseline by 1% to 20% in terms of accuracy for novel category classification. We believe this study will contribute to the community.

Classification of Magnetic Resonance Imagery Using Deterministic Relaxation of Neural Network (신경망의 결정론적 이완에 의한 자기공명영상 분류)

  • 전준철;민경필;권수일
    • Investigative Magnetic Resonance Imaging
    • /
    • v.6 no.2
    • /
    • pp.137-146
    • /
    • 2002
  • Purpose : This paper introduces an improved classification approach which adopts a deterministic relaxation method and an agglomerative clustering technique for the classification of MRI using neural network. The proposed approach can solve the problems of convergency to local optima and computational burden caused by a large number of input patterns when a neural network is used for image classification. Materials and methods : Application of Hopfield neural network has been solving various optimization problems. However, major problem of mapping an image classification problem into a neural network is that network is opt to converge to local optima and its convergency toward the global solution with a standard stochastic relaxation spends much time. Therefore, to avoid local solutions and to achieve fast convergency toward a global optimization, we adopt MFA to a Hopfield network during the classification. MFA replaces the stochastic nature of simulated annealing method with a set of deterministic update rules that act on the average value of the variable. By minimizing averages, it is possible to converge to an equilibrium state considerably faster than standard simulated annealing method. Moreover, the proposed agglomerative clustering algorithm which determines the underlying clusters of the image provides initial input values of Hopfield neural network. Results : The proposed approach which uses agglomerative clustering and deterministic relaxation approach resolves the problem of local optimization and achieves fast convergency toward a global optimization when a neural network is used for MRI classification. Conclusion : In this paper, we introduce a new paradigm to classify MRI using clustering analysis and deterministic relaxation for neural network to improve the classification results.

  • PDF

Skyline Query Algorithm in the Categoric Data (범주형 데이터에 대한 스카이라인 질의 알고리즘)

  • Lee, Woo-Key;Choi, Jung-Ho;Song, Jong-Su
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.7
    • /
    • pp.819-823
    • /
    • 2010
  • The skyline query is one of the effective methods to deal with the large amounts and multi-dimensional data set. By utilizing the concept of 'dominate' the skyline query can pinpoint the target data so that the dominated ones, about 95% of them, can efficiently be excluded as an unnecessary data. Most of the skyline query algorithms, however, have been developed in terms of the numerical data set. This paper pioneers an entirely new domain, the categorical data, on which the corresponding ranking measures for the skyline queries are suggested. In the experiment, the ACM Computing Classification System has been exploited to which our methods are significantly represented with respect to performance thresholds such as the processing time and precision ratio, etc.

Neural Network-based Recognition of Handwritten Hangul Characters in Form's Monetary Fields (전표 금액란에 나타나는 필기 한글의 신경망-기반 인식)

  • 이진선;오일석
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.5 no.1
    • /
    • pp.25-30
    • /
    • 2000
  • Hangul is regarded as one of the difficult character set due to the large number of classes and the shape similarity among different characters. Most of the conventional researches attempted to recognize the 2,350 characters which are popularly used, but this approach has a problem or low recognition performance while it provides a generality. On the contrary, recognition of a small character set appearing in specific fields like postal address or bank checks is more practical approach. This paper describes a research for recognizing the handwritten Hangul characters appearing in monetary fields. The modular neural network is adopted for the classification and three kinds of feature are tested. The experiment performed using standard Hangul database PE92 showed the correct recognition rate 91.56%.

  • PDF

Analysis of IT Service Quality Elements Using Text Sentiment Analysis (텍스트 감정분석을 이용한 IT 서비스 품질요소 분석)

  • Kim, Hong Sam;Kim, Chong Su
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.43 no.4
    • /
    • pp.33-40
    • /
    • 2020
  • In order to satisfy customers, it is important to identify the quality elements that affect customers' satisfaction. The Kano model has been widely used in identifying multi-dimensional quality attributes in this purpose. However, the model suffers from various shortcomings and limitations, especially those related to survey practices such as the data amount, reply attitude and cost. In this research, a model based on the text sentiment analysis is proposed, which aims to substitute the survey-based data gathering process of Kano models with sentiment analysis. In this model, from the set of opinion text, quality elements for the research are extracted using the morpheme analysis. The opinions' polarity attributes are evaluated using text sentiment analysis, and those polarity text items are transformed into equivalent Kano survey questions. Replies for the transformed survey questions are generated based on the total score of the original data. Then, the question-reply set is analyzed using both the original Kano evaluation method and the satisfaction index method. The proposed research model has been tested using a large amount of data of public IT service project evaluations. The result shows that it can replace the existing practice and it promises advantages in terms of quality and cost of data gathering. The authors hope that the proposed model of this research may serve as a new quality analysis model for a wide range of areas.