• 제목/요약/키워드: Global feature selection

검색결과 35건 처리시간 0.022초

New Feature Selection Method for Text Categorization

  • Wang, Xingfeng;Kim, Hee-Cheol
    • Journal of information and communication convergence engineering
    • /
    • 제15권1호
    • /
    • pp.53-61
    • /
    • 2017
  • The preferred feature selection methods for text classification are filter-based. In a common filter-based feature selection scheme, unique scores are assigned to features; then, these features are sorted according to their scores. The last step is to add the top-N features to the feature set. In this paper, we propose an improved global feature selection scheme wherein its last step is modified to obtain a more representative feature set. The proposed method aims to improve the classification performance of global feature selection methods by creating a feature set representing all classes almost equally. For this purpose, a local feature selection method is used in the proposed method to label features according to their discriminative power on classes; these labels are used while producing the feature sets. Experimental results obtained using the well-known 20 Newsgroups and Reuters-21578 datasets with the k-nearest neighbor algorithm and a support vector machine indicate that the proposed method improves the classification performance in terms of a widely known metric ($F_1$).

문서 분류 알고리즘을 이용한 한국어 스팸 문서 분류 성능 비교 (Comparing Korean Spam Document Classification Using Document Classification Algorithms)

  • 송철환;유성준
    • 한국정보과학회:학술대회논문집
    • /
    • 한국정보과학회 2006년도 가을 학술발표논문집 Vol.33 No.2 (C)
    • /
    • pp.222-225
    • /
    • 2006
  • 한국은 다른 나라에 비해 많은 인터넷 사용자를 가지고 있다. 이에 비례해서 한국의 인터넷 유저들은 Spam Mail에 대해 많은 불편함을 호소하고 있다. 이러한 문제를 해결하기 위해 본 논문은 다양한 Feature Weighting, Feature Selection 그리고 문서 분류 알고리즘들을 이용한 한국어 스팸 문서 Filtering연구에 대해 기술한다. 그리고 한국어 문서(Spam/Non-Spam 문서)로부터 영사를 추출하고 이를 각 분류 알고리즘의 Input Feature로써 이용한다. 그리고 우리는 Feature weighting 에 대해 기존의 전통적인 방법이 아니라 각 Feature에 대해 Variance 값을 구하고 Global Feature를 선택하기 위해 Max Value Selection 방법에 적용 후에 전통적인 Feature Selection 방법인 MI, IG, CHI 들을 적용하여 Feature들을 추출한다. 이렇게 추출된 Feature들을 Naive Bayes, Support Vector Machine과 같은 분류 알고리즘에 적용한다. Vector Space Model의 경우에는 전통적인 방법 그대로 사용한다. 그 결과 우리는 Support Vector Machine Classifier, TF-IDF Variance Weighting(Combined Max Value Selection), CHI Feature Selection 방법을 사용할 경우 Recall(99.4%), Precision(97.4%), F-Measure(98.39%)의 성능을 보였다.

  • PDF

Biological Feature Selection and Disease Gene Identification using New Stepwise Random Forests

  • Hwang, Wook-Yeon
    • Industrial Engineering and Management Systems
    • /
    • 제16권1호
    • /
    • pp.64-79
    • /
    • 2017
  • Identifying disease genes from human genome is a critical task in biomedical research. Important biological features to distinguish the disease genes from the non-disease genes have been mainly selected based on traditional feature selection approaches. However, the traditional feature selection approaches unnecessarily consider many unimportant biological features. As a result, although some of the existing classification techniques have been applied to disease gene identification, the prediction performance was not satisfactory. A small set of the most important biological features can enhance the accuracy of disease gene identification, as well as provide potentially useful knowledge for biologists or clinicians, who can further investigate the selected biological features as well as the potential disease genes. In this paper, we propose a new stepwise random forests (SRF) approach for biological feature selection and disease gene identification. The SRF approach consists of two stages. In the first stage, only important biological features are iteratively selected in a forward selection manner based on one-dimensional random forest regression, where the updated residual vector is considered as the current response vector. We can then determine a small set of important biological features. In the second stage, random forests classification with regard to the selected biological features is applied to identify disease genes. Our extensive experiments show that the proposed SRF approach outperforms the existing feature selection and classification techniques in terms of biological feature selection and disease gene identification.

Analyzing empirical performance of correlation based feature selection with company credit rank score dataset - Emphasis on KOSPI manufacturing companies -

  • Nam, Youn Chang;Lee, Kun Chang
    • 한국컴퓨터정보학회논문지
    • /
    • 제21권4호
    • /
    • pp.63-71
    • /
    • 2016
  • This paper is about applying efficient data mining method which improves the score calculation and proper building performance of credit ranking score system. The main idea of this data mining technique is accomplishing such objectives by applying Correlation based Feature Selection which could also be used to verify the properness of existing rank scores quickly. This study selected 2047 manufacturing companies on KOSPI market during the period of 2009 to 2013, which have their own credit rank scores given by NICE information service agency. Regarding the relevant financial variables, total 80 variables were collected from KIS-Value and DART (Data Analysis, Retrieval and Transfer System). If correlation based feature selection could select more important variables, then required information and cost would be reduced significantly. Through analysis, this study show that the proposed correlation based feature selection method improves selection and classification process of credit rank system so that the accuracy and credibility would be increased while the cost for building system would be decreased.

Combined Features with Global and Local Features for Gas Classification

  • Choi, Sang-Il
    • 한국컴퓨터정보학회논문지
    • /
    • 제21권9호
    • /
    • pp.11-18
    • /
    • 2016
  • In this paper, we propose a gas classification method using combined features for an electronic nose system that performs well even when some loss occurs in measuring data samples. We first divide the entire measurement for a data sample into three local sections, which are the stabilization, exposure, and purge; local features are then extracted from each section. Based on the discrimination analysis, measurements of the discriminative information amounts are taken. Subsequently, the local features that have a large amount of discriminative information are chosen to compose the combined features together with the global features that extracted from the entire measurement section of the data sample. The experimental results show that the combined features by the proposed method gives better classification performance for a variety of volatile organic compound data than the other feature types, especially when there is data loss.

Discriminative and Non-User Specific Binary Biometric Representation via Linearly-Separable SubCode Encoding-based Discretization

  • Lim, Meng-Hui;Teoh, Andrew Beng Jin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제5권2호
    • /
    • pp.374-388
    • /
    • 2011
  • Biometric discretization is a process of transforming continuous biometric features of an identity into a binary bit string. This paper mainly focuses on improving the global discretization method - a discretization method that does not base on information specific to each user in bitstring extraction, which appears to be important in applications that prioritize strong security provision and strong privacy protection. In particular, we demonstrate how the actual performance of a global discretization could further be improved by embedding a global discriminative feature selection method and a Linearly Separable Subcode-based encoding technique. In addition, we examine a number of discriminative feature selection measures that can reliably be used for such discretization. Lastly, encouraging empirical results vindicate the feasibility of our approach.

2차원 웨이브렛 패킷에 기반한 필기체 문자인식의 특징선택방법 (A Feature Selection for the Recognition of Handwritten Characters based on Two-Dimensional Wavelet Packet)

  • 김민수;백장선;이귀상;김수형
    • 한국정보과학회논문지:소프트웨어및응용
    • /
    • 제29권8호
    • /
    • pp.521-528
    • /
    • 2002
  • 본 논문에서는 문자인식의 특징선택방법으로 2차원 웨이브렛 패킷을 이용하는 새로운 방법을 제안한다. 영상자료의 특징들로부터 중심특징을 선택하기 위한 차원축소 기법으로 주성분분석 기법이 주로 사용된다. 하지만, 주성분분석 기법은 고유시스템에 의존하기 때문에, 이상치나 잡음 등에 민감할 뿐만 아니라, 전역적 특징만을 선택하는 경향이 있다. 때때로, 영상자료의 중요한 특징이 가장자리 부분이나 뽀족한 부분 같은 지역적 정보일 수 있다. 이러한 경우, 주성분분석 기법은 좋은 결과를 줄 수 없다. 또한 고유시스템은 많은 계산시간을 요구한다. 본 논문에서 원 자료는 2차원 웨이브렛 패킷기저에 의해 변환되고, 최적 판별 기저가 탐색된 후, 그것으로부터 적절한 특징이 선택된다. 주성분분석 기법과 비교하여, 제안된 방법은 웨이브렛의 좋은 특성에 의해 전역적 특징뿐만 아니라 지역적 특징의 선택이 빠른 계산시간으로 이루어진다. 제안된 방법의 성능을 보이기 위해 PCA와 제안된 방법의 인식률의 실험결과가 분석되었다.

Improved marine predators algorithm for feature selection and SVM optimization

  • Jia, Heming;Sun, Kangjian;Li, Yao;Cao, Ning
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제16권4호
    • /
    • pp.1128-1145
    • /
    • 2022
  • Owing to the rapid development of information science, data analysis based on machine learning has become an interdisciplinary and strategic area. Marine predators algorithm (MPA) is a novel metaheuristic algorithm inspired by the foraging strategies of marine organisms. Considering the randomness of these strategies, an improved algorithm called co-evolutionary cultural mechanism-based marine predators algorithm (CECMPA) is proposed. Through this mechanism, search agents in different spaces can share knowledge and experience to improve the performance of the native algorithm. More specifically, CECMPA has a higher probability of avoiding local optimum and can search the global optimum quickly. In this paper, it is the first to use CECMPA to perform feature subset selection and optimize hyperparameters in support vector machine (SVM) simultaneously. For performance evaluation the proposed method, it is tested on twelve datasets from the university of California Irvine (UCI) repository. Moreover, the coronavirus disease 2019 (COVID-19) can be a real-world application and is spreading in many countries. CECMPA is also applied to a COVID-19 dataset. The experimental results and statistical analysis demonstrate that CECMPA is superior to other compared methods in the literature in terms of several evaluation metrics. The proposed method has strong competitive abilities and promising prospects.

선별된 특성 정보를 이용한 안드로이드 악성 앱 탐지 연구 (A Study on Android Malware Detection using Selected Features)

  • 명상준;김강석
    • 융합정보논문지
    • /
    • 제12권3호
    • /
    • pp.17-24
    • /
    • 2022
  • 모바일 악성 앱이 급증하고 있으며, 전 세계 모바일 OS 시장의 대부분을 차지하고 있는 안드로이드가 모바일 사이버 보안 위협의 주요 대상이 되고 있다. 따라서 빠르게 진화하는 악성 앱에 대응하기 위해 인공지능 구현기술 중 하나인 기계학습을 활용한 악성 앱 탐지 기법의 필요성이 대두되고 있다. 본 논문은 악성 앱의 탐지성능을 향상할 수 있는 특성 선택 및 특성 추출을 이용한 특성 선별 방법을 제안하였다. 특성 선별 과정에서 특성 개수에 따라 탐지 성능이 향상되었으며, 권한보다 API가 상대적으로 좋은 탐지 성능을 보였고, 두 특성을 조합하면 평균 93% 이상의 높은 탐지 정밀도를 보여 적절한 특성의 조합이 탐지 성능을 높일 수 있음을 확인하였다.

단면 재구성을 통한 CSG 모델의 기계가공부품 형상추출 (Sliced Profile-based Automatic Extraction of Machined Features from CSG Models)

  • 이영래
    • 대한산업공학회지
    • /
    • 제20권1호
    • /
    • pp.99-112
    • /
    • 1994
  • This paper describe the development of a systematic method of slicing solid parts based on a data structure called Sliced Profile Data Structure(SPDS). SPDS is an augmented polygon data structure that allows multiple layers of sliced profiles to be connected together. The method consists of five steps: (1) Selection of slicing directions, (2) Determination of slicing levels, (3) Creation of sliced profiles, (4) Connection of sliced profiles, and (5) Refinement. The presented method is aimed at enhancing the applicability of CSG for manufacturing by overcoming the problem of non-uniqueness and global nature. The SPDS-based method of feature extraction is suitable for recognizing broad scope of features with detailed information. The method is also suitable for identifying the global relationships among features and is capable of incorporating the context dependency of feature extraction.

  • PDF