• Title/Summary/Keyword: Feature Selection Methods

Search Result 318, Processing Time 0.029 seconds

An Empirical Study on Improving the Performance of Text Categorization Considering the Relationships between Feature Selection Criteria and Weighting Methods (자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 대한 연구)

  • Lee Jae-Yun
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.39 no.2
    • /
    • pp.123-146
    • /
    • 2005
  • This study aims to find consistent strategies for feature selection and feature weighting methods, which can improve the effectiveness and efficiency of kNN text classifier. Feature selection criteria and feature weighting methods are as important factor as classification algorithms to achieve good performance of text categorization systems. Most of the former studies chose conflicting strategies for feature selection criteria and weighting methods. In this study, the performance of several feature selection criteria are measured considering the storage space for inverted index records and the classification time. The classification experiments in this study are conducted to examine the performance of IDF as feature selection criteria and the performance of conventional feature selection criteria, e.g. mutual information, as feature weighting methods. The results of these experiments suggest that using those measures which prefer low-frequency features as feature selection criterion and also as feature weighting method. we can increase the classification speed up to three or five times without loosing classification accuracy.

Landslide susceptibility assessment using feature selection-based machine learning models

  • Liu, Lei-Lei;Yang, Can;Wang, Xiao-Mi
    • Geomechanics and Engineering
    • /
    • v.25 no.1
    • /
    • pp.1-16
    • /
    • 2021
  • Machine learning models have been widely used for landslide susceptibility assessment (LSA) in recent years. The large number of inputs or conditioning factors for these models, however, can reduce the computation efficiency and increase the difficulty in collecting data. Feature selection is a good tool to address this problem by selecting the most important features among all factors to reduce the size of the input variables. However, two important questions need to be solved: (1) how do feature selection methods affect the performance of machine learning models? and (2) which feature selection method is the most suitable for a given machine learning model? This paper aims to address these two questions by comparing the predictive performance of 13 feature selection-based machine learning (FS-ML) models and 5 ordinary machine learning models on LSA. First, five commonly used machine learning models (i.e., logistic regression, support vector machine, artificial neural network, Gaussian process and random forest) and six typical feature selection methods in the literature are adopted to constitute the proposed models. Then, fifteen conditioning factors are chosen as input variables and 1,017 landslides are used as recorded data. Next, feature selection methods are used to obtain the importance of the conditioning factors to create feature subsets, based on which 13 FS-ML models are constructed. For each of the machine learning models, a best optimized FS-ML model is selected according to the area under curve value. Finally, five optimal FS-ML models are obtained and applied to the LSA of the studied area. The predictive abilities of the FS-ML models on LSA are verified and compared through the receive operating characteristic curve and statistical indicators such as sensitivity, specificity and accuracy. The results showed that different feature selection methods have different effects on the performance of LSA machine learning models. FS-ML models generally outperform the ordinary machine learning models. The best FS-ML model is the recursive feature elimination (RFE) optimized RF, and RFE is an optimal method for feature selection.

Nonlinear Feature Transformation and Genetic Feature Selection: Improving System Security and Decreasing Computational Cost

  • Taghanaki, Saeid Asgari;Ansari, Mohammad Reza;Dehkordi, Behzad Zamani;Mousavi, Sayed Ali
    • ETRI Journal
    • /
    • v.34 no.6
    • /
    • pp.847-857
    • /
    • 2012
  • Intrusion detection systems (IDSs) have an important effect on system defense and security. Recently, most IDS methods have used transformed features, selected features, or original features. Both feature transformation and feature selection have their advantages. Neighborhood component analysis feature transformation and genetic feature selection (NCAGAFS) is proposed in this research. NCAGAFS is based on soft computing and data mining and uses the advantages of both transformation and selection. This method transforms features via neighborhood component analysis and chooses the best features with a classifier based on a genetic feature selection method. This novel approach is verified using the KDD Cup99 dataset, demonstrating higher performances than other well-known methods under various classifiers have demonstrated.

New Feature Selection Method for Text Categorization

  • Wang, Xingfeng;Kim, Hee-Cheol
    • Journal of information and communication convergence engineering
    • /
    • v.15 no.1
    • /
    • pp.53-61
    • /
    • 2017
  • The preferred feature selection methods for text classification are filter-based. In a common filter-based feature selection scheme, unique scores are assigned to features; then, these features are sorted according to their scores. The last step is to add the top-N features to the feature set. In this paper, we propose an improved global feature selection scheme wherein its last step is modified to obtain a more representative feature set. The proposed method aims to improve the classification performance of global feature selection methods by creating a feature set representing all classes almost equally. For this purpose, a local feature selection method is used in the proposed method to label features according to their discriminative power on classes; these labels are used while producing the feature sets. Experimental results obtained using the well-known 20 Newsgroups and Reuters-21578 datasets with the k-nearest neighbor algorithm and a support vector machine indicate that the proposed method improves the classification performance in terms of a widely known metric ($F_1$).

A Novel Feature Selection Method in the Categorization of Imbalanced Textual Data

  • Pouramini, Jafar;Minaei-Bidgoli, Behrouze;Esmaeili, Mahdi
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.8
    • /
    • pp.3725-3748
    • /
    • 2018
  • Text data distribution is often imbalanced. Imbalanced data is one of the challenges in text classification, as it leads to the loss of performance of classifiers. Many studies have been conducted so far in this regard. The proposed solutions are divided into several general categories, include sampling-based and algorithm-based methods. In recent studies, feature selection has also been considered as one of the solutions for the imbalance problem. In this paper, a novel one-sided feature selection known as probabilistic feature selection (PFS) was presented for imbalanced text classification. The PFS is a probabilistic method that is calculated using feature distribution. Compared to the similar methods, the PFS has more parameters. In order to evaluate the performance of the proposed method, the feature selection methods including Gini, MI, FAST and DFS were implemented. To assess the proposed method, the decision tree classifications such as C4.5 and Naive Bayes were used. The results of tests on Reuters-21875 and WebKB figures per F-measure suggested that the proposed feature selection has significantly improved the performance of the classifiers.

Effective Multi-label Feature Selection based on Large Offspring Set created by Enhanced Evolutionary Search Process

  • Lim, Hyunki;Seo, Wangduk;Lee, Jaesung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.9
    • /
    • pp.7-13
    • /
    • 2018
  • Recent advancement in data gathering technique improves the capability of information collecting, thus allowing the learning process between gathered data patterns and application sub-tasks. A pattern can be associated with multiple labels, demanding multi-label learning capability, resulting in significant attention to multi-label feature selection since it can improve multi-label learning accuracy. However, existing evolutionary multi-label feature selection methods suffer from ineffective search process. In this study, we propose a evolutionary search process for the task of multi-label feature selection problem. The proposed method creates large set of offspring or new feature subsets and then retains the most promising feature subset. Experimental results demonstrate that the proposed method can identify feature subsets giving good multi-label classification accuracy much faster than conventional methods.

A Novel Statistical Feature Selection Approach for Text Categorization

  • Fattah, Mohamed Abdel
    • Journal of Information Processing Systems
    • /
    • v.13 no.5
    • /
    • pp.1397-1409
    • /
    • 2017
  • For text categorization task, distinctive text features selection is important due to feature space high dimensionality. It is important to decrease the feature space dimension to decrease processing time and increase accuracy. In the current study, for text categorization task, we introduce a novel statistical feature selection approach. This approach measures the term distribution in all collection documents, the term distribution in a certain category and the term distribution in a certain class relative to other classes. The proposed method results show its superiority over the traditional feature selection methods.

Comparison of Feature Selection Methods in Support Vector Machines (지지벡터기계의 변수 선택방법 비교)

  • Kim, Kwangsu;Park, Changyi
    • The Korean Journal of Applied Statistics
    • /
    • v.26 no.1
    • /
    • pp.131-139
    • /
    • 2013
  • Support vector machines(SVM) may perform poorly in the presence of noise variables; in addition, it is difficult to identify the importance of each variable in the resulting classifier. A feature selection can improve the interpretability and the accuracy of SVM. Most existing studies concern feature selection in the linear SVM through penalty functions yielding sparse solutions. Note that one usually adopts nonlinear kernels for the accuracy of classification in practice. Hence feature selection is still desirable for nonlinear SVMs. In this paper, we compare the performances of nonlinear feature selection methods such as component selection and smoothing operator(COSSO) and kernel iterative feature extraction(KNIFE) on simulated and real data sets.

Comparison of Feature Selection Processes for Image Retrieval Applications

  • Choi, Young-Mee;Choo, Moon-Won
    • Journal of Korea Multimedia Society
    • /
    • v.14 no.12
    • /
    • pp.1544-1548
    • /
    • 2011
  • A process of choosing a subset of original features, so called feature selection, is considered as a crucial preprocessing step to image processing applications. There are already large pools of techniques developed for machine learning and data mining fields. In this paper, basically two methods, non-feature selection and feature selection, are investigated to compare their predictive effectiveness of classification. Color co-occurrence feature is used for defining image features. Standard Sequential Forward Selection algorithm are used for feature selection to identify relevant features and redundancy among relevant features. Four color spaces, RGB, YCbCr, HSV, and Gaussian space are considered for computing color co-occurrence features. Gray-level image feature is also considered for the performance comparison reasons. The experimental results are presented.

A Study on Feature Selection for kNN Classifier using Document Frequency and Collection Frequency (문헌빈도와 장서빈도를 이용한 kNN 분류기의 자질선정에 관한 연구)

  • Lee, Yong-Gu
    • Journal of Korean Library and Information Science Society
    • /
    • v.44 no.1
    • /
    • pp.27-47
    • /
    • 2013
  • This study investigated the classification performance of a kNN classifier using the feature selection methods based on document frequency(DF) and collection frequency(CF). The results of the experiments, which used HKIB-20000 data, were as follows. First, the feature selection methods that used high-frequency terms and removed low-frequency terms by the CF criterion achieved better classification performance than those using the DF criterion. Second, neither DF nor CF methods performed well when low-frequency terms were selected first in the feature selection process. Last, combining CF and DF criteria did not result in better classification performance than using the single feature selection criterion of DF or CF.