Search | Korea Science

A Clustering Approach for Feature Selection in Microarray Data Classification Using Random Forest

Aydadenta, Husna;Adiwijaya, Adiwijaya
- Journal of Information Processing Systems
- /
- v.14 no.5
- /
- pp.1167-1175
- /
- 2018
Microarray data plays an essential role in diagnosing and detecting cancer. Microarray analysis allows the examination of levels of gene expression in specific cell samples, where thousands of genes can be analyzed simultaneously. However, microarray data have very little sample data and high data dimensionality. Therefore, to classify microarray data, a dimensional reduction process is required. Dimensional reduction can eliminate redundancy of data; thus, features used in classification are features that only have a high correlation with their class. There are two types of dimensional reduction, namely feature selection and feature extraction. In this paper, we used k-means algorithm as the clustering approach for feature selection. The proposed approach can be used to categorize features that have the same characteristics in one cluster, so that redundancy in microarray data is removed. The result of clustering is ranked using the Relief algorithm such that the best scoring element for each cluster is obtained. All best elements of each cluster are selected and used as features in the classification process. Next, the Random Forest algorithm is used. Based on the simulation, the accuracy of the proposed approach for each dataset, namely Colon, Lung Cancer, and Prostate Tumor, achieved 85.87%, 98.9%, and 89% accuracy, respectively. The accuracy of the proposed approach is therefore higher than the approach using Random Forest without clustering.
https://doi.org/10.3745/JIPS.04.0087 인용 PDF KSCI

Dimension Reduction Method of Speech Feature Vector for Real-Time Adaptation of Voice Activity Detection (음성구간 검출기의 실시간 적응화를 위한 음성 특징벡터의 차원 축소 방법)

Park Jin-Young;Lee Kwang-Seok;Hur Kang-In
- Journal of the Institute of Convergence Signal Processing
- /
- v.7 no.3
- /
- pp.116-121
- /
- 2006
In this paper, we propose the dimension reduction method of multi-dimension speech feature vector for real-time adaptation procedure in various noisy environments. This method which reduces dimensions non-linearly to map the likelihood of speech feature vector and noise feature vector. The LRT(Likelihood Ratio Test) is used for classifying speech and non-speech. The results of implementation are similar to multi-dimensional speech feature vector. The results of speech recognition implementation of detected speech data are also similar to multi-dimensional(10-order dimensional MFCC(Mel-Frequency Cepstral Coefficient)) speech feature vector.
PDF

An Ensemble Classifier using Two Dimensional LDA

Park, Cheong-Hee
- Journal of Korea Multimedia Society
- /
- v.13 no.6
- /
- pp.817-824
- /
- 2010
Linear Discriminant Analysis (LDA) has been successfully applied for dimension reduction in face recognition. However, LDA requires the transformation of a face image to a one-dimensional vector and this process can cause the correlation information among neighboring pixels to be disregarded. On the other hand, 2D-LDA uses 2D images directly without a transformation process and it has been shown to be superior to the traditional LDA. Nevertheless, there are some problems in 2D-LDA. First, it is difficult to determine the optimal number of feature vectors in a reduced dimensional space. Second, the size of rectangular windows used in 2D-LDA makes strong impacts on classification accuracies but there is no reliable way to determine an optimal window size. In this paper, we propose a new algorithm to overcome those problems in 2D-LDA. We adopt an ensemble approach which combines several classifiers obtained by utilizing various window sizes. And a practical method to determine the number of feature vectors is also presented. Experimental results demonstrate that the proposed method can overcome the difficulties with choosing an optimal window size and the number of feature vectors.
PDF KSCI

Effective Dimensionality Reduction of Payload-Based Anomaly Detection in TMAD Model for HTTP Payload

Kakavand, Mohsen;Mustapha, Norwati;Mustapha, Aida;Abdullah, Mohd Taufik
- KSII Transactions on Internet and Information Systems (TIIS)
- /
- v.10 no.8
- /
- pp.3884-3910
- /
- 2016
Intrusion Detection System (IDS) in general considers a big amount of data that are highly redundant and irrelevant. This trait causes slow instruction, assessment procedures, high resource consumption and poor detection rate. Due to their expensive computational requirements during both training and detection, IDSs are mostly ineffective for real-time anomaly detection. This paper proposes a dimensionality reduction technique that is able to enhance the performance of IDSs up to constant time O(1) based on the Principle Component Analysis (PCA). Furthermore, the present study offers a feature selection approach for identifying major components in real time. The PCA algorithm transforms high-dimensional feature vectors into a low-dimensional feature space, which is used to determine the optimum volume of factors. The proposed approach was assessed using HTTP packet payload of ISCX 2012 IDS and DARPA 1999 dataset. The experimental outcome demonstrated that our proposed anomaly detection achieved promising results with 97% detection rate with 1.2% false positive rate for ISCX 2012 dataset and 100% detection rate with 0.06% false positive rate for DARPA 1999 dataset. Our proposed anomaly detection also achieved comparable performance in terms of computational complexity when compared to three state-of-the-art anomaly detection systems.
https://doi.org/10.3837/tiis.2016.08.025 인용 PDF KSCI KPUBS HTML

On Combining Genetic Algorithm (GA) and Wavelet for High Dimensional Data Reduction

Liu, Zhengjun;Wang, Changyao;Zhang, Jixian;Yan, Qin
- Proceedings of the KSRS Conference
- /
- 2003.11a
- /
- pp.1272-1274
- /
- 2003
In this paper, we present a new algorithm for high dimensional data reduction based on wavelet decomposition and Genetic Algorithm (GA). Comparative results show the superiority of our algorithm for dimensionality reduction and accuracy improvement.
PDF

Feature-Based Image Retrieval using SOM-Based R＊-Tree

Shin, Min-Hwa;Kwon, Chang-Hee;Bae, Sang-Hyun
- Proceedings of the KAIS Fall Conference
- /
- 2003.11a
- /
- pp.223-230
- /
- 2003
Feature-based similarity retrieval has become an important research issue in multimedia database systems. The features of multimedia data are useful for discriminating between multimedia objects (e 'g', documents, images, video, music score, etc.). For example, images are represented by their color histograms, texture vectors, and shape descriptors, and are usually high-dimensional data. The performance of conventional multidimensional data structures(e'g', R- Tree family, K-D-B tree, grid file, TV-tree) tends to deteriorate as the number of dimensions of feature vectors increases. The R＊-tree is the most successful variant of the R-tree. In this paper, we propose a SOM-based R＊-tree as a new indexing method for high-dimensional feature vectors.The SOM-based R＊-tree combines SOM and R＊-tree to achieve search performance more scalable to high dimensionalities. Self-Organizing Maps (SOMs) provide mapping from high-dimensional feature vectors onto a two dimensional space. The mapping preserves the topology of the feature vectors. The map is called a topological of the feature map, and preserves the mutual relationship (similarity) in the feature spaces of input data, clustering mutually similar feature vectors in neighboring nodes. Each node of the topological feature map holds a codebook vector. A best-matching-image-list. (BMIL) holds similar images that are closest to each codebook vector. In a topological feature map, there are empty nodes in which no image is classified. When we build an R＊-tree, we use codebook vectors of topological feature map which eliminates the empty nodes that cause unnecessary disk access and degrade retrieval performance. We experimentally compare the retrieval time cost of a SOM-based R＊-tree with that of an SOM and an R＊-tree using color feature vectors extracted from 40, 000 images. The result show that the SOM-based R＊-tree outperforms both the SOM and R＊-tree due to the reduction of the number of nodes required to build R＊-tree and retrieval time cost.
PDF

Comparative Analysis of Dimensionality Reduction Techniques for Advanced Ransomware Detection with Machine Learning (기계학습 기반 랜섬웨어 공격 탐지를 위한 효과적인 특성 추출기법 비교분석)

Kim Han Seok;Lee Soo Jin
- Convergence Security Journal
- /
- v.23 no.1
- /
- pp.117-123
- /
- 2023
To detect advanced ransomware attacks with machine learning-based models, the classification model must train learning data with high-dimensional feature space. And in this case, a 'curse of dimension' phenomenon is likely to occur. Therefore, dimensionality reduction of features must be preceded in order to increase the accuracy of the learning model and improve the execution speed while avoiding the 'curse of dimension' phenomenon. In this paper, we conducted classification of ransomware by applying three machine learning models and two feature extraction techniques to two datasets with extremely different dimensions of feature space. As a result of the experiment, the feature dimensionality reduction techniques did not significantly affect the performance improvement in binary classification, and it was the same even when the dimension of featurespace was small in multi-class clasification. However, when the dataset had high-dimensional feature space, LDA(Linear Discriminant Analysis) showed quite excellent performance.
https://doi.org/10.33778/kcsa.2023.23.1.117 인용 PDF HTML

Parts-based Feature Extraction of Speech Spectrum Using Non-Negative Matrix Factorization (Non-Negative Matrix Factorization을 이용한 음성 스펙트럼의 부분 특징 추출)

박정원;김창근;허강인
- Proceedings of the IEEK Conference
- /
- 2003.11a
- /
- pp.49-52
- /
- 2003
In this paper, we propose new speech feature parameter using NMf(Non-Negative Matrix Factorization). NMF can represent multi-dimensional data based on effective dimensional reduction through matrix factorization under the non-negativity constraint, and reduced data present parts-based features of input data. In this paper, we verify about usefulness of NMF algorithm for speech feature extraction applying feature parameter that is got using NMF in Mel-scaled filter bank output. According to recognition experiment result, we could confirm that proposal feature parameter is superior in recognition performance than MFCC(mel frequency cepstral coefficient) that is used generally.
PDF

Feature reduction for classifying high dimensional data sets using support vector machine (고차원 데이터의 분류를 위한 서포트 벡터 머신을 이용한 피처 감소 기법)

Ko, Seok-Ha;Lee, Hyun-Ju
- Proceedings of the IEEK Conference
- /
- 2008.06a
- /
- pp.877-878
- /
- 2008
We suggest a feature reduction method to classify mouse function data sets, which integrate several biological data sets represented as high dimensional vectors. To increase classification accuracy and decrease computational overhead, it is important to reduce the dimension of features. To do this, we employed Hybrid Huberized Support Vector Machine with kernels used for a kernel logistic regression method. When compared to support vector machine, this a pproach shows the better accuracy with useful features for each mouse function.
PDF

Case based Reasoning System with Two Dimensional Reduction Technique for Customer Classification Model

Kim, Kyoung-Jae;Ahn, Hyun-Chul
- Proceedings of the Korean Institute of Information and Commucation Sciences Conference
- /
- v.9 no.2
- /
- pp.383-386
- /
- 2005
This study proposes a case based reasoning system with two dimensional reduction techniques. In this study, vertical and horizontal dimensions of the research data are reduced through hybrid feature and instance selection process using genetic algorithms. We applied the proposed model to customer classification model which utilizes customers' demographic characteristics as inputs to predict their buying behavior for the specific product. Experimental results show that the proposed technique may improve the classification accuracy and outperform various optimized models of typical CBR system.
PDF

Search Result 85, Processing Time 0.027 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)