DOI QR코드

DOI QR Code

Semantic-based Genetic Algorithm for Feature Selection

의미 기반 유전 알고리즘을 사용한 특징 선택

  • 김정호 (한국항공대학교 대학원 컴퓨터공학과) ;
  • 인주호 (한국항공대학교 대학원 컴퓨터공학과) ;
  • 채수환 (한국항공대학 전자 및 정보통신공학부)
  • Received : 2012.01.18
  • Accepted : 2012.06.05
  • Published : 2012.08.31

Abstract

In this paper, an optimal feature selection method considering sematic of features, which is preprocess of document classification is proposed. The feature selection is very important part on classification, which is composed of removing redundant features and selecting essential features. LSA (Latent Semantic Analysis) for considering meaning of the features is adopted. However, a supervised LSA which is suitable method for classification problems is used because the basic LSA is not specialized for feature selection. We also apply GA (Genetic Algorithm) to the features, which are obtained from supervised LSA to select better feature subset. Finally, we project documents onto new selected feature subset and classify them using specific classifier, SVM (Support Vector Machine). It is expected to get high performance and efficiency of classification by selecting optimal feature subset using the proposed hybrid method of supervised LSA and GA. Its efficiency is proved through experiments using internet news classification with low features.

본 논문은 문서 분류의 전처리 단계인 특징 선택을 위해 의미를 고려한 최적의 특징 선택 방법을 제안한다. 특징 선택은 불필요한 특징을 제거하고 분류에 필요한 특징을 추출하는 작업으로 분류 작업에서 매우 중요한 역할을 한다. 특징 선택 기법으로 특징의 의미를 파악하여 특징을 선택하는 LSA(Latent Semantic Analysis) 기법을 사용하지만 기본 LSA는 분류 작업에 특성화 된 기법이 아니므로 지도적 학습을 통해 분류에 적합하도록 개선된 지도적 LSA를 사용한다. 지도적 LSA를 통해 선택된 특징들로부터 최적화 기법인 유전 알고리즘을 사용하여 더 최적의 특징들을 추출한다. 마지막으로, 추출한 특징들로 분류할 문서를 표현하고 SVM (Support Vector Machine)을 이용한 특정 분류기를 사용하여 분류를 수행하였다. 지도적 LSA를 통해 의미를 고려하고 유전 알고리즘을 통해 최적의 특징 집합을 찾음으로써 높은 분류 성능과 효율성을 보일 것이라 가정하였다. 인터넷 뉴스 기사를 대상으로 분류 실험을 수행한 결과 적은 수의 특징들로 높은 분류 성능을 확인할 수 있었다.

Keywords

References

  1. X. Qi, B. D. Davison, "Web page classification: Features and Algorithms," ACM Computing Surveys(CSUR), Vol. 41, No. 2, Feb. 2009, pp. 1-31.
  2. I. Guyon, A. Elisseeff, "An Introduction to Variable and Feature Selection," Journal of Machine Learning Research, Vol. 3, Jan. 2003, pp. 1157-1182.
  3. A. Dasgupta, P. Drineas, B. Harb, V. Josifovski, M. W. Mahnoney, "Feature Selection methods for Text Categorization," Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007, pp. 230-239.
  4. Landauer, T. K., S. T. Dumais, "A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge," Psychological Review, Vol. 104, No. 2, Apr. 1997, pp. 211-240. https://doi.org/10.1037/0033-295X.104.2.211
  5. S. C. Deerwester, S. T. Dumais, T. K. Landaner, G. W. Furnas, R. A. Harshman, "Indexing by latent semantic analysis," Journal of the American Society for Information Science, Vol. 41, No. 6, 1990, pp. 391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  6. S. Chakraborti, R. Lothian, N. Wiratunga, S. Watt, "Sprinkling: Supervised Latent Semantic Indexing," Advances in Information Retrieval, 2006, pp. 510-514.
  7. 7] J. T. Sun, Z. Chen, H. J. Zeng, Y. C. Lu, C. Y. Shi, W. Y. Ma, "Supervised Latent Semantic Indexing for Document Categorization," Fourth IEEE International Conference on Data Mining(ICDM '04), Nov. 2004, pp. 535-538.
  8. L. S. Oliveira, N. Benahmed, R. Sabourin, F. Bortolozzi, C. Y. Suen, "Feature Subset Selection Using Genetic Algorithms for Handwritten Digit Recognition," Proceeding SIBGRAPI '01 Proceedings of the 14th Brazilian Symposium on Computer Graphics and Image Processing, 2001, pp.362-370
  9. H. Liu, L. Yu, "Toward Integrating Feature selection algorithm for Classification and Clustering," IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 4, 2005, pp. 491-502. https://doi.org/10.1109/TKDE.2005.66
  10. C. M. Chen, H. M. Lee, Y. J. Chang, "Tow novel feature selection approaches for Web page Classification," Expert Systems with Application, Vol. 36, No. 1, Jan. 2009, pp. 260-272. https://doi.org/10.1016/j.eswa.2007.09.008
  11. A. Selamat, S. Omatu, "Web page Feature Selection and Classification using Neural Networks," Information Sciences, Vol. 158, Jan. 2004, pp. 69-88. https://doi.org/10.1016/j.ins.2003.03.003
  12. Y. Yang, J. O. Pedersen, "A Comparative Study on Feature Selection in Text Categorization," Proceedings of the 14th International Conference on Machine Learning(ICML '97), Jul. 1997, pp. 412-420.
  13. H. Peng, F. Long, C. Ding, "Feature selection Based on Mutual Information Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, Aug. 2005, pp. 1226-1238. https://doi.org/10.1109/TPAMI.2005.159
  14. J. Cheng, H. Huang, S. Tian, "Feature Selection for Text Classification with Naïve Byes," Expert Systems with Application, Vol. 36, No. 3, Apr. 2009, pp. 5432-5435. https://doi.org/10.1016/j.eswa.2008.06.054
  15. D. Mladenic, J. Brank, M. Grobelnik, N. Milic-Frayling, "Feature selection using Linear Classification weights: Interaction with Classification models," Proceedings of the 27th annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul. 2004, pp. 234-241.
  16. I. Inza, P. Larranaga, R. Etxeberria, B. Sierra, "Feature Subset Selection by Bayesian networkbased Optimization," Artificial Intelligence, Vol. 123, No. 1-2, 2000, pp. 157-184. https://doi.org/10.1016/S0004-3702(00)00052-7
  17. G. John, R. Kohavi, K. Pfleger, "Irrelevant Feature and the Subset Selection Problem," In Proceedings of 11th International Conference on Machine Learning, 1994, pp. 121-129.
  18. P. Luukka, "Feature selection using fuzzy entropy measures with similarity classifier," Expert Systems with Applications, Vol. 38, No. 4, Apr. 2011, pp. 4600-4607. https://doi.org/10.1016/j.eswa.2010.09.133
  19. I. A. Gheyas, L. S. Smith, "Feature subset selection in large dimensionality domains," Pattern Recognition, Vol. 43, No. 1, Jan. 2010, pp. 5-13. https://doi.org/10.1016/j.patcog.2009.06.009
  20. J. Hua, W. D. Tembe, E. R. Dougherty, "Performance of feature-selection methods in the classification of high-dimension data," Pattern Recognition, Vol. 42, No. 3, Mar. 2009, pp.409-424. https://doi.org/10.1016/j.patcog.2008.08.001
  21. M. Kudo, J. Sklansky, "Comparison of Algorithms that Select Features for Pattern Classifiers," Pattern Recognition, Vol. 33, No. 1, 2000, pp. 25-41. https://doi.org/10.1016/S0031-3203(99)00041-2

Cited by

  1. Combined Feature Set and Hybrid Feature Selection Method for Effective Document Classification vol.14, pp.5, 2013, https://doi.org/10.7472/jksii.2013.14.5.49