Performance Improvement of Nearest-neighbor Classification Learning through Prototype Selections

프로토타입 선택을 이용한 최근접 분류 학습의 성능 개선

  • 황두성 (단국대학교 컴퓨터과학과)
  • Received : 2011.02.15
  • Accepted : 2012.03.02
  • Published : 2012.03.25

Abstract

Nearest-neighbor classification predicts the class of an input data with the most frequent class among the near training data of the input data. Even though nearest-neighbor classification doesn't have a training stage, all of the training data are necessary in a predictive stage and the generalization performance depends on the quality of training data. Therefore, as the training data size increase, a nearest-neighbor classification requires the large amount of memory and the large computation time in prediction. In this paper, we propose a prototype selection algorithm that predicts the class of test data with the new set of prototypes which are near-boundary training data. Based on Tomek links and distance metric, the proposed algorithm selects boundary data and decides whether the selected data is added to the set of prototypes by considering classes and distance relationships. In the experiments, the number of prototypes is much smaller than the size of original training data and we takes advantages of storage reduction and fast prediction in a nearest-neighbor classification.

최근접 이웃 분류에서 입력 데이터의 클래스는 선택된 근접 학습 데이터들 중에서 가장 빈번한 클래스로 예측된다. 최근접분류 학습은 학습 단계가 없으나, 준비된 데이터가 모두 예측 분류에 참여하여 일반화 성능이 학습 데이터의 질에 의존된다. 그러므로 학습 데이터가 많아지면 높은 기억 장치 용량과 예측 분류 시 높은 계산 시간이 요구된다. 본 논문에서는 분리 경계면에 위치한 학습 데이터들로 구성된 새로운 학습 데이터를 생성시켜 분류 예측을 수행하는 프로토타입 선택 알고리즘을 제안한다. 제안하는 알고리즘에서는 분리 경계 영역에 위치한 데이터를 Tomek links와 거리를 이용하여 선별하며, 이미 선택된 데이터와 클래스와 거리 관계 분석을 이용하여 프로토타입 집합에 추가 여부를 결정한다. 실험에서 선택된 프로토타입의 수는 원래 학습 데이터에 비해 적은 수의 데이터 집합이 되어 최근접 분류의 적용 시 기억장소의 축소와 빠른 예측 시간을 제공할수 있다.

Keywords

References

  1. Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
  2. X. Wu and V. Kumar, Eds, The Top Ten Algorithms in Data Mining, Chapman & Hall/CRC Data Mining and Knowledge Discovery, 2009.
  3. S. García, J. Derrac, J.R. Cano, F. Herrera, "Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, No. 3, pp. 417-435, 2012. https://doi.org/10.1109/TPAMI.2011.142
  4. K. Yu, L. Ji, and X. Zhang, "Kernel nearest-neighbor algorithm", Neural Processing Letters, Vol.15, pp.147-156, 2002. https://doi.org/10.1023/A:1015244902967
  5. P. Jeatrakul, K.W. Wong and C.C. Fung, "Data cleaning for classification using misclassification analysis," Journal of Advanced Computational and Intelligent Informatics, Vol.14, No.3, pp. 297-302, 2010. https://doi.org/10.20965/jaciii.2010.p0297
  6. T.M. Cover and P. E. Hart, "Nearest neighbor pattern classification," IEEE Trans. on Information Theory, Vol. 13, No. 1, pp. 21-27, 1967. https://doi.org/10.1109/TIT.1967.1053964
  7. F. Angiulli, "Fast nearest neighbor condensation for large data sets classification," IEEE Trans. Knowledge and Data Engineering, Vol.19, pp. 1450-1464, 2007. https://doi.org/10.1109/TKDE.2007.190645
  8. H. A. Fayed and A. F. Atiya, "A Novel Template Reduction Approach for the K-Nearest Neighbor Method," IEEE Trans. on Neural Networks, Vol.20, No. 5, pp.890-896, 2009. https://doi.org/10.1109/TNN.2009.2018547
  9. H. J. Shin and S. Z. Cho, "Response modeling with support vector machines," Expert Systems with Applications, Vol.30, No.4, pp.746-760, 2006. https://doi.org/10.1016/j.eswa.2005.07.037
  10. J. Wang, P. Neskovic, and L. N. Cooper, "Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence," Pattern Recognition, Vol.39, No.3, pp.417-423, 2006. https://doi.org/10.1016/j.patcog.2005.08.009
  11. UCI machine learning repository, http://archive.ics.uci.edu/ml/.
  12. Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd edition, Elsevier, 2005.
  13. C. Ferri, P. Flach and J. Hernndez-Orallo, "Learning Decision Trees Using the Area Under ROC Curve," Proceedings of the 19th International Conference on Machine Learning(ICML-2002), pp. 139-146, 2002.
  14. Jin Huang and Charles X. Ling, "Using AUC and Accuracy in Evaluating Learning Algorithms," IEEE Trans. on Knowledge and Data Engineering, Vol. 17, No. 3, pp. 299-310, 2005. https://doi.org/10.1109/TKDE.2005.50
  15. Jesse Davis and Mark Goadrich, "The relationship between Precision-Recall and ROC curves," Proceedings of the 23th International Conference on Machine Learning(ICML-2006), pp. 233-240, 2006.