DOI QR코드

DOI QR Code

Impact of Instance Selection on kNN-Based Text Categorization

  • Barigou, Fatiha (Dept. of Computer Science, University of Oran 1 Ahmed Ben Bella)
  • Received : 2014.08.18
  • Accepted : 2016.11.30
  • Published : 2018.04.30

Abstract

With the increasing use of the Internet and electronic documents, automatic text categorization becomes imperative. Several machine learning algorithms have been proposed for text categorization. The k-nearest neighbor algorithm (kNN) is known to be one of the best state of the art classifiers when used for text categorization. However, kNN suffers from limitations such as high computation when classifying new instances. Instance selection techniques have emerged as highly competitive methods to improve kNN through data reduction. However previous works have evaluated those approaches only on structured datasets. In addition, their performance has not been examined over the text categorization domain where the dimensionality and size of the dataset is very high. Motivated by these observations, this paper investigates and analyzes the impact of instance selection on kNN-based text categorization in terms of various aspects such as classification accuracy, classification efficiency, and data reduction.

References

  1. F. Sebastiani, "Machine learning in automated text categorization," ACM Computing Surveys, vol. 34, no 1, pp. 1-47, 2002. https://doi.org/10.1145/505282.505283
  2. T. Songbo, "An effective refinement strategy for KNN text classifier," Expert Systems with Applications, vol. 30, no. 2, pp. 290-298, 2006. https://doi.org/10.1016/j.eswa.2005.07.019
  3. S. Garcia, J. Derrac, R. Cano, and F. Herrera, "Prototype selection for nearest neighbor classification: Taxonomy and empirical study," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 317-435, 2012.
  4. J. Bien and R. Tibshirani, "Prototype selection for interpretable classification," Annals of Applied Statistics Journal, vol. 5, no. 4, pp. 2403-2424, 2011. https://doi.org/10.1214/11-AOAS495
  5. J. A. Olvera-Lopez, J. A. Carrasco-Ochoa, J. F. Martinez-Trinidad, and J. Kittler, "A review of instance selection methods," Artificial Intelligence Review, vol. 34, no. 2, pp. 133-143, 2010. https://doi.org/10.1007/s10462-010-9165-y
  6. T. Liu, A. W. Moore, and A. Gray, "New algorithms for efficient high dimensional nonparametric classification," Journal of Machine Learning Research, vol. 7, pp. 1135-1158, 2006.
  7. C. F. Tsai, Z. Y. Chen, and S. W. Ke, "Evolutionary instance selection for text classification," Journal of Systems and Software vol. 90, pp. 104-113, 2014. https://doi.org/10.1016/j.jss.2013.12.034
  8. F. Barigou, "A new term weighting scheme for text categorization," International Journal of Intelligent Systems Technologies and Applications, vol. 14, no. 3/4, pp. 256-272, 2015. https://doi.org/10.1504/IJISTA.2015.074332
  9. H. Zhou, J. Guo, and Y. Wang, "A feature selection approach based on term distributions," SpringerPlus, vol. 5, article no. 249, 2016.
  10. M. Grochowski and N. Jankowski, "Comparison of instance selection algorithms II. Results and comments," in Proceedings of the 7th International Conference on Artificial Intelligence and Soft Computing, Zakopane, Poland, 2004, pp. 580-585.
  11. N. Jankowski and M. Grochowski, "Comparison of Instance Selection Algorithms I. Algorithms survey," in Proceedings of the 7th International Conference on Artificial Intelligence and Soft Computing, Zakopane, Poland, 2004, pp. 598-603.
  12. H. Brighton and C. Mellish, "Advances in instance selection for instance-based learning algorithms," Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 153-172, 2002. https://doi.org/10.1023/A:1014043630878
  13. I. Triguero, J. Derrac, S. Garcia, and F. Herrera, "A taxonomy and experimental study on prototype generation for nearest neighbor classification," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 1, pp. 86-100, 2012. https://doi.org/10.1109/TSMCC.2010.2103939
  14. T. Cover amd P. Hart, "Nearest neighbor pattern classification," IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21-27, 1967. https://doi.org/10.1109/TIT.1967.1053964
  15. G. L. Libralon, A. C. P. D. L. Carvalho, and A. C. Lorena, "Pre-processing for noise detection in gene expression classification data," Journal of the Brazilian Computer Society, vol. 15, no. 1, pp. 3-11, 2009. https://doi.org/10.1007/BF03194502
  16. A. Arnaiz-Gonzalez, J. F. Diez-Pastor, J. J. Rodriguez, and C. Garcia-Osorio, "Instance selection for regression: adapting DROP," Neurocomputing, vol. 201, pp. 66-81, 2016. https://doi.org/10.1016/j.neucom.2016.04.003
  17. M. B. Stojanovic, M. M. Bozic, M. M. Stankovic, & Z. P. Stajic, "A methodology for training set instance selection using mutual information in time series prediction," Neurocomputing, vol. 141, pp. 236-245, 2014. https://doi.org/10.1016/j.neucom.2014.03.006
  18. P. Hart, "The condensed nearest neighbor rule," IEEE Transactions on Information Theory, vol. 14, no. 3, pp. 515-516, 1968. https://doi.org/10.1109/TIT.1968.1054155
  19. G. Gates, "The reduced nearest neighbor rule," IEEE Transactions on Information Theory, vol. 18, no. 3, pp. 431-433, 1972. https://doi.org/10.1109/TIT.1972.1054809
  20. G. Ritter, H. Woodruff, S. Lowry, and T. Isenhour, "An algorithm for a selective nearest neighbor decision rule," IEEE Transactions on Information Theory, vol. 21, no. 6, pp. 665-669, 1975. https://doi.org/10.1109/TIT.1975.1055464
  21. I. Tomek, "Two modifications of CNN," IEEE Transactions on Systems, Man, and Cybernetics, vol. 6, no. 6, pp. 769-772, 1976.
  22. V. Devi and M. Murty, "An incremental prototype set building technique," Pattern Recognition, vol. 35, no. 2, pp. 505-513, 2002. https://doi.org/10.1016/S0031-3203(00)00184-9
  23. J. Riquelme, J. Aguilar-Ruiz, and M. Toro, "Finding representative patterns with ordered projections," Pattern Recognition, vol. 36, no. 4, pp. 1009-1018, 2003. https://doi.org/10.1016/S0031-3203(02)00119-X
  24. F. Angiulli, "Fast condensed nearest neighbor rule," in Proceedings of the 22d International Conference on Machine Learning, Bonn, Germany, 2005, pp. 25-325.
  25. D. L. Wilson, "Asymptotic properties of nearest neighbor rules using edited data," IEEE Transactions on Systems, Man, and Cybernetics, vol. 2, no. 3, pp. 408-421, 1972. https://doi.org/10.1109/TSMC.1972.4309137
  26. K. Hattori and M. Takahashi, "A new edited k-nearest neighbor rule in the pattern classification problem," Pattern Recognition, vol. 33, no. 3, pp. 521-528, 2000. https://doi.org/10.1016/S0031-3203(99)00068-0
  27. J. S. Sanchez, F. Pla, and F. J. Ferri," Prototype selection for the nearest neighbor rule through proximity graphs," Pattern Recognition Letters, vol. 18, no. 6, pp. 507-513, 1997. https://doi.org/10.1016/S0167-8655(97)00035-4
  28. D. Aha, D. Kibler, and M. Albert, "Instance-based learning algorithms," Machine Learning, vol. 6, no. 1, pp. 37-66, 1991. https://doi.org/10.1007/BF00153759
  29. D. Randall Wilson and T. R. Martinez, "Reduction techniques for instance-based learning algorithms," Machine Learning, vol. 38, no. 3, pp. 257-286, 2000. https://doi.org/10.1023/A:1007626913721
  30. H. Brighton and C. Mellish. "Advances in instance selection for instance-based learning algorithms," Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 153-172, 2002. https://doi.org/10.1023/A:1014043630878
  31. J. Derrac, S. Garcia, and F. Herrera, "A survey on evolutionary instance selection and generation," International Journal of Applied Metaheuristic Computing, vol. 1, no. 1, pp. 60-92, 2010. https://doi.org/10.4018/jamc.2010102604
  32. E. Leyva, A. Gonzalez, and R. Perez, "Knowledge-based instance selection: a compromise between efficiency and versatility," Knowledge-Based Systems, vol. 47, pp. 65-76, 2013. https://doi.org/10.1016/j.knosys.2013.04.005
  33. J. R. Cano, F. Herrera, and M. Lozano, "Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study," IEEE Transactions on Evolutionary Computation, vol. 7, no. 6, pp. 561-575, 2003. https://doi.org/10.1109/TEVC.2003.819265
  34. S. Garcia, J. R. Cano, and F. Herrera, "A memetic algorithm for evolutionary prototype selection: a scaling up approach," Pattern Recognition, vol. 8, no. 41, pp. 2693-2709, 2008.
  35. J. R. Cano, F. Herrera, and M. Lozano, "Stratification for scaling up evolutionary prototype selection," Pattern Recognition Letters, vol. 26, no. 7, pp. 953-963, 2005. https://doi.org/10.1016/j.patrec.2004.09.043
  36. C. Garcia-Osorio, A. de Haro-Garcia, and N. Garcia-Pedrajas, "Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts," Artificial Intelligence, vol. 174, no. 5/6, pp. 410-441, 2010. https://doi.org/10.1016/j.artint.2010.01.001
  37. I. Triguero, D. Peralta, J. Bacardit, S. Garcia, and F. Herrera, "MRPR: a MapReduce solution for prototype reduction in big data classification," Neurocomputing, vol. 150, pp. 331-345, 2015. https://doi.org/10.1016/j.neucom.2014.04.078
  38. M. F. Porter, "An algorithm for suffix stripping," Program, vol. 14, no. 3, pp. 130-137, 1980. https://doi.org/10.1108/eb046814
  39. Y. Yang and J. O. Pedersen, "A comparative study on feature selection in text categorization," in Proceedings of the 14th International Conference on Machine Learning, San Francisco, CA, 1997, pp. 412-420.