DOI QR코드

DOI QR Code

A New Statistical Sampling Method for Reducing Computing time of Machine Learning Algorithms

기계학습 알고리즘의 컴퓨팅시간 단축을 위한 새로운 통계적 샘플링 기법

  • 전성해 (청주대학교 바이오정보통계학과)
  • Received : 2011.02.08
  • Accepted : 2011.04.03
  • Published : 2011.04.25

Abstract

Accuracy and computing time are considerable issues in machine learning. In general, the computing time for data analysis is increased in proportion to the size of given data. So, we need a sampling approach to reduce the size of training data. But, the accuracy of constructed model is decreased by going down the data size simultaneously. To solve this problem, we propose a new statistical sampling method having similar performance to the total data. We suggest a rule to select optimal sampling techniques according to given data structure. This paper shows a sampling method for reducing computing time with keeping the most of accuracy using cluster sampling, stratified sampling, and systematic sampling. We verify improved performance of proposed method by accuracy and computing time between sample data and total data using objective machine learning data sets.

기계학습에서 모형의 정확도와 컴퓨팅시간은 중요하게 다루어지는 부분이다. 일반적으로 모형을 구축하는 데 사용되는 컴퓨팅시간은 분석에 사용되는 데이터의 크기에 비례하여 커진다. 따라서 컴퓨팅시간 단축을 위하여 분석에 사용되는 데이터의 크기를 줄이는 샘플링전략이 필요하다. 하지만 학습데이터의 크기가 작게 되면 구축된 모형의 정확도도 함께 떨어지게 된다. 본 논문에서는 이와 같은 문제를 해결하기 위하여 전체데이터를 분석하지 않아도 전체를 분석할 때와 비슷한 모형성능을 유지할 수 있는 새로운 통계적 샘플링방법을 제안한다. 주어진 데이터의 구조에 따라 최선의 통계적 샘플링기법을 선택할 수 있는 기준을 제시한다. 군집, 층화, 계통추출에 의한 통계적 샘플링기법을 사용하여 정확도를 최대한 유지하면서 컴퓨팅시간을 단축할 수 있는 방법을 보인다. 제안방법의 성능을 평가하기 위하여 객관적인 기계학습 데이터를 이용하여 전체데이터와 샘플데이터 간의 정확도와 컴퓨팅시간을 비교하였다.

Keywords

References

  1. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer, 2001.
  2. T. M. Mitchell, Machine Learning, McGraw-Hill, 1997.
  3. A. Ben-Hur, D. Horn, H. T. Siegelmann, V. N. Vapnik, “Support Vector Clustering,” Journal of Machine Learning Research, vol. 2, pp. 125-137, 2001.
  4. S. R. Gunn, “Support Vector Machines for Classification and Regression,” Technical Report, University of Southampton, 1998.
  5. V. N. Vapnik, Statistical Learning Theory, John Wiley & Sons, 1998.
  6. V. N. Vapnik, “An Overview of Statistical Learning Theory,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 988-999, 1999. https://doi.org/10.1109/72.788640
  7. Z.-J. Chen, B. Liu, X.-P. He, “A SVC Iterative Learning Algorithm Based on Sample Selection for Large Samples," Proceedings of International Conference on Machine Learning and Cybernetics, vol. 6, pp. 3308-3313, 2007.
  8. M.-H. Ha, L.-F. Zheng, J.-Q. Chen, “The Key Theorem of Learning Theory Based on Random Sets Samples," Proceedings of International Conference on Machine Learning and Cybernetics, vol. 5, pp. 2826-2831, 2007.
  9. Y. S. Jia, C. Y. Jia, H. W. Qi, “A New Nu-Support Vector Machine for Training Sets with Duplicate Samples,” Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, pp. 4370-4373, 2005.
  10. W. Ng, M. Dash, “An Evaluation of Progressive Sampling for Imbalanced Data Sets," Proceedings of Sixth IEEE International Conference on Data Mining, pp. 657-661, 2006.
  11. K.-H. Yang, G.-L. Shan L.-L. Zhao, “Correlation Coefficient Method for Support Vector Machine Input Samples," Proceedings of International Conference on Machine Learning and Cybernetics, pp. 2856-2861, 2006.
  12. C. S. Ding, Q. Wu, C. T. Hsieh, M. Pedram, “Stratified Random Sampling for Power Estimation,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 17, no. 6, pp. 465-471, 1998. https://doi.org/10.1109/43.703828
  13. M. Keramat, R. Kielbasa, “A study of stratified sampling in variance reduction techniques for parametric yield estimation,” Proceedings of IEEE International Symposium on Circuits and Systems, vol. 3, pp. 1652-1655, 1997.
  14. P. A. D. I. Santos, Jr., R. J. Burke, J. M. Tien, “Prograssive Random Sampling With Stratification,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Applications and Reviews, vol. 37, no. 6, pp. 1223-1230, 2007.
  15. M. Xing, M. Jaeger, H. Baogang, “An Effective Stratified Sampling Scheme for Environment Maps with Median Cut Method,” Proceedings of International Conference on Computer Graphics, Imaging and Visualisation, pp. 384-389, 2006.
  16. The UC Irvine Machine Learning Repository, http://archive.ics.uci.edu/ml/
  17. S. K. Thompson, Sampling, 2nd ed., John Wiley & Sons, 2002.
  18. S. Jun, “Support Vector Machine based on Stratified Sampling,” International Journal of Fuzzy Logic and Intelligent System, vol. 9, no. 2, pp. 141-146, 2009. https://doi.org/10.5391/IJFIS.2009.9.2.141
  19. S. Jun, “Improvement of SOM using Stratifiation,” International Journal of Fuzzy Logic and Intelligent Systems, vol. 9, no. 1, pp. 36-41, 2009. https://doi.org/10.5391/IJFIS.2009.9.1.036
  20. S. Jun, “Web Usage Mining Using Evolutionary Support Vector Machine," Lecture Note in Artificial Intelligence, vol. 3809, pp. 1015-1020, Springer-Verlag, 2005.
  21. J. Wang, X. Wu, C. Zhang, “Support vector machines based on K-means clustering for real-time business intelligent systems,” International Journal Business Intelligence and Data Mining, vol. 1, no. 1, pp. 54-64, 2005. https://doi.org/10.1504/IJBIDM.2005.007318
  22. 김영원, 류제복, 박진우, 홍기학 역, 표본조사의 이해와 활용, 교우사, 2006.
  23. R. L. Scheaffer, W. Mendenhall III, R. L. Ott, Elementary Survey Sampling 6th edition, Duxbury, 2006.
  24. 손건태, 전산통계개론 - 통계적 모의실험과 추정 알고리즘 제4판, 자유아카데미, 2005.
  25. R Development Core Team, R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org, 2010.
  26. Y. Tille, A. Matei, Survey Sampling-Package 'sampling', R-Project CRAN, 2009.
  27. B. Repley, Feed-forward Neural Networks and Multinomial Log-Linear Models-Package 'nnet', R-Project CRAN, 2009.

Cited by

  1. Design of Client-Server Model For Effective Processing and Utilization of Bigdata vol.22, pp.4, 2016, https://doi.org/10.13088/jiis.2016.22.4.109