DOI QR코드

DOI QR Code

A measure of discrepancy based on margin of victory useful for the determination of random forest size

랜덤포레스트의 크기 결정에 유용한 승리표차에 기반한 불일치 측도

  • Received : 2017.04.17
  • Accepted : 2017.05.16
  • Published : 2017.05.31

Abstract

In this study, a measure of discrepancy based on MV (margin of victory) has been suggested that might be useful in determining the size of random forest for classification. Here MV is a scaled difference in the votes, at infinite random forest, of two most popular classes of current random forest. More specifically, max(-MV,0) is proposed as a reasonable measure of discrepancy by noting that negative MV values mean a discrepancy in two most popular classes between the current and infinite random forests. We propose an appropriate diagnostic statistic based on this measure that might be useful for the determination of random forest size, and then we derive its asymptotic distribution. Finally, a simulation study has been conducted to compare the performances, in finite samples, between this proposed statistic and other recently proposed diagnostic statistics.

이 연구에서는 분류를 위한 RF (random forest)의 크기 결정에 유용한 승리표차 MV (margin of victory)에 기반한 불일치 측도를 제안하고자 한다. 여기서 MV는 현재의 RF에서 1등과 2등을 차지하는 집단이 무한 RF에서 차지하는 승리표차이다. 구체적으로 -MV가 양수이면 현재와 무한 RF 사이에 1등과 2등인 집단에서 불일치가 생긴다는 점에 착안하여, max(-MV, 0)을 하나의 불일치 측도로 제안한다. 이 불일치 측도에 근거하여 RF의 크기 결정에 적절한 진단통계량을 제안하며, 또한 이 통계량의 이론적인 점근분포를 유도한다. 마지막으로 이 통계량을 최근에 제안된 진단통계량들과 소표본 하에서 성능을 비교하는 모의실험을 실행한다.

Keywords

Acknowledgement

Supported by : 한국연구재단

References

  1. Banfield, R. E., Hall, L. O., Bowyer, K. W. and Kegelmeyer, W. P. (2007). A comparison of decision tree creation techniques. IEEE Transactions on Pattern Recognition and Machine Learning, 29, 173-180. https://doi.org/10.1109/TPAMI.2007.250609
  2. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
  3. Breiman, L. (2001). Random forest. Machine Learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324
  4. Caruana, R., Karampatziakis, N. and Yessenalina, A. (2008). An empirical evaluation of supervised learning in high dimensions. Proceedings of the 25th International Conference on Machine Learning, 96-103.
  5. Choi, S. H. and Kim, H. (2016). Tree size determination for classification ensemble. Journal of the Korean Data & Information Science Society, 27, 255-264. https://doi.org/10.7465/jkdi.2016.27.1.255
  6. Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97, 77-87. https://doi.org/10.1198/016214502753479248
  7. Hamza, M. and Larocque, D. (2005). An empirical comparison of ensemble methods based on classification trees. Journal of Statistical Computation and Simulation, 75, 629-643. https://doi.org/10.1080/00949650410001729472
  8. Hernandez-Lobato, D., Martinez-Munoz, G. and Suarez, A. (2011). Inference on prediction of ensembles of infinite size. Pattern Recognition, 44, 1426-1434. https://doi.org/10.1016/j.patcog.2010.12.021
  9. Hernandez-Lobato, D., Martinez-Munoz, G. and Suarez, A. (2013). How large should ensembles of classifiers be? Pattern Recognition, 46, 1323-1336. https://doi.org/10.1016/j.patcog.2012.10.021
  10. Park, C. (2016). A simple diagnostic statistic for determining the size of random forest. Journal of the Korean Data & Information Science Society, 27, 855-863. https://doi.org/10.7465/jkdi.2016.27.4.855
  11. Shapire, R., Freund, Y., Bartlett, P. and Lee, W. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26, 1651-1686. https://doi.org/10.1214/aos/1024691352