DOI QR코드

DOI QR Code

Direct Divergence Approximation between Probability Distributions and Its Applications in Machine Learning

  • Sugiyama, Masashi (Department of Computer Science, Tokyo Institute of Technology) ;
  • Liu, Song (Department of Computer Science, Tokyo Institute of Technology) ;
  • du Plessis, Marthinus Christoffel (Department of Computer Science, Tokyo Institute of Technology) ;
  • Yamanaka, Masao (Department of Computational Intelligence and Systems Science, Tokyo Institute of Technology) ;
  • Yamada, Makoto (NTT Communication Science Laboratories, NTT Corporation) ;
  • Suzuki, Taiji (Department of Mathematical Informatics, The University of Tokyo) ;
  • Kanamori, Takafumi (Department of Computer Science and Mathematical Informatics, Nagoya University)
  • Received : 2013.03.12
  • Accepted : 2013.05.03
  • Published : 2013.06.30

Abstract

Approximating a divergence between two probability distributions from their samples is a fundamental challenge in statistics, information theory, and machine learning. A divergence approximator can be used for various purposes, such as two-sample homogeneity testing, change-point detection, and class-balance estimation. Furthermore, an approximator of a divergence between the joint distribution and the product of marginals can be used for independence testing, which has a wide range of applications, including feature selection and extraction, clustering, object matching, independent component analysis, and causal direction estimation. In this paper, we review recent advances in divergence approximation. Our emphasis is that directly approximating the divergence without estimating probability distributions is more sensible than a naive two-step approach of first estimating probability distributions and then approximating the divergence. Furthermore, despite the overwhelming popularity of the Kullback-Leibler divergence as a divergence measure, we argue that alternatives such as the Pearson divergence, the relative Pearson divergence, and the $L^2$-distance are more useful in practice because of their computationally efficient approximability, high numerical stability, and superior robustness against outliers.

Keywords

References

  1. M. Sugiyama, T. Suzuki, Y. Itoh, T. Kanamori, and M. Kimura, "Least-squares two-sample test," Neural Networks, vol. 24, no. 7, pp. 735-751, 2011. https://doi.org/10.1016/j.neunet.2011.04.003
  2. T. Kanamori, T. Suzuki, and M. Sugiyama, "f-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models," IEEE Transactions on Information Theory, vol. 58, no. 2, pp. 708-720, 2012. https://doi.org/10.1109/TIT.2011.2163380
  3. Y. Kawahara and M. Sugiyama, "Sequential change-point detection based on direct density-ratio estimation," Statistical Analysis and Data Mining, vol. 5, no. 2, pp. 114-127, 2012. https://doi.org/10.1002/sam.10124
  4. M. C. du Plessis and M. Sugiyama, "Semi-supervised learning of class balance under class-prior change by distribution matching," in Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, 2012, pp. 823-830.
  5. M. Yamanaka, M. Matsugu, and M. Sugiyama, "Salient object detection based on direct density ratio estimation," IPSJ Transactions on Mathematical Modeling and Its Applications, 2013, to appear.
  6. M. Yamanaka, M. Matsugu, and M. Sugiyama, "Detection of activities and events without explicit categorization," IPSJ Transactions on Mathematical Modeling and Its Applications, 2013, to appear.
  7. S. Liu, M. Yamada, N. Collier, and M. Sugiyama, "Changepoint detection in time-series data by relative density-ratio estimation," Neural Networks, vol. 43, pp. 72-83, 2013. https://doi.org/10.1016/j.neunet.2013.01.012
  8. M. Sugiyama, "Machine learning with squared-loss mutual information," Entropy, vol. 15, no. 1, pp. 80-112, 2013.
  9. M. Sugiyama and T. Suzuki, "Least-squares independence test," IEICE Transactions on Information and Systems, vol. 94, no. 6, pp. 1333-1336, 2011.
  10. T. Suzuki, M. Sugiyama, T. Kanamori, and J. Sese, "Mutual information estimation reveals global associations between stimuli and biological processes," BMC Bioinformatics, vol. 10, no. 1, p. S52, 2009. https://doi.org/10.1186/1471-2105-10-S1-S52
  11. W. Jitkrittum, H. Hachiya, and M. Sugiyama, "Feature selection via L1-penalized squared-loss mutual information," IEICE Transactions on Information and Systems, 2013, to appear.
  12. T. Suzuki and M. Sugiyama, "Sufficient dimension reduction via squared-loss mutual information estimation," Neural Computation, vol. 25, no. 3, pp. 725-758, 2013. https://doi.org/10.1162/NECO_a_00407
  13. M. Yamada, G. Niu, J. Takagi, and M. Sugiyama, "Computationally efficient sufficient dimension reduction via squaredloss mutual information," JMLR Workshop and Conference Proceedings, vol. 20, pp. 247-262, 2011.
  14. M. Karasuyama and Sugiyama, "Canonical dependency analysis based on squared-loss mutual information," Neural Networks, vol. 34, pp. 46-55, 2012. https://doi.org/10.1016/j.neunet.2012.06.009
  15. M. Yamada and M. Sugiyama, "Cross-domain object matching with model selection," JMLR Workshop and Conference Proceedings, vol. 15, pp. 807-815, 2011.
  16. T. Suzuki and M. Sugiyama, "Least-squares independent component analysis," Neural Computation, vol. 23, no. 1, pp. 284-301, 2011. https://doi.org/10.1162/NECO_a_00062
  17. M. Sugiyama, M. Yamada, M. Kimura, and H. Hachiya, "On information-maximization clustering: tuning parameter selection and analytic solution," in Proceedings of the 28th International Conference on Machine Learning, Washington, DC, 2011, pp. 65-72.
  18. M. Kimura and M. Sugiyama, "Dependence maximization clustering with least-squares mutual information," Journal of Advanced Computational Intelligence and Intelligent Informatics, vol. 15, no. 7, pp. 800-805, 2011. https://doi.org/10.20965/jaciii.2011.p0800
  19. M. Yamada and M. Sugiyama, "Dependence minimizing regression with model selection for nonlinear causal inference under non-Gaussian noise," in Proceedings of the 24th AAAI Conference on Artificial Intelligence, Atlanta, GA, 2010, pp. 643-648.
  20. V. N. Vapnik, Statistical Learning Theory, New York, NY: Wiley, 1998.
  21. M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bunau, and M. Kawanabe, "Direct importance estimation for covariate shift adaptation," Annals of the Institute of Statistical Mathematics, vol. 60, no. 4, pp. 699-746, 2008. https://doi.org/10.1007/s10463-008-0197-x
  22. X. Nguyen, M. J. Wainwright, and M. I. Jordan, "Estimating divergence functionals and the likelihood ratio by convex risk minimization," IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5847-5861, 2010. https://doi.org/10.1109/TIT.2010.2068870
  23. T. Kanamori, S. Hido, and M. Sugiyama, "A least-squares approach to direct importance estimation," Journal of Machine Learning Research, vol. 10, pp. 1391-1445, 2009.
  24. M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama, "Relative density-ratio estimation for robust distribution comparison," Neural Computation, vol. 25, no. 5, pp. 1324-1370, 2013. https://doi.org/10.1162/NECO_a_00442
  25. M. Sugiyama, T. Suzuki, T. Kanamori, M. C. du Plessis, S. Liu, and I. Takeuchi, "Density difference estimation," Neural Computation, 2013, to appear.
  26. S. Kullback and R. A. Leibler, "On information and sufficiency," The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79-86, 1951. https://doi.org/10.1214/aoms/1177729694
  27. S. Amari and H. Nagaoka, Methods of Information Geometry, Providence, RI: American Mathematical Society, 2000.
  28. M. Sugiyama, T. Suzuki, and T. Kanamori, Density Ratio Estimation in Machine Learning, New York, NY: Cambridge University Press, 2012.
  29. C. Cortes, Y. Mansour, and M. Mohri, "Learning bounds for importance weighting," in Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, R. Zemel, J. Shawe-Taylor, and A. Culotta, Eds., La Jolla, CA: Neural Information Processing Systems, 2010, pp. 442-450.
  30. K. Pearson, "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling," Philosophical Magazine, vol. 50, no. 302, pp. 157-175, 1900. https://doi.org/10.1080/14786440009463897
  31. S. M. Ali and S. D. Silvey, "A general class of coefficients of divergence of one distribution from another," Journal of the Royal Statistical Society B, vol. 28, no. 1, pp. 131-142, 1966.
  32. I. Csiszar, "Information-type measures of difference of probability distributions and indirect observation," Studia Scientiarum Mathematicarum Hungarica, vol. 2, pp. 299-318, 1967.
  33. M. Sugiyama, T. Suzuki, and T. Kanamori, "Density ratio matching under the Bregman divergence: a unified framework of density ratio estimation," Annals of the Institute of Statistical Mathematics, vol. 64, no. 5, pp. 1009-1044, 2012. https://doi.org/10.1007/s10463-011-0343-8
  34. Y. Tsuboi, H. Kashima, S. Hido, S. Bickel, and M. Sugiyama, "Direct density ratio estimation for large-scale covariate shift adaptation," Information and Media Technologies, vol. 4, no. 2, pp. 529-546, 2009.
  35. M. Yamada and M. Sugiyama, "Direct importance estima tion with Gaussian mixture models," IEICE Transactions on Information and Systems, vol. 92, no. 10, pp. 2159-2162, 2009.
  36. M. Yamada, M. Sugiyama, G. Wichern, and J. Simm, "Direct importance estimation with a mixture of probabilistic principal component analyzers," IEICE Transactions on Information and Systems, vol. 93, no. 10, pp. 2846-2849, 2010.
  37. A. Keziou, "Dual representation of $\phi$-divergences and applications," Comptes Rendus Mathematique, vol. 336, no. 10, pp. 857-862, 2003. https://doi.org/10.1016/S1631-073X(03)00215-2
  38. R. T. Rockafellar, Convex Analysis. Princeton, NJ: Princeton University Press, 1970.
  39. R. Tibshirani, "Regression shrinkage and subset selection with the lasso," Journal of the Royal Statistical Society B, vol. 58, no. 1, pp. 267-288, 1996.
  40. R. Tomioka, T. Suzuki, and M. Sugiyama, "Super-linear convergence of dual augmented Lagrangian algorithm for sparsity regularized estimation," Journal of Machine Learning Research, vol. 12, pp. 1537-1586, 2011.
  41. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, "Least angle regression," The Annals of Statistics, vol. 32, no. 2, pp. 407-499, 2004. https://doi.org/10.1214/009053604000000067
  42. O. Chapelle, B. Scholkopf, and A. Zien, Semi-Supervised Learning. Cambridge, MA: MIT Press, 2006.
  43. R. Rifkin, G. Yeo, and T. Poggio, "Regularized least-squares classification," in Advances in Learning Theory: Methods, Models and Applications, J. A. K. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, Eds. Amsterdam, the Netherlands: IOS Press, 2003, pp. 131-154.
  44. M. Sugiyama, M. Krauledat, and K. R. Muller, "Covariate shift adaptation by importance weighted cross validation," Journal of Machine Learning Research, vol. 8, pp. 985- 1005, 2007.
  45. T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H. Y. Shum, "Learning to detect a salient object," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 2, pp. 353-367, 2011. https://doi.org/10.1109/TPAMI.2010.70
  46. C. E. Shannon, "A mathematical theory of communication," Bell Systems Technical Journal, vol. 27, pp. 379-423, 1948. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
  47. T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed., Hoboken, NJ: John Wiley & Sons Inc., 2006.
  48. K. Torkkola, "Feature extraction by non-parametric mutual information maximization," Journal of Machine Learning Research, vol. 3, pp. 1415-1438, 2003.
  49. J. Sainui and M. Sugiyama, "Direct approximation of quadratic mutual information and its application to dependencemaximization clustering," IEICE Transactions on Information and Systems, 2013, submitted for publication.
  50. M. Sugiyama, M. Kawanabe, and P. L. Chui, "Dimensionality reduction for density ratio estimation in high-dimensional spaces," Neural Networks, vol. 23, no. 1, pp. 44-59, 2010. https://doi.org/10.1016/j.neunet.2009.07.007
  51. M. Sugiyama, M. Yamada, P. von Bunau, T. Suzuki, T. Kanamori, and M. Kawanabe, "Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search," Neural Networks, vol. 24, no. 2, pp. 183-198, 2011. https://doi.org/10.1016/j.neunet.2010.10.005
  52. M. Yamada and M. Sugiyama, "Direct density-ratio estimation with dimensionality reduction via hetero-distributional subspace analysis," in Proceedings of the 25th AAAI Conference on Artificial Intelligence, San Francisco, CA, 2011, pp. 549-554.

Cited by

  1. Statistical Analysis of Distance Estimators with Density Differences and Density Ratios vol.16, pp.2, 2014, https://doi.org/10.3390/e16020921
  2. Minimum Distance Estimation of Milky Way Model Parameters and Related Inference vol.3, pp.1, 2015, https://doi.org/10.1137/130935525
  3. On second order efficient robust inference vol.88, 2015, https://doi.org/10.1016/j.csda.2015.02.008
  4. Direct Learning of Sparse Changes in Markov Networks by Density Ratio Estimation vol.26, pp.6, 2014, https://doi.org/10.1162/NECO_a_00589
  5. Computationally Efficient Class-Prior Estimation under Class Balance Change Using Energy Distance vol.E99.D, pp.1, 2016, https://doi.org/10.1587/transinf.2015EDP7212
  6. Direct Density Ratio Estimation with Convolutional Neural Networks with Application in Outlier Detection vol.E98.D, pp.5, 2015, https://doi.org/10.1587/transinf.2014EDP7335
  7. Noisy and incomplete fingerprint classification using local ridge distribution models vol.48, pp.2, 2015, https://doi.org/10.1016/j.patcog.2014.07.030
  8. Non-Bayesian Social Learning with Observation Reuse and Soft Switching vol.14, pp.2, 2018, https://doi.org/10.1145/3199513