Direct Divergence Approximation between Probability Distributions and Its Applications in Machine Learning

Sugiyama, Masashi;Liu, Song;du Plessis, Marthinus Christoffel;Yamanaka, Masao;Yamada, Makoto;Suzuki, Taiji;Kanamori, Takafumi;

doi:10.5626/JCSE.2013.7.2.99

Journal of Computing Science and Engineering

Volume 7 Issue 2
/
Pages.99-111
/
2013
/
1976-4677(pISSN)
/
2093-8020(eISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

DOI QR Code

Direct Divergence Approximation between Probability Distributions and Its Applications in Machine Learning

Sugiyama, Masashi (Department of Computer Science, Tokyo Institute of Technology) ;
Liu, Song (Department of Computer Science, Tokyo Institute of Technology) ;
du Plessis, Marthinus Christoffel (Department of Computer Science, Tokyo Institute of Technology) ;
Yamanaka, Masao (Department of Computational Intelligence and Systems Science, Tokyo Institute of Technology) ;
Yamada, Makoto (NTT Communication Science Laboratories, NTT Corporation) ;
Suzuki, Taiji (Department of Mathematical Informatics, The University of Tokyo) ;
Kanamori, Takafumi (Department of Computer Science and Mathematical Informatics, Nagoya University)

Received : 2013.03.12
Accepted : 2013.05.03
Published : 2013.06.30

https://doi.org/10.5626/JCSE.2013.7.2.99 Citation PDF KSCI KPUBS

Download PDF

⟨ Previous Next ⟩

Abstract

Approximating a divergence between two probability distributions from their samples is a fundamental challenge in statistics, information theory, and machine learning. A divergence approximator can be used for various purposes, such as two-sample homogeneity testing, change-point detection, and class-balance estimation. Furthermore, an approximator of a divergence between the joint distribution and the product of marginals can be used for independence testing, which has a wide range of applications, including feature selection and extraction, clustering, object matching, independent component analysis, and causal direction estimation. In this paper, we review recent advances in divergence approximation. Our emphasis is that directly approximating the divergence without estimating probability distributions is more sensible than a naive two-step approach of first estimating probability distributions and then approximating the divergence. Furthermore, despite the overwhelming popularity of the Kullback-Leibler divergence as a divergence measure, we argue that alternatives such as the Pearson divergence, the relative Pearson divergence, and the $L^2$-distance are more useful in practice because of their computationally efficient approximability, high numerical stability, and superior robustness against outliers.

Keywords

References

M. Sugiyama, T. Suzuki, Y. Itoh, T. Kanamori, and M. Kimura, "Least-squares two-sample test," Neural Networks, vol. 24, no. 7, pp. 735-751, 2011. https://doi.org/10.1016/j.neunet.2011.04.003
T. Kanamori, T. Suzuki, and M. Sugiyama, "f-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models," IEEE Transactions on Information Theory, vol. 58, no. 2, pp. 708-720, 2012. https://doi.org/10.1109/TIT.2011.2163380
Y. Kawahara and M. Sugiyama, "Sequential change-point detection based on direct density-ratio estimation," Statistical Analysis and Data Mining, vol. 5, no. 2, pp. 114-127, 2012. https://doi.org/10.1002/sam.10124
M. C. du Plessis and M. Sugiyama, "Semi-supervised learning of class balance under class-prior change by distribution matching," in Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, 2012, pp. 823-830.
M. Yamanaka, M. Matsugu, and M. Sugiyama, "Salient object detection based on direct density ratio estimation," IPSJ Transactions on Mathematical Modeling and Its Applications, 2013, to appear.
M. Yamanaka, M. Matsugu, and M. Sugiyama, "Detection of activities and events without explicit categorization," IPSJ Transactions on Mathematical Modeling and Its Applications, 2013, to appear.
S. Liu, M. Yamada, N. Collier, and M. Sugiyama, "Changepoint detection in time-series data by relative density-ratio estimation," Neural Networks, vol. 43, pp. 72-83, 2013. https://doi.org/10.1016/j.neunet.2013.01.012
M. Sugiyama, "Machine learning with squared-loss mutual information," Entropy, vol. 15, no. 1, pp. 80-112, 2013.
M. Sugiyama and T. Suzuki, "Least-squares independence test," IEICE Transactions on Information and Systems, vol. 94, no. 6, pp. 1333-1336, 2011.
T. Suzuki, M. Sugiyama, T. Kanamori, and J. Sese, "Mutual information estimation reveals global associations between stimuli and biological processes," BMC Bioinformatics, vol. 10, no. 1, p. S52, 2009. https://doi.org/10.1186/1471-2105-10-S1-S52
W. Jitkrittum, H. Hachiya, and M. Sugiyama, "Feature selection via L1-penalized squared-loss mutual information," IEICE Transactions on Information and Systems, 2013, to appear.
T. Suzuki and M. Sugiyama, "Sufficient dimension reduction via squared-loss mutual information estimation," Neural Computation, vol. 25, no. 3, pp. 725-758, 2013. https://doi.org/10.1162/NECO_a_00407
M. Yamada, G. Niu, J. Takagi, and M. Sugiyama, "Computationally efficient sufficient dimension reduction via squaredloss mutual information," JMLR Workshop and Conference Proceedings, vol. 20, pp. 247-262, 2011.
M. Karasuyama and Sugiyama, "Canonical dependency analysis based on squared-loss mutual information," Neural Networks, vol. 34, pp. 46-55, 2012. https://doi.org/10.1016/j.neunet.2012.06.009
M. Yamada and M. Sugiyama, "Cross-domain object matching with model selection," JMLR Workshop and Conference Proceedings, vol. 15, pp. 807-815, 2011.
T. Suzuki and M. Sugiyama, "Least-squares independent component analysis," Neural Computation, vol. 23, no. 1, pp. 284-301, 2011. https://doi.org/10.1162/NECO_a_00062
M. Sugiyama, M. Yamada, M. Kimura, and H. Hachiya, "On information-maximization clustering: tuning parameter selection and analytic solution," in Proceedings of the 28th International Conference on Machine Learning, Washington, DC, 2011, pp. 65-72.
M. Kimura and M. Sugiyama, "Dependence maximization clustering with least-squares mutual information," Journal of Advanced Computational Intelligence and Intelligent Informatics, vol. 15, no. 7, pp. 800-805, 2011. https://doi.org/10.20965/jaciii.2011.p0800
M. Yamada and M. Sugiyama, "Dependence minimizing regression with model selection for nonlinear causal inference under non-Gaussian noise," in Proceedings of the 24th AAAI Conference on Artificial Intelligence, Atlanta, GA, 2010, pp. 643-648.
V. N. Vapnik, Statistical Learning Theory, New York, NY: Wiley, 1998.
M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bunau, and M. Kawanabe, "Direct importance estimation for covariate shift adaptation," Annals of the Institute of Statistical Mathematics, vol. 60, no. 4, pp. 699-746, 2008. https://doi.org/10.1007/s10463-008-0197-x
X. Nguyen, M. J. Wainwright, and M. I. Jordan, "Estimating divergence functionals and the likelihood ratio by convex risk minimization," IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5847-5861, 2010. https://doi.org/10.1109/TIT.2010.2068870
T. Kanamori, S. Hido, and M. Sugiyama, "A least-squares approach to direct importance estimation," Journal of Machine Learning Research, vol. 10, pp. 1391-1445, 2009.
M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama, "Relative density-ratio estimation for robust distribution comparison," Neural Computation, vol. 25, no. 5, pp. 1324-1370, 2013. https://doi.org/10.1162/NECO_a_00442
M. Sugiyama, T. Suzuki, T. Kanamori, M. C. du Plessis, S. Liu, and I. Takeuchi, "Density difference estimation," Neural Computation, 2013, to appear.
S. Kullback and R. A. Leibler, "On information and sufficiency," The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79-86, 1951. https://doi.org/10.1214/aoms/1177729694
S. Amari and H. Nagaoka, Methods of Information Geometry, Providence, RI: American Mathematical Society, 2000.
M. Sugiyama, T. Suzuki, and T. Kanamori, Density Ratio Estimation in Machine Learning, New York, NY: Cambridge University Press, 2012.
C. Cortes, Y. Mansour, and M. Mohri, "Learning bounds for importance weighting," in Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, R. Zemel, J. Shawe-Taylor, and A. Culotta, Eds., La Jolla, CA: Neural Information Processing Systems, 2010, pp. 442-450.
K. Pearson, "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling," Philosophical Magazine, vol. 50, no. 302, pp. 157-175, 1900. https://doi.org/10.1080/14786440009463897
S. M. Ali and S. D. Silvey, "A general class of coefficients of divergence of one distribution from another," Journal of the Royal Statistical Society B, vol. 28, no. 1, pp. 131-142, 1966.
I. Csiszar, "Information-type measures of difference of probability distributions and indirect observation," Studia Scientiarum Mathematicarum Hungarica, vol. 2, pp. 299-318, 1967.
M. Sugiyama, T. Suzuki, and T. Kanamori, "Density ratio matching under the Bregman divergence: a unified framework of density ratio estimation," Annals of the Institute of Statistical Mathematics, vol. 64, no. 5, pp. 1009-1044, 2012. https://doi.org/10.1007/s10463-011-0343-8
Y. Tsuboi, H. Kashima, S. Hido, S. Bickel, and M. Sugiyama, "Direct density ratio estimation for large-scale covariate shift adaptation," Information and Media Technologies, vol. 4, no. 2, pp. 529-546, 2009.
M. Yamada and M. Sugiyama, "Direct importance estima tion with Gaussian mixture models," IEICE Transactions on Information and Systems, vol. 92, no. 10, pp. 2159-2162, 2009.
M. Yamada, M. Sugiyama, G. Wichern, and J. Simm, "Direct importance estimation with a mixture of probabilistic principal component analyzers," IEICE Transactions on Information and Systems, vol. 93, no. 10, pp. 2846-2849, 2010.
A. Keziou, "Dual representation of $\phi$-divergences and applications," Comptes Rendus Mathematique, vol. 336, no. 10, pp. 857-862, 2003. https://doi.org/10.1016/S1631-073X(03)00215-2
R. T. Rockafellar, Convex Analysis. Princeton, NJ: Princeton University Press, 1970.
R. Tibshirani, "Regression shrinkage and subset selection with the lasso," Journal of the Royal Statistical Society B, vol. 58, no. 1, pp. 267-288, 1996.
R. Tomioka, T. Suzuki, and M. Sugiyama, "Super-linear convergence of dual augmented Lagrangian algorithm for sparsity regularized estimation," Journal of Machine Learning Research, vol. 12, pp. 1537-1586, 2011.
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, "Least angle regression," The Annals of Statistics, vol. 32, no. 2, pp. 407-499, 2004. https://doi.org/10.1214/009053604000000067
O. Chapelle, B. Scholkopf, and A. Zien, Semi-Supervised Learning. Cambridge, MA: MIT Press, 2006.
R. Rifkin, G. Yeo, and T. Poggio, "Regularized least-squares classification," in Advances in Learning Theory: Methods, Models and Applications, J. A. K. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, Eds. Amsterdam, the Netherlands: IOS Press, 2003, pp. 131-154.
M. Sugiyama, M. Krauledat, and K. R. Muller, "Covariate shift adaptation by importance weighted cross validation," Journal of Machine Learning Research, vol. 8, pp. 985- 1005, 2007.
T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H. Y. Shum, "Learning to detect a salient object," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 2, pp. 353-367, 2011. https://doi.org/10.1109/TPAMI.2010.70
C. E. Shannon, "A mathematical theory of communication," Bell Systems Technical Journal, vol. 27, pp. 379-423, 1948. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed., Hoboken, NJ: John Wiley & Sons Inc., 2006.
K. Torkkola, "Feature extraction by non-parametric mutual information maximization," Journal of Machine Learning Research, vol. 3, pp. 1415-1438, 2003.
J. Sainui and M. Sugiyama, "Direct approximation of quadratic mutual information and its application to dependencemaximization clustering," IEICE Transactions on Information and Systems, 2013, submitted for publication.
M. Sugiyama, M. Kawanabe, and P. L. Chui, "Dimensionality reduction for density ratio estimation in high-dimensional spaces," Neural Networks, vol. 23, no. 1, pp. 44-59, 2010. https://doi.org/10.1016/j.neunet.2009.07.007
M. Sugiyama, M. Yamada, P. von Bunau, T. Suzuki, T. Kanamori, and M. Kawanabe, "Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search," Neural Networks, vol. 24, no. 2, pp. 183-198, 2011. https://doi.org/10.1016/j.neunet.2010.10.005
M. Yamada and M. Sugiyama, "Direct density-ratio estimation with dimensionality reduction via hetero-distributional subspace analysis," in Proceedings of the 25th AAAI Conference on Artificial Intelligence, San Francisco, CA, 2011, pp. 549-554.

Cited by

Statistical Analysis of Distance Estimators with Density Differences and Density Ratios vol.16, pp.2, 2014, https://doi.org/10.3390/e16020921
Minimum Distance Estimation of Milky Way Model Parameters and Related Inference vol.3, pp.1, 2015, https://doi.org/10.1137/130935525
On second order efficient robust inference vol.88, 2015, https://doi.org/10.1016/j.csda.2015.02.008
Direct Learning of Sparse Changes in Markov Networks by Density Ratio Estimation vol.26, pp.6, 2014, https://doi.org/10.1162/NECO_a_00589
Computationally Efficient Class-Prior Estimation under Class Balance Change Using Energy Distance vol.E99.D, pp.1, 2016, https://doi.org/10.1587/transinf.2015EDP7212
Direct Density Ratio Estimation with Convolutional Neural Networks with Application in Outlier Detection vol.E98.D, pp.5, 2015, https://doi.org/10.1587/transinf.2014EDP7335
Noisy and incomplete fingerprint classification using local ridge distribution models vol.48, pp.2, 2015, https://doi.org/10.1016/j.patcog.2014.07.030
Non-Bayesian Social Learning with Observation Reuse and Soft Switching vol.14, pp.2, 2018, https://doi.org/10.1145/3199513

Journal of Computing Science and Engineering

Direct Divergence Approximation between Probability Distributions and Its Applications in Machine Learning

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)