DOI QR코드

DOI QR Code

Non-Simultaneous Sampling Deactivation during the Parameter Approximation of a Topic Model

  • Jeong, Young-Seob (Department of Computer Science, Korea Advanced Institute of Science and Technology) ;
  • Jin, Sou-Young (Department of Computer Science, Korea Advanced Institute of Science and Technology) ;
  • Choi, Ho-Jin (Department of Computer Science, Korea Advanced Institute of Science and Technology)
  • Received : 2012.09.06
  • Accepted : 2012.12.21
  • Published : 2013.01.31

Abstract

Since Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) were introduced, many revised or extended topic models have appeared. Due to the intractable likelihood of these models, training any topic model requires to use some approximation algorithm such as variational approximation, Laplace approximation, or Markov chain Monte Carlo (MCMC). Although these approximation algorithms perform well, training a topic model is still computationally expensive given the large amount of data it requires. In this paper, we propose a new method, called non-simultaneous sampling deactivation, for efficient approximation of parameters in a topic model. While each random variable is normally sampled or obtained by a single predefined burn-in period in the traditional approximation algorithms, our new method is based on the observation that the random variable nodes in one topic model have all different periods of convergence. During the iterative approximation process, the proposed method allows each random variable node to be terminated or deactivated when it is converged. Therefore, compared to the traditional approximation ways in which usually every node is deactivated concurrently, the proposed method achieves the inference efficiency in terms of time and memory. We do not propose a new approximation algorithm, but a new process applicable to the existing approximation algorithms. Through experiments, we show the time and memory efficiency of the method, and discuss about the tradeoff between the efficiency of the approximation process and the parameter consistency.

Keywords

References

  1. Thomas Hofmann, "Probabilistic latent semantic analysis," in Proc. of 15th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 289-296, July 30-August 1, 1999.
  2. David M. Blei, Andrew Y. Ng, and Michael I. Jordan, "Latent dirichlet allocation," Journal of Machine Learning Research, vol. 3, pp. 993-1022, March 2003.
  3. Flora S. Tsai, "A tag-topic model for blog mining," Expert Systems with Applications, vol. 38, no. 5, pp. 5330-5335, May 2011. https://doi.org/10.1016/j.eswa.2010.10.025
  4. David M. Blei and John D. Lafferty, "Dynamic topic models," in Proc. of 23rd International Conference on Machine Learning (ICML), pp. 113-120, June 25-29, 2006.
  5. David Newman, Chaitanya Chemudugunta, and Padhraic Smyth, "Statistical entity-topic models," in Proc. of 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 680-686, August 20-23, 2006.
  6. Zhenxing Nu, Gang Hua, Xinbo Gao, and Qi Tain, "Context aware topic model for scene recognition," in Proc. of 25th IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 2743-2750, June 16-21, 2012.
  7. Young-Seob Jeong and Ho-Jin Choi, "Sequential entity group topic model for getting topic flows of entity groups within one document," in Proc. of 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining (PAKDD), pp. 366-378, May 29-June 1, 2012.
  8. Thomas L. Griffiths and Mark Steyvers, "Finding scientific topics," in Proc. of the National Academy of Sciences of the United States of America (PNAS), pp.5228-5235, April 6, 2004.
  9. Khaled Alsabti, Sanjay Ranka, and Vineet Singh, "An efficient K-means clustering algorithm," in Proc. of 1st IPPS/SPDP Workshop on High Performance Data Mining, March 30-April 3, 1998.
  10. Dan Pelleg and Andrew Moore, "X-means: extending K-means with efficient estimation of the number of clusters," in Proc. of 17th International Conference on Machine Learning (ICML), pp. 727-734, June 29-July 2 , 2000.
  11. Andrew W. Moore, "Very fast EM-based mixture model clustering using multiresolution kd-trees," in Proc. of 12th Conference on Advances in Neural Information Processing Systems(NIPS), pp. 543-549, November 29-December 3, 1998.
  12. Alexander T. Ihler, Erik B. Sudderth, William T. Freeman, and Alan S. Willsky, "Efficient multiscale sampling from products of Gaussian mixtures," in Proc. of 17th Conference on Advances in Neural Information Processing Systems (NIPS), December 9-11, 2003.
  13. Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling, "Fast collapsed gibbs sampling for latent dirichlet allocation," in Proc. of 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp. 569-577, August 24-27, 2008.
  14. Limin Yao, David Mimno, and Andrew McCallum, "Efficient methods for topic model inference on streaming document collections," in Proc. of 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp. 937-946, June 28-July 1, 2009.
  15. Raymond Wan, Vo Ngoc Anh, and Hiroshi Mamitsuka, "Efficient probabilistic latent semantic analysis through parallelization," in Proc. of 5th Asia Information Retrieval Symposium on Information Retrieval Technology (AIRS), pp. 432-443, October 21-23, 2009.
  16. David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling, "Distributed inference for latent dirichlet allocation," in Proc. of 21th Conference on Advances in Neural Information Processing Systems (NIPS), pp. 1081-1088, December 3-6, 2007.
  17. David Mimno and Andrew McCallum, "Organizing the OCA: learning faceted subjects from a library of digital books," in Proc. of 7th ACM/IEEE-CS joint conference on Digital libraries (JCDL), pp. 376-385, June 18-23, 2007.
  18. Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth, "The author-topic model for authors and documents," in Proc. of 20th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 487-494, July 7-11, 2004.
  19. Yohan Jo and Alice H. Oh, "Aspect and sentiment unification model for online review analysis," in Proc. of 4th ACM international conference on Web search and data mining (WSDM), pp. 815-824, February 9-12, 2011.
  20. Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul, "An introduction to variational methods for graphical models," Machine Learning, vol. 37, no. 2, pp. 183-233, November, 1999. https://doi.org/10.1023/A:1007665907178
  21. Yee Whye Teh, David Newman, and Max Welling, "A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation," Advances in Neural Information Processing Systems (NIPS) 2006, December 4-9, 2006.