Multiple Cause Model-based Topic Extraction and Semantic Kernel Construction from Text Documents

다중요인모델에 기반한 텍스트 문서에서의 토픽 추출 및 의미 커널 구축

  • 장정호 (서울대학교 컴퓨터공학부) ;
  • 장병탁 (서울대학교 컴퓨터공학부)
  • Published : 2004.05.01

Abstract

Automatic analysis of concepts or semantic relations from text documents enables not only an efficient acquisition of relevant information, but also a comparison of documents in the concept level. We present a multiple cause model-based approach to text analysis, where latent topics are automatically extracted from document sets and similarity between documents is measured by semantic kernels constructed from the extracted topics. In our approach, a document is assumed to be generated by various combinations of underlying topics. A topic is defined by a set of words that are related to the same topic or cooccur frequently within a document. In a network representing a multiple-cause model, each topic is identified by a group of words having high connection weights from a latent node. In order to facilitate teaming and inferences in multiple-cause models, some approximation methods are required and we utilize an approximation by Helmholtz machines. In an experiment on TDT-2 data set, we extract sets of meaningful words where each set contains some theme-specific terms. Using semantic kernels constructed from latent topics extracted by multiple cause models, we also achieve significant improvements over the basic vector space model in terms of retrieval effectiveness.

문서 집합 내의 개념 또는 의미 관계의 자동 분석은 보다 효율적인 정보 획득과 단어 이상의 개념 수준에서의 문서간 비교를 가능케 한다. 본 논문에서는 다중요인모델에 기반 하여 텍스트 문서로부터 토픽들을 추출하고 이로부터 의미 커널(semantic kernel)을 구축하여 문서간 유사도를 측정하는 방안을 제시한다. 텍스트 문서는 내재된 토픽들의 다양한 결합에 의해 생성된다고 가정하며 하나의 토픽은 공통 주제에 관련되거나 적어도 자주 같이 나타나는 단어들의 집합으로 정의한다. 다중요인모델은 은닉층을 갖는 하나의 네트워크 형태로 표현되며, 토픽을 표현하는 단어 집합은 은닉노드로부터의 가중치가 높은 단어들로 구성된다. 일반적으로 이러한 다중요인 네트워크에서의 학습과 추론과정을 용이하게 하기 위해서는 근사적 확률 추정 기법이 요구되는데, 본 논문에서는 헬름홀츠 머신에 의한 방법을 활용한다. TDT-2 문서 집합에 대한 실험에서 토픽별로 관련 있는 단어 집합들을 추출할 수 있었으며, 4개의 텍스트 집합에 대한문서 검색 실험에서는 다중요인모델의 분석결과에 기반 한 의미 커널을 사용함으로써 기본 벡터공간 모델에 비해 평균정확도 면에서 통계적으로 유의한 수준의 성능 향상을 얻을 수 있었다.

Keywords

References

  1. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. A., Indexing by latent semantic analysis, Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990 https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  2. Lee, D. D. and Seung, H. S. , Learning the parts of objects by non-negative matrix factorization, Nature, 401, 788-791, 1999 https://doi.org/10.1038/44565
  3. Dhillon, I. and Modha, D., Concept decomposition for large sparse text data using clustering, Machine Learning, vol. 42, pp. 143-175, 2001 https://doi.org/10.1023/A:1007612920971
  4. Kolenda, T., Hansen, L. K. and Sigurdsson, S., Independent components in text, In Proceedings of ICA'99, 1999
  5. van Rijsbergen, C. J., Information Retrieval, London: Butterworths, 2nd Edition, 1979
  6. Jiang, F. and Littman, M. L., Approximate dimension equalization in vector-based information retrieval, In Proceedings of the 17th International Conference on Machine Learning, pp. 423-430, 2000
  7. Cristianini, N., Shawe-Taylor, J. and Lodhi, H., Latent semantic kernels, Journal of Intelligent Information Systems, vol. 18, no. 2/3, pp. 127-152, 2002 https://doi.org/10.1023/A:1013625426931
  8. M. W. Berry, S. T. Dumais, and G. W. O'Brien., Using linear algebra for intelligent information retrieval, SIAM Review, vol. 37, no. 4, pp. 573-595, 1995 https://doi.org/10.1137/1037127
  9. Dumais, S. T., Furnas, G. W., Landauer, T. K., and Deerwester, S., Using latent semantic analysis to improve information retrieval, In Proceedings of CHI'88, pp. 281-285, 1988 https://doi.org/10.1145/57167.57214
  10. Dumais, S.T., Latent semantic indexing (LSI): TREC-3 report, In Proceedings of the Text Retrieval Conference (TREC-3), pp. 219-230, 1995
  11. Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S., The Helmholtz machine, Neural Computation, vol. 7, pp. 889-904, 1995 https://doi.org/10.1162/neco.1995.7.5.889
  12. Frey, B. J., Graphical Models for Machine Learning and Digital Communication, The MIT Press, 1998
  13. Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, 1988
  14. Hinton, G. E., Dayan, P., Frey, B. J., Neal, R. M., The wake-sleep algorithm for unsupervised neural networks, Science 268, pp. 1158-1161, 1995 https://doi.org/10.1126/science.7761831
  15. Dempster, A. P., Laird, N. M., and Rubin, D. B., Maximum likelihood from incomplete date via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, vol. 39, pp. 1-38, 1977
  16. Chang, J.-H. and Zhang, B.-T., Using stochastic Helmholtz machine for text learning, In Proceedings of International Conference on Computer Processing of Oriental Languages, pp. 453-458, 2001
  17. Salton, G. and McGill, M. J., Introduction to Modern Information Retrieval, McGraw-Hill, 1983
  18. Siolas, G. and d'AlcheBuc, F., Support vector machines based on a semantic kernel for text categorization, In Proceedings of the International Joint Conference on Neural Networks, vol. 5, pp. 205-209, 2000 https://doi.org/10.1109/IJCNN.2000.861458
  19. Wong, S. K. M., Ziarko, W., and Wong, P. C. N., Generalized vector space model in information retrieval, In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 18-25, 1985 https://doi.org/10.1145/253495.253506
  20. Fellbaum, C., deitor, WordNet: An Electronic Lexical Database, MIT Press, 1998
  21. Slonim, N. and Tishby, N., Document clustering using word clusters via the information bottleneck method, In Proceedings of ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 208-215, 2000 https://doi.org/10.1145/345508.345578
  22. Hull, D., Using statistical testing in the evaluation of retrieval experiments, In Proceedings of ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 329-338, 1993 https://doi.org/10.1145/160688.160758