Multiple Cause Model-based Topic Extraction and Semantic Kernel Construction from Text Documents

;;

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Volume 31 Issue 5
/
Pages.595-604
/
2004
/
1229-6848(pISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

Multiple Cause Model-based Topic Extraction and Semantic Kernel Construction from Text Documents

다중요인모델에 기반한 텍스트 문서에서의 토픽 추출 및 의미 커널 구축

장정호 (서울대학교 컴퓨터공학부) ;
장병탁 (서울대학교 컴퓨터공학부)

Published : 2004.05.01

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Automatic analysis of concepts or semantic relations from text documents enables not only an efficient acquisition of relevant information, but also a comparison of documents in the concept level. We present a multiple cause model-based approach to text analysis, where latent topics are automatically extracted from document sets and similarity between documents is measured by semantic kernels constructed from the extracted topics. In our approach, a document is assumed to be generated by various combinations of underlying topics. A topic is defined by a set of words that are related to the same topic or cooccur frequently within a document. In a network representing a multiple-cause model, each topic is identified by a group of words having high connection weights from a latent node. In order to facilitate teaming and inferences in multiple-cause models, some approximation methods are required and we utilize an approximation by Helmholtz machines. In an experiment on TDT-2 data set, we extract sets of meaningful words where each set contains some theme-specific terms. Using semantic kernels constructed from latent topics extracted by multiple cause models, we also achieve significant improvements over the basic vector space model in terms of retrieval effectiveness.

문서 집합 내의 개념 또는 의미 관계의 자동 분석은 보다 효율적인 정보 획득과 단어 이상의 개념 수준에서의 문서간 비교를 가능케 한다. 본 논문에서는 다중요인모델에 기반 하여 텍스트 문서로부터 토픽들을 추출하고 이로부터 의미 커널(semantic kernel)을 구축하여 문서간 유사도를 측정하는 방안을 제시한다. 텍스트 문서는 내재된 토픽들의 다양한 결합에 의해 생성된다고 가정하며 하나의 토픽은 공통 주제에 관련되거나 적어도 자주 같이 나타나는 단어들의 집합으로 정의한다. 다중요인모델은 은닉층을 갖는 하나의 네트워크 형태로 표현되며, 토픽을 표현하는 단어 집합은 은닉노드로부터의 가중치가 높은 단어들로 구성된다. 일반적으로 이러한 다중요인 네트워크에서의 학습과 추론과정을 용이하게 하기 위해서는 근사적 확률 추정 기법이 요구되는데, 본 논문에서는 헬름홀츠 머신에 의한 방법을 활용한다. TDT-2 문서 집합에 대한 실험에서 토픽별로 관련 있는 단어 집합들을 추출할 수 있었으며, 4개의 텍스트 집합에 대한문서 검색 실험에서는 다중요인모델의 분석결과에 기반 한 의미 커널을 사용함으로써 기본 벡터공간 모델에 비해 평균정확도 면에서 통계적으로 유의한 수준의 성능 향상을 얻을 수 있었다.

Keywords

References

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. A., Indexing by latent semantic analysis, Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990 https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
Lee, D. D. and Seung, H. S. , Learning the parts of objects by non-negative matrix factorization, Nature, 401, 788-791, 1999 https://doi.org/10.1038/44565
Dhillon, I. and Modha, D., Concept decomposition for large sparse text data using clustering, Machine Learning, vol. 42, pp. 143-175, 2001 https://doi.org/10.1023/A:1007612920971
Kolenda, T., Hansen, L. K. and Sigurdsson, S., Independent components in text, In Proceedings of ICA'99, 1999
van Rijsbergen, C. J., Information Retrieval, London: Butterworths, 2nd Edition, 1979
Jiang, F. and Littman, M. L., Approximate dimension equalization in vector-based information retrieval, In Proceedings of the 17th International Conference on Machine Learning, pp. 423-430, 2000
Cristianini, N., Shawe-Taylor, J. and Lodhi, H., Latent semantic kernels, Journal of Intelligent Information Systems, vol. 18, no. 2/3, pp. 127-152, 2002 https://doi.org/10.1023/A:1013625426931
M. W. Berry, S. T. Dumais, and G. W. O'Brien., Using linear algebra for intelligent information retrieval, SIAM Review, vol. 37, no. 4, pp. 573-595, 1995 https://doi.org/10.1137/1037127
Dumais, S. T., Furnas, G. W., Landauer, T. K., and Deerwester, S., Using latent semantic analysis to improve information retrieval, In Proceedings of CHI'88, pp. 281-285, 1988 https://doi.org/10.1145/57167.57214
Dumais, S.T., Latent semantic indexing (LSI): TREC-3 report, In Proceedings of the Text Retrieval Conference (TREC-3), pp. 219-230, 1995
Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S., The Helmholtz machine, Neural Computation, vol. 7, pp. 889-904, 1995 https://doi.org/10.1162/neco.1995.7.5.889
Frey, B. J., Graphical Models for Machine Learning and Digital Communication, The MIT Press, 1998
Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, 1988
Hinton, G. E., Dayan, P., Frey, B. J., Neal, R. M., The wake-sleep algorithm for unsupervised neural networks, Science 268, pp. 1158-1161, 1995 https://doi.org/10.1126/science.7761831
Dempster, A. P., Laird, N. M., and Rubin, D. B., Maximum likelihood from incomplete date via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, vol. 39, pp. 1-38, 1977
Chang, J.-H. and Zhang, B.-T., Using stochastic Helmholtz machine for text learning, In Proceedings of International Conference on Computer Processing of Oriental Languages, pp. 453-458, 2001
Salton, G. and McGill, M. J., Introduction to Modern Information Retrieval, McGraw-Hill, 1983
Siolas, G. and d'AlcheBuc, F., Support vector machines based on a semantic kernel for text categorization, In Proceedings of the International Joint Conference on Neural Networks, vol. 5, pp. 205-209, 2000 https://doi.org/10.1109/IJCNN.2000.861458
Wong, S. K. M., Ziarko, W., and Wong, P. C. N., Generalized vector space model in information retrieval, In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 18-25, 1985 https://doi.org/10.1145/253495.253506
Fellbaum, C., deitor, WordNet: An Electronic Lexical Database, MIT Press, 1998
Slonim, N. and Tishby, N., Document clustering using word clusters via the information bottleneck method, In Proceedings of ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 208-215, 2000 https://doi.org/10.1145/345508.345578
Hull, D., Using statistical testing in the evaluation of retrieval experiments, In Proceedings of ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 329-338, 1993 https://doi.org/10.1145/160688.160758

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Multiple Cause Model-based Topic Extraction and Semantic Kernel Construction from Text Documents

다중요인모델에 기반한 텍스트 문서에서의 토픽 추출 및 의미 커널 구축

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)