# 위키피디어 기반 개념 공간을 가지는 시멘틱 텍스트 모델

• Kim, Han-Joon (School of Electrical and Computer Engineering, University of Seoul) ;
• Chang, Jae-Young (Department of Computer Engineering, Hansung University)
• Received : 2014.07.18
• Accepted : 2014.08.19
• Published : 2014.08.31

#### Abstract

Current text mining techniques suffer from the problem that the conventional text representation models cannot express the semantic or conceptual information for the textual documents written with natural languages. The conventional text models represent the textual documents as bag of words, which include vector space model, Boolean model, statistical model, and tensor space model. These models express documents only with the term literals for indexing and the frequency-based weights for their corresponding terms; that is, they ignore semantical information, sequential order information, and structural information of terms. Most of the text mining techniques have been developed assuming that the given documents are represented as 'bag-of-words' based text models. However, currently, confronting the big data era, a new paradigm of text representation model is required which can analyse huge amounts of textual documents more precisely. Our text model regards the 'concept' as an independent space equated with the 'term' and 'document' spaces used in the vector space model, and it expresses the relatedness among the three spaces. To develop the concept space, we use Wikipedia data, each of which defines a single concept. Consequently, a document collection is represented as a 3-order tensor with semantic information, and then the proposed model is called text cuboid model in our paper. Through experiments using the popular 20NewsGroup document corpus, we prove the superiority of the proposed text model in terms of document clustering and concept clustering.

#### Acknowledgement

Supported by : 한국연구재단

#### References

1. Antonellis, I. and Gallopoulos, E., Exploring term-document matrices from matrix models in text mining, SIAM Text Mining Workshop, SIAM Conference on Data Mining, 2006.
2. Berry, M. W., Survey of text mining : Clustering, Classification, and Retrieval, Springer-Verlag, 2003.
3. Cai, D., He, X., Wen, J. R., Han, J., and Ma, W. Y., Support Tensor Machines for Text Categorization, Technical Report UIUCDCS-R-2006-2714, 2006.
4. Cavnar, W. B. and Trenkle, J. M., N-Gram-Based Text Categorization, Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161-175, 1994.
5. Faulkner, A., Automated Classification of Stance in Student Essays : An Approach Using Stance Target Information and the Wikipedia Link-Based Measure, Science, Vol. 376, No. 12, p. 86, 2014.
6. Gabrilovich, E. and Markovitch, S., Feature generation for text categorization using world knowledge, Proceedings of International Joint Conferences on Artificial Intelligence, pp. 1048-1053, 2005.
7. Howard, T. and Croft, W. B., Inference networks for document retrieval, Proceedings of International ACM SIGIR, pp. 1-24, 1989.
8. http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaosar.pdf.
9. http://www.statsoft.com/textbook/text-mining/.
10. Jiang, C., Coenen, F., Sanderson, R., and Zito, M., Text Classification Using Graph Mining-Based Feature Extraction, Knowledge-Based Systems, Vol. 23, No. 4, pp. 302-308, 2009.
11. Kimbrough, S., Executive Briefing : Text Mining for Business Intelligence, INSEAD-UNILEVER workshop, 2006.
12. Lancaster, F. W. and Fayen, E. G., Information Retrieval On-Line, Melville Publishing Co., 1973.
13. Maron, M. and Kuhns, J., On relevance, probabilistic indexing and information retrieval, Journal of the Association for Computing Machinery, Vol. 7, pp. 216-244, 1960. https://doi.org/10.1145/321033.321035
14. Martinez, D. and Baldwin, T., Word sense disambiguation for event trigger word detection, Proceedings of the ACM fourth international workshop on Data and text mining in biomedical informatics, pp. 41-48, 2010.
15. Navigli, R., Word sense disambiguation : A survey, ACM Computing Surveys, Vol. 41, No. 2, pp. 1-69, 2009.
16. Ribeiro, B. and Muntz, R. A., Belief Network Model for IR, Proceedings of International ACM SIGIR, pp. 253-260, 1996.
17. Salton, G., Wong, A., and Yang, C. S., A Vector Space Model for Automatic Indexing, Communications of the ACM, Vol. 18, No. 11, pp. 613-620, 1975. https://doi.org/10.1145/361219.361220
18. Schenker, A., Last, M., Bunke, H., and Kandel, A., Classification of Web Documents Using a Graph Model, Proceedings of 7th International Conference on Document Analysis and Recognition, pp. 240-244, 2003.
19. Sui, Z., Zhao, Q., and Liu, Y., Inducting Concept Hierarchies from Text based on FCA, Proceedings of Fourth International Conference on Innovative Computing, Information and Control, pp. 1080-1083, 2009.
20. Tamara, G. K. and Bader, B., Tensor Decompositions and Applications, SIAM Review, Vol. 51, No. 3, pp. 455-500, 2009. https://doi.org/10.1137/07070111X
21. The Value and Benefits of Text Mining, JISC Digital Infrastructure, 2012.
22. Witten, I. H., Text Mining, http://www.cs.waikato.ac.nz/-ihw/papers/04-IHW-Textmining.pdf.
23. Wu, J., Xuan, Z., and Pan, D., Enhancing Text Representation for Classification Tasks with Semantic Graph Structures, International Journal of Innovative Computing, Information Control, Vol. 7, No. 5(B), pp. 2689-2698, 2011.
24. Yeon, J., Shim, J., and Lee, S. G., Outlier Detection Techniques for Biased Opinion Discovery, Journal of Society for e-Business Studies, Vol. 18, No. 4, pp. 315-326, 2013. https://doi.org/10.7838/jsebs.2013.18.4.315
25. Zhang, B., Yan, J., Chen, Z., Liu, W., Bai, F., and Chien, L., Text representation : from vector to tensor, Fifth IEEE International Conference on Data Mining, pp. 725-728, 2005.

#### Cited by

1. A Tensor Space Model based Semantic Search Technique vol.21, pp.4, 2016, https://doi.org/10.7838/jsebs.2016.21.4.001
2. Automated Development of Rank-Based Concept Hierarchical Structures using Wikipedia Links vol.20, pp.4, 2015, https://doi.org/10.7838/jsebs.2015.20.4.061
3. Multidimensional Text Warehousing for Automated Text Classification vol.11, pp.2, 2018, https://doi.org/10.4018/JITR.2018040110