Mapping Categories of Heterogeneous Sources Using Text Analytics

텍스트 분석을 통한 이종 매체 카테고리 다중 매핑 방법론

  • Kim, Dasom (Graduate School of Business IT, Kookmin University) ;
  • Kim, Namgyu (School of Management Information Systems, Kookmin University)
  • 김다솜 (국민대학교 비즈니스IT전문대학원) ;
  • 김남규 (국민대학교 비즈니스IT전문대학원)
  • Received : 2016.08.17
  • Accepted : 2016.12.28
  • Published : 2016.12.31


In recent years, the proliferation of diverse social networking services has led users to use many mediums simultaneously depending on their individual purpose and taste. Besides, while collecting information about particular themes, they usually employ various mediums such as social networking services, Internet news, and blogs. However, in terms of management, each document circulated through diverse mediums is placed in different categories on the basis of each source's policy and standards, hindering any attempt to conduct research on a specific category across different kinds of sources. For example, documents containing content on "Application for a foreign travel" can be classified into "Information Technology," "Travel," or "Life and Culture" according to the peculiar standard of each source. Likewise, with different viewpoints of definition and levels of specification for each source, similar categories can be named and structured differently in accordance with each source. To overcome these limitations, this study proposes a plan for conducting category mapping between different sources with various mediums while maintaining the existing category system of the medium as it is. Specifically, by re-classifying individual documents from the viewpoint of diverse sources and storing the result of such a classification as extra attributes, this study proposes a logical layer by which users can search for a specific document from multiple heterogeneous sources with different category names as if they belong to the same source. Besides, by collecting 6,000 articles of news from two Internet news portals, experiments were conducted to compare accuracy among sources, supervised learning and semi-supervised learning, and homogeneous and heterogeneous learning data. It is particularly interesting that in some categories, classifying accuracy of semi-supervised learning using heterogeneous learning data proved to be higher than that of supervised learning and semi-supervised learning, which used homogeneous learning data. This study has the following significances. First, it proposes a logical plan for establishing a system to integrate and manage all the heterogeneous mediums in different classifying systems while maintaining the existing physical classifying system as it is. This study's results particularly exhibit very different classifying accuracies in accordance with the heterogeneity of learning data; this is expected to spur further studies for enhancing the performance of the proposed methodology through the analysis of characteristics by category. In addition, with an increasing demand for search, collection, and analysis of documents from diverse mediums, the scope of the Internet search is not restricted to one medium. However, since each medium has a different categorical structure and name, it is actually very difficult to search for a specific category insofar as encompassing heterogeneous mediums. The proposed methodology is also significant for presenting a plan that enquires into all the documents regarding the standards of the relevant sites' categorical classification when the users select the desired site, while maintaining the existing site's characteristics and structure as it is. This study's proposed methodology needs to be further complemented in the following aspects. First, though only an indirect comparison and evaluation was made on the performance of this proposed methodology, future studies would need to conduct more direct tests on its accuracy. That is, after re-classifying documents of the object source on the basis of the categorical system of the existing source, the extent to which the classification was accurate needs to be verified through evaluation by actual users. In addition, the accuracy in classification needs to be increased by making the methodology more sophisticated. Furthermore, an understanding is required that the characteristics of some categories that showed a rather higher classifying accuracy of heterogeneous semi-supervised learning than that of supervised learning might assist in obtaining heterogeneous documents from diverse mediums and seeking plans that enhance the accuracy of document classification through its usage.


Supported by : 한국연구재단


  1. Blei, D. M., Ng, A. Y., and Jordan, M. I., "Latent Dirichlet Allocation," Journal of Machine Learning Research, Vol. 3(2003), 993-1022.
  2. Deerwester, S. C., S. T. Dumais, T. K. Landauer, G. W. Furnas and R. A. Harshman, "Indexing by Latent Semantic Analysis," Journal of the American Society for Information Science, Vol. 41, No. 6(1990), 391-407.<391::AID-ASI1>3.0.CO;2-9
  3. Hearst, M. A., "Untangling Text Data Mining," Proceedings of the 37th ACL, 1999.
  4. Hofmann, T., "Probabilistic Latent Semantic Indexing," Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, 50-57.
  5. Hong, J. S., N. Kim, and S. Lee. "A Methodology for Automatic Multi-Categorization of Single-Categorized Documents," Journal of Intelligence and Information Systems, Vol. 20, No. 3(2014), 77-92.
  6. Jeong, H., "A Study on Ontology and Topic Modeling-based Multi-dimensional Knowledge Map Services," Journal of Intelligence and Information Systems, Vol. 21, No. 4(2015), 79-92.
  7. Joachims, T., "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," Proceedings of the 10th European Conference on Machine Learning, 1998, 137-142.
  8. Kang, J. H., J. C. Kim, J. H. Lee, S. S. Park and D. S. Jang, "A Comparative Study on Patent Document Classification Algorithms," Proceedings of KIIS Spring Conference, Vol. 26, No. 1(2016), 9-10.
  9. Kim, P. J. and J. Y. Lee, "Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities," Journal of the Korean Society for Information Management, Vol. 24, No. 1(2007), 251-271.
  10. Ko, Y. and J. Seo, "Automatic Text Categorization based on Semi-Supervised Learning," Journal of KIISE: Software and Applications, Vol. 35, No. 5(2008), 325-334.
  11. Korea Internet Security Agency, 2014 Korea Internet White Paper, Korea Internet Security Agency, 2014.
  12. Korea Research Institute for Vocational Education & Training , THE HRD, Vol. 16, No. 6(2013), 136-151.
  13. Lee, S., J. Kim and S. H. Myaeng, "An Extension of Topic Models for Text Classification: A Term Weighting Approach", Proceedings of the 2015 International Conference on Big Data and Smart Computing(BigComp), 2015, 217-224.
  14. Li, C., D. R. Byun, and S. C. Park "BPNN Algorithm with SVD Technique for Korean Document Categorization", Journal of the Korea Industrial Information System Society, Vol. 15, No. 2(2010), 49-57.
  15. Liu, B., Y. Dai, X. Li, W. S. Lee and P. S. Yu, "Building Text Classifiers Using Positive and Unlabeled Examples", Proceedings of the 3rd IEEE International Conference on Data Mining, 2003, 179-188.
  16. Lu, Y., S. Okada and K. Nitta, "Semi-supervised Latent Dirichlet Allocation for Multi-label Text Classification", Proceedings of 26th IEA/AIE, 2013, 351-360.
  17. McKinsey Global Institute, Big Data : The next Frontier for Innovation, Competition, and Productivity, McKinsey and Company, 2011.
  18. Nigam, K., A. K. McCallum, S. Thrun and T. Mitchell, "Learning to Classify Text from Labeled and Unlabeled Documents", Proceedings of 15th national conference on artificial intelligence, 1998, 792-799.
  19. Nigam, K., A. K. McCallum, S. Thrun and T. Mitchell, "Text Classification from Labeled and Unlabeled Documents Using EM", Machine Learning, Vol. 39, No. 2(2000), 103-134.
  20. Nigam, K., A. McCallum, and T. Mitchell, "Semi-Supervised Text Classification Using EM", Supervised Learning, MIT Press, 2006.
  21. Rogati, M. and Y. Yang, "High-Performing Feature Selection for Text Classification", Proceedings of the International Conference on Information and Knowledge Management, 2002, 659-661.
  22. Rubin, T. N., A. Chambers, P. Smyth and M. Steyvers, "Statistical Topic Models for Multi-label Document Classification", Machine learning, Vol. 88, No. 1(2012), 157-208.
  23. Salton, G. and M. J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, 1986.
  24. Salton, G., A. Wong and C. S. Yang, "A Vector Space Model for Automatic Indexing", Communications of the ACM, Vol. 18, No. 11(1975), 613-620.
  25. Silva, C. and B. Ribeiro, "Labeled and Unlabeled Data in Text Categorization", Proceedings of the IEEE International Joint Conference on Neural Networks, 2004, 2971-2976.
  26. Sun, A., "Short Text Classification Using Very Few Words", Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2012, 1145-1146.
  27. Vapnik, V. N., The Nature of Statistical Learning Theory, Springer, 1995.
  28. Yoon, S, S. Kim, and K. Shin, "Development of the Accident Prediction Model for Enlisted Men through an Integrated Approach to Datamining and Textmining," Journal of Intelligence and Information Systems, Vol. 21, No.3(2015), 1-17.