DOI QR코드

DOI QR Code

Investigation on the Effect of Multi-Vector Document Embedding for Interdisciplinary Knowledge Representation

  • 박종인 (국민대학교 비즈니스IT전문대학원) ;
  • 김남규 (국민대학교 경영정보학부)
  • Received : 2020.01.18
  • Accepted : 2020.02.25
  • Published : 2020.03.31

Abstract

Text is the most widely used means of exchanging or expressing knowledge and information in the real world. Recently, researches on structuring unstructured text data for text analysis have been actively performed. One of the most representative document embedding method (i.e. doc2Vec) generates a single vector for each document using the whole corpus included in the document. This causes a limitation that the document vector is affected by not only core words but also other miscellaneous words. Additionally, the traditional document embedding algorithms map each document into only one vector. Therefore, it is not easy to represent a complex document with interdisciplinary subjects into a single vector properly by the traditional approach. In this paper, we introduce a multi-vector document embedding method to overcome these limitations of the traditional document embedding methods. After introducing the previous study on multi-vector document embedding, we visually analyze the effects of the multi-vector document embedding method. Firstly, the new method vectorizes the document using only predefined keywords instead of the entire words. Secondly, the new method decomposes various subjects included in the document and generates multiple vectors for each document. The experiments for about three thousands of academic papers revealed that the single vector-based traditional approach cannot properly map complex documents because of interference among subjects in each vector. With the multi-vector based method, we ascertained that the information and knowledge in complex documents can be represented more accurately by eliminating the interference among subjects.

Keywords

References

  1. 김도우, 구명완 2017. "Doc2Vec과 Word2Vec을 활용한 Convolutional Neural Network 기반 한국어 신문 기사 분류," 정보과학회논문지 (44:7), pp. 742-747.
  2. 박종인, 김남규 2019. "복합문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 방법론," 지능정보연구 (25:3), pp. 19-41. https://doi.org/10.13088/JIIS.2019.25.3.019
  3. 송서하, 김준홍, 김형석, 박재선, 강필성 2019. "금융 데이터 및 텍스트 데이터를 활용한 금융 기업조기 경보 모형 개발 : 부실은행 예측을 중심으로," 대한산업공학회지 (45:3), pp. 248-259.
  4. Bengio, Y.., and Leidner, D. E. 2001. "Review: Knowledge Management and Knowledge Management Systems : Conceptual Foundations and Research Issues," MIS Quarterly (25:1), pp. 107-136. https://doi.org/10.2307/3250961
  5. Aggarwal, C. C., and Zhai C. 2012. "Mining Text Data," Boston: Springer.
  6. Bengio, Y., Ducharme R., Vincent P., and Janvin C. 2003, "A Neural Probabilistic Language Model," The Journal of Machine Learning Research, (3), pp. 1137-1155.
  7. Firth, J. R. 1957, "A Synopsis of Linguistic Theory 1930-1955", Studies in Linguistic Analysis, pp. 1-32.
  8. Hinton, G. E. 1986, "Learning Distributed Representations of Concepts," Proceedings of the 8th Annual Conference of the Cognitive Science Society (1), 1-12.
  9. Hotho, A., Nurnberger A., and Paass G. 2005, "A Brief Survey of Text Mining," LDV-Forum, (20:1), pp. 19-62.
  10. Kenter, T. and Rijke M. 2015, "Short Text Similarity with Word Embedding," Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1411-1420.
  11. Kim, Y. 2014, "Convolutional Neural Networks for Sentence Classification," Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP), pp. 1746-1751.
  12. Kiros, R., Zhu Y., Salakhutdinov R., Zemel R. S., Torralba A., Urtasun R., and Fidler S. 2015, "Skip-Thought Vectors," Proceedings of the 28th International Conference on Neural Information Processing Systems (2), pp. 3294-3302.
  13. Lai, S., Xu L., Liu K., and Zhao J. 2015, "Recurrent Convolutional Neural Network for Text Classification," Proceedings of the 29th AAAI Conference on Artificial Intelligence, pp. 2267-2273.
  14. Liu, J., Chang W., Wu Y., and Yang Y. 2017, "Deep Learning for Extreme Multi-label Text Classification," Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115-124.
  15. Mikolov, T., Deoras A., Povey D., Burget L., and Cernocky J. 2011, "Strategies for Training Large Scale Neural Network Language Models," 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 196-201.
  16. Mikolov, T., Sutskever I., Chen K., Corrado G., and Dean J. 2013, "Distributed representations of words and phrases and their compositionality," Proceedings of the 26th International Conference on Neural Information Processing Systems (2), pp. 3111-3119.
  17. Quoc, L. and Mikolov T. 2014, "Distributed Representations of Sentences and Documents," Proceedings of the 31st International Conference on Machine Learning (32), pp. 1188-1196.
  18. Salton, G., Wong A., and Yang C. S. 1975, "A Vector Space Model for Automatic Indexing," Communications of the ACM (18:11), pp.613-620. https://doi.org/10.1145/361219.361220
  19. Tan, A. 1999, "Text Mining: The State of the Art and the Challenges," Proceedings of the Pacific Asia Conference on Knowledge Discovery and Data Mining PAKDD'99 workshop on Knowledge Discovery from Advanced Databases, pp. 65-70.
  20. Turian, J., Ratinov L., and Bengio Y. 2010, "Word Representations: A Simple and General Method for Semi-Supervised Learning," Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 384-394.
  21. Yu, H., Lee S., and Ko Y. 2019, "Incremental Clustering and Multi - Document Summarization for Issue Analysis based on Real-time News," Journal of Korean Institute of Information Scientists and Engineers (46:4), pp. 355-362.
  22. Witten, I. H. 2004, "Text Mining, Practical Handbook of Internet Computing," FL: CRC Press.