DOI QR코드

DOI QR Code

Subject-Balanced Intelligent Text Summarization Scheme

주제 균형 지능형 텍스트 요약 기법

  • Yun, Yeoil (College of Business Administration, Kookmin University) ;
  • Ko, Eunjung (Graduate School of Business IT, Kookmin University) ;
  • Kim, Namgyu (College of Business Administration, Kookmin University)
  • 윤여일 (국민대학교 경영대학) ;
  • 고은정 (국민대학교 비즈니스IT전문대학원) ;
  • 김남규 (국민대학교 경영대학)
  • Received : 2019.01.03
  • Accepted : 2019.05.06
  • Published : 2019.06.30

Abstract

Recently, channels like social media and SNS create enormous amount of data. In all kinds of data, portions of unstructured data which represented as text data has increased geometrically. But there are some difficulties to check all text data, so it is important to access those data rapidly and grasp key points of text. Due to needs of efficient understanding, many studies about text summarization for handling and using tremendous amounts of text data have been proposed. Especially, a lot of summarization methods using machine learning and artificial intelligence algorithms have been proposed lately to generate summary objectively and effectively which called "automatic summarization". However almost text summarization methods proposed up to date construct summary focused on frequency of contents in original documents. Those summaries have a limitation for contain small-weight subjects that mentioned less in original text. If summaries include contents with only major subject, bias occurs and it causes loss of information so that it is hard to ascertain every subject documents have. To avoid those bias, it is possible to summarize in point of balance between topics document have so all subject in document can be ascertained, but still unbalance of distribution between those subjects remains. To retain balance of subjects in summary, it is necessary to consider proportion of every subject documents originally have and also allocate the portion of subjects equally so that even sentences of minor subjects can be included in summary sufficiently. In this study, we propose "subject-balanced" text summarization method that procure balance between all subjects and minimize omission of low-frequency subjects. For subject-balanced summary, we use two concept of summary evaluation metrics "completeness" and "succinctness". Completeness is the feature that summary should include contents of original documents fully and succinctness means summary has minimum duplication with contents in itself. Proposed method has 3-phases for summarization. First phase is constructing subject term dictionaries. Topic modeling is used for calculating topic-term weight which indicates degrees that each terms are related to each topic. From derived weight, it is possible to figure out highly related terms for every topic and subjects of documents can be found from various topic composed similar meaning terms. And then, few terms are selected which represent subject well. In this method, it is called "seed terms". However, those terms are too small to explain each subject enough, so sufficient similar terms with seed terms are needed for well-constructed subject dictionary. Word2Vec is used for word expansion, finds similar terms with seed terms. Word vectors are created after Word2Vec modeling, and from those vectors, similarity between all terms can be derived by using cosine-similarity. Higher cosine similarity between two terms calculated, higher relationship between two terms defined. So terms that have high similarity values with seed terms for each subjects are selected and filtering those expanded terms subject dictionary is finally constructed. Next phase is allocating subjects to every sentences which original documents have. To grasp contents of all sentences first, frequency analysis is conducted with specific terms that subject dictionaries compose. TF-IDF weight of each subjects are calculated after frequency analysis, and it is possible to figure out how much sentences are explaining about each subjects. However, TF-IDF weight has limitation that the weight can be increased infinitely, so by normalizing TF-IDF weights for every subject sentences have, all values are changed to 0 to 1 values. Then allocating subject for every sentences with maximum TF-IDF weight between all subjects, sentence group are constructed for each subjects finally. Last phase is summary generation parts. Sen2Vec is used to figure out similarity between subject-sentences, and similarity matrix can be formed. By repetitive sentences selecting, it is possible to generate summary that include contents of original documents fully and minimize duplication in summary itself. For evaluation of proposed method, 50,000 reviews of TripAdvisor are used for constructing subject dictionaries and 23,087 reviews are used for generating summary. Also comparison between proposed method summary and frequency-based summary is performed and as a result, it is verified that summary from proposed method can retain balance of all subject more which documents originally have.

최근 다양한 매체를 통해 생성되는 방대한 양의 텍스트 데이터를 효율적으로 관리 및 활용하기 위한 방안으로써 문서 요약에 대한 연구가 활발히 진행되고 있다. 특히 최근에는 기계 학습 및 인공 지능을 활용하여 객관적이고 효율적으로 요약문을 도출하기 위한 다양한 자동 요약 기법이(Automatic Summarization) 고안되고 있다. 하지만 현재까지 제안된 대부분의 텍스트 자동 요약 기법들은 원문에서 나타난 내용의 분포에 따라 요약문의 내용이 구성되는 방식을 따르며, 이와 같은 방식은 비중이 낮은 주제(Subject), 즉 원문 내에서 언급 빈도가 낮은 주제에 대한 내용이 요약문에 포함되기 어렵다는 한계를 갖고 있다. 본 논문에서는 이러한 한계를 극복하기 위해 저빈도 주제의 누락을 최소화하는 문서 자동 요약 기법을 제안한다. 구체적으로 본 연구에서는 (i) 원문에 포함된 다양한 주제를 식별하고 주제별 대표 용어를 선정한 뒤 워드 임베딩을 통해 주제별 용어 사전을 생성하고, (ii) 원문의 각 문장이 다양한 주제에 대응되는 정도를 파악하고, (iii) 문장을 주제별로 분할한 후 각 주제에 해당하는 문장들의 유사도를 계산한 뒤, (iv) 요약문 내 내용의 중복을 최소화하면서도 원문의 다양한 내용을 최대한 포함할 수 있는 자동적인 문서 요약 기법을 제시한다. 제안 방법론의 평가를 위해 TripAdvisor의 리뷰 50,000건으로부터 용어 사전을 구축하고, 리뷰 23,087건에 대한 요약 실험을 수행한 뒤 기존의 단순 빈도 기반의 요약문과 주제별 분포의 비교를 진행하였다. 실험 결과 제안 방법론에 따른 문서 자동 요약을 통해 원문 내각 주제의 균형을 유지하는 요약문을 도출할 수 있음을 확인하였다.

Keywords

JJSHBB_2019_v25n2_141_f0001.png 이미지

Comparison of Various Summarization

JJSHBB_2019_v25n2_141_f0002.png 이미지

Subject Dictionary Construction and Review Summary Generation

JJSHBB_2019_v25n2_141_f0003.png 이미지

Research Overview

JJSHBB_2019_v25n2_141_f0004.png 이미지

Discovering Subjects and Seed Terms

JJSHBB_2019_v25n2_141_f0005.png 이미지

Word Embedding with Word2Vec

JJSHBB_2019_v25n2_141_f0006.png 이미지

Similar Term Expansion and Subject Dictionary Construction

JJSHBB_2019_v25n2_141_f0007.png 이미지

TF-IDF Value of Each Subject

JJSHBB_2019_v25n2_141_f0008.png 이미지

Normalized Sentence TF-IDF for Subject “Room” and “Service”

JJSHBB_2019_v25n2_141_f0009.png 이미지

Calculating Sentence Vectors from Word Vectors

JJSHBB_2019_v25n2_141_f0010.png 이미지

Document Similarity Matrix

JJSHBB_2019_v25n2_141_f0011.png 이미지

Process of Summary Generation

JJSHBB_2019_v25n2_141_f0012.png 이미지

Topic Keywords for 20 Topics

JJSHBB_2019_v25n2_141_f0013.png 이미지

Seed Term Candidates

JJSHBB_2019_v25n2_141_f0014.png 이미지

Seed Terms of Each Subject

JJSHBB_2019_v25n2_141_f0015.png 이미지

Term Expansion by Word2Vec

JJSHBB_2019_v25n2_141_f0016.png 이미지

Subject Dictionaries for “Room”, “Food”, “Service”, and “Location”

JJSHBB_2019_v25n2_141_f0017.png 이미지

Normalized Sentence Weight for Each Subject

JJSHBB_2019_v25n2_141_f0018.png 이미지

Sentence Set in Subject “Room”

JJSHBB_2019_v25n2_141_f0019.png 이미지

Comparison of Subject Distribution

JJSHBB_2019_v25n2_141_f0020.png 이미지

Summary Report of Subject “Location” for Hotel “E”

The Number of Sentences in Summary (Our Approach)

JJSHBB_2019_v25n2_141_t0001.png 이미지

The Number of Sentences in Summary (Traditional Approach)

JJSHBB_2019_v25n2_141_t0002.png 이미지

References

  1. Bingham, E. and H. Mannila, "Random projection in dimesionality reduction: applications to image and text," Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, (2001), 245-250.
  2. Chen, Y. and M. Bansal, "Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting," Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, (2018), 675-686.
  3. Chorpa, S., M. Auli and A. Rush, "Abstractive Sentence Summarization with Attentive Recurrent Neural Networks," Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (2016), 93-98.
  4. Eduard, H., and C. Lin, "Automated text summarization and the SUMMARIST system," Proceedings of a workshop, (1998), 197-214.
  5. Eduard, H., The Oxford Handbook of Computational Linguistics 2nd edition, Oxford University Press, Oxford, 2015.
  6. Erk, K. and S. Pado, "A structured vector space model for word meaning in context," Proceedings of the Conference on Empirical Methods in Natural Language Processing, (2008), 897-906.
  7. Gao, J., Y. He, X. Zhang and Y. Xia, "Duplicate Short Text Detection Based on Word2Vec," 2017 8th IEEE International Conference on Software Engineering and Service Science, (2017), 33-38.
  8. Goldstein, J., M. Kantrowitz, V. Mittal and J. Carbonell, "Summarizing text documents: sentence selection and evaluation metrics," Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, (1999), 121-128.
  9. Gong, Y. and X. Liu, "Generic text summarization using relevance measure and latent semantic analysis," Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, (2001), 19-25.
  10. Gupta, V. and G. Lehal, "A Survey of Text Summarization Extractive Techniques," JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, Vol.2, No.3(2010), 258-268.
  11. Joel, L. N., A. Alex and C, Kaestner, "Automatic Text Summarization Using a Machine Learning Approach," Brazilian Symposium on Artificial Intelligence, (2002), 205-215.
  12. Kageback, M., O. Mogren, N. Tahmasebi and D. Dubhashi, "Extractive Summarization using Continuous Vector Space Models," Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality, (2014), 31-39.
  13. Kim, J., J. Kim and D. Hwang, "Korean Text Summarization Using an Aggregate Similarity," Proceedings of the fifth international workshop on on Information retrieval with Asian languages, (2000), 111-118.
  14. Ko, E. and N. Kim, "Automatic Quality Evaluation with Completeness and Succinctness for Text Summarization," Journal of Intelligence and Information Systems, Vol.24, No.2(2018), 125-148. https://doi.org/10.13088/JIIS.2018.24.2.125
  15. Li, W., X. Xiao, Y. Lyu and Y. Wang, "Improving Neural Abstractive Document Summarization with Structural Regularization," Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, (2018), 4078-4097.
  16. Marco, B., D. Georgiana and K. German, "Don't count, predict! A Systematic comparison of context-counting vs. context-predicting semantic vectors," Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, (2014), 238-247.
  17. Mikolov, T., I. Sutskever, K. Chen, G. Corrado and J. Dean, "Distributed representations of words and phrases and their compositionality," Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol.2, (2013), 3111-3119.
  18. Mittal, N., B. Agarwal, H. Mantri, R. Goyal and M. Jain, "Extractive Text Summarization," International Journal of Current Engineering and Technology, Vol.4, No.2(2014), 870-872.
  19. Mohamed, A. F. and R. Fuji, "GA, MR, FFNN, Pnn and GMM based models for automatic text summarization," Computer Speech & Language, Vol.23, (2009), 126-144. https://doi.org/10.1016/j.csl.2008.04.002
  20. Nallapati, R., B. Zhou, C. Santos, C. Gulcehre and B. Xiang, "Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond," Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, (2016), 280-290.
  21. Nenkova, A. and K. Mckewon, "A Survey of Text Summarization Techniques," Mining Text Data, (2012), 43-76.
  22. Omer, L., Y. Goldberg and I. Dagan, "Improving Distributional Similarity with Lessons Learned from Word Embeddings," Transactions of the Association for Computational Linguistics, Vol.3, (2015), 211-225. https://doi.org/10.1162/tacl_a_00134
  23. Rachit, A. and B. Ravindran "Latent Dirichlet Allocation and Singular Value Decomposition Based Multi-document Summarization," 2008 8th IEEE International Conference on Data Mining, (2008), 713-718.
  24. Ramiz, M. A., "A new sentence similarity measure and sentence based extractive technique for automatic text summarization," Expert Systems with Applications, Vol.36, (2009), 7764-7772. https://doi.org/10.1016/j.eswa.2008.11.022
  25. Salton, G., A. Wong and C. S. Yang, "A Vector Space Model for Automatic Indexing," Communications of the ACM, Vol.18, No.11(1975), 613-620. https://doi.org/10.1145/361219.361220
  26. Singhal, A., "Modern Information Retrieval: A Brief Overview," Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, (2001), 35-43.
  27. Sonawane, S., A. Ghotkar and S. Hinge, "Context-Based Multi-document Summarization," Contemporary Advances in Innovative and Applicable Information Technology, (2018), 153-165.
  28. Tan, A. "Text Mining: The state of the art and the challenges," Proceedings of the Pacific Asia Conf on Knowledge Discovery and Data Mining PAKDD'99 workshop on Knowledge Discovery from Advanced Databases, (1999), 65-70.
  29. Wan, X. and J. Yang, "Multi-document summarization using cluster-based link analysis," Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, (2008), 299-306.
  30. Wen, Z., T. Yoshida and X. Tang, "A comparative study of TF*IDF, LSI and multi-words for text classification," Expert Systems with Applications, Vol.38, No.3(2011), 2758-2765. https://doi.org/10.1016/j.eswa.2010.08.066
  31. Yeh, J. Y., H. Ke and W. Yang, "iSpreadRank: Ranking sentences for extraction-based summarization using feature weight propagation in the sentence similarity network," Expert Systems with Applications, Vol.35, (2008), 1451-1462. https://doi.org/10.1016/j.eswa.2007.08.037
  32. Zhang, F., J. Yao and R. Yan, "On the Abstractiveness of Neural Document Summarization," Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, (2018), 785-790.
  33. Zhang, P. and C. Li, "Automatic text summarization based on sentences clustering and extraction," 2009 2nd IEEE International Conference on Computer Science and Information Technology, (2009), 167-170.