Automatic Extractive Summarization of Newspaper Articles using Activation Degree of 5W1H

육하원칙 활성화도를 이용한 신문기사 자동추출요약

  • Published : 2004.04.01

Abstract

In a newspaper, 5W1H information is the most fundamental and important element for writing and understanding articles. Focusing on such a relation between a newspaper article and the 5W1H, we propose a summarization method based on the activation degree of 5W1H. To overcome problems of the lead-based and the title-based methods, both of which are known to be the most effective in newspaper summarization, sufficient 5W1H information is extracted from both a title and a lead sentence. Moreover, for each sentence, its weight is computed by considering various factors, such as activation degree of 5W1H, the number of 5W1H categories, and its length and position. These factors make a great contribution to the selection of more important sentences, and thus to the improvement of readability of the summarized texts. In an experimental evaluation, the proposed method achieved a precision of 74.7% outperforming the lead-based method. In sum, our 5W1H approach was shown to be promising for automatic summarization of newspaper articles.

육하원칙은 신문기사를 기술하는데 있어서 가장 기본적인 요소로서 기사 내용 파악에 핵심적인 역할을 수행한다. 본 논문은 이러한 육하원칙에 기반 하여 기술되는 신문기사의 특성에 주목하여, 육하원칙 활성화도를 이용한 신문기사 요약 방법론을 제안한다. 제안하는 방법론은 기존의 요약 기법 중 가장 우수한 방법으로 알려진 두문 기반 기법(lead-based method)과 제목 기반 기법(title-based method)의 문제점을 극복하기 위해, 제목과 두문의 정보를 결합시켜 충분한 어휘정보를 확보하도록 하였다. 특히 육하원칙 활성화도, 육하원칙 범주 개수, 문장 길이, 문장의 위치 둥과 같은 다양한 요소들을 문장 중요도 계산에 반영함으로써 보다 중요한 정보를 포함하면서도 가독성이 높은 문장들이 요약문으로 선택될 수 있도록 고려하였다. 제안된 방법론의 정확률은 74.7%로서 기존의 두문 기반 기법보다 우수한 성능을 보였으며, 신문기사를 자동 요약하는데 있어서 충분히 효과적으로 사용될 수 있는 방법론임을 실험을 통해 입증하였다.

Keywords

References

  1. Mani, I., Automatic summarization, John Benjamin Publishing Company, 2001
  2. Edmundson, H. P., 'New Methods in Automatic Extracting,' Journal of the ACM, Vol.16, No.2, pp.264-285, 1969 https://doi.org/10.1145/321510.321519
  3. Teufel, S. and Moens, M. 'Sentence Extraction as a Classification Task,' In Proceedings of the ACL'97/EACL'97 Workshop on Intelligent Scalable Text Summarization, pp.58-65, 1997
  4. Marcu, D., 'Building Up Rhetorical Structure Trees,' In Proceedings of the 13th National Conference on Artificial Intelligence, Vol.2, pp.1069-1074, 1996
  5. Marcu, D., 'The Rhetorical Parsing of Natural Language Texts,' In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics(ACL'97;EACL'97), pp.96-103, 1997 https://doi.org/10.3115/976909.979630
  6. Marcu, D., 'Discourse trees are good indicators of importance in text,' In Inderjeet Mani and Mark Maybury, eds, Advances in Automatic Text Summarization, pp.123-136, The MIT Press, 1999
  7. Brazilay, R. and Elhadad, M., 'Using Lexical Chains for Text Summarization,' In Inderjeet Mani and Mark Maybury, eds, Advances in Automatic Text Summarization, pp.111-121, The MIT Press, 1999
  8. Salton, G. and Singhal, A., 'Automatic Text Theme Generation and the Analysis of Text Structure,' Cornell U. Technical Report TR 94-1438, 1994
  9. Salton, G. et al., 'Automatic Text Decomposition Using Text Segments and Text Themes,' '96 ACM Conference on Hypertext, pp.53-65, 1996 https://doi.org/10.1145/234828.234834
  10. Salton, G. et al., 'Automatic Text Structuring and Summarization,' Information Processing and Management, Vol.33, No.2, pp.193-207, 1997 https://doi.org/10.1016/S0306-4573(96)00062-3
  11. Lin, C. Y. and Hovy, E., 'Identifying Topics by Position,' In Proceedings of the 5th Conference on Applied Natural Language Processing(ANLP'97), pp.283-290, 1997
  12. Hovy, E. and Lin, C. Y., 'Automated Text Summarization in SUMMARlST,' In Inderjeet Mani and Mark Maybury, eds, Advances in Automatic Text Summarization, pp.81-94, The MIT Press, 1999
  13. Brandow, R., Mitze, K. and Rau, L. F., 'Automatical condensation of electronic publications by sentence selection,' Information Processing and Management, Vol.31, No.5, pp.675-685, 1995 https://doi.org/10.1016/0306-4573(95)00052-I
  14. 고혜련, 신문 취재와 기사작성, 중앙M&B, 2001
  15. Kupiec, J., Pedersen, J. and Chen, F., 'A Trainable Document Summarizer,' In Proceedings of ACM-SIGIR'95, pp.68-73, 1995 https://doi.org/10.1145/215206.215333
  16. 이현주, 김계성, 구상욱, 이상조, '신문기사에서 육하원칙 중심의 정보추출', 한국정보과학회 춘계 학술발표 논문집, pp.361-363, 2001
  17. Okumura, A., Ikeda, T. and Muraki, K., 'Text Summarization based on Information Extraction and Categorization Using 5W1H,' Journal of Natural Language Processing, Vol.6, No.6, pp.27-44, 1999 https://doi.org/10.5715/jnlp.6.6_27
  18. Marcu, D., 'Improving Summarization through Rhetorical Parsing Tuning,' In Proceeding of the COLING ACL Workshop on Very Large Corpora, Montreal, Canada, 1998
  19. 김재훈, 김준흥, '도합유사도를 이용한 한국어 추출문서 요약', 제10회 한글 및 한국어 정보처리 학술발표 논문집, pp.238-244, 2000
  20. 이행원, 취재보도의 실제, 나남출판, 1999
  21. 김지용, 현장신문론, 도서출판 쟁기, 1996
  22. Hohenberg, J., The Professional Journalist, Henry Holt and Company Inc., New York, 1960
  23. 윤석흥, 김춘옥, 신문방송, 취재와 보도, 나남출판, 2000
  24. Brooks, B. et al., The Missouri Group : News Reporting and Writing, St. Martin's Press, 1996
  25. 조용철 외, 취재와 기사작성, 도서출판 양지, 1999
  26. 국립국어연구원, 한국신문의 문체, 1997
  27. 윤만근, Chomsky 생성문법의 변천, 경진문화사, 2001
  28. Ohno, S. and Hamanishi, M., 'New Synonym Dictionary,' Kadokawa Shoten, Tokyo, 1981 (Written in japanese)
  29. New Synonym Dictionary Ohno,S.;Hamanishi,M.