A Study on Extracting News Contents from News Web Pages

Lee, Yong-Gu;

doi:10.3743/KOSIM.2009.26.1.305

Journal of the Korean Society for information Management (정보관리학회지)

Volume 26 Issue 1
/
Pages.305-320
/
2009
/
1013-0799(pISSN)
/
2586-2073(eISSN)

Korean Society for Information Management (한국정보관리학회)

DOI QR Code

A Study on Extracting News Contents from News Web Pages

뉴스 웹 페이지에서 기사 본문 추출에 관한 연구

Lee, Yong-Gu (School of Information Sciences, University of Pittsburgh)

이용구

Published : 2009.03.30

https://doi.org/10.3743/KOSIM.2009.26.1.305 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

The news pages provided through the web contain unnecessary information. This causes low performance and inefficiency of the news processing system. In this study, news content extraction methods, which are based on sentence identification and block-level tags news web pages, was suggested. To obtain optimal performance, combinations of these methods were applied. The results showed good performance when using an extraction method which applied the sentence identification and eliminated hyperlink text from web pages. Moreover, this method showed better results when combined with the extraction method which used block-level. Extraction methods, which used sentence identification, were effective for raising the extraction recall ratio.

웹을 통해 제공되는 뉴스 페이지의 경우 필요한 정보 뿐 아니라 많은 불필요한 정보를 담고 있다. 이러한 불필요한 정보는 뉴스를 처리하는 시스템의 성능 저하와 비효율성을 가져온다. 이 연구에서는 웹 페이지로부터 뉴스 콘텐츠를 추출하기 위해 문장과 블록에 기반한 뉴스 기사 추출 방법을 제시하였다. 또한 이들을 결합하여 최적의 성능을 가져올 수 있는 방안을 모색하였다. 실험 결과, 웹 페이지에 대해 하이퍼링크 텍스트를 제거한 후 문장을 이용한 추출 방법을 적용하였을 때 효과적이었으며, 여기에 블록을 이용한 추출 방법과 결합하였을 때 더 좋은 결과를 가져왔다. 문장을 이용한 추출 방법은 추출 재현율을 높여주는 효과가 있는 것으로 나타났다.

Keywords

References

정영미. 2005. '정보검색연구'. 서울: 구미무역출판부
한광록, 선복근, 유형선. 2007. 웹 뉴스의 기사추출과 요약. '한국 컴퓨터정보학회 논문집', 12(5): 1-10
Cadenhead, Tyrone, Jinlin Chen, and Terry Cook. 2008. 'Improving web information indexing and retrieval based on center block duplication detection.' International Journal of Innovative Computing and Applications, 1(3): 194-204 https://doi.org/10.1504/IJICA.2008.019687
Debnath, Sandip, Prasenjit Mitra, and C. Lee Giles. 2005. 'Automatic extraction of informative blocks from webpages.' Proceedings of the 2005 ACM Symposium on Applied Computing, 1722-1726
Etzioni, Oren. 1996. 'The world wide web: Quagmire or gold mine.' Communications of the ACM, 39(11): 65-68 https://doi.org/10.1145/240455.240473
Gupta, S., K. Kaiser, D. Neistadt, and P. Grimm. 2003. 'DOM-based content extraction of HTML documents.' Proceedings of the 12th International Conference on World Wide Web, 249- 256
Lin, Shian-Hua and Jan-Ming Ho. 2002. 'Discovering informative content blocks from web documents.' Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 588-593
Reis, Davi Castro, Paulo Golgher, Altigran Silva, and Alberto Leaender. 2003. 'Automatic web news extraction using tree edit distance.' Proceedings of the 13th International Conference on World Wide Web, 502-511
Sebastiani, Fabrizio. 2002. 'Machine learning in automated text categorization.' ACM Computing Surveys, 34(1): 1-47 https://doi.org/10.1145/505282.505283
Song, Ruihua, Haifeng Liu, Ji-Rong Wen, and Wei-Ying Ma. 2004. 'Learning block importance models for web pages.' Proceedings of the 13th International Conference on World Wide Web, 203-111
Vitali, Fabio, Angelo Di Iorio, and Elisa Ventura Campori. 2004. 'Rule-Based Structural Analysis of Web Pages.' Document Analysis Systems VI, 425-437 https://doi.org/10.1007/b100557
Yi, Lan, Bing Liu, and Xiaoli Li. 2003. 'Eliminating noisy information in Web pages for data mining.' Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and data Mining, 296-305
Yu, Shipeng, Deng Cai, Ji-Rong Wen, and Wei-Ying Ma. 2003. 'Improving pseudorelevance feedback in web information retrieval using web page segmentation.' Proceedings of the 12th International Conference on World Wide Web, 11-18

Cited by

Text Extraction Algorithm using the HTML Logical Structure Analysis vol.16, pp.3, 2015, https://doi.org/10.9728/dcs.2015.16.3.445

Journal of the Korean Society for information Management (정보관리학회지)

A Study on Extracting News Contents from News Web Pages

뉴스 웹 페이지에서 기사 본문 추출에 관한 연구

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)