Text Extraction Algorithm using the HTML Logical Structure Analysis

Jeon, Hyun-Gee;KOH, Chan;

doi:10.9728/dcs.2015.16.3.445

Journal of Digital Contents Society (디지털콘텐츠학회 논문지)

Volume 16 Issue 3
/
Pages.445-455
/
2015
/
1598-2009(pISSN)
/
2287-738X(eISSN)

Digital Contents Society (한국디지털콘텐츠학회)

DOI QR Code

Text Extraction Algorithm using the HTML Logical Structure Analysis

HTML 논리적 구조분석을 통한 본문추출 알고리즘

Jeon, Hyun-Gee (Seoul National University of Science & Technology) ;
KOH, Chan (Seoul National University of Science & Technology)

전현지 ;
고찬

Received : 2015.03.15
Accepted : 2015.06.30
Published : 2015.06.30

https://doi.org/10.9728/dcs.2015.16.3.445 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

According as internet and computer technology develops, the amount of information has increased exponentially, arising from a variety of web authoring tools and is a new web standard of appearance and a wide variety of web content accessibility as more convenient for the web are produced very quickly. However, web documents are put out on a variety of topics divided into some blocks where each of the blocks are dealing with a topic unrelated to one another as well as you can not see with contents such as many navigations, simple decorations, advertisements, copyright. Extract only the exact area of the web document body to solve this problem and to meet user requirements, and to study the effective information. Later on, as the reconstruction method, we propose a web search system can be optimized systematically manage documents.

인터넷과 컴퓨터 기술이 발전함에 따라 정보의 양이 폭발적으로 증가하였으며, 이로 인해 다양한 웹 저작 도구 및 새로운 웹 표준의 출현과 웹에 대한 접근성이 보다 편리해지면서 매우 다양한 종류의 웹 콘텐츠들이 아주 빠르게 생산되고 있다. 하지만 웹 문서는 여러 블록으로 나누어 다양한 주제를 담아내고 있으며, 각각의 블록들이 서로 연관성이 없는 주제를 다루는 경우가 많을 뿐만 아니라 네비게이션, 단순한 장식물, 광고, 저작권 정보 등과 같이 콘텐츠로 볼 수 없는 블록들도 존재한다. 이러한 문제를 해결하기 위해 HTML 웹 문서의 정확한 본문영역만을 추출하여 사용자 요구조건을 충족하고 효과적으로 정보를 학습할 수 있도록 하며, 추후에는 문서를 체계적으로 관리할 수 있게 최적화된 웹 검색 시스템으로서의 재구성 방법을 제안하고자 한다.

Keywords

References

J.M. Lim, S.J. Jang, M.Y. Kim, J. H. Lee, "2014 Status of Utilization of Internet," Korea Internet Agency, 2014
Deng C., Shipeng Y., Ji-Rong W., Wei-Ying M., "VIPS: a Vision-based Page Segmentation Algorithm," Microsoft Technical Report(MSR-TR-2003-79), 2003.
Suhit G., Gail E. K., Peter G., Michael F. C., Justin S., "Automating Content Extraction of HTML Documents," World Wide Web, vol.8, Issue2, pp.179-224, 2005. https://doi.org/10.1007/s11280-004-4873-3
Jeff P., Dan R., "Extracting Article Text from the Web with Maximum Subsequence Segmentation," The 18th international conference on World wide web, pp.971-980, 2009.
Stefan E., "A lightweight and efficient tool for clcaning Web pages", The 6th International Conference on Language Resources and Evaluation, 2008.
Christian K., Peter F., Wolfgang N., "Boilerplate Detection using Shallow Text Features," The third ACM international conference on Web search and data mining, pp.441-450, 2010.
Jian F., Ping L., Suk Hwan L., Sam L., Parag J., Jerry L., "Article Clipper-A System for Web Article Extraction," 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.743-746, 2011.
Tim W., William H. H., Jiawei H., "CETR-Content Extraction via Tag Ratios," 19th international conference on World wide web, pp.971-980, 2010.
Jung-chan Yun, Sung-dae Yun, "Design of personalized Web mining using association rules ", Journal of Korea multimedia society, Vol. 11-11, pp.1566-1574, 2008.
Hyung-woo Lee, Tae-su Kim, "Research of knowledge inference algorithm with associated mining method based on Ontology", Journal of Korea multimedia society, Vol. 11-11, pp.1601-1614, 2008.
Tomaz K., Evaluating Text Extraction Algorithms. [Online]. Available: http://tomazkovacic.com/blog/(downloaded 2012, Jul.)
W3C Recommendation. (1999, Dec. 24). HTML 4.01 Specification [Online]. Available:http://www.w3.org/TR/html401/ (downloaded 2012, Jul.)
Ju-gil Hong, Eun-young Shin, Jue-il Lee, Won-Seok Lee, "Automatic Hierarchical Classification of news articles using association rules", Journal of Korea multimedia society, Vol. 14-6, pp.730-741, 2011. https://doi.org/10.9717/kmms.2011.14.6.730
Won-moon Song, Woo-seung Kim, Mung-won Kim, "HTML document, extraction using the context of the surrounding text blocks", Journal of Korean Institute of Information Scientists and Engineers : Software and Applications, Vol. 40-3, pp.155-163, 2013.
S.-H. Lin, J.-M. Ho, Discobering Informative Content Blocks from Web Documents. Proc. of 8th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining, 2002.
Young-gu Lee, "Study on the article text extraction from news web page", Journal of Korea Society for Information Management, Vol. 26, pp.305-320, 2009. https://doi.org/10.3743/KOSIM.2009.26.1.305
L. Bing, Y. Wang, Y. Zhang, Primary Content Extraction with Mountain Model. Proc. 8th IEEE CIT, 2008.

Cited by

한국 인터넷신문 HTML 규격 및 시맨틱스 수준 분석 vol.18, pp.5, 2015, https://doi.org/10.9728/dcs.2017.18.5.949
Software Implementation to Covert Table and Text-Based Hangul Files(.hwp) to HTML vol.17, pp.12, 2015, https://doi.org/10.14801/jkiit.2019.17.12.155

Journal of Digital Contents Society (디지털콘텐츠학회 논문지)

Text Extraction Algorithm using the HTML Logical Structure Analysis

HTML 논리적 구조분석을 통한 본문추출 알고리즘

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)