Proceedings of the Korea Association of Information Systems Conference (한국정보시스템학회:학술대회논문집)
- 2005.05a
- /
- Pages.79-92
- /
- 2005
Web Information Extraction using HTML Tag Pattern
HTML 태그페턴을 이용한 웹정보추출시스템
Abstract
To query the vast amount of web pages which are available i]l the Internet, it is necessary to extract the encoded information in the web pages for converting it into structured data (e.g. relational data for SQL) or semistructured data (e.g. XML data for XQuery), In this paper, we propose a new web information extraction system, PIES, to convert web information into XML documents. PIES is based on a user-specified target schema and HTML tag pattern descriptions. The web information is extracted by the pattern descriptions and validated by the target schema. We designed a new language to describe extraction rules, and a new regular expression to describe HTML tag patterns. We implemented PIES and applied it to the US patent web site to evaluate its correctness. It successfully extracted more than thousands of US patent data and converted them into XML documents.
Keywords