Web Information Extraction and Multidimensional Analysis Using XML

XML을 이용한 웹 정보 추출 및 다차원 분석

  • 박병권 (동아대학교 경영정보과학부)
  • Received : 2007.06.13
  • Accepted : 2008.04.02
  • Published : 2008.05.31

Abstract

For analyzing a huge amount of web pages available in the Internet, we need to extract the encoded information in web pages. In this paper, we propose a method to extract and convert web information from web pages into XML documents for multidimensional analysis. For extracting information from web pages, we propose two languages: one for describing web information extraction rules based on the object-oriented model, and another for describing regular expressions of HTML tag patterns to search for target information. For multidimensional analysis on XML documents, we propose a method for constructing an XML warehouse and various XML cubes from it like the way we do for relational data. Finally, we show the validness of our method through the application to US patent web pages.

Acknowledgement

Supported by : 동아대학교