GUI-based HTML2XML Wrapperusing Inductive Reasoning

학습 추론을 이용한 GUI 기반의 HTML2XML 래퍼

  • 장문성 (서울대학교 컴퓨터공학부) ;
  • 정재목 (서울대학교 컴퓨터공학부) ;
  • 최일환 (서울대학교 컴퓨터공학부) ;
  • 김형주 (서울대학교 컴퓨터공학부)
  • Published : 2002.08.01

Abstract

The 'wrapper' is a module that extracts and processes information from the specified data source by the pre-composed extraction rule. 'HTML Wrapper for XML' extracts information from the web source as the form of XML document. Since composing the extraction rule is a repetitious and tedious job, it should be done as easy and fast as possible. This paper presents the method to minimize the composing job, which integrates GUI based training and scripting.

래퍼(wrapper)는 미리 입력된 추출 규칙을 바탕으로 특정 정보 소스에서 원하는 정보를 추출, 가공하는 모듈이다. HTML-XML 래퍼(HTML Wrapper for XML)는 HTML로 이루어진 웹 정보에서 특정 정보를 XML 문서 형태로 추출한다. 사람이 추출 규칙을 직접 작성하는 일은 단순 반복적이고 지루한 일이므로, 최소의 노력으로 쉽고 빠르게 이를 생성할 수 있어야 한다. 본 논문에서는 기존의 스크립팅 방식에 GUI를 통한 학습 추론 방법을 통합하여 추출 규칙 생성 작업을 최소화 하는 방법을 제시한다.

Keywords

References

  1. Brad Adelberg. NoDoSE A Tool for SemiAutomatically Extracting Semi-Structured Data from Text Documents. In SIGMOD Conference, Pages 283-294, 1998 https://doi.org/10.1145/276304.276330
  2. William W. Cohen. Recognizing Structure in Web Pages using Similarity Queries. In AAAI/IAAI, Pages 59-66. 1999
  3. David W. Embley, y'S.]iang, and Yiu-Kai Ng. Record-Boundary Discovery in Web Documents. In SIGMOD Conference, Pages 467-478, 1999 https://doi.org/10.1145/304182.304223
  4. Daniela Florescu, Alon Y. Levy, and Alberto O. Mendelzon. Database Techniques for the World Wide Web: A Survey. In SIGMOD Record 27(3). Pages 59-74, 1998 https://doi.org/10.1145/290593.290605
  5. Gerald Huck, Peter Fankhauser, Karl Aberer, and Erich J. Neuhold. Jedi: Extracting, and Synthesizing Information from the Web. In CoopIS, Pages 32-43, 1998 https://doi.org/10.1109/COOPIS.1998.706182
  6. James Jaworski. Inside Secrets JavaScript & JScript, 삼각형 프레스, 1999
  7. JaeMok Jeong. Xws script language. Technical report, Seoul National University, Jan 2000 http://oopsla.snu.ac.kr/xweet/xws/xws-script.txt
  8. Thomas Kistler, and Hannes Marais. WebL A Programming Language for the Web. In WWW7/Computer Networks 30, Pages 259-270, 1998 https://doi.org/10.1016/S0169-7552(98)00018-X
  9. Craig A. Knoblock, Steven Minton, Jos Luis Ambite, Naveen Ashish, Pragnesh J. Modi, Ion Muslea, Andrew Philpot, and Sheila Tejada. Modeling Web Sources for Information Integration. In AAAI/IAAI, Pages 211 -218, 1998
  10. Nicholas Kushmerick. Regression testing for wrapper maintenance. In AAAI/IAAI, Pages 74-79, 1999
  11. Nicholas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper Induction for Information Extraction. In UCAI, Pages 729-737, 1997
  12. Ling Liu, Calton Pu, and Wei Han. XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. In ICDE, Pages 611 -621, 2000 https://doi.org/10.1109/ICDE.2000.839475
  13. Ion Muslea, Steve Minton, and Craig A. Knoblock. Active Learning Hierarchical Wrapper Induction. In AAAI/IAAI, Page 975, 1999
  14. Arnaud Sahuguet, and Fabien Azavant. Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F. In VLDB, pages 738-741, 1999
  15. Arnaud Sahuguet, and Fabien Azavant. Web Ecology: Recycling HTML Pages as XML Documents Using W4F. In WebDB, pages 31-36, 1999
  16. Wireless Application Protocol Forum Ltd, Wireless Markup Language (WML) 2.0 Specification, 2001. http://www.wapforum.org/what/technical.htm
  17. World Wide Web Consortium (W3C). Extensible Markup Language (XML) 1.0, 1998. http://www.w3.org/TR/1998/REC-xml-19980210
  18. World Wide Web Consortium (W3C). HTML 4.01 Specification, Dec. 1999. http://www.w3.org/TR/1999/REC-html401-19991224
  19. 이승진, 김대건, 최린, 강철희, 확장성 있는 웹서비스를 위한 무선 응용 프로토콜 기반의 HTML Filter 구현, 한국정보과학회논문집, Vol.28, No.1, pages 391-393, 2001
  20. 전병선, Microsoft Visual C++ 6.0 ATL COM Programming, 삼양출판사, 1999