• Title/Summary/Keyword: HTML Structure

Search Result 91, Processing Time 0.027 seconds

An Efficient Method for Logical Structure Analysis of HTML Tables (HTML 테이블의 논리적 구조분석을 위한 효율적인 방법)

  • Kim Yeon-Seok;Lee Kyong-Ho
    • Journal of Korea Multimedia Society
    • /
    • v.9 no.9
    • /
    • pp.1231-1246
    • /
    • 2006
  • HTML is a format for rendering Web documents visually and uses tables to present a relational information. Since HTML has limits in terms of information processing and management by a computer, it is important to transform HTML tables into XML documents, which is able to represent logical structure information. As a prerequisite for extracting information from the Web, this paper presents an efficient method for extracting logical structures from HTML tables and transforming them into XML documents. The proposed method consists of two phases: Area segmentation and structure analysis. The area segmentation step removes noisy areas and extracts attribute and value areas through visual and semantic coherency checkup. The hierarchical structure between attribute and value areas are analyzed and transformed into XML representations using a proposed table model. Experimental results with 1,180 HTML tables show that the proposed method performs better than the conventional method, resulting in an average precision of 86.7%.

  • PDF

Design and Implementation of an HTML Pages Modification Detector for Meta-search Engines (메타 검색엔진을 위한 HTML 문서 변경 탐지기의 설계 및 구현)

  • Park, Sang-Wi;O, Jeong-Seok;Lee, Sang-Ho
    • The KIPS Transactions:PartD
    • /
    • v.9D no.3
    • /
    • pp.345-354
    • /
    • 2002
  • HTML pages in the web change at any time. It could cause to decrease the functionality of meta-search engines which provide users with integrated results of search engines. To solve this problem, we propose an HTML pages modification detector. It utilities information of element positions in HTML pages and the modified Jaak Vilo algorithm. The HTML page modification detector uses patterns that represent the structure of HTML expressions occurring repeatedly in HTML pages. An experiment is carried out to verify the correctness of the modification detector.

An Approach to Structuralizing Business Information for Internet Shopping Malls (인터넷쇼핑몰의 사업자신원정보 구조화 방안)

  • 장용식
    • Journal of Intelligence and Information Systems
    • /
    • v.10 no.1
    • /
    • pp.27-45
    • /
    • 2004
  • While on-line shopping is increasing, the "Consumer Protection Law in Electronic Commerce" obliges each internet shopping mall to provide its business information. Although most internet shopping malls provide their business information in the semi-structured format on the bottom of their homepages, the attributes and expression forms of business information are different each other. It makes consumers difficult to identify their business information and lowers public confidence. Hence this study proposes three approaches - HTML-based structure, XML-based structure, and XML data island-based structure - to structuralizing business information for correct expression. The experiment results showed that the business information extraction time by XML data island-based structure is independent of the size of the web document, while the time by HTML-based structure is dependent on the size. By comparing the business information extraction times, we show that XML data island-based structure is more efficient and effective than HTML-based structure.structure.

  • PDF

Analysis of HTML Structure in E-commerce Websites Using Tree Representation (트리 표현을 사용하는 전자상거래 웹 사이트의 HTML 구조 분석)

  • Ventura, Jose E.;Park, Jeong-Sun
    • Journal of the Korea Safety Management & Science
    • /
    • v.13 no.4
    • /
    • pp.201-205
    • /
    • 2011
  • 개인화된 제품과 서비스에 대한 소비자의 요구는 성공적인 전자상거래 플랫폼을 기반으로 하고 있다. 성공적인 전자상거래 플랫폼을 개발함에 있어 자주 간과되고 있는 중요한 요소는 바로 웹 페이지의 HTML 구조이다. HTML 구조는 전자상거래 웹사이트의 속도와 랭킹을 결정짓는 기본적인 요소이다. 본 논문은 HTML 구조를 분석하기 위한 효율적이고 다소 생소한 시각화 기법을 제안하는데, 이러한 기법을 사용하여 개발자는 잠재적인 프로그래밍 오류와 개선 사항을 발견할 수 있다. 본 논문은 하나의 사례를 이용하여 제안된 기법을 더욱 구체화 시킨다.

A Method of Form-Based HTML Documents Generation (폼에 기반한 HTML 문서 생성 방법)

  • Choe, Jun-Yong;Kim, Byeong-Gi
    • The Transactions of the Korea Information Processing Society
    • /
    • v.6 no.2
    • /
    • pp.292-298
    • /
    • 1999
  • An information structure of large size hypermedia application is usually hierarchical, and the sibling nodes in this structure have same or similar tags and contents. a word "개그" that means the common set of sibling nodes in the hierarchical information structure is used in this paper. It proposes a design method that divides form and content from nodes and it proposes HTML page generation algorithm from forms and contents. This method has reusability of form, maintainability of documents and reduction of cost for authoring.

  • PDF

Development of an Intelligent Illegal Gambling Site Detection Model Based on Tag2Vec (Tag2vec 기반의 지능형 불법 도박 사이트 탐지 모형 개발)

  • Song, ChanWoo;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.4
    • /
    • pp.211-227
    • /
    • 2022
  • Illegal gambling through online gambling sites has become a significant social problem. The development of Internet technology and the spread of smartphones have led to the proliferation of illegal gambling sites, so now illegal online gambling has become accessible to anyone. In order to mitigate its negative effect, the Korean government is trying to detect illegal gambling sites by using self-monitoring agents or reporting systems such as 'Nuricops.' However, it is difficult to detect all illegal sites due to limitations such as a lack of staffing. Accordingly, several scholars have proposed intelligent illegal gambling site detection techniques. Xu et al. (2019) found that fake or illegal websites generally have unique features in the HTML tag structure. It implies that the HTML tag structure can be important for detecting illegal sites. However, prior studies to improve the model's performance by utilizing the HTML tag structure in the illegal site detection model are rare. Against this background, our study aimed to improve the model's performance by utilizing the HTML tag structure and proposes Tag2Vec, a modified version of Doc2Vec, as a methodology to vectorize the HTML tag structure properly. To validate the proposed model, we perform the empirical analysis using a data set consisting of the list of harmful sites from 'The Cheat' and normal sites through Google search. As a result, it was confirmed that the Tag2Vec-based detection model proposed in this study showed better classification accuracy, recall, and F1_Score than the URL-based detection model-a comparative model. The proposed model of this study is expected to be effectively utilized to improve the health of our society through intelligent technology.

Automatically Converting HTML Documents with Similar Pattern into XML Documents (유사 패턴을 갖는 HTML 문서의 XML 자동 변환)

  • O, Geum-Yong;Hwang, In-Jun
    • The KIPS Transactions:PartD
    • /
    • v.9D no.3
    • /
    • pp.355-364
    • /
    • 2002
  • Recently, WWW(World Wide Web) has become a source of a large amount of information, and is now recognized not only as an information-sharing tool, but also as an information repository. Currently, the majority of documents on the web were created using HTML(Hypertext Markup Language). Although HTML is simple and easy to learn, its inherent lack of describing document structure makes it difficult to retrieve information effectively. One possible solution would be to convert such HTML documents into XML (extensible Markup Language) documents. This is a standard markup language for exchanging data on the web. It can describe a document structure freely by defining its own DTD (Document Type Definition). This makes it possible to integrate, store, and retrieve data on the web efficiently In this paper, we will propose a converter that automatically converts HTML documents with similar pattern into XML documents by analyzing the document structure and recognizing its path information.

Analysis of User Preferences on the Structure of Digital Textbook Contents (디지털교과서 내용 구성에 관한 사용자 선호도 분석)

  • Kim, Mi-Hye
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.12
    • /
    • pp.900-911
    • /
    • 2009
  • This paper analyzes user preferences on the basic structure of digital textbook contents based on the PDF and HTML formats. This was conducted by analysing the data from an online survey on user preferences for the representative structures of the PDF- and HTML-based digital textbook contents that are currently used on the Web. Results show that in the PDF format, the structure with TOC(Table Of Contents) links on the left screen and the main content on the right was most preferred by 82% of the respondents. In terms of the viewing method, the one that presents one page of the textbook fitted to the width of the computer screen in a single-page view was regarded as the best. Similarly, in the HTML format, the structure with TOC links on the left frame and the main content on the right using 2-frames was revealed as the most preferred by 84% of the respondents. However, the structures of the PDF- and HTML-based digital textbook contents employed by most existing Web sites go against the users' preferences. Accordingly, for digital textbook development in the future, user preferences must be considered to allow students to read the contents more easily and conveniently.

HTML Tag Depth Embedding: An Input Embedding Method of the BERT Model for Improving Web Document Reading Comprehension Performance (HTML 태그 깊이 임베딩: 웹 문서 기계 독해 성능 개선을 위한 BERT 모델의 입력 임베딩 기법)

  • Mok, Jin-Wang;Jang, Hyun Jae;Lee, Hyun-Seob
    • Journal of Internet of Things and Convergence
    • /
    • v.8 no.5
    • /
    • pp.17-25
    • /
    • 2022
  • Recently the massive amount of data has been generated because of the number of edge devices increases. And especially, the number of raw unstructured HTML documents has been increased. Therefore, MRC(Machine Reading Comprehension) in which a natural language processing model finds the important information within an HTML document is becoming more important. In this paper, we propose HTDE(HTML Tag Depth Embedding Method), which allows the BERT to train the depth of the HTML document structure. HTDE makes a tag stack from the HTML document for each input token in the BERT and then extracts the depth information. After that, we add a HTML embedding layer that takes the depth of the token as input to the step of input embedding of BERT. Since tokenization using HTDE identifies the HTML document structures through the relationship of surrounding tokens, HTDE improves the accuracy of BERT for HTML documents. Finally, we demonstrated that the proposed idea showing the higher accuracy compared than the accuracy using the conventional embedding of BERT.

Implementation of an XML-Based Editor/Transformer for Large Volume of Similar Documents (XML 기반의 대용량 유사 문서 편집기/변환기 구현)

  • 황인준
    • The Journal of Society for e-Business Studies
    • /
    • v.9 no.1
    • /
    • pp.21-38
    • /
    • 2004
  • With its recent popularity, Web is now considered as a huge repository of information. Most documents on the web have been created using HTML(Hyper Text Markup Language). Even though HTML is simple and easy to learn, it has several features that are obstacles to the efficient information retrieval. XML(eXtensible Markup Language) can provide a solution to such problems and in fact, has already been used in many applications, XML is a standard markup language for exchanging data on the web. It can describe a document structure freely by defining its DTD, which enables efficient integration and retrieval of data on the web. In this paper, we propose a versatile and efficient XML document manager. Its features include (i) form-based XML editor that enables easy creation of new XML documents, (ii) automatic document converter that can transform HTML documents with similar structure into XML documents automatically, and (iii) GUI-based DTD editor.

  • PDF