• Title/Summary/Keyword: HTML documents

Search Result 149, Processing Time 0.037 seconds

Web Information Extraction and Multidimensional Analysis Using XML (XML을 이용한 웹 정보 추출 및 다차원 분석)

  • Park, Byung-Kwon
    • Journal of Korea Multimedia Society
    • /
    • v.11 no.5
    • /
    • pp.567-578
    • /
    • 2008
  • For analyzing a huge amount of web pages available in the Internet, we need to extract the encoded information in web pages. In this paper, we propose a method to extract and convert web information from web pages into XML documents for multidimensional analysis. For extracting information from web pages, we propose two languages: one for describing web information extraction rules based on the object-oriented model, and another for describing regular expressions of HTML tag patterns to search for target information. For multidimensional analysis on XML documents, we propose a method for constructing an XML warehouse and various XML cubes from it like the way we do for relational data. Finally, we show the validness of our method through the application to US patent web pages.

  • PDF

A Suggestion of Efficient Method for Integrating XML and Voice XML (XML과 Voice XML의 효율적인 통합 방안 제시)

  • 장민석;홍용택
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2001.05a
    • /
    • pp.260-264
    • /
    • 2001
  • In this paper we suggest a method for translating (or integrating) XML documents into Voice XML documents in order to provide voice communication service. In the forthcoming web environment, XML will certainly overwhelms HTML. At this situation, the method of accessing data through the more various types of terminal machines is required.'rho best way is to use Voice XML by which the data accessing method is able to change from Web to the wired or/and wireless terminal at low costs. Thus we suggest a method for integrating the XML-based system into the Voice XML-based one.

  • PDF

A Study on Authorization Policy Management for Semantic Web (시맨틱 웹을 위한 권한부여 정책 관리에 관한 연구)

  • Jo, Sun-Moon
    • Journal of Digital Convergence
    • /
    • v.11 no.9
    • /
    • pp.189-194
    • /
    • 2013
  • Semantic Web is what supports a search, data integration, and automated web service by developing technology of giving help so that a computer can understand information a little more on the web. As amount of information gets growing and diverse, there is a problem of offering by efficiently extracting and processing only information proper for users' demand. Semantic Web isn't what is distinguished completely from the existing web. It gives a meaning, which was well defined, to information of being inserted on web by expanding the current web. Through this, a computer and a person come to perform work cooperatively. To implement Semantic Web, the limit of HTML needs to be overcome. The existing access authorization has not taken information and semantics into account due to the limitations of HTML. It is difficult to expand or integrate many relevant documents by using HTML. Program or software agent, not a person, cannot extract a meaning of document automatically. This study suggests a method of Access Authorization Policy Management that is in the Semantic Web configuration. Accordingly, the policy, which was designed in this study, improved the authorization process more than the existing method.

Logistic Regression Ensemble Method for Extracting Significant Information from Social Texts (소셜 텍스트의 주요 정보 추출을 위한 로지스틱 회귀 앙상블 기법)

  • Kim, So Hyeon;Kim, Han Joon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.6 no.5
    • /
    • pp.279-284
    • /
    • 2017
  • Currenty, in the era of big data, text mining and opinion mining have been used in many domains, and one of their most important research issues is to extract significant information from social media. Thus in this paper, we propose a logistic regression ensemble method of finding the main body text from blog HTML. First, we extract structural features and text features from blog HTML tags. Then we construct a classification model with logistic regression and ensemble that can decide whether any given tags involve main body text or not. One of our important findings is that the main body text can be found through 'depth' features extracted from HTML tags. In our experiment using diverse topics of blog data collected from the web, our tag classification model achieved 99% in terms of accuracy, and it recalled 80.5% of documents that have tags involving the main body text.

A Universal Smart-phone APP for Processing One-shot Tasks (일회성 작업 처리를 위한 통함 스마트폰 앱)

  • Cha, Shin;So, Sun Sup;Jung, Jinman;Yoon, Young-Sun;Eun, Seongbae
    • Journal of Korea Multimedia Society
    • /
    • v.20 no.3
    • /
    • pp.562-570
    • /
    • 2017
  • One shot tasks like a MERSC handling policy, a cinema poster, and so on are too small, diverse, and sporadic to make them as apps or web applications. They are usually shared as the form of notes attached in the field or messages in smart phones. In order to support inter-operability with internet web sites, QR/NFC tags are attached to them. What matters in the web technology is that HTML5 standard does not supply the accessability of smart phones' resources like a camera, an audio, magnetic sensors, and etc. In this paper, we propose a universal smart phone application for handling various one-shot tasks in the same UI/UX. One-shot tasks are described with HTML5 web documents, and the URL for the web documents are stored in QR/NFC tags. A smart phone scans a tag, and then the web document is retrieved and presented finally. QR tags can be delivered to other smart phones through messages or SNS. We solve the problem of HTML5 standard supplying a resource access library with javascrippts. We suggested the whole architecture and the internal structure of QR/NFC tags. We show that our scheme is applicable to make variable one-shot tasks.

The Study of technique to find and prove vulnerabilities in ActiveX Control (ActiveX Control 취약점 검사 및 검증 기법 연구)

  • Sohn, Ki-Wook;Kim, Su-Yong
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.15 no.6
    • /
    • pp.3-12
    • /
    • 2005
  • To provide visitors with the various services, Many web sites distribute many ActiveX controls to them because ActiveX controls can overcome limits of HTML documents and script languages. However, PC can become dangerous if it has unsecure ActiveX controls, because they can be executed in HTML documents. Nevertheless, many web sites provide visitors with ActiveX controls whose security are not verified. Therefore, the verification is needed by third party to remove vulnerabilities in ActiveX controls. In this paper, we introduce the process and the technique to fad vulnerabilities. The existing proof codes are not valid because ActiveX controls are different from normal application and domestic environments are different from foreign environments. In this paper, we introduce the technique to prove vulnerabilities in ActiveX control.

A Study on Effective Internet Data Extraction through Layout Detection

  • Sun Bok-Keun;Han Kwang-Rok
    • International Journal of Contents
    • /
    • v.1 no.2
    • /
    • pp.5-9
    • /
    • 2005
  • Currently most Internet documents including data are made based on predefined templates, but templates are usually formed only for main data and are not helpful for information retrieval against indexes, advertisements, header data etc. Templates in such forms are not appropriate when Internet documents are used as data for information retrieval. In order to process Internet documents in various areas of information retrieval, it is necessary to detect additional information such as advertisements and page indexes. Thus this study proposes a method of detecting the layout of Web pages by identifying the characteristics and structure of block tags that affect the layout of Web pages and calculating distances between Web pages. This method is purposed to reduce the cost of Web document automatic processing and improve processing efficiency by providing information about the structure of Web pages using templates through applying the method to information retrieval such as data extraction.

  • PDF

Web-based Chart Generating System Using CML (CML을 이용한 웹 기반 차트출력시스템)

  • Yoon, Hyun-Nim;Kim, Yang-Woo
    • Journal of Internet Computing and Services
    • /
    • v.9 no.5
    • /
    • pp.47-58
    • /
    • 2008
  • Charts can propagate information more effectively by visualizing various types of information. Due to this reason, many web developers often use charts when they display information on the web. However, using a chart requires a special dedicated software to install and it is difficult to share the finished charts since they are usually in raster format. The raster format images are fixed in size, therefore, experience image distortion problem when their sizes are changed. In this paper, we propose a web-based chart generating system using CML(Chart Markup Language) to solve the current compatibility and sharing problems. The proposed chart generating system first analyzes XML, Text, or HTML documents to extract chart information in the documents, and then it converts collected chart information to CML. The converted CML documents are displayed on the web browser using a vector method. The vector method has an advantage of protecting its images from distortion even when image sizes are scaled. The CML, proposed in this paper, is a chart markup language based on XML for modeling charts Information. If CML is used for presenting a chart in a web, it makes easier to share and to convert the chart information.

  • PDF

Design of CSS3 Extensions for Polar-Coordinate Text Layout in Web Documents (웹문서 내의 극좌표계 텍스트 배치를 위한 CSS3 확장사양 설계)

  • Shim, Seung-Min;Lim, Soon-Bum
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.10
    • /
    • pp.537-545
    • /
    • 2016
  • Demand for text arranged in a circular shape is increasing as devices with round display such as smart watches are now being actively released. Data visualization field is receiving a lot of attention as the era of big data evolves. However, current web standard does not support the drawing of circular text. Therefore, the objective of this study was to extend CSS3 specifications to have circular text layout in web documents. In addition, we implemented a preprocessor so that contents made with CSS3 extensions could be shown in existing browsers. To confirm the wide expression range of CSS3 extension, we prepared some sample contents and analyzed them.

Hyper-Text Compression Method Based on LZW Dictionary Entry Management (개선된 LZW 사전 관리 기법에 기반한 효과적인 Hyper-Text 문서 압축 방안)

  • Sin, Gwang-Cheol;Han, Sang-Yong
    • The KIPS Transactions:PartA
    • /
    • v.9A no.3
    • /
    • pp.311-316
    • /
    • 2002
  • LZW is a popular variant of LZ78 to compress text documents. LZW yields a high compression rate and is widely used by many commercial programs. Its core idea is to assign most probably used character group an entry in a dictionary. If a group of character which is already positioned in a dictionary appears in the streaming data, then an index of a dictionary is replaced in the position of character group. In this paper, we propose a new efficient method to find least used entries in a dictionary using counter. We also achieve higher compression rate by preassigning widely used tags in hyper-text documents. Experimental results show that the proposed method is more effective than V.42bis and Unix compression method. It gives 3∼8% better in the standard Calgary Corpus and 23∼24% better in HTML documents.