• Title/Summary/Keyword: Web documents

Search Result 831, Processing Time 0.025 seconds

On Developing a Semantic Annotation Tool for Managing Metadata of Web Documents based on XMP and Ontology (웹 문서의 메타데이터 관리를 위한 XMP 및 온톨로지 기반의 시맨틱 어노테이션 지원도구 개발)

  • Yang, Kyoung-Mo;Hwang, Suk-Hyung;Choi, Sung-Hee
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.10 no.7
    • /
    • pp.1585-1600
    • /
    • 2009
  • The goal of Semantic Web is to provide efficient and effective semantic search and web services based on the machine-processable semantic information of web resources. Therefore, the process of creating and adding computer-understandable metadata for a variety of web contents, namely, semantic annotation is one of the fundamental technologies for the semantic web. Recently, in order to manage annotation metadata, direct approach for embedding metadata into the document is mainly used in semantic annotation. However, many semantic annotation tools for web documents have been mainly worked with HTML documents, and most of these tools do not support semantic search functionalities using the metadata. In this paper, based on these problems and previous works, we propose the Ontology-based Semantic Annotation tool(OSA) to efficiently support semantic annotation for web documents(such as HTML, PDF). We define a semantic annotation model that represents ontological-semantic information by using RDFS(RDF Schema). Based on XMP(eXtensible Metadata Platform) standard, the model is encoded directly into the document. By using OSA with XMP, user can perform semantic annotation on web documents which are able to keep compatibility for managing annotation metadata. Eventually, the integrated semantic annotation metadata can be used effectively in semantic search for a variety of web contents.

REL and RDD Web Services System Based on MPEG-21 Framework (MPEG-21 프레임워크 기반의 REL 및 RDD 웹서비스 시스템)

  • Yoon Haw-Mook;Cho Tae-Beom;Jung Hoe-Kyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.10 no.5
    • /
    • pp.843-850
    • /
    • 2006
  • The standardization on RDD and REL has been developed by MPEG. REL is a Right Expression Language and RDD is term dictionary which develop active application of REL. However, since REL documents could be oかy edited by user understanding MPEG-21's framework, much easier editing system is required. As well, REL Consumption System to process and analyze REL documents, the RDD interoperability system to interoperate REL and RDD are needed. In this paper, REL Editing System and REL Consumption System, RDD Web Services System were designed and implemented, REL Editing System to make REL document, REL Consumption System to process and analyze edited documents, RDD Web Services System to process rights inquiry base on the Web Services.

WebDBs : A User oriented Web Search Engine (WebDBs: 사용자 중심의 웹 검색 엔진)

  • 김홍일;임해철
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.24 no.7B
    • /
    • pp.1331-1341
    • /
    • 1999
  • This paper propose WebDBs(Web Database system) which retrieves information registered in web using query language similar to SQL. This proposed system automatically extracts information which is needed to retrieve from HTML documents dispersed in web. Also, it has an ability to process SQL based query intended for the extracted information. Web database system takes the most of query processing time for capturing documents going through network line. And so, the information previously retrieved is reused in similar applications after stored in cache in perceiving that most of the web retrieval depends on web locality. In this case, we propose cache mechanism adapted to user applications by storing cached information associated with retrieved query. And, Web search engine is implemented based on these concepts.

  • PDF

XML Document Repository System for structured retrieval (구조 검색을 위한 XML 문서 저장 시스템)

  • 임산송;현득창;정회경
    • The Journal of Information Technology
    • /
    • v.4 no.4
    • /
    • pp.89-100
    • /
    • 2001
  • XML (extensible Markup Language) is selected and published as a representative standard of electronic documents by W3C (World Wide Web Consortium). The structured information can be created and also transferred in XML documents. By utilizing XML, you can express the meaningful information unit as a structure comparing existed file typed information. With structured information, you can also manage retrieve, and reposit documents. According to the above facts, in this paper, it is the purpose to design and implement XML documents repository system to reposit and retrieve using structured information of XML documents. As a model it was designed to be stored by element unit which is the basic unit of documents and was also designed to retrieve the stored XML information by structured unit. It was, especially, designed to manage and reposit the structure of various documents effectively through creating schema as to DTD(Document Type Definition) and instance.

  • PDF

Digitalizing Technical Documents of Construction Projects Based on Database and XML (데이터베이스와 XML에 기반한 건설프로젝트 기술문서 전자화)

  • Jung Jong-Hyun
    • Korean Journal of Construction Engineering and Management
    • /
    • v.6 no.4 s.26
    • /
    • pp.190-198
    • /
    • 2005
  • This study describes the digitalization of technical documents of construction projects using database for storage and XML for exchange format on the web. First, for this purpose, the requirements for effective digitalization are identified. Second, the strategies for using database and XML are presented. These strategies include the way to store and search for the technical documents, to draw up the XML document for some parts of the technical documents, to arrange the components in their proper hierarchy, to manage the graphics and mathematical expressions in database and XML documents. Finally we discussed the validities of the results of this study by partial implementation for structural design sheets which has all the characteristics of technical documents.

TripleDiff: an Incremental Update Algorithm on RDF Documents in Triple Stores (TripleDiff: 트리플 저장소에서 RDF 문서에 대한 점진적 갱신 알고리즘)

  • Lee, Tae-Whi;Kim, Ki-Sung;Yoo, Sang-Won;Kim, Hyoung-Joo
    • Journal of KIISE:Databases
    • /
    • v.33 no.5
    • /
    • pp.476-485
    • /
    • 2006
  • The Resource Description Framework(RDF), which emerged with the semantic web, is settling down as a standard for representing information about the resources in the World Wide Web Hence, a lot of research on storing and query processing RDF documents has been done and several RDF storage systems, such as Sesame and Jena, have been developed. But the research on updating RDF documents is still insufficient. When a RDF document is changed, data in the RDF triple store also needs to be updated. However, current RDF triple stores don't support incremental update. So updating can be peformed only by deleting the old version and then storing the new document. This updating method is very inefficient because RDF documents are steadily updated. Furthermore, it makes worse when several RDF documents are stored in the same database. In this paper, we propose an incremental update algorithm on RDF, documents in triple stores. We use a text matching technique for two versions of a RDF document and compensate for the text matching result to find the right target triples to be updated. We show that our approach efficiently update RDF documents through experiments with real-life RDF datasets.

Methods for Integration of Documents using Hierarchical Structure based on the Formal Concept Analysis (FCA 기반 계층적 구조를 이용한 문서 통합 기법)

  • Kim, Tae-Hwan;Jeon, Ho-Cheol;Choi, Joong-Min
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.3
    • /
    • pp.63-77
    • /
    • 2011
  • The World Wide Web is a very large distributed digital information space. From its origins in 1991, the web has grown to encompass diverse information resources as personal home pasges, online digital libraries and virtual museums. Some estimates suggest that the web currently includes over 500 billion pages in the deep web. The ability to search and retrieve information from the web efficiently and effectively is an enabling technology for realizing its full potential. With powerful workstations and parallel processing technology, efficiency is not a bottleneck. In fact, some existing search tools sift through gigabyte.syze precompiled web indexes in a fraction of a second. But retrieval effectiveness is a different matter. Current search tools retrieve too many documents, of which only a small fraction are relevant to the user query. Furthermore, the most relevant documents do not nessarily appear at the top of the query output order. Also, current search tools can not retrieve the documents related with retrieved document from gigantic amount of documents. The most important problem for lots of current searching systems is to increase the quality of search. It means to provide related documents or decrease the number of unrelated documents as low as possible in the results of search. For this problem, CiteSeer proposed the ACI (Autonomous Citation Indexing) of the articles on the World Wide Web. A "citation index" indexes the links between articles that researchers make when they cite other articles. Citation indexes are very useful for a number of purposes, including literature search and analysis of the academic literature. For details of this work, references contained in academic articles are used to give credit to previous work in the literature and provide a link between the "citing" and "cited" articles. A citation index indexes the citations that an article makes, linking the articleswith the cited works. Citation indexes were originally designed mainly for information retrieval. The citation links allow navigating the literature in unique ways. Papers can be located independent of language, and words in thetitle, keywords or document. A citation index allows navigation backward in time (the list of cited articles) and forwardin time (which subsequent articles cite the current article?) But CiteSeer can not indexes the links between articles that researchers doesn't make. Because it indexes the links between articles that only researchers make when they cite other articles. Also, CiteSeer is not easy to scalability. Because CiteSeer can not indexes the links between articles that researchers doesn't make. All these problems make us orient for designing more effective search system. This paper shows a method that extracts subject and predicate per each sentence in documents. A document will be changed into the tabular form that extracted predicate checked value of possible subject and object. We make a hierarchical graph of a document using the table and then integrate graphs of documents. The graph of entire documents calculates the area of document as compared with integrated documents. We mark relation among the documents as compared with the area of documents. Also it proposes a method for structural integration of documents that retrieves documents from the graph. It makes that the user can find information easier. We compared the performance of the proposed approaches with lucene search engine using the formulas for ranking. As a result, the F.measure is about 60% and it is better as about 15%.

A Web-based Remote Instruction System on Real-time using Action Synchronization between the Instructor and Learners (교수와 학습자간의 행동 동기화를 이용한 웹 기반의 실시간 원격 강의 시스템)

  • 이부권;박규석;서영건
    • Journal of Korea Multimedia Society
    • /
    • v.3 no.6
    • /
    • pp.611-616
    • /
    • 2000
  • By the most important media to deliver the contents on a remote instruction we commonly use audio and documents. A number of remote instruction system are trying to offer the video, but they did not acquire satisfiable results because of the limited network and width. Also, they use the general web browsers that have a lot of unspecific users access the contents. Also, they use the general web browsers that have a lot of unspecific users access the contents. Like this most systems that use the continuous media have not offer the satisfiable contents because of the network limitation. Moreover, because they use the web browser, they offer the contents having documents(web pages) only. In this paper, we propose a web-based remote instruction system on real-time using audio and documents which are the most important media for a information delivery. In addition to the system, we use a action synchronization mechanism between the web browsers. If the instructor uses web pages on his computer and explains the contents of them, the learners see the same web pages as the instructor's and listens to his voice.

  • PDF

Web Structure Mining by Extracting Hyperlinks from Web Documents and Access Logs (웹 문서와 접근로그의 하이퍼링크 추출을 통한 웹 구조 마이닝)

  • Lee, Seong-Dae;Park, Hyu-Chan
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.11 no.11
    • /
    • pp.2059-2071
    • /
    • 2007
  • If the correct structure of Web site is known, the information provider can discover users# behavior patterns and characteristics for better services, and users can find useful information easily and exactly. There may be some difficulties, however, to extract the exact structure of Web site because documents one the Web tend to be changed frequently. This paper proposes new method for extracting such Web structure automatically. The method consists of two phases. The first phase extracts the hyperlinks among Web documents, and then constructs a directed graph to represent the structure of Web site. It has limitations, however, to discover the hyperlinks in Flash and Java Applet. The second phase is to find such hidden hyperlinks by using Web access log. It fist extracts the click streams from the access log, and then extract the hidden hyperlinks by comparing with the directed graph. Several experiments have been conducted to evaluate the proposed method.

A Study on Layout Extraction from Internet Documents Through Xpath (Xpath에 의한 인터넷 문서의 레이아웃 추출 방법에 관한 연구)

  • Han Kwang-Rok;Sun Bok-Keun
    • The Journal of the Korea Contents Association
    • /
    • v.5 no.4
    • /
    • pp.237-244
    • /
    • 2005
  • Currently most Internet documents including news data are made based on predefined templates, but templates are usually formed only for main data and are not helpful for information retrieval against indexes, advertisements, header data etc. Templates in such forms are not appropriate when Internet documents are used as data for information retrieval. In order to process Internet documents in various areas of information retrieval, it is necessary to detect additional information such as advertisements and page indexes. Thus this study proposes a method of detecting the layout of web pages by identifying the characteristics and structure of block tags that affect the layout of web pages and calculating distances between web pages. As a result of experiment, we can successfully extract 640 documents from 1000 samples and obtain 64% recall rate. This method is purposed to reduce the cost of web document automatic processing and improve its efficiency through applying the method to document preprocessing of information retrieval such as data extraction and document summarization.

  • PDF