• Title/Summary/Keyword: Document Retrieval

Search Result 448, Processing Time 0.025 seconds

Clustering XML Documents Considering The Weight of Large Items in Clusters (클러스터의 주요항목 가중치 기반 XML 문서 클러스터링)

  • Hwang, Jeong-Hee
    • The KIPS Transactions:PartD
    • /
    • v.14D no.1 s.111
    • /
    • pp.1-8
    • /
    • 2007
  • As the web document of XML, an exchange language of data in the advanced Internet, is increasing, a target of information retrieval becomes the web documents. Therefore, there we researches on structure, integration and retrieval of XML documents. This paper proposes a clustering method of XML documents based on frequent structures, as a basic research to efficiently process query and retrieval. To do so, first, trees representing XML documents are decomposed and we extract frequent structures from them. Second, we perform clustering considering the weight of large items to adjust cluster creation and cluster cohesion, considering frequent structures as items of transactions. Third, we show the excellence of our method through some experiments which compare which the previous methods.

Personal Electronic Document Retrieval System Using Semantic Web/Ontology Technologies (시멘틱 웹/온톨로지 기술을 이용한 개인용 전자문서 검색 시스템)

  • Kim, Hak-Lae;Kim, Hong-Gee
    • The Journal of Society for e-Business Studies
    • /
    • v.12 no.1
    • /
    • pp.135-149
    • /
    • 2007
  • There are many kinds of applications or software components to manage files in a local computer, but it is very difficult to organize personal documents in a consistent way and to search expected ones in a precise way. In this paper, we present our development of a document management and retrieval tool, which is named Ontalk. Our system provides a semi-automatic metadata generator and an ontology-based search engine for electronic documents. Ontalk can create and import various ontologies in RDFS or OWL for describing the metadata. Our system that is built upon.NET technology is easily communicated with or flexibly plugged into many different programs.

  • PDF

A Study on Layout Extraction from Internet Documents Through Xpath (Xpath에 의한 인터넷 문서의 레이아웃 추출 방법에 관한 연구)

  • Han Kwang-Rok;Sun Bok-Keun
    • The Journal of the Korea Contents Association
    • /
    • v.5 no.4
    • /
    • pp.237-244
    • /
    • 2005
  • Currently most Internet documents including news data are made based on predefined templates, but templates are usually formed only for main data and are not helpful for information retrieval against indexes, advertisements, header data etc. Templates in such forms are not appropriate when Internet documents are used as data for information retrieval. In order to process Internet documents in various areas of information retrieval, it is necessary to detect additional information such as advertisements and page indexes. Thus this study proposes a method of detecting the layout of web pages by identifying the characteristics and structure of block tags that affect the layout of web pages and calculating distances between web pages. As a result of experiment, we can successfully extract 640 documents from 1000 samples and obtain 64% recall rate. This method is purposed to reduce the cost of web document automatic processing and improve its efficiency through applying the method to document preprocessing of information retrieval such as data extraction and document summarization.

  • PDF

A Design and Implementation of XML Document storing and retrieval Framework based on a variant k-ary complete tree and RDF Metadata (가변 K진 완전트리와 RDF메타정보에 기반한 XML문서 저장 및 검색 프레임워크의 설계 및 구현)

  • 김규태;정회경;이수연
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.7 no.4
    • /
    • pp.612-622
    • /
    • 2003
  • This paper studied and proposed a XML document storing-and-retrieval framework based on a variant k-ary complete tree and a RDF metadata, which is composed of an effective storing module to store xml documents, a retrieving module to retrieve xml documents, and a connecting module to make this system intemperate in the web environment. In this storing module, DTD independent DOM based decomposition model using a method of addressing unique ill using a variant k-ary complete tree is adopted and is implemented. Query Processing Module includes a XPath query process and a content based retrieval function using word index for content information. To retrieve more exactly data, a structural retrieval using RDF metadata is adopted and implemented. In order to implement effectively XML document storing and retrieval system in the web environment, API using XML-RPC, API using HTTP's GET, PUT, POST and API using SOAP have been adopted and implemented.

Design and Implementation of a SGML Index Manager for Dynamic Environment (동적 환경에 적합한 SGML 인덱스 관리자의 설계 및 구현)

  • Han, Seong-Geun;Son, Jeong-Han;Jang, Jae-U;Kim, Hyeon-Gi;Gang, Hyeon-Gyu
    • The Transactions of the Korea Information Processing Society
    • /
    • v.6 no.10
    • /
    • pp.2574-2586
    • /
    • 1999
  • Since a SGML document is composed of elements, the primitive unit of information, SGML information retrieval should support retrieval on element as well as document. In addition, SGML index organization should support the partial insertion and deletion of document for the dynamic environment. For this, we propose a SGML index organization suited to structured-based retrieval for dynamic environment. Based on the proposed index organization, we design a SGML index manager to support content-based and structure-based retrieval efficiently. We implement the SGML index manager based on O2 storage system and compare the performance of our SGML index manager with the conventional SGML index manager. According to the performance comparison, it is shown that the proposed index structure achieves better retrieval performance than the conventional K-ary complete tree.

  • PDF

Word Embeddings-Based Pseudo Relevance Feedback Using Deep Averaging Networks for Arabic Document Retrieval

  • Farhan, Yasir Hadi;Noah, Shahrul Azman Mohd;Mohd, Masnizah;Atwan, Jaffar
    • Journal of Information Science Theory and Practice
    • /
    • v.9 no.2
    • /
    • pp.1-17
    • /
    • 2021
  • Pseudo relevance feedback (PRF) is a powerful query expansion (QE) technique that prepares queries using the top k pseudorelevant documents and choosing expansion elements. Traditional PRF frameworks have robustly handled vocabulary mismatch corresponding to user queries and pertinent documents; nevertheless, expansion elements are chosen, disregarding similarity to the original query's elements. Word embedding (WE) schemes comprise techniques of significant interest concerning QE, that falls within the information retrieval domain. Deep averaging networks (DANs) defines a framework relying on average word presence passed through multiple linear layers. The complete query is understandably represented using the average vector comprising the query terms. The vector may be employed for determining expansion elements pertinent to the entire query. In this study, we suggest a DANs-based technique that augments PRF frameworks by integrating WE similarities to facilitate Arabic information retrieval. The technique is based on the fundamental that the top pseudo-relevant document set is assessed to determine candidate element distribution and select expansion terms appropriately, considering their similarity to the average vector representing the initial query elements. The Word2Vec model is selected for executing the experiments on a standard Arabic TREC 2001/2002 set. The majority of the evaluations indicate that the PRF implementation in the present study offers a significant performance improvement compared to that of the baseline PRF frameworks.

Multi-class Support Vector Machines Model Based Clustering for Hierarchical Document Categorization in Big Data Environment (빅 데이터 환경에서 계층적 문서 유형 분류를 위한 클러스터링 기반 다중 SVM 모델)

  • Kim, Young Soo;Lee, Byoung Yup
    • The Journal of the Korea Contents Association
    • /
    • v.17 no.11
    • /
    • pp.600-608
    • /
    • 2017
  • Recently data growth rates are growing exponentially according to the rapid expansion of internet. Since users need some of all the information, they carry a heavy workload for examination and discovery of the necessary contents. Therefore information retrieval must provide hierarchical class information and the priority of examination through the evaluation of similarity on query and documents. In this paper we propose an Multi-class support vector machines model based clustering for hierarchical document categorization that make semantic search possible considering the word co-occurrence measures. A combination of hierarchical document categorization and SVM classifier gives high performance for analytical classification of web documents that increase exponentially according to extension of document hierarchy. More information retrieval systems are expected to use our proposed model in their developments and can perform a accurate and rapid information retrieval service.

Retrieval algorithm for Web Document using XML DOM (XML DOM을 이용한 웹문서 검색 알고리즘)

  • 김노환;정충교
    • Journal of the Korea Computer Industry Society
    • /
    • v.2 no.6
    • /
    • pp.775-782
    • /
    • 2001
  • Until recently Web retrieval engine has presented a demanded document to users according to the amount and the frequency of inquired key words in each document under the assumption that the more key words a document has, the more accessible it is. This method of searching doesn't matter to a normal document such as HTML Web data in which structural information is not involved. However, Web data realized in XML contains structural information and modeling of graphic forms is also available. Therefore, in the case of XML, this method leads to no less trouble since it depends only on the frequency of key words. We consider that this problem can be resolved by way of inquiry which is similar to SQL. This form of inquiry enables us to snatch an exact data we want in a quick and clear way with a full advantage of structural quality of XML, overcoming the shortcomings of frequency-based engine. In this paper, We aim to design a model of information retrieval system of XML data using XML DOM and consider its algorithm related with it.

  • PDF

Fussy operator analyses to imporve retrieval effectiveness of the fuzzy set model (퍼지 집합 모델의 검색 효율 개선을 위한 퍼지 연산자의 분석)

  • 이준호;김원용;이윤준;김명호
    • Journal of the Korean Society for information Management
    • /
    • v.10 no.1
    • /
    • pp.53-63
    • /
    • 1993
  • The conventional fuzzy set model has been criticized as a retrieval model because the MIN and MAX operators have the properties adverse to effective calculation of document values. Since the first introduction of fuzzy set theory a variety of fuzzy operators have been developed, which can replace the MIN and MAX operators. We analyze their behavioral aspects of generating document values, and propose the enhanced fuzzy set model based on a class of fuzzy operators called positively compensatory operators. We also show through performance experiments that the proposed fuzzy set model provides higher retrieval effectiveness.

  • PDF

A Database Approach for Modeling and Querying XML Documents

  • Panseop Shin;Kim, Jeong-Eun;Lee, Jaeho;Haechull Lim
    • Proceedings of the IEEK Conference
    • /
    • 2000.07b
    • /
    • pp.703-706
    • /
    • 2000
  • In recent years. XML applications are being developed in diverse area. Especially, development of XML document repository system associated with database is carrying out widely. The previous researches of XML repository system have several defects which are update and retrieval limitations for the XML document, design limitation for a formal retrieval algorithm and data redundancy. In order to solve the above problems. in this paper, we suggest relational database schemes for overcoming limitations of updating, retrieving, and rebuilding document. And suggest query translation strategy using two-phase translation that consists of pattern analyzing phase and SQL generating phase.

  • PDF