• Title/Summary/Keyword: 대용량문서 색인

Search Result 23, Processing Time 0.018 seconds

A Knowledge-Based Intelligent Information Agent for Animal Domain (동물 영역 지식 기반의 지능형 정보 에이전트)

  • 이용현;오정욱;변영태
    • Korean Journal of Cognitive Science
    • /
    • v.10 no.1
    • /
    • pp.67-78
    • /
    • 1999
  • Information providers on WWW have been rapidly increasing, and they provide a vast amount of information in various fields, Because of this reason, it becomes hard for users to get the information they want. Although there are several search engines that help users with the keyword matching methods, it is not easy to find suitable keywords. In order to solve these problems with a specific domain, we propose an intelligent information agent(HHA-la : HongIk Information Agent) that converts user's q queries to forms including related domain words in order to represent user's intention as much as it can and provides the necessary information of the domain to users. HHA-la h has an ontological knowledge base of animal domain, supplies necessary information for queries from users and other agents, and provides relevant web page information. One of system components is a WebDB which indexes web pages relevant to the animal domain. The system also supplies new operators by which users can represent their thought more clearly, and has a learning mechanism using accumulated results and user feedback to behave more intelligently, We implement the system and show the effectiveness of the information agent by presenting experiment results in this paper.

  • PDF

Methods for Integration of Documents using Hierarchical Structure based on the Formal Concept Analysis (FCA 기반 계층적 구조를 이용한 문서 통합 기법)

  • Kim, Tae-Hwan;Jeon, Ho-Cheol;Choi, Joong-Min
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.3
    • /
    • pp.63-77
    • /
    • 2011
  • The World Wide Web is a very large distributed digital information space. From its origins in 1991, the web has grown to encompass diverse information resources as personal home pasges, online digital libraries and virtual museums. Some estimates suggest that the web currently includes over 500 billion pages in the deep web. The ability to search and retrieve information from the web efficiently and effectively is an enabling technology for realizing its full potential. With powerful workstations and parallel processing technology, efficiency is not a bottleneck. In fact, some existing search tools sift through gigabyte.syze precompiled web indexes in a fraction of a second. But retrieval effectiveness is a different matter. Current search tools retrieve too many documents, of which only a small fraction are relevant to the user query. Furthermore, the most relevant documents do not nessarily appear at the top of the query output order. Also, current search tools can not retrieve the documents related with retrieved document from gigantic amount of documents. The most important problem for lots of current searching systems is to increase the quality of search. It means to provide related documents or decrease the number of unrelated documents as low as possible in the results of search. For this problem, CiteSeer proposed the ACI (Autonomous Citation Indexing) of the articles on the World Wide Web. A "citation index" indexes the links between articles that researchers make when they cite other articles. Citation indexes are very useful for a number of purposes, including literature search and analysis of the academic literature. For details of this work, references contained in academic articles are used to give credit to previous work in the literature and provide a link between the "citing" and "cited" articles. A citation index indexes the citations that an article makes, linking the articleswith the cited works. Citation indexes were originally designed mainly for information retrieval. The citation links allow navigating the literature in unique ways. Papers can be located independent of language, and words in thetitle, keywords or document. A citation index allows navigation backward in time (the list of cited articles) and forwardin time (which subsequent articles cite the current article?) But CiteSeer can not indexes the links between articles that researchers doesn't make. Because it indexes the links between articles that only researchers make when they cite other articles. Also, CiteSeer is not easy to scalability. Because CiteSeer can not indexes the links between articles that researchers doesn't make. All these problems make us orient for designing more effective search system. This paper shows a method that extracts subject and predicate per each sentence in documents. A document will be changed into the tabular form that extracted predicate checked value of possible subject and object. We make a hierarchical graph of a document using the table and then integrate graphs of documents. The graph of entire documents calculates the area of document as compared with integrated documents. We mark relation among the documents as compared with the area of documents. Also it proposes a method for structural integration of documents that retrieves documents from the graph. It makes that the user can find information easier. We compared the performance of the proposed approaches with lucene search engine using the formulas for ranking. As a result, the F.measure is about 60% and it is better as about 15%.

Implementation of a Large-scale Web Query Processing System Using the Multi-level Cache Scheme (계층적 캐시 기법을 이용한 대용량 웹 검색 질의 처리 시스템의 구현)

  • Lim, Sung-Chae
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.14 no.7
    • /
    • pp.669-679
    • /
    • 2008
  • With the increasing demands of information sharing and searches via the web, the web search engine has drawn much attention. Although many researches have been done to solve technical challenges to build the web search engine, the issue regarding its query processing system is rarely dealt with. Since the software architecture and operational schemes of the query processing system are hard to elaborate, we here present related techniques implemented on a commercial system. The implemented system is a very large-scale system that can process 5-million user queries per day by using index files built on about 65-million web pages. We implement a multi-level cache scheme to save already returned query results for performance considerations, and the multi-level cache is managed in 4-level cache storage areas. Using the multi-level cache, we can improve the system throughput by a factor of 4, thereby reducing around 70% of the server cost.