• Title/Summary/Keyword: Document information retrieval

Search Result 413, Processing Time 0.027 seconds

Efficient Internet Information Extraction Using Hyperlink Structure and Fitness of Hypertext Document (웹의 연결구조와 웹문서의 적합도를 이용한 효율적인 인터넷 정보추출)

  • Hwang Insoo
    • Journal of Information Technology Applications and Management
    • /
    • v.11 no.4
    • /
    • pp.49-60
    • /
    • 2004
  • While the World-Wide Web offers an incredibly rich base of information, organized as a hypertext it does not provide a uniform and efficient way to retrieve specific information. Therefore, it is needed to develop an efficient web crawler for gathering useful information in acceptable amount of time. In this paper, we studied the order in which the web crawler visit URLs to rapidly obtain more important web pages. We also developed an internet agent for efficient web crawling using hyperlink structure and fitness of hypertext documents. As a result of experiment on a website. it is shown that proposed agent outperforms other web crawlers using BackLink and PageRank algorithm.

  • PDF

A Study of Main Contents Extraction from Web News Pages based on XPath Analysis

  • Sun, Bok-Keun
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.7
    • /
    • pp.1-7
    • /
    • 2015
  • Although data on the internet can be used in various fields such as source of data of IR(Information Retrieval), Data mining and knowledge information servece, and contains a lot of unnecessary information. The removal of the unnecessary data is a problem to be solved prior to the study of the knowledge-based information service that is based on the data of the web page, in this paper, we solve the problem through the implementation of XTractor(XPath Extractor). Since XPath is used to navigate the attribute data and the data elements in the XML document, the XPath analysis to be carried out through the XTractor. XTractor Extracts main text by html parsing, XPath grouping and detecting the XPath contains the main data. The result, the recognition and precision rate are showed in 97.9%, 93.9%, except for a few cases in a large amount of experimental data and it was confirmed that it is possible to properly extract the main text of the news.

Design & Evaluation of an Intelligent Model for Extracting the Web User' Preference (웹 사용자의 선호도 추출을 위한 지능모델 설계 및 평가)

  • Kim, Kwang-Nam;Yoon, Hee-Byung;Kim, Hwa-Soo
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.15 no.4
    • /
    • pp.443-450
    • /
    • 2005
  • In this paper, we propose an intelligent model lot extraction of the web user's preference and present the results of evaluation. For this purpose, we analyze shortcomings of current information retrieval engine being used and reflect preference weights on learner. As it doesn't depend on frequency of each word but intelligently learns patterns of user behavior, the mechanism Provides the appropriate set of results about user's questions. Then, we propose the concept of preference trend and its considerations and present an algorithm for extracting preference with examples. Also, we design an intelligent model for extraction of behavior patterns and propose HTML index and process of intelligent learning for preference decision. Finally, we validate the proposed model by comparing estimated results(after applying the Preference) of document ranking measurement.

Known-Item Retrieval Performance of a PICO-based Medical Question Answering Engine

  • Vong, Wan-Tze;Then, Patrick Hang Hui
    • Asia pacific journal of information systems
    • /
    • v.25 no.4
    • /
    • pp.686-711
    • /
    • 2015
  • The performance of a novel medical question-answering engine called CliniCluster and existing search engines, such as CQA-1.0, Google, and Google Scholar, was evaluated using known-item searching. Known-item searching is a document that has been critically appraised to be highly relevant to a therapy question. Results show that, using CliniCluster, known-items were retrieved on average at rank 2 ($MRR@10{\approx}0.50$), and most of the known-items could be identified from the top-10 document lists. In response to ill-defined questions, the known-items were ranked lower by CliniCluster and CQA-1.0, whereas for Google and Google Scholar, significant difference in ranking was not found between well- and ill-defined questions. Less than 40% of the known-items could be identified from the top-10 documents retrieved by CQA-1.0, Google, and Google Scholar. An analysis of the top-ranked documents by strength of evidence revealed that CliniCluster outperformed other search engines by providing a higher number of recent publications with the highest study design. In conclusion, the overall results support the use of CliniCluster in answering therapy questions by ranking highly relevant documents in the top positions of the search results.

A Way to Speed up Evaluation of Path-oriented Queries using An Abbreviation-paths and An Extendible Hashing Technique (단축-경로와 확장성 해싱 기법을 이용한 경로-지향 질의의 평가속도 개선 방법)

  • Park Hee-Sook;Cho Woo-Hyun
    • The KIPS Transactions:PartD
    • /
    • v.11D no.7 s.96
    • /
    • pp.1409-1416
    • /
    • 2004
  • Recently, due to the popularity and explosive growth of the Internet, information exchange is increasing dramatically over the Internet. Also the XML is becoming a standard as well as a major tool of data exchange on the Internet. so that in retrieving the XML document. the problem for speeding up evaluation of path-oriented queries is a main issue. In this paper, we propose a new indexing technique to advance the searching performance of path-oriented queries in document databases. In the new indexing technique, an abbreviation-path file to perform path-oriented queries efficiently is generated which is able to use its hash-code value to index keys. Also this technique can be further enhanced by combining the Extendible Hashing technique with the abbreviation path file to expedite a speed up evaluation of retrieval.

The Avata Construction System for Image Lossless Scaling (이미지 손실없는 확대/축소가 가능한 아바타 생성 시스템)

  • 김원중;장미화
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.6 no.2
    • /
    • pp.181-189
    • /
    • 2002
  • In this paper, we designed and implemented Avata construction system using XML(extensible Markup Language) and SVG(Scalable Vector Graphic). The Web character created with Avata(or Web character) construction system are displayed in same (on without damage of image, regardless terminal type and user can modify and change image easily in form that want. Compare with existing Web character system, the Reusability of web character part element Is increased greatly with Avata construction system of this paper. Because SVG is described by text, graphic retrieval is convenient, and applications can use easily SVG document. Also, SVG can create web graphic document dynamically with database because can access easily in all graphic primitives of line, Polygon, text, image etc. As well as web character using study finding, we may develop usable technology to some contents on World Wide Web.

Automatic term-network construction for Oral Documents (구술문서에 기초한 자동 용어 네트워크 구축)

  • Park, Soon-Cheol
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.12 no.4
    • /
    • pp.25-31
    • /
    • 2007
  • An automatic term-network construction system is proposed in this paper. This system uses the statistical values of the terms appeared in a document corpus. The 186 oral history documents collected from the Saemangeum area of Chollapuk-do, Korea, are used for the research. The term relationships presented in the term-network are decided by the cosine similarities of the term vectors. The number of the terms extracted from the documents is about 1700. The system is able to show the term relationships from the term-network as quickly as like a real-time system. The way of this term-network construction is expected as one of the methods to construct the ontology system and to support the semantic retrieval system in the near future.

  • PDF

A Study on Ranking Retrieved Documents Utilizing Term Relationship (용어간 관계를 이용한 검색문헌의 순위부여에 관한 연구)

  • Gang, Il-Jung;Jeong, Yeong-Mi
    • Journal of the Korean Society for information Management
    • /
    • v.8 no.1
    • /
    • pp.100-116
    • /
    • 1991
  • In this study, a retrieval system taking advantage of term relationship in a specific domain and also of evidential reasoning as tools for measuring relevance is implemented. For this experiment, techincal memoranda documented in Electronics and Telecommunications Research Institute (ETRI) served as a sample document file. Sample knowledge base was prepared by extracting terms and term relations pertaining to telecommunications from INSPEC thesaurus. Relations between terms were represented by numerical values according to types of term relations. Relationship between a query and a document was measured according to Dempster-Shafer theory of evidence. As a result of this experiment, a more comprehensive search was made by expanding search terms utilizing term relations. Measure of relevance represented by reflecting term relations, and search results were listed in a descending order of relevance.

  • PDF

Patent Document Similarity Based on Image Analysis Using the SIFT-Algorithm and OCR-Text

  • Park, Jeong Beom;Mandl, Thomas;Kim, Do Wan
    • International Journal of Contents
    • /
    • v.13 no.4
    • /
    • pp.70-79
    • /
    • 2017
  • Images are an important element in patents and many experts use images to analyze a patent or to check differences between patents. However, there is little research on image analysis for patents partly because image processing is an advanced technology and typically patent images consist of visual parts as well as of text and numbers. This study suggests two methods for using image processing; the Scale Invariant Feature Transform(SIFT) algorithm and Optical Character Recognition(OCR). The first method which works with SIFT uses image feature points. Through feature matching, it can be applied to calculate the similarity between documents containing these images. And in the second method, OCR is used to extract text from the images. By using numbers which are extracted from an image, it is possible to extract the corresponding related text within the text passages. Subsequently, document similarity can be calculated based on the extracted text. Through comparing the suggested methods and an existing method based only on text for calculating the similarity, the feasibility is achieved. Additionally, the correlation between both the similarity measures is low which shows that they capture different aspects of the patent content.

The Refinement Effect of Foreign Word Transliteration Query on Meta Search (메타 검색에서 외래어 질의 정제 효과)

  • Lee, Jae-Sung
    • The KIPS Transactions:PartB
    • /
    • v.15B no.2
    • /
    • pp.171-178
    • /
    • 2008
  • Foreign word transliterations are not consistently used in documents, which hinders retrieving some important relevant documents in exact term matching information retrieval systems. In this paper, a meta search method is proposed, which expands and refines relevant variant queries from an original input foreign word transliteration query to retrieve the more relevant documents. The method firstly expands a transliteration query to the variants using a statistical method. Secondly the method selects the valid variants: it queries each variant to the retrieval systems beforehand and checks the validity of each variant by counting the number of appearance of the variant in the retrieved document and calculating the similarity of the context of the variant. Experiment result showed that querying with the variants produced at the first step, which is a base method of the test, performed 38% in average F measure, and querying with the refined variants at the second step, which is a proposed method, significantly improved the performance to 81% in average F measure.