• 제목/요약/키워드: Document Retrieval

검색결과 450건 처리시간 0.023초

Query Formulation for Heuristic Retrieval in Obfuscated and Translated Partially Derived Text

  • Kumar, Aarti;Das, Sujoy
    • Journal of Information Science Theory and Practice
    • /
    • 제3권1호
    • /
    • pp.24-39
    • /
    • 2015
  • Pre-retrieval query formulation is an important step for identifying local text reuse. Local reuse with high obfuscation, paraphrasing, and translation poses a challenge of finding the reused text in a document. In this paper, three pre-retrieval query formulation strategies for heuristic retrieval in case of low obfuscated, high obfuscated, and translated text are studied. The strategies used are (a) Query formulation using proper nouns; (b) Query formulation using unique words (Hapax); and (c) Query formulation using most frequent words. Whereas in case of low and high obfuscation and simulated paraphrasing, keywords with Hapax proved to be slightly more efficient, initial results indicate that the simple strategy of query formulation using proper nouns gives promising results and may prove better in reducing the size of the corpus for post processing, for identifying local text reuse in case of obfuscated and translated text reuse.

단어 및 단어쌍 별 빈도수를 이용한 문서간 유사도 측정 (Measurement of Document Similarity using Word and Word-Pair Frequencies)

  • 김혜숙;박상철;김수형
    • 대한전자공학회:학술대회논문집
    • /
    • 대한전자공학회 2003년도 하계종합학술대회 논문집 Ⅲ
    • /
    • pp.1311-1314
    • /
    • 2003
  • In this paper, we propose a method to measure document similarity. First, we have exploited single-term method that extracts nouns by using a lexical analyzer as a preprocessing step to match one index to one noun. In spite of irrelevance between documents, possibility of increasing document similarity is high with this method. For this reason, a term-phrase method has been reported. This method constructs co-occurrence between two words as an index to measure document similarity. In this paper, we tried another method that combine these two methods to compensate the problems in these two methods. Six types of features are extracted from two input documents, and they are fed into a neural network to calculate the final value of document similarity. Reliability of our method has been proved by an experiment of document retrieval.

  • PDF

Grove를 이용한 구조적 SGML문서의 저장 및 검색 (A Storage and Retrieval System for Structured SGML Documents using Grove)

  • 김학균;조성배
    • 한국정보과학회논문지:컴퓨팅의 실제 및 레터
    • /
    • 제8권5호
    • /
    • pp.501-509
    • /
    • 2002
  • 플랫폼에 관계없이 한번 작성된 문서의 정보를 이기종 시스템간 공유하고 다양한 문서 형식을 지원하기 위해 SGML(1508879)이 사용되고 있다. SGML 문서는 내용뿐만 아니라 구조정보를 가지고 있다. SGML 문서가 널리 보급됨에 따라서 구조적 정보를 이용한 데이타베이스의 구축 및 검색 시스템에 대한 필요성이 고조되고 있다. 그러나, 기존의 색인어를 이용한 전문 검색 엔진으로는 문서의 구조정보를 활용할 수 없다. 본 논문에서는 DSSSL 및 HyTime의 문서 모델인 Grove를 변형한 데이타 모델을 이용하여 문서 형식에 독립적이면서, 문서 형식과 내용을 분리하여 저장하는 SGML 문서 저장 시스템을 개발하였다. 구조정보를 손실없이 저장할 수 있도록 객체 지향형 데이타베이스 시스템인 오브젝트 스토어(Object Store)를 이용하였다. 또한 엘리먼트에 대해 관계형 DBMS와 유사한 인덱스 구조를 생성하여 검색 성능을 향상시켰고, 내용기반 검색과 구조기반 검색을 효율적으로 결합한 사용자 인터페이스를 구축하였다.

비음수 행렬 분해와 군집의 응집도를 이용한 문서군집 (Document Clustering Method using Coherence of Cluster and Non-negative Matrix Factorization)

  • 김철원;박선
    • 한국정보통신학회논문지
    • /
    • 제13권12호
    • /
    • pp.2603-2608
    • /
    • 2009
  • 문서군집은 정보검색의 많은 응용분야에 사용되는 중요한 문서 분석 방법이다. 본 논문은 비음수 행렬 분해 (NMF, non-negative matrix factorization)를 군집방법과 군집의 응집도(coherence of cluster)를 이용한 군집 내 문서들의 정제를 이용한 새로운 문서군집방법을 제안한다. 제안된 방법은 문서집합의 내부구조를 나타내는 의미특징행렬과 의미변수행렬 이용하여 문서군집의 성능을 높일 수 있고, 문장들 간의 유사도에 기반 한 군집의 응집도를 이용하여 군집내의 문서들을 정제하여서 재 할당함으로써 군집의 효율을 향상시킬 수 있다. 실험결과 제안방법을 적용한 문서군집방법이 다른 문서군집 방법에 비하여 좋은 성능을 보인다.

2단계 유사관계 행렬을 기반으로 한 순위 재조정 검색 모델 (A Re-Ranking Retrieval Model based on Two-Level Similarity Relation Matrices)

  • 이기영;은희주;김용성
    • 한국정보과학회논문지:소프트웨어및응용
    • /
    • 제31권11호
    • /
    • pp.1519-1533
    • /
    • 2004
  • 웹 기반의 학술분야 전문 검색 시스템은 사용자의 정보 요구 표현을 극히 제한적으로 허용함으로써 검색된 정보의 내용 분석과 정보 습득의 과정이 일관되지 못해 무분별한 정보 제공이 이루어진다. 본 논문에서는 용어의 상대적인 중요 정도를 축소용어 집합으로 구성하여 검색 시스템의 높은 시간 복잡도를 해결할 수 있도록 퍼지 검색 모델을 적용하였다. 또한 퍼지 호환관계의 특성을 만족하는 유사관계 행렬을 통해 사용자 질의를 정확하게 반영할 수 있도록 클러스터 검색을 수행하였다. 본 논문에서 제안한 퍼지 검색과 문서 클러스터 검색의 유사도 결합을 통한 순위 재조정 검색 모델은 검색 성능을 표현하는 정확률과 재현율 척도에서 향상됨을 입증하였다.

Design and Development of a Multimodal Biomedical Information Retrieval System

  • Demner-Fushman, Dina;Antani, Sameer;Simpson, Matthew;Thoma, George R.
    • Journal of Computing Science and Engineering
    • /
    • 제6권2호
    • /
    • pp.168-177
    • /
    • 2012
  • The search for relevant and actionable information is a key to achieving clinical and research goals in biomedicine. Biomedical information exists in different forms: as text and illustrations in journal articles and other documents, in images stored in databases, and as patients' cases in electronic health records. This paper presents ways to move beyond conventional text-based searching of these resources, by combining text and visual features in search queries and document representation. A combination of techniques and tools from the fields of natural language processing, information retrieval, and content-based image retrieval allows the development of building blocks for advanced information services. Such services enable searching by textual as well as visual queries, and retrieving documents enriched by relevant images, charts, and other illustrations from the journal literature, patient records and image databases.

A Study on Effective Internet Data Extraction through Layout Detection

  • Sun Bok-Keun;Han Kwang-Rok
    • International Journal of Contents
    • /
    • 제1권2호
    • /
    • pp.5-9
    • /
    • 2005
  • Currently most Internet documents including data are made based on predefined templates, but templates are usually formed only for main data and are not helpful for information retrieval against indexes, advertisements, header data etc. Templates in such forms are not appropriate when Internet documents are used as data for information retrieval. In order to process Internet documents in various areas of information retrieval, it is necessary to detect additional information such as advertisements and page indexes. Thus this study proposes a method of detecting the layout of Web pages by identifying the characteristics and structure of block tags that affect the layout of Web pages and calculating distances between Web pages. This method is purposed to reduce the cost of Web document automatic processing and improve processing efficiency by providing information about the structure of Web pages using templates through applying the method to information retrieval such as data extraction.

  • PDF

지능적인 웹문서 분류를 위한 구조 및 프로세스 설계 연구 (A Study on Building Structures and Processes for Intelligent Web Document Classification)

  • 장영철
    • 디지털융복합연구
    • /
    • 제6권4호
    • /
    • pp.177-183
    • /
    • 2008
  • This paper aims to offer a solution based on intelligent document classification to create a user-centric information retrieval system allowing user-centric linguistic expression. So, structures expressing user intention and fine document classifying process using EBL, similarity, knowledge base, user intention, are proposed. To overcome the problem requiring huge and exact semantic information, a hybrid process is designed integrating keyword, thesaurus, probability and user intention information. User intention tree hierarchy is build and a method of extracting group intention between key words and user intentions is proposed. These structures and processes are implemented in HDCI(Hybrid Document Classification with Intention) system. HDCI consists of analyzing user intention and classifying web documents stages. Classifying stage is composed of knowledge base process, similarity process and hybrid coordinating process. With the help of user intention related structures and hybrid coordinating process, HDCI can efficiently categorize web documents in according to user's complex linguistic expression with small priori information.

  • PDF

Automatic Linkage Method Between Email and Block Structure to Store Construction Project Documents in The Blockchain

  • Kim, Eu Wang;Park, Min Seo;Kim, Jong Inn;Wei, Ameng;Kim, Kyoungmin;Kim, Kyong Ju
    • 국제학술발표논문집
    • /
    • The 9th International Conference on Construction Engineering and Project Management
    • /
    • pp.886-892
    • /
    • 2022
  • In construction projects, it is common to exchange documents using email because of convenience. In this study, a method extracting and organizing block information automatically based on email was developed. This method is composed of document exchange and archiving processes, which are difficult to manage and vulnerable to loss. Therefore, this study aims to develop a solution that can automatically link email and block information. The block data components are designed to derive from email exchange and user-additional input information. Also, automatically generating blocks process including extraction and conversion of information was proposed. This solution can lead to promote the convenience of project document management in terms of identifying the document flow and preventing loss of information.

  • PDF

OSI 환경에서 문서 파일링 및 검색 시스템의 설계 및 구현 (Design and Implementation of Document Filing and Retrieval System in an OSI Environment)

  • 임재홍;박용진
    • 전자공학회논문지B
    • /
    • 제31B권2호
    • /
    • pp.10-20
    • /
    • 1994
  • This paper describes a design and implementation of the DFR(Document Filing and Retrieval) system. one of applications of DOAM(Distributed Office Application Model) which is the international standard in ISO(International Standards Organization). On the basis of the international standard, the DFR system is implemented on SUN workstation and PC/386 with C language, and its implementation is verified by tracing the association descriptor and primitives of service elements when its operation is tested between client and server. The result of this study shows that the DFR system can be implemented on the basis of the international standard, and makes a contribution toward the establishment of functional standards for the DFR system.

  • PDF