• Title/Summary/Keyword: Related Documents Retrieval

Search Result 56, Processing Time 0.024 seconds

An n-gram-based Indexing Method for Effective Retrieval of Hangul Texts (한글 문서의 효과적인 검색을 위한 n-gram 기반의 색인 방법)

  • 이준호;안정수;박현주;김명호
    • Journal of the Korean Society for information Management
    • /
    • v.13 no.1
    • /
    • pp.47-63
    • /
    • 1996
  • Conventional automatic indexing methods for Hangul texts can be classified into two groups as follows: One is to extract index terms by removing non-indexable segments from word-phrases, and the other is to generate index terms from the morphemes of word-phrases. The former suffers from the problem of word boundaries when documents contain many compound nouns. The latter can overcome the word boundary problem by extracting simple nouns, but has many overheads to develop a lot of linguistic knowledges needed in the indexing procedure. In this paper we propose a new indexing method based on n-grams. This method alleviates the problems of previous indexing methods related with word boundaries and linguistic knowledges. We also compare the effectiveness of the n-gram based indexing method with that of the previous ones.

  • PDF

Patent Document Similarity Based on Image Analysis Using the SIFT-Algorithm and OCR-Text

  • Park, Jeong Beom;Mandl, Thomas;Kim, Do Wan
    • International Journal of Contents
    • /
    • v.13 no.4
    • /
    • pp.70-79
    • /
    • 2017
  • Images are an important element in patents and many experts use images to analyze a patent or to check differences between patents. However, there is little research on image analysis for patents partly because image processing is an advanced technology and typically patent images consist of visual parts as well as of text and numbers. This study suggests two methods for using image processing; the Scale Invariant Feature Transform(SIFT) algorithm and Optical Character Recognition(OCR). The first method which works with SIFT uses image feature points. Through feature matching, it can be applied to calculate the similarity between documents containing these images. And in the second method, OCR is used to extract text from the images. By using numbers which are extracted from an image, it is possible to extract the corresponding related text within the text passages. Subsequently, document similarity can be calculated based on the extracted text. Through comparing the suggested methods and an existing method based only on text for calculating the similarity, the feasibility is achieved. Additionally, the correlation between both the similarity measures is low which shows that they capture different aspects of the patent content.

Similar Documents and Related Researcher Retrieval Method (유서문서 및 관련연구자 검색 방법)

  • Han, Hee-Jun
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2010.06b
    • /
    • pp.6-9
    • /
    • 2010
  • 학술정보 이용자는 연구에 필요한 자료를 획득하기 위해 검색서비스를 이용한다. 대부분의 웹 이용자는 원하는 정보를 얻기 위해 수많은 검색 질의어를 생성하여 시스템에 요청하고 선별된 정보 리스트들을 탐색하고 정보획득의 최종 목적지로써 해당 정보의 상세화면으로 이동하게 된다. 마찬가지로 논문 및 특허 정보를 제공하는 학술정보서비스의 경우 이용자의 최종 목적지는 한 건의 상세 메타정보 혹은 원문이 되는데, 이 때 이용중인 정보와 유사한 다른 유형의 학술정보 및 관련 연구 분야의 연구자 제공 서비스는 이용자의 정보획득 요구를 쉽게 충족시키기 위한 필수요소이다. NDSL(국가과학기술종합정보서비스) 의 경우 동일 DB내에서의 유사문서 검색기능(논문검색에서는 유사논문 제공, 특허검색에서는 유사특허 제공)을 제공하지만 이는 이종 DB간 유사문서를 이용하고자 하는 사용자 요구사항을 만족시키지 못하는 수준이다. 본 논문에서는 논문, 특허, 연구보고서, 동향분석 자료를 포함한 학술정보 검색서비스에서 사용자 질의어와 검색엔진이 제공하는 검색 요소 및 부스팅(boosting) 기법을 이용한 이종 컨텐츠간 유사문서 리스트 및 관련 연구 분야의 연구자명 검색 서비스 기법에 대해 논한다. 이는 사용자가 원하는 학술정보를 서비스 최종 화면에서 효과적으로 제공함으로써 반복되는 검색 및 탐색의 노력을 줄일 수 있다.

  • PDF

Feature Generation of Dictionary for Named-Entity Recognition based on Machine Learning (기계학습 기반 개체명 인식을 위한 사전 자질 생성)

  • Kim, Jae-Hoon;Kim, Hyung-Chul;Choi, Yun-Soo
    • Journal of Information Management
    • /
    • v.41 no.2
    • /
    • pp.31-46
    • /
    • 2010
  • Now named-entity recognition(NER) as a part of information extraction has been used in the fields of information retrieval as well as question-answering systems. Unlike words, named-entities(NEs) are generated and changed steadily in documents on the Web, newspapers, and so on. The NE generation causes an unknown word problem and makes many application systems with NER difficult. In order to alleviate this problem, this paper proposes a new feature generation method for machine learning-based NER. In general features in machine learning-based NER are related with words, but entities in named-entity dictionaries are related to phrases. So the entities are not able to be directly used as features of the NER systems. This paper proposes an encoding scheme as a feature generation method which converts phrase entities into features of word units. Futhermore, due to this scheme, entities with semantic information in WordNet can be converted into features of the NER systems. Through our experiments we have shown that the performance is increased by about 6% of F1 score and the errors is reduced by about 38%.

Visualizing Fuzzy Set Based on Venn Diagram (벤 다이어그램 기반 퍼지 집합 시각화)

  • Park, Ye-Seul;Park, Jin-Ah
    • 한국HCI학회:학술대회논문집
    • /
    • 2009.02a
    • /
    • pp.15-20
    • /
    • 2009
  • Much amount of data which demand fuzzy information system requires various analysis through the fuzzy set visualization. Therefore, this study proposes how to visualize fuzzy data set using variation of Venn diagram. For the fuzzy data which are related to many topics and have ranking of relation, this way gives results that users want by visualizing intersection, union and complementary set. That is, it visualizes the set of fuzzy data which have many topics at once, or the set of all fuzzy data which has topics, or the set of fuzzy data not related to a topic. Users control these sets by overlapping or piling them; visualized with Venn diagram, which is user-oriented. One distinct advantage of this visualization is the fact that it delivers web documents which users of search engine and web developers want much quickly. Furthermore, its possibility can be expanded to several purposes by using for information retrieval.

  • PDF

A Personal Digital Library on a Distributed Mobile Multiagents Platform (분산 모바일 멀티에이전트 플랫폼을 이용한 사용자 기반 디지털 라이브러리 구축)

  • Cho Young Im
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.12
    • /
    • pp.1637-1648
    • /
    • 2004
  • When digital libraries are developed by the traditional client/sever system using a single agent on the distributed environment, several problems occur. First, as the search method is one dimensional, the search results have little relationship to each other. Second, the results do not reflect the user's preference. Third, whenever a client connects to the server, users have to receive the certification. Therefore, the retrieval of documents is less efficient causing dissatisfaction with the system. I propose a new platform of mobile multiagents for a personal digital library to overcome these problems. To develop this new platform I combine the existing DECAF multiagents platform with the Voyager mobile ORB and propose a new negotiation algorithm and scheduling algorithm. Although there has been some research for a personal digital library, I believe there have been few studies on their integration and systemization. For searches of related information, the proposed platform could increase the relationship of search results by subdividing the related documents, which are classified by a supervised neural network. For the user's preference, as some modular clients are applied to a neural network, the search results are optimized. By combining a mobile and multiagents platform a new mobile, multiagents platform is developed in order to decrease a network burden. Furthermore, a new negotiation algorithm and a scheduling algorithm are activated for the effectiveness of PDS. The results of the simulation demonstrate that as the number of servers and agents are increased, the search time for PDS decreases while the degree of the user's satisfaction is four times greater than with the C/S model.

Investigation of Topic Trends in Computer and Information Science by Text Mining Techniques: From the Perspective of Conferences in DBLP (텍스트 마이닝 기법을 이용한 컴퓨터공학 및 정보학 분야 연구동향 조사: DBLP의 학술회의 데이터를 중심으로)

  • Kim, Su Yeon;Song, Sung Jeon;Song, Min
    • Journal of the Korean Society for information Management
    • /
    • v.32 no.1
    • /
    • pp.135-152
    • /
    • 2015
  • The goal of this paper is to explore the field of Computer and Information Science with the aid of text mining techniques by mining Computer and Information Science related conference data available in DBLP (Digital Bibliography & Library Project). Although studies based on bibliometric analysis are most prevalent in investigating dynamics of a research field, we attempt to understand dynamics of the field by utilizing Latent Dirichlet Allocation (LDA)-based multinomial topic modeling. For this study, we collect 236,170 documents from 353 conferences related to Computer and Information Science in DBLP. We aim to include conferences in the field of Computer and Information Science as broad as possible. We analyze topic modeling results along with datasets collected over the period of 2000 to 2011 including top authors per topic and top conferences per topic. We identify the following four different patterns in topic trends in the field of computer and information science during this period: growing (network related topics), shrinking (AI and data mining related topics), continuing (web, text mining information retrieval and database related topics), and fluctuating pattern (HCI, information system and multimedia system related topics).

Standard Translation of Terms of Korean Medicine through Consideration of Chinese-Korean Collated Medical Classics - With focus on 『Eonhaegugeupbang』, 『Eonhaetaesanjipyo』 and 『Eonhaetaesanjipyo』 - (언해의서 비교고찰을 통한 한의학용어의 번역표준안 - 『언해두창집요』, 『언해구급방』, 『언해태산집요』를 중심으로)

  • Ku, Hyunhee;Kim, Hyunkoo;Lee, JungHyun;Oh, Junho;Kwon, Ohmin
    • Korean Journal of Oriental Medicine
    • /
    • v.18 no.3
    • /
    • pp.49-61
    • /
    • 2012
  • This article set out to develop an old Chinese - modern Korean collated terminology by analyzing and paralleling Chinese-Korean translational terms relevant to Korean medicine at a minimum meaning unit from "Eonhaegugeupbang", "Eonhaetaesanjipyo" and "Eonhaetaesanjipyo". Those are composed of original Chinese texts and their subsequent corresponding Korean translations. It tries to make a list of translational standards of Korean medicine terms by classifying the cases of translational ambiguity in terms of disease, body position, thumbnail-pressing acupuncture method, and disease-curing method. The above-mentioned ancient books are medical classics written by Huh Jun, the representative medical physician, and published by the Joseon government. Thus, they are appropriate enough as historically legitimate medical documents, from which are drawn out words and terms to form an old Chinese - modern Korean collation dictionary. This collation glossary will contribute to the increased relevance of data ming, or information retrieval. in a database system and information search engine of massive Korean medical records, by means of providing a novel way to obtaining synchronized results between the original writings of old Chinese and the secondary translated ones of modern Korean. The glossary will promote the collective but consistent translation of numerous old archives of Korean medicine and in other related fields as well.

Inverse Document Frequency-Based Word Embedding of Unseen Words for Question Answering Systems (질의응답 시스템에서 처음 보는 단어의 역문헌빈도 기반 단어 임베딩 기법)

  • Lee, Wooin;Song, Gwangho;Shim, Kyuseok
    • Journal of KIISE
    • /
    • v.43 no.8
    • /
    • pp.902-909
    • /
    • 2016
  • Question answering system (QA system) is a system that finds an actual answer to the question posed by a user, whereas a typical search engine would only find the links to the relevant documents. Recent works related to the open domain QA systems are receiving much attention in the fields of natural language processing, artificial intelligence, and data mining. However, the prior works on QA systems simply replace all words that are not in the training data with a single token, even though such unseen words are likely to play crucial roles in differentiating the candidate answers from the actual answers. In this paper, we propose a method to compute vectors of such unseen words by taking into account the context in which the words have occurred. Next, we also propose a model which utilizes inverse document frequencies (IDF) to efficiently process unseen words by expanding the system's vocabulary. Finally, we validate that the proposed method and model improve the performance of a QA system through experiments.

Structural Analysis of Scientific Information Usage (해사관계 연구자의 문헌정보 이용에 관한 구조분석)

  • 이철영
    • Journal of the Korean Institute of Navigation
    • /
    • v.4 no.2
    • /
    • pp.7-38
    • /
    • 1980
  • Nowadays researchers attach a great importance to the problems concerned with scientific information in the field of science and engineering. There are some reasons for it, that is, ⅰ) the amount of scientific information increases in proportion to the activities of scientists and engineers, so it is difficult to pick up a real valuable information ⅱ) it becomes more important to use a variety of information in proportion to the spread ofthe branch of science ⅲ) since the medium of scientific information is mostly technical papers, it is very difficult to mechanically transact these papers and to keep all documents and scientific informations for a long time. To cope with these difficult situations, many technical skills have been developed, for example, data-base, automatic information retrieval, micro-film and so on. But there are comparatively few investigation on the matter how the researchers who are users and producers think about the systematization of scientific information usage, so this paper investigates the thought and information needs of researchers, and proposes a basis of a method for systematization of scientific information usage. The author inspects the actual conditions of scientific information, reconsider the method which has been used and investigates the matter of how researchers whose interest is closely related to the study of marine affairs think about problems of scientific information usage by thequestionarie of Fuzzy-DEMATEL method. Also, FSM which is method for structuring hierarchy for the several complex problems on the basis of fuzzy sets theory is adopted as a tool of analysis. We can understand the key problems and make a story to solve the systematization of scientific information usage from the results of the analysis and those results will be directly applicable to construct a new system for scientific information usage.

  • PDF