• Title/Summary/Keyword: Document information retrieval

Search Result 411, Processing Time 0.031 seconds

A Study on the Development of Search Algorithm for Identifying the Similar and Redundant Research (유사과제파악을 위한 검색 알고리즘의 개발에 관한 연구)

  • Park, Dong-Jin;Choi, Ki-Seok;Lee, Myung-Sun;Lee, Sang-Tae
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.11
    • /
    • pp.54-62
    • /
    • 2009
  • To avoid the redundant investment on the project selection process, it is necessary to check whether the submitted research topics have been proposed or carried out at other institutions before. This is possible through the search engines adopted by the keyword matching algorithm which is based on boolean techniques in national-sized research results database. Even though the accuracy and speed of information retrieval have been improved, they still have fundamental limits caused by keyword matching. This paper examines implemented TFIDF-based algorithm, and shows an experiment in search engine to retrieve and give the order of priority for similar and redundant documents compared with research proposals, In addition to generic TFIDF algorithm, feature weighting and K-Nearest Neighbors classification methods are implemented in this algorithm. The documents are extracted from NDSL(National Digital Science Library) web directory service to test the algorithm.

Improvement OCR Algorithm for Efficient Book Catalog RetrievalTechnology (효과적인 도서목록 검색을 위한 개선된 OCR알고리즘에 관한 연구)

  • HeWen, HeWen;Baek, Young-Hyun;Moon, Sung-Ryong
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.47 no.1
    • /
    • pp.152-159
    • /
    • 2010
  • Existing character recognition algorithm recognize characters in simple conditional. It has the disadvantage that recognition rates often drop drastically when input document image has low quality, rotated text, various font or size text because of external noise or data loss. In this paper, proposes the optical character recognition algorithm which using bicubic interpolation method for the catalog retrieval when the input image has rotated text, blurred, various font and size. In this paper, applied optical character recognition algorithm consist of detection and recognition part. Detection part applied roberts and hausdorff distance algorithm for correct detection the catalog of book. Recognition part applied bicubic interpolation to interpolate data loss due to low quality, various font and size text. By the next time, applied rotation for the bicubic interpolation result image to slant proofreading. Experimental results show that proposal method can effectively improve recognition rate 6% and search-time 1.077s process result.

Linear Path Query Processing using Backward Label Path on XML Documents (역방향 레이블 경로를 이용한 XML 문서의 선형 경로 질의 처리)

  • Park, Chung-Hee;Koo, Heung-Seo;Lee, Sang-Joon
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.17 no.6
    • /
    • pp.766-772
    • /
    • 2007
  • As XML is widely used, many researches on the XML storage and query processing have been done. But, previous works on path query processing have mainly focused on the storage and retrieval methods for a large XML document or XML documents had a same DTD. Those researches did not efficiently process partial match queries on the differently-structured document set. To resolve the problem, we suggested a new index structure using relational table. The method constructs the $B^+$-tree index using backward label paths instead of forward label paths used in previous researches for storing path information and allows for finding the label paths that match the partial match queries efficiently using it when process the queries.

An Operation-Based Model of Version Storage and Consistency Management for Fine-Grained Software Objects (미세 단위 소프트웨어 객체를 위한 연산 기반 버전 및 일관성 관리 모델)

  • Rho, Jung-Kyu;Wu, Chi-Su
    • Journal of KIISE:Software and Applications
    • /
    • v.27 no.7
    • /
    • pp.691-701
    • /
    • 2000
  • Software documents consists of a number of objects and relationships between them, and structure of documents can be changed frequently. In this paper, we propose a version storage and consistency management model for fine-grained software objects based on operations applied to edit software objects. An object has an interface and can be updated only through operations defined in its interface. Operations applied to objects are recorded in the operation history, which is used to retrieve versions of a document and manage consistency between documents. Because versions of an object are stored and retrieved using the operation delta, it is not needed to compare versions of a document to extract delta and it is easy to identify the changes between versions in order to propagate the changes. Consistencies between documents are managed using dependencies between objects and kinds of the operations applied to the objects. Therefore unnecessary version propagation can be avoided. This paper presents a formal model of version retrieval and consistency management at the fine-grained level based on operations applied to the objects.

  • PDF

Trends Analysis on Research Articles in the Journal of Korean Society for Information Management (『정보관리학회지』 연구의 동향분석)

  • Seo, Eun-Gyoung
    • Journal of the Korean Society for information Management
    • /
    • v.27 no.4
    • /
    • pp.7-32
    • /
    • 2010
  • The aims of this study were to provide a global overview of research trends in information science and to trace its changes in the main research topics over time using trends analysis. The study examined the topics of research articles published in Journal of Korean Society for Information Management between 1984 and 2009. Rather than taking a single snapshot of a given point in time, this study attempted to present a series of such pictures in order to identify trends over time. The fairly arbitrary decision was taken to divide the period under consideration into three 'publication windows': 1984-1994, 1995-2002, 2003-2009. The study revealed that the most productive areas were 'Information Service', followed by 'Information Organization', and 'Information System'. The most productive sub-areas were 'Library Service', 'User Study', 'Automatic Document Analysis', 'ILS', 'Thesaurus/Ontology', and 'Digital Library'. From the comparisons of intellectual structures of title keywords, the key research area in the field of Information Science was 'Information Retrieval'. The studies of IT applications and service system evaluation have been expanded.

Multiple Cause Model-based Topic Extraction and Semantic Kernel Construction from Text Documents (다중요인모델에 기반한 텍스트 문서에서의 토픽 추출 및 의미 커널 구축)

  • 장정호;장병탁
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.5
    • /
    • pp.595-604
    • /
    • 2004
  • Automatic analysis of concepts or semantic relations from text documents enables not only an efficient acquisition of relevant information, but also a comparison of documents in the concept level. We present a multiple cause model-based approach to text analysis, where latent topics are automatically extracted from document sets and similarity between documents is measured by semantic kernels constructed from the extracted topics. In our approach, a document is assumed to be generated by various combinations of underlying topics. A topic is defined by a set of words that are related to the same topic or cooccur frequently within a document. In a network representing a multiple-cause model, each topic is identified by a group of words having high connection weights from a latent node. In order to facilitate teaming and inferences in multiple-cause models, some approximation methods are required and we utilize an approximation by Helmholtz machines. In an experiment on TDT-2 data set, we extract sets of meaningful words where each set contains some theme-specific terms. Using semantic kernels constructed from latent topics extracted by multiple cause models, we also achieve significant improvements over the basic vector space model in terms of retrieval effectiveness.

Attribute-Based Classification Method for Automatic Construction of Answer Set (정답문서집합 자동 구축을 위한 속성 기반 분류 방법)

  • 오효정;장문수;장명길
    • Journal of KIISE:Software and Applications
    • /
    • v.30 no.7_8
    • /
    • pp.764-772
    • /
    • 2003
  • The main thrust of our talk will be based on our experience in developing and applying an attribute-based classification technique in the context of an operational answer set driven retrieval system. To alleviate the difficulty and reduce the cost of manually constructing and maintaining answer sets, i.e., knowledge base, we have devised a new method of automating the answer document selection process by using the notion of attribute-based classification, which is in and of itself novel. We attempt to explain through experiments how helpful the proposed method is for the knowledge base construction process.

An Efficient BitmapInvert Index based on Relative Position Coordinate for Retrieval of XML documents (효율적인 XML검색을 위한 상대 위치 좌표 기반의 BitmapInvert Index 기법)

  • Kim, Tack-Gon;Kim, Woo-Saeng
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.43 no.1 s.307
    • /
    • pp.35-44
    • /
    • 2006
  • Recently, a lot of index techniques for storing and querying XML document have been studied so far and many researches of them used coordinate-based methods. But update operation and query processing to express structural relations among elements, attributes and texts make a large burden. In this paper, we propose an efficient BitmapInvert index technique based on Relative Position Coordinate (RPC). RPC has good preformance even if there are frequent update operations because it represents relationship among parent node and left, right sibling nodes. BitmapInvert index supports tort query with bitwise operations and does not casue serious performance degradations on update operations using PostUpdate algerian. Overall, the performance could be improved by reduction of the number of times for traversing nodes.

Design and Implementation of On-line Standards Development System on the World Wide Web (WWW상에서의 온라인 정보통신표준 개발 시스템 설계 및 구현)

  • 구경철;김형준;박기식;송기평;조인준;정회경
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.2 no.4
    • /
    • pp.559-573
    • /
    • 1998
  • Recently Standards Developments Organizations (SDO$\_S$) in the field of Information and Communication recognize that "More new and more complex standards should be developed in shorter time". To cope with this challenge they try to construct Standards Information Cooperation Network (SICN) or Electronic Document Handling (EDH) systems for efficient standards development process. This paper presents the design and implementation of an Extranet based Web system dedicated to effective on-line standards making environments. The system, which is called SICN (Standards Information Cooperation Network), is a workflow-based network application created with a view to fostering faster standards development with functionalities such as an electronic signature mechanism, electronic voting, comment gathering and dynamic links for ready retrieval of standards information stored in a database. This paper also describes the concept of a VSDO (Virtual Standards Development Organization) that supports all the features needed by the relevant standards making bodies to carry out their activities in dynamic on-line environments.ironments.

  • PDF

Relevance Feedback Agent for Improving Precision in Korean Web Information Retrieval System (한국어 웹 정보검색 시스템의 정확도 향상을 위한 연관 피드백 에이전트)

  • Baek, Jun-Ho;Choe, Jun-Hyeok;Lee, Jeong-Hyeon
    • The Transactions of the Korea Information Processing Society
    • /
    • v.6 no.7
    • /
    • pp.1832-1840
    • /
    • 1999
  • Since the existed Korean Web IR systems generally use boolean system, it is difficult to retrieve the information to be wanted at one time. Also, because of the feature that web documents have the frequent abbreviation and many links, the keyword extraction using the inverted document frequency extracts the improper keywords for adding ambiguous meaning problem. Therefore, users must repeat the modification of the queries until they get the proper information. In this paper, we design and implement the relevance feedback agent system for resolving the above problems. The relevance feedback agent system extracts the proper information in response to user's preferred keywords and stores these keywords in preference DB table. When users retrieve this information later, the relevance feedback agent system will search it adding relevant keywords to user's queries. As a result of this method, the system can reduce the number of modification of user's queries and improve the efficiency of the IR system.

  • PDF