• 제목/요약/키워드: Document research

검색결과 1,342건 처리시간 0.027초

XML Document Management System (XML 문서 관리 시스템)

  • Na, Jung-Chan;Lee, Mi-Yeong;Kim, Wan-Seok;Kim, Myeong-Jun;Lee, Gyu-Cheol
    • The Transactions of the Korea Information Processing Society
    • /
    • 제7권2S호
    • /
    • pp.711-720
    • /
    • 2000
  • BADA-IV/XML is a system designed specifically for managing XML. Documents and is essential to various electronic document applications as a fundamental system. BADA-IV/XML supports all of aspects of data model, querying and manipulation operations for managing XML documents. This paper provides an overview of these aspects of the BADA-IV/XML, as well as defines schema classes for stroing, querying and maintaining hierarchical semantics of multimedia documents and structural semantics of complex documents linked with each other, Also a multimedia document query language is designed and implemented to support essential operations for efficient searching and managing multimedia documents. Finally, some simulation results show the performance of the paged VF(Virtual Fragmentation) model and the search model using element's identifier as compared with a general model.

Design and Implementation of BADA-IV/XML Query Processor Supporting Efficient Structure Querying (효율적 구조 질의를 지원하는 바다-IV/XML 질의처리기의 설계 및 구현)

  • 이명철;김상균;손덕주;김명준;이규철
    • The Journal of Information Technology and Database
    • /
    • 제7권2호
    • /
    • pp.17-32
    • /
    • 2000
  • As XML emerging as the Internet electronic document language standard of the next generation, the number of XML documents which contain vast amount of Information is increasing substantially through the transformation of existing documents to XML documents or the appearance of new XML documents. Consequently, XML document retrieval system becomes extremely essential for searching through a large quantity of XML documents that are storied in and managed by DBMS. In this paper we describe the design and implementation of BADA-IV/XML query processor that supports content-based, structure-based and attribute-based retrieval. We design XML query language based upon XQL (XML Query Language) of W3C and tightly-coupled with OQL (a query language for object-oriented database). XML document is stored and maintained in BADA-IV, which is an object-oriented database management system developed by ETRI (Electronics and Telecommunications Research Institute) The storage data model is based on DOM (Document Object Model), therefore the retrieval of XML documents is executed basically using DOM tree traversal. We improve the search performance using Node ID which represents node's hierarchy information in an XML document. Assuming that DOW tree is a complete k-ary tree, we show that Node ID technique is superior to DOM tree traversal from the viewpoint of node fetch counts.

  • PDF

Discriminator of Similar Documents Using Syntactic and Semantic Analysis (구문의미분석를 이용한 유사문서 판별기)

  • Kang, Won-Seog;Hwang, Do-Sam;Kim, Jung H.
    • The Journal of the Korea Contents Association
    • /
    • 제14권3호
    • /
    • pp.40-51
    • /
    • 2014
  • Owing to importance of document copyright the need to detect document duplication and plagiarism is increasing. Many studies have sought to meet such need, but there are difficulties in document duplication detection due to technological limitations with the processing of natural language. This thesis designs and implements a discriminator of similar documents with natural language processing technique. This system discriminates similar documents using morphological analysis, syntactic analysis, and weight on low frequency and idiom. To evaluate the system, we analyze the correlation between human discrimination and term-based discrimination, and between human discrimination and proposed discrimination. This analysis shows that the proposed discrimination needs improving. Future research should work to define the document type and improve the processing technique appropriate for each type.

Fast, Flexible Text Search Using Genomic Short-Read Mapping Model

  • Kim, Sung-Hwan;Cho, Hwan-Gue
    • ETRI Journal
    • /
    • 제38권3호
    • /
    • pp.518-528
    • /
    • 2016
  • The searching of an extensive document database for documents that are locally similar to a given query document, and the subsequent detection of similar regions between such documents, is considered as an essential task in the fields of information retrieval and data management. In this paper, we present a framework for such a task. The proposed framework employs the method of short-read mapping, which is used in bioinformatics to reveal similarities between genomic sequences. In this paper, documents are considered biological objects; consequently, edit operations between locally similar documents are viewed as an evolutionary process. Accordingly, we are able to apply the method of evolution tracing in the detection of similar regions between documents. In addition, we propose heuristic methods to address issues associated with the different stages of the proposed framework, for example, a frequency-based fragment ordering method and a locality-aware interval aggregation method. Extensive experiments covering various scenarios related to the search of an extensive document database for documents that are locally similar to a given query document are considered, and the results indicate that the proposed framework outperforms existing methods.

Adaptive Binarization for Camera-based Document Recognition (카메라 기반 문서 인식을 위한 적응적 이진화)

  • Kim, In-Jung
    • Journal of Korea Society of Industrial Information Systems
    • /
    • 제12권3호
    • /
    • pp.132-140
    • /
    • 2007
  • The quality of the camera image is worse than that of the scanner image because of lighting variation and inaccurate focus. This paper proposes a binarization method for camera-based document recognition, which is tolerant to low-quality camera images. Based on an existing method reported to be effective in previous evaluations, we enhanced the adaptability to the image with a low contrast due to low intensity and inaccurate focus. Furthermore, applying an additional small-size window in the binarization process, it is effective to extract the fine detail of character structure, which is often degraded by conventional methods. In experiments, we applied the proposed method as well as other methods to a document recognizer and compared the performance for many cm images. The result showed the proposed method is effective for recognition of document images captured by the camera.

  • PDF

PMCN: Combining PDF-modified Similarity and Complex Network in Multi-document Summarization

  • Tu, Yi-Ning;Hsu, Wei-Tse
    • International Journal of Knowledge Content Development & Technology
    • /
    • 제9권3호
    • /
    • pp.23-41
    • /
    • 2019
  • This study combines the concept of degree centrality in complex network with the Term Frequency $^*$ Proportional Document Frequency ($TF^*PDF$) algorithm; the combined method, called PMCN (PDF-Modified similarity and Complex Network), constructs relationship networks among sentences for writing news summaries. The PMCN method is a multi-document summarization extension of the ideas of Bun and Ishizuka (2002), who first published the $TF^*PDF$ algorithm for detecting hot topics. In their $TF^*PDF$ algorithm, Bun and Ishizuka defined the publisher of a news item as its channel. If the PDF weight of a term is higher than the weights of other terms, then the term is hotter than the other terms. However, this study attempts to develop summaries for news items. Because the $TF^*PDF$ algorithm summarizes daily news, PMCN replaces the concept of "channel" with "the date of the news event", and uses the resulting chronicle ordering for a multi-document summarization algorithm, of which the F-measure scores were 0.042 and 0.051 higher than LexRank for the famous d30001t and d30003t tasks, respectively.

The XML Compression Algorithm Supporting Query Processing For Compressed Documents (압축된 문서에 대해 질의 처리를 지원하는 XML 압축 알고리즘)

  • 강영준;이석재;유재수
    • Proceedings of the Korea Contents Association Conference
    • /
    • 한국콘텐츠학회 2003년도 추계종합학술대회 논문집
    • /
    • pp.195-203
    • /
    • 2003
  • With the spread of interment, the digitalization and knowledge-based information are in progress. Specially, numerous users make the various works and use the services on the web. For the most part, these works make use of the XML. The XML shines the reusing of the Documents because it is separated from contents and styles. Also, it can re-define the logic structure of the Document for requirement of the developer. However, the XML document's size is much larger than common text document because it basically handles the document type and adds numerous tags for representing structure of the document. To utilize the limited storage of Palmtop, PDA and so on, it is necessary to compress and handle the documents efficiently. Recently, the compression techniques for efficiently handling and compressing the XML documents are in progress to solve this problem. But the existing research doesn't support the query processing for that. In this paper, we design and implement the XML compression algorithm that compresses the XML document and processes the quay of compressed XML document faster and mote effciently than the previous techniques.

  • PDF

XML Document Clustering Based on Sequential Pattern (순차패턴에 기반한 XML 문서 클러스터링)

  • Hwang, Jeong-Hee;Ryu, Keun-Ho
    • The KIPS Transactions:PartD
    • /
    • 제10D권7호
    • /
    • pp.1093-1102
    • /
    • 2003
  • As the use of internet is growing, the amount of information is increasing rapidly and XML that is a standard of the web data has the property of flexibility of data representation. Therefore electronic document systems based on web, such as EDMS (Electronic Document Management System), ebXML (e-business extensible Markup Language), have been adopting XML as the method for exchange and standard of documents. So research on the method which can manage and search structural XML documents in an effective wav is required. In this paper we propose the clustering method based on structural similarity among the many XML documents, using typical structures extracted from each document by sequential pattern mining in pre-clustering process. The proposed algorithm improves the accuracy of clustering by computing cost considering cluster cohesion and inter-cluster similarity.

Implementation of Digital Document Management DRM System with OMA Structure (OMA 구조를 이용한 안전한 전자문서 관리를 위한 DRM 시스템 구현)

  • Shin Young-Chan;Choi Hyo-Sik;Kim Yong-Goo;Choi Seoko-Jin;Ryou Jae-Cheol
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • 제16권2호
    • /
    • pp.45-54
    • /
    • 2006
  • As widespread of using digital documents in various fields, control the usage of digital document is needed. So, Digital Rights Management(DRM) will become a key component of digital document system, but absence of proper digital document DRM system, there is a real risk to lose important information when a hacker achieved intrusion in important system. This paper designs and implements digital document DRM system based on OMA(Open Mobile Alliance) DRM model and OpenOffice. We considered being a digital document DRM system to contain appropriate solution of security and document compatibility.

Object detection in financial reporting documents for subsequent recognition

  • Sokerin, Petr;Volkova, Alla;Kushnarev, Kirill
    • International journal of advanced smart convergence
    • /
    • 제10권1호
    • /
    • pp.1-11
    • /
    • 2021
  • Document page segmentation is an important step in building a quality optical character recognition module. The study examined already existing work on the topic of page segmentation and focused on the development of a segmentation model that has greater functional significance for application in an organization, as well as broad capabilities for managing the quality of the model. The main problems of document segmentation were highlighted, which include a complex background of intersecting objects. As classes for detection, not only classic text, table and figure were selected, but also additional types, such as signature, logo and table without borders (or with partially missing borders). This made it possible to pose a non-trivial task of detecting non-standard document elements. The authors compared existing neural network architectures for object detection based on published research data. The most suitable architecture was RetinaNet. To ensure the possibility of quality control of the model, a method based on neural network modeling using the RetinaNet architecture is proposed. During the study, several models were built, the quality of which was assessed on the test sample using the Mean average Precision metric. The best result among the constructed algorithms was shown by a model that includes four neural networks: the focus of the first neural network on detecting tables and tables without borders, the second - seals and signatures, the third - pictures and logos, and the fourth - text. As a result of the analysis, it was revealed that the approach based on four neural networks showed the best results in accordance with the objectives of the study on the test sample in the context of most classes of detection. The method proposed in the article can be used to recognize other objects. A promising direction in which the analysis can be continued is the segmentation of tables; the areas of the table that differ in function will act as classes: heading, cell with a name, cell with data, empty cell.