• Title/Summary/Keyword: Document Image Retrieval

Search Result 31, Processing Time 0.022 seconds

Development of an Automated ESG Document Review System using Ensemble-Based OCR and RAG Technologies

  • Eun-Sil Choi
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.9
    • /
    • pp.25-37
    • /
    • 2024
  • This study proposes a novel automation system that integrates Optical Character Recognition (OCR) and Retrieval-Augmented Generation (RAG) technologies to enhance the efficiency of the ESG (Environmental, Social, and Governance) document review process. The proposed system improves text recognition accuracy by applying an ensemble model-based image preprocessing algorithm and hybrid information extraction models in the OCR process. Additionally, the RAG pipeline optimizes information retrieval and answer generation reliability through the implementation of layout analysis algorithms, re-ranking algorithms, and ensemble retrievers. The system's performance was evaluated using certificate images from online portals and corporate internal regulations obtained from various sources, such as the company's websites. The results demonstrated an accuracy of 93.8% for certification reviews and 92.2% for company regulations reviews, indicating that the proposed system effectively supports human evaluators in the ESG assessment process.

Image Based Text Matching Using Local Crowdedness and Hausdorff Distance (지역 밀집도 및 Hausdorff 거리를 이용한 영상기반 텍스트 매칭)

  • Son, Hwa-Jeong;Kim, Ji-Soo;Park, Mi-Seon;Yoo, Jae-Myeong;Kim, Soo-Hyung
    • The Journal of the Korea Contents Association
    • /
    • v.6 no.10
    • /
    • pp.134-142
    • /
    • 2006
  • In this paper, we investigate a Hausdorff distance, which is used for the measurement of image similarity, to see whether it is also effective for document retrieval. The proposed method uses a local crowdedness and a Hausdorff distance to locate text images by determining whether a pair of images scanned at different time comes from the same text or not. To reduce the processing time, which is one of the disadvantages of a Hausdorff distance algorithm, we adopt a local crowdedness for feature point extraction. We apply the proposed method to 190 pairs of the same class and 190 pairs of the different class collected from postal envelop images. The results show that the modified Hausdorff distance proposed in this paper performed well in locating the tort region and calculating the degree of similarity between two images. An improvement of accuracy by 2.7% and 9.0% has been obtained, compared to a binary correlation method and the original Hausdorff distance method, respectively.

  • PDF

Jointly Image Topic and Emotion Detection using Multi-Modal Hierarchical Latent Dirichlet Allocation

  • Ding, Wanying;Zhu, Junhuan;Guo, Lifan;Hu, Xiaohua;Luo, Jiebo;Wang, Haohong
    • Journal of Multimedia Information System
    • /
    • v.1 no.1
    • /
    • pp.55-67
    • /
    • 2014
  • Image topic and emotion analysis is an important component of online image retrieval, which nowadays has become very popular in the widely growing social media community. However, due to the gaps between images and texts, there is very limited work in literature to detect one image's Topics and Emotions in a unified framework, although topics and emotions are two levels of semantics that often work together to comprehensively describe one image. In this work, a unified model, Joint Topic/Emotion Multi-Modal Hierarchical Latent Dirichlet Allocation (JTE-MMHLDA) model, which extends previous LDA, mmLDA, and JST model to capture topic and emotion information at the same time from heterogeneous data, is proposed. Specifically, a two level graphical structured model is built to realize sharing topics and emotions among the whole document collection. The experimental results on a Flickr dataset indicate that the proposed model efficiently discovers images' topics and emotions, and significantly outperform the text-only system by 4.4%, vision-only system by 18.1% in topic detection, and outperforms the text-only system by 7.1%, vision-only system by 39.7% in emotion detection.

  • PDF

Improvement OCR Algorithm for Efficient Book Catalog RetrievalTechnology (효과적인 도서목록 검색을 위한 개선된 OCR알고리즘에 관한 연구)

  • HeWen, HeWen;Baek, Young-Hyun;Moon, Sung-Ryong
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.47 no.1
    • /
    • pp.152-159
    • /
    • 2010
  • Existing character recognition algorithm recognize characters in simple conditional. It has the disadvantage that recognition rates often drop drastically when input document image has low quality, rotated text, various font or size text because of external noise or data loss. In this paper, proposes the optical character recognition algorithm which using bicubic interpolation method for the catalog retrieval when the input image has rotated text, blurred, various font and size. In this paper, applied optical character recognition algorithm consist of detection and recognition part. Detection part applied roberts and hausdorff distance algorithm for correct detection the catalog of book. Recognition part applied bicubic interpolation to interpolate data loss due to low quality, various font and size text. By the next time, applied rotation for the bicubic interpolation result image to slant proofreading. Experimental results show that proposal method can effectively improve recognition rate 6% and search-time 1.077s process result.

The Project and Prospects of Old Documents Information Systems in Korea (한국 고문헌 정보시스템의 구축 및 전망)

  • Kang Soon-Ae
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.31 no.4
    • /
    • pp.83-112
    • /
    • 1997
  • The purpose of this paper Is to describe the matters to plan the best information systems in Korean old books. It analyzes: i) a range of definition of old books, ii) its characteristics and current state of processing the old documents, iii) the scope of automation and building up the library institution, iv) the construction of Korean old books Information systems, v) its case study, and vi) the evaluation and vision of system. The old document information system have been organized on the basis of library networks systems with the National Central Library as leader, its implemented system has the subsystem such as cataloging system, annotation system, full-text or image-based system, and retrieval system. In case study, it is suggested two examples which has been built in the National Central Library and Sung Kyun Kwan university. finally, it provides the evaluation criteria and vision for the library which designs the old document information systems.

  • PDF

Auto Detection System of Personal Information based on Images and Document Analysis (이미지와 문서 분석을 통한 개인 정보 자동 검색 시스템)

  • Cho, Jeong-Hyun;Ahn, Cheol-Woong
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.15 no.5
    • /
    • pp.183-192
    • /
    • 2015
  • This paper proposes Personal Information Auto Detection(PIAD) System to prevent leakage of Personal informations in document and image files that can be used by mobile service provider. The proposed system is to automatically detect the images and documents that contain personal informations and shows the result to the user. The PIAD is divided into the selection step for fast and accurate retrieval images and analysis which is composed of SURF, erosion and dilation, FindContours algorithm. The result of proposed PIAD system showed more than 98% accuracy by selection and analysis steps, 267 images detection of 272 images.

Design of Multimedia Document Retrieval System Using Relations between Media (미디어간 상호 연관성을 이용한 멀티미디어 문서 검색 시스템의 설계)

  • 이성환;유채곤;이원호;황치정
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 1998.10b
    • /
    • pp.274-276
    • /
    • 1998
  • 많은 분야에서 정보를 효과적으로 전달하기 위한 수단으로 멀티미디어가 많이 사용되고 있다. 이에 멀티미디어 문서를 효율적으로 저장, 검색, 표현하기 위한 기법에 대한 연구가 필요하다. 멀티미디어 문서 내에 사용되는 audio, video, image, text와 같은 여러 미디어들은 문서 내에서 시.공간적 관계뿐 아니라 내용상의 연관성을 갖게 된다. 본 논문에서는 멀티미디어 문서에 사용되는 미디어들의 특징 및 연관성을 추출해 내고, 각 미디어들을 효율적으로 관리하기 위하여 미디어 특성에 맞는 세크멘테이션 기법을 이용하고 이들에 대한 내용상의 연관성을 고려하여 저장(store), 검색(retrieve), 표현(present)하기위한 시스템을 설계 하였다.

An Image-based Word Matching Method for Large volume Printed Hangul Document Retrieval (대용량 인쇄 한글 문서 검색을 위한 영상 기반 단어 매칭 방법)

  • 진영범;오일석
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2000.10b
    • /
    • pp.461-463
    • /
    • 2000
  • 기계 인쇄된 문서 영상에서 주제어를 탐색하는 문제는 여러 응용 분야에 필수적인 핵심 기술이지만 수작업 또는 OCR 소프트웨어를 이용하여 텍스트로 변환하는 방법은 많은 비용 때문에 한계를 가지고 있다. 요즘 영상 형태로 원문을 저장하는 경우가 많으므로 본 논문은 영상-기반 매칭을 통한 검색 방법을 채택하였다. 문자 또는 단어 매칭에서 가장 중요한 요소가 특징인데 본 논문에서는 디지털도서관과 같이 매칭 대상 단어가 수천만∼수십억에 달하는 대용량 한글 문서 검색에 이용될 수 있도록 비교적 간단히 추출할 수 있고 차원수 조절이 용이한 4방향 프로파일 특징을 이용하는 빠른 검색 방법을 제안한다. 실험결과 8-차원 정도의 간단한 특징으로도 의미 있는 검색 성능을 얻을 수 있음을 보였다.

  • PDF

A Review of Access Conditions of the W3 and the Inline Image/Sound Processing of HTML Document for Utilizing of the Virtual Library (W3 가상도서관 활용을 위한 HTML 문서작성과 이미지/사운드 처리)

  • 유사라
    • Journal of the Korean Society for information Management
    • /
    • v.12 no.1
    • /
    • pp.45-66
    • /
    • 1995
  • The information users of the middle of 1990s. who know the Internet as well as its useful information services, are now expecting the virtual library services. Especially the increasing demands on hypertext and hypermedia information in the internet settings have been centered on the W3 with the man-page information. In this manner, the paper describes the access methods with brief concepts of the W3 and explains URLs and HTML. It also gives the retrieval layouts of unformatted data including images and sounds and then provides the information sources and software of W3 Clients and Servers in order to catch up the most recently post version of W3.

  • PDF

Recognition of Word-level Attributed in Machine-printed Document Images (인쇄 문서 영상의 단어 단위 속성 인식)

  • Gwak, Hui-Gyu;Kim, Su-Hyeong
    • Journal of KIISE:Software and Applications
    • /
    • v.28 no.5
    • /
    • pp.412-421
    • /
    • 2001
  • 본 논문은 문서 영상에 존재하는 개별 단어들에 대한 속성정보 추출 방법을 제안한다. 단어 단위의 속성 인식은 단어 영상 매칭의 정확도 및 속도 개선, OCR 시스템에서 인식률 향상, 문서의 재생산 등 다양한 응용 가치를 찾을 수 있으며, 메타정보(meta-information) 추출을 통해 영상 검색(image retrieval)이나 요약(summary) 생성 등에 활용할 수 있다. 제안하는 시스템에서 고려하는 단어 영상의 속성은 언어의 종류(한글, 영문), 스타일(볼드, 이탤릭, 보통, 밑줄), 문자 크기(10, 12, 14 포인트), 문자 개수 (한글: 2, 3, 4, 5, 영문: 4, 5, 6, 7, 8, 9, 10), 서체(명조, 고딕)의 다섯 가지 정보이다. 속성 인식을 위한 특징은, 언어 종류 인식에 2개, 스타일 인식에 3개, 문자 크기와 개수는 각각 1개, 한글 서체 인식은 1개, 영문 서체 인식은 2개를 사용한다. 분류기는 신경망, 2차형 판별함수(QDF), 선형 판별함수(LDF)를 계층적으로 구성한다. 다섯 가지 속성이 조합된 26,400개의 단어 영상을 사용한 실험을 통해, 제안된 방법이 소수의 특징만으로도 우수한 속성 인식 성능을 보임을 입증하였다.

  • PDF