• Title/Summary/Keyword: Text features

Search Result 580, Processing Time 0.036 seconds

Web Image Caption Extraction using Positional Relation and Lexical Similarity (위치적 연관성과 어휘적 유사성을 이용한 웹 이미지 캡션 추출)

  • Lee, Hyoung-Gyu;Kim, Min-Jeong;Hong, Gum-Won;Rim, Hae-Chang
    • Journal of KIISE:Software and Applications
    • /
    • v.36 no.4
    • /
    • pp.335-345
    • /
    • 2009
  • In this paper, we propose a new web image caption extraction method considering the positional relation between a caption and an image and the lexical similarity between a caption and the main text containing the caption. The positional relation between a caption and an image represents how the caption is located with respect to the distance and the direction of the corresponding image. The lexical similarity between a caption and the main text indicates how likely the main text generates the caption of the image. Compared with previous image caption extraction approaches which only utilize the independent features of image and captions, the proposed approach can improve caption extraction recall rate, precision rate and 28% F-measure by including additional features of positional relation and lexical similarity.

BERT-based Classification Model for Korean Documents (한국어 기술문서 분석을 위한 BERT 기반의 분류모델)

  • Hwang, Sangheum;Kim, Dohyun
    • The Journal of Society for e-Business Studies
    • /
    • v.25 no.1
    • /
    • pp.203-214
    • /
    • 2020
  • It is necessary to classify technical documents such as patents, R&D project reports in order to understand the trends of technology convergence and interdisciplinary joint research, technology development and so on. Text mining techniques have been mainly used to classify these technical documents. However, in the case of classifying technical documents by text mining algorithms, there is a disadvantage that the features representing technical documents must be directly extracted. In this study, we propose a BERT-based document classification model to automatically extract document features from text information of national R&D projects and to classify them. Then, we verify the applicability and performance of the proposed model for classifying documents.

Development of Spatial Reference System Component with Open GIS Simple Features Specification (개방형 GIS의 단순개체 사양을 이용한 공간 기준 좌표계 컴포넌트의 개발)

  • Lee, Dae-Hee;Biun, Su-Yun;Lim, Sam-Sung
    • Journal of Korea Spatial Information System Society
    • /
    • v.2 no.1 s.3
    • /
    • pp.57-62
    • /
    • 2000
  • Open GIS Consortium(OGC) provides with Simple Features Specification for OLE/COM which is a system object technology of interoperability and reusable capability. In this research, the Spatial Reference System(SRS) component is developed based on the OGC specification using ATL. The component presents 44 map projections and transformations between different geographic coordinate systems utilizing the seven parameter(Bursa Wolf) and Molodenski's methods, a user can set up all objects and its attributes comprising SRS and can create SRS and save its setting using predefined text, WellKnownText. The Spatial Reference System component can be easily implemented into the variety of GIS software so that it reduces the developing time for a system and defines new reference system without difficulty.

  • PDF

Development of Vaccine with Artificial Intelligence: By Analyzing OP Code Features Based on Text and Image Dataset (OP Code 특징 기반의 텍스트와 이미지 데이터셋 연구를 통한 인공지능 백신 개발)

  • Choi, Hyo-Kyung;Lee, Se-Eun;Lee, Ju-Hyun;Hong, Rae-Young;Choi, Won-Hyok;Kim, Hyung-Jong
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.29 no.5
    • /
    • pp.1019-1026
    • /
    • 2019
  • Due to limitations of existing methods for detecting newly introduced malware, the importance of the development of artificial intelligence vaccines arises. Existing artificial intelligence vaccines have a disadvantage that the accuracy of the detection rate is low because those vaccines do not scan all parts of the file. In this paper, we suggest an enhanced method for detecting malware which is composed of unique OP Code features in the malware files. Specifically, we tested the method with text datasets trained on Random Forest algorithm and with image datasets trained on the Inception V3 model. As a result, the highest accuracy of the detection rate was about 80%.

An Implementation of Hangul Handwriting Correction Application Based on Deep Learning (딥러닝에 의한 한글 필기체 교정 어플 구현)

  • Jae-Hyeong Lee;Min-Young Cho;Jin-soo Kim
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.29 no.3
    • /
    • pp.13-22
    • /
    • 2024
  • Currently, with the proliferation of digital devices, the significance of handwritten texts in daily lives is gradually diminishing. As the use of keyboards and touch screens increase, a decline in Korean handwriting quality is being observed across a broad spectrum of Korean documents, from young students to adults. However, Korean handwriting still remains necessary for many documentations, as it retains individual unique features while ensuring readability. To this end, this paper aims to implement an application designed to improve and correct the quality of handwritten Korean script The implemented application utilizes the CRAFT (Character-Region Awareness For Text Detection) model for handwriting area detection and employs the VGG-Feature-Extraction as a deep learning model for learning features of the handwritten script. Simultaneously, the application presents the user's handwritten Korean script's reliability on a syllable-by-syllable basis as a recognition rate and also suggests the most similar fonts among candidate fonts. Furthermore, through various experiments, it can be confirmed that the proposed application provides an excellent recognition rate comparable to conventional commercial character recognition OCR systems.

Scene Text Extraction in Natural Images Using Color Variance Feature (색 변화 특징을 이용한 자연이미지에서의 장면 텍스트 추출)

  • 송영자;최영우
    • Proceedings of the IEEK Conference
    • /
    • 2003.07e
    • /
    • pp.1835-1838
    • /
    • 2003
  • Texts in natural images contain significant and detailed informations about the images. Thus, to extract those texts correctly, we suggest a text extraction method using color variance feature. Generally, the texts in images have color variations with the backgrounds. Thus, if we express those variations in 3 dimensional RGB color space, we can emphasize the text regions that can be hard to be captured with a method using intensity variations in the gray-level images. We can even make robust extraction results with the images contaminated by light variations. The color variations are measured by color variance in this paper. First, horizontal and vertical variance images are obtained independently, and we can fine that the text regions have high values of the variances in both directions. Then, the two images are logically ANDed to remove the non-text components with only one directional high variance. We have applied the proposed method to the multiple kinds of the natural images, and we confirmed that the proposed feature can help to find the text regions that can he missed with the following features - intensity variations in the gray-level images and/or color continuity in the color images.

  • PDF

Properties of chi-square statistic and information gain for feature selection of imbalanced text data (불균형 텍스트 데이터의 변수 선택에 있어서의 카이제곱통계량과 정보이득의 특징)

  • Mun, Hye In;Son, Won
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.4
    • /
    • pp.469-484
    • /
    • 2022
  • Since a large text corpus contains hundred-thousand unique words, text data is one of the typical large-dimensional data. Therefore, various feature selection methods have been proposed for dimension reduction. Feature selection methods can improve the prediction accuracy. In addition, with reduced data size, computational efficiency also can be achieved. The chi-square statistic and the information gain are two of the most popular measures for identifying interesting terms from text data. In this paper, we investigate the theoretical properties of the chi-square statistic and the information gain. We show that the two filtering metrics share theoretical properties such as non-negativity and convexity. However, they are different from each other in the sense that the information gain is prone to select more negative features than the chi-square statistic in imbalanced text data.

Feature Extraction to Detect Hoax Articles (낚시성 인터넷 신문기사 검출을 위한 특징 추출)

  • Heo, Seong-Wan;Sohn, Kyung-Ah
    • Journal of KIISE
    • /
    • v.43 no.11
    • /
    • pp.1210-1215
    • /
    • 2016
  • Readership of online newspapers has grown with the proliferation of smart devices. However, fierce competition between Internet newspaper companies has resulted in a large increase in the number of hoax articles. Hoax articles are those where the title does not convey the content of the main story, and this gives readers the wrong information about the contents. We note that the hoax articles have certain characteristics, such as unnecessary celebrity quotations, mismatch in the title and content, or incomplete sentences. Based on these, we extract and validate features to identify hoax articles. We build a large-scale training dataset by analyzing text keywords in replies to articles and thus extracted five effective features. We evaluate the performance of the support vector machine classifier on the extracted features, and a 92% accuracy is observed in our validation set. In addition, we also present a selective bigram model to measure the consistency between the title and content, which can be effectively used to analyze short texts in general.

Block Classification of Document Images by Block Attributes and Texture Features (블록의 속성과 질감특징을 이용한 문서영상의 블록분류)

  • Jang, Young-Nae;Kim, Joong-Soo;Lee, Cheol-Hee
    • Journal of Korea Multimedia Society
    • /
    • v.10 no.7
    • /
    • pp.856-868
    • /
    • 2007
  • We propose an effective method for block classification in a document image. The gray level document image is converted to the binary image for a block segmentation. This binary image would be smoothed to find the locations and sizes of each block. And especially during this smoothing, the inner block heights of each block are obtained. The gray level image is divided to several blocks by these location informations. The SGLDM(spatial gray level dependence matrices) are made using the each gray-level document block and the seven second-order statistical texture features are extracted from the (0,1) direction's SGLDM which include the document attributes. Document image blocks are classified to two groups, text and non-text group, by the inner block height of the block at the nearest neighbor rule. The seven texture features(that were extracted from the SGLDM) are used for the five detail categories of small font, large font, table, graphic and photo blocks. These document blocks are available not only for structure analysis of document recognition but also the various applied area.

  • PDF

A Viewer Preference Model Based on Physiological Feedback (CogTV를 위한 생체신호기반 시청자 선호도 모델)

  • Park, Tae-Suh;Kim, Byoung-Hee;Zhang, Byoung-Tak
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.24 no.3
    • /
    • pp.316-322
    • /
    • 2014
  • A movie recommendation system is proposed to learn a preference model of a viewer by using multimodal features of a video content and their evoked implicit responses of the viewer in synchronized manner. In this system, facial expression, body posture, and physiological signals are measured to estimate the affective states of the viewer, in accordance with the stimuli consisting of low-level and affective features from video, audio, and text streams. Experimental results show that it is possible to predict arousal response, which is measured by electrodermal activity, of a viewer from auditory and text features in a video stimuli, for estimating interestingness on the video.