• Title/Summary/Keyword: text information

Search Result 4,361, Processing Time 0.035 seconds

A Novel Character Segmentation Method for Text Images Captured by Cameras

  • Lue, Hsin-Te;Wen, Ming-Gang;Cheng, Hsu-Yung;Fan, Kuo-Chin;Lin, Chih-Wei;Yu, Chih-Chang
    • ETRI Journal
    • /
    • v.32 no.5
    • /
    • pp.729-739
    • /
    • 2010
  • Due to the rapid development of mobile devices equipped with cameras, instant translation of any text seen in any context is possible. Mobile devices can serve as a translation tool by recognizing the texts presented in the captured scenes. Images captured by cameras will embed more external or unwanted effects which need not to be considered in traditional optical character recognition (OCR). In this paper, we segment a text image captured by mobile devices into individual single characters to facilitate OCR kernel processing. Before proceeding with character segmentation, text detection and text line construction need to be performed in advance. A novel character segmentation method which integrates touched character filters is employed on text images captured by cameras. In addition, periphery features are extracted from the segmented images of touched characters and fed as inputs to support vector machines to calculate the confident values. In our experiment, the accuracy rate of the proposed character segmentation system is 94.90%, which demonstrates the effectiveness of the proposed method.

A bio-text mining system using keywords and patterns in a grid environment

  • Kwon, Hyuk-Ryul;Jung, Tae-Sung;Kim, Kyoung-Ran;Jahng, Hye-Kyoung;Cho, Wan-Sup;Yoo, Jae-Soo
    • Proceedings of the Korea Society for Industrial Systems Conference
    • /
    • 2007.02a
    • /
    • pp.48-52
    • /
    • 2007
  • As huge amount of literature including biological data is being generated after post genome era, it becomes difficult for researcher to find useful knowledge from the biological databases. Bio-text mining and related natural language processing technique are the key issues in the intelligent knowledge retrieval from the biological databases. We propose a bio-text mining technique for the biologists who find Knowledge from the huge literature. At first, web robot is used to extract and transform related literature from remote databases. To improve retrieval speed, we generate an inverted file for keywords in the literature. Then, text mining system is used for extracting given knowledge patterns and keywords. Finally, we construct a grid computing environment to guarantee processing speed in the text mining even for huge literature databases. In the real experiment for 10,000 bio-literatures, the system shows 95% precision and 98% recall.

  • PDF

Detecting Spam Data for Securing the Reliability of Text Analysis (텍스트 분석의 신뢰성 확보를 위한 스팸 데이터 식별 방안)

  • Hyun, Yoonjin;Kim, Namgyu
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.42 no.2
    • /
    • pp.493-504
    • /
    • 2017
  • Recently, tremendous amounts of unstructured text data that is distributed through news, blogs, and social media has gained much attention from many researchers and practitioners as this data contains abundant information about various consumers' opinions. However, as the usefulness of text data is increasing, more and more attempts to gain profits by distorting text data maliciously or nonmaliciously are also increasing. This increase in spam text data not only burdens users who want to obtain useful information with a large amount of inappropriate information, but also damages the reliability of information and information providers. Therefore, efforts must be made to improve the reliability of information and the quality of analysis results by detecting and removing spam data in advance. For this purpose, many studies to detect spam have been actively conducted in areas such as opinion spam detection, spam e-mail detection, and web spam detection. In this study, we introduce core concepts and current research trends of spam detection and propose a methodology to detect the spam tag of a blog as one of the challenging attempts to improve the reliability of blog information.

A Implementation of Keyword Extraction Algorithm Using Anchor Text for Web's Conceptual Knowledge (웹의 개념지식을 위한 Anchor Text에서의 키워드 추출 알고리즘의 구현)

  • 조남덕;배환국;김기태
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2000.10b
    • /
    • pp.72-74
    • /
    • 2000
  • 인터넷을 효과적으로 검색하기 위하여 검색엔진을 많이 이용하고 있다. 그런데 문서의 키워드를 추출할 적에 지금까지는 Anchor Text를 염두에 두지 않았었다. Anchor Text는 사람이 직접 요약한 것이고(요약성), 하이퍼링크를 포함하는 웹 문서에 반드시 존재하므로(보편성) 그 하이퍼링크가 가리키는 곳의 문서의 키워드를 추출에 적합한 용도가 될 수 있다. 웹 그래프는 이러한 Anchor Text를 이용하여 키워드를 추출함으로써 문서와 문서간, 단어와 단어간의 관계(연관성)까지도 나타내 줄 수 있게 한 검색 엔진 시스템이다. 그러나 Anchor Text 자체가 본문의 내용이 아니고, Anchor Text를 작성한 사람에 따라 다르게 작성되며, 본문의 내용과 무관한 내용도 작성할 수 있다. 따라서 Anchor Text 자체를 어떠한 여과 없이 문서의 키워드로 받아들이긴 힘들다. 본 논문에서는 TFIDF를 통해 좀 더 정확성이 있는 키워드를 추출하였다.

  • PDF

Unstructured Data Quantification Scheme Based on Text Mining for User Feedback Extraction (사용자 의견 추출을 위한 텍스트 마이닝 기반 비정형 데이터 정량화 방안)

  • Jo, Jung-Heum;Chung, Yong-Taek;Choi, Seong-Wook;Ok, Changsoo
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.41 no.4
    • /
    • pp.131-137
    • /
    • 2018
  • People write reviews of numerous products or services on the Internet, in their blogs or community bulletin boards. These unstructured data contain important emotions and opinions about the author's product or service, which can provide important information for future product design or marketing. However, this text-based information cannot be evaluated quantitatively, and thus they are difficult to apply to mathematical models or optimization problems for product design and improvement. Therefore, this study proposes a method to quantitatively extract user's opinion or preference about a specific product or service by utilizing a lot of text-based information existing on the Internet or online. The extracted unstructured text information is decomposed into basic unit words, and positive rate is evaluated by using existing emotional dictionaries and additional lists proposed in this study. This can be a way to effectively utilize unstructured text data, which is being generated and stored in vast quantities, in product or service design. Finally, to verify the effectiveness of the proposed method, a case study was conducted using movie review data retrieved from a portal website. By comparing the positive rates calculated by the proposed framework with user ratings for movies, a guideline on text mining based evaluation of unstructured data is provided.

GNI Corpus Version 1.0: Annotated Full-Text Corpus of Genomics & Informatics to Support Biomedical Information Extraction

  • Oh, So-Yeon;Kim, Ji-Hyeon;Kim, Seo-Jin;Nam, Hee-Jo;Park, Hyun-Seok
    • Genomics & Informatics
    • /
    • v.16 no.3
    • /
    • pp.75-77
    • /
    • 2018
  • Genomics & Informatics (NLM title abbreviation: Genomics Inform) is the official journal of the Korea Genome Organization. Text corpus for this journal annotated with various levels of linguistic information would be a valuable resource as the process of information extraction requires syntactic, semantic, and higher levels of natural language processing. In this study, we publish our new corpus called GNI Corpus version 1.0, extracted and annotated from full texts of Genomics & Informatics, with NLTK (Natural Language ToolKit)-based text mining script. The preliminary version of the corpus could be used as a training and testing set of a system that serves a variety of functions for future biomedical text mining.

Bilingual document analysis and character segmentation using connected components (연결요소를 이용한 한.영 혼용문서의 구조분석 및 낱자분리)

  • 김민기;권영빈;한상용
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.22 no.3
    • /
    • pp.410-422
    • /
    • 1997
  • In this paper, we descried a bottom-up document structure analysis method in bilingual Korean-English document. We proposed a character segmentation method based on the layout information of connected component of each character. In many researches, a document has been analyzed into text blocks and graphics. We analyzed a document into four parts: text, table, graphic, and separator. A text is recursively subdivided into text blocks, text lines, words, and characters. To extract the character in bilingual text, we proposed a new method of word of word separation of Korean or English. Futhermore, we used a character merging and segmentation method in accordance with the properties of Hangul on the Korean word blocks. Experimental results on the various documents show that the proposed method is very effectively operated on the document structure analysis and the character segmentation.

  • PDF

An Extracting Text Area Using Adaptive Edge Enhanced MSER in Real World Image (실세계 영상에서 적응적 에지 강화 기반의 MSER을 이용한 글자 영역 추출 기법)

  • Park, Youngmok;Park, Sunhwa;Seo, Yeong Geon
    • Journal of Digital Contents Society
    • /
    • v.17 no.4
    • /
    • pp.219-226
    • /
    • 2016
  • In our general life, what we recognize information with our human eyes and use it is diverse and massive. But even the current technologies improved by artificial intelligence are exorbitantly deficient comparing to human visual processing ability. Nevertheless, many researchers are trying to get information in everyday life, especially concentrate effort on recognizing information consisted of text. In the fields of recognizing text, to extract the text from the general document is used in some information processing fields, but to extract and recognize the text from real image is deficient too much yet. It is because the real images have many properties like color, size, orientation and something in common. In this paper, we applies an adaptive edge enhanced MSER(Maximally Stable Extremal Regions) to extract the text area in those diverse environments and the scene text, and show that the proposed method is a comparatively nice method with experiments.

A novel, reversible, Chinese text information hiding scheme based on lookalike traditional and simplified Chinese characters

  • Feng, Bin;Wang, Zhi-Hui;Wang, Duo;Chang, Ching-Yun;Li, Ming-Chu
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.8 no.1
    • /
    • pp.269-281
    • /
    • 2014
  • Compared to hiding information into digital image, hiding information into digital text file requires less storage space and smaller bandwidth for data transmission, and it has obvious universality and extensiveness. However, text files have low redundancy, so it is more difficult to hide information in text files. To overcome this difficulty, Wang et al. proposed a reversible information hiding scheme using left-right and up-down representations of Chinese characters, but, when the scheme is implemented, it does not provide good visual steganographic effectiveness, and the embedding and extracting processes are too complicated to be done with reasonable effort and cost. We observed that a lot of traditional and simplified Chinese characters look somewhat the same (also called lookalike), so we utilize this feature to propose a novel information hiding scheme for hiding secret data in lookalike Chinese characters. Comparing to Wang et al.'s scheme, the proposed scheme simplifies the embedding and extracting procedures significantly and improves the effectiveness of visual steganographic images. The experimental results demonstrated the advantages of our proposed scheme.

Text Extraction from Complex Natural Images

  • Kumar, Manoj;Lee, Guee-Sang
    • International Journal of Contents
    • /
    • v.6 no.2
    • /
    • pp.1-5
    • /
    • 2010
  • The rapid growth in communication technology has led to the development of effective ways of sharing ideas and information in the form of speech and images. Understanding this information has become an important research issue and drawn the attention of many researchers. Text in a digital image contains much important information regarding the scene. Detecting and extracting this text is a difficult task and has many challenging issues. The main challenges in extracting text from natural scene images are the variation in the font size, alignment of text, font colors, illumination changes, and reflections in the images. In this paper, we propose a connected component based method to automatically detect the text region in natural images. Since text regions in mages contain mostly repetitions of vertical strokes, we try to find a pattern of closely packed vertical edges. Once the group of edges is found, the neighboring vertical edges are connected to each other. Connected regions whose geometric features lie outside of the valid specifications are considered as outliers and eliminated. The proposed method is more effective than the existing methods for slanted or curved characters. The experimental results are given for the validation of our approach.