• Title, Summary, Keyword: Text based

Search Result 2,969, Processing Time 0.04 seconds

A Study on Research Trends of Graph-Based Text Representations for Text Mining (텍스트 마이닝을 위한 그래프 기반 텍스트 표현 모델의 연구 동향)

  • Chang, Jae-Young
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.13 no.5
    • /
    • pp.37-47
    • /
    • 2013
  • Text Mining is a research area of retrieving high quality hidden information such as patterns, trends, or distributions through analyzing unformatted text. Basically, since text mining assumes an unstructured text, it needs to be represented as a simple text model for analyzing it. So far, most frequently used model is VSM(Vector Space Model), in which a text is represented as a bag of words. However, recently much researches tried to apply a graph-based text model for representing semantic relationships between words. In this paper, we survey research trends of graph-based text representation models for text mining. Additionally, we also discuss about future models of graph-based text mining.

The Binarization of Text Regions in Natural Scene Images, based on Stroke Width Estimation (자연 영상에서 획 너비 추정 기반 텍스트 영역 이진화)

  • Zhang, Chengdong;Kim, Jung Hwan;Lee, Guee Sang
    • Smart Media Journal
    • /
    • v.1 no.4
    • /
    • pp.27-34
    • /
    • 2012
  • In this paper, a novel text binarization is presented that can deal with some complex conditions, such as shadows, non-uniform illumination due to highlight or object projection, and messy backgrounds. To locate the target text region, a focus line is assumed to pass through a text region. Next, connected component analysis and stroke width estimation based on location information of the focus line is used to locate the bounding box of the text region, and each box of connected components. A series of classifications are applied to identify whether each CC(Connected component) is text or non-text. Also, a modified K-means clustering method based on an HCL color space is applied to reduce the color dimension. A text binarization procedure based on location of text component and seed color pixel is then used to generate the final result.

  • PDF

Text Location and Extraction for Business Cards Using Stroke Width Estimation

  • Zhang, Cheng Dong;Lee, Guee-Sang
    • International Journal of Contents
    • /
    • v.8 no.1
    • /
    • pp.30-38
    • /
    • 2012
  • Text extraction and binarization are the important pre-processing steps for text recognition. The performance of text binarization strongly related to the accuracy of recognition stage. In our proposed method, the first stage based on line detection and shape feature analysis applied to locate the position of a business card and detect the shape from the complex environment. In the second stage, several local regions contained the possible text components are separated based on the projection histogram. In each local region, the pixels grouped into several connected components based on the connected component labeling and projection histogram. Then, classify each connect component into text region and reject the non-text region based on the feature information analysis such as size of connected component and stroke width estimation.

Table based Matching Algorithm for Soft Categorization of News Articles in Reuter 21578

  • Jo, Tae-Ho
    • Journal of Korea Multimedia Society
    • /
    • v.11 no.6
    • /
    • pp.875-882
    • /
    • 2008
  • This research proposes an alternative approach to machine learning based ones for text categorization. For using machine learning based approaches for any task of text mining, documents should be encoded into numerical vectors; it causes two problems: huge dimensionality and sparse distribution. Although there are various tasks of text mining such as text categorization, text clustering, and text summarization, the scope of this research is restricted to text categorization. The idea of this research is to avoid the two problems by encoding a document or documents into a table, instead of numerical vectors. Therefore, the goal of this research is to improve the performance of text categorization by proposing approaches, which are free from the two problems.

  • PDF

Deep-Learning Approach for Text Detection Using Fully Convolutional Networks

  • Tung, Trieu Son;Lee, Gueesang
    • International Journal of Contents
    • /
    • v.14 no.1
    • /
    • pp.1-6
    • /
    • 2018
  • Text, as one of the most influential inventions of humanity, has played an important role in human life since ancient times. The rich and precise information embodied in text is very useful in a wide range of vision-based applications such as the text data extracted from images that can provide information for automatic annotation, indexing, language translation, and the assistance systems for impaired persons. Therefore, natural-scene text detection with active research topics regarding computer vision and document analysis is very important. Previous methods have poor performances due to numerous false-positive and true-negative regions. In this paper, a fully-convolutional-network (FCN)-based method that uses supervised architecture is used to localize textual regions. The model was trained directly using images wherein pixel values were used as inputs and binary ground truth was used as label. The method was evaluated using ICDAR-2013 dataset and proved to be comparable to other feature-based methods. It could expedite research on text detection using deep-learning based approach in the future.

The Development and Effects of the Text-Based Media Literacy Program for Young Children (텍스트 중심 유아 미디어 리터러시 교육 프로그램 개발 및 적용 효과)

  • Lee, Jae-Eun;Cho, Eun-Jin
    • Korean Journal of Child Studies
    • /
    • v.38 no.1
    • /
    • pp.77-93
    • /
    • 2017
  • Objective: The purpose of this study was to develop a text-based media literacy program and to examine its effects on young children's understanding and expression of media text. Methods: The participants were 54 5-year-old kindergarteners assigned to an experimental or a control group, with 27 children per group. The text-based media literacy program was based on the ADDIE model and was administered to the experimental group for 8 weeks. The pre- and post-test instruments measured media text understanding and expression ability and were patterned after those used by British Film Institute (2003) and other major studies. Results: The experimental group showed higher levels of media text understanding and expression than the control group. Conclusion: The results are discussed with respect to their implications for educational practice and future research.

Text Categorization for Authorship based on the Features of Lingual Conceptual Expression

  • Zhang, Quan;Zhang, Yun-liang;Yuan, Yi
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • /
    • pp.515-521
    • /
    • 2007
  • The text categorization is an important field for the automatic text information processing. Moreover, the authorship identification of a text can be treated as a special text categorization. This paper adopts the conceptual primitives' expression based on the Hierarchical Network of Concepts (HNC) theory, which can describe the words meaning in hierarchical symbols, in order to avoid the sparse data shortcoming that is aroused by the natural language surface features in text categorization. The KNN algorithm is used as computing classification element. Then, the experiment has been done on the Chinese text authorship identification. The experiment result gives out that the processing mode that is put forward in this paper achieves high correct rate, so it is feasible for the text authorship identification.

  • PDF

A Stroke-Based Text Extraction Algorithm for Digital Videos (디지털 비디오를 위한 획기반 자막 추출 알고리즘)

  • Jeong, Jong-Myeon;Cha, Ji-Hun;Kim, Kyu-Heon
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.17 no.3
    • /
    • pp.297-303
    • /
    • 2007
  • In this paper, the stroke-based text extraction algorithm for digital video is proposed. The proposed algorithm consists of four stages such as text detection, text localization, text segmentation and geometric verification. The text detection stage ascertains that a given frame in a video sequence contains text. This procedure is accomplished by morphological operations for the pixels with higher possibility of being stroke-based text, which is called as seed points. For the text localization stage, morphological operations for the edges including seed points ate adopted followed by horizontal and vortical projections. Text segmentation stage is to classify projected areas into text and background regions according to their intensity distribution. Finally, in the geometric verification stage, the segmented area are verified by using prior knowledge of video text characteristics.

Text Detection based on Edge Enhanced Contrast Extremal Region and Tensor Voting in Natural Scene Images

  • Pham, Van Khien;Kim, Soo-Hyung;Yang, Hyung-Jeong;Lee, Guee-Sang
    • Smart Media Journal
    • /
    • v.6 no.4
    • /
    • pp.32-40
    • /
    • 2017
  • In this paper, a robust text detection method based on edge enhanced contrasting extremal region (CER) is proposed using stroke width transform (SWT) and tensor voting. First, the edge enhanced CER extracts a number of covariant regions, which is a stable connected component from input images. Next, SWT is created by the distance map, which is used to eliminate non-text regions. Then, these candidate text regions are verified based on tensor voting, which uses the input center point in the previous step to compute curve salience values. Finally, the connected component grouping is applied to a cluster closed to characters. The proposed method is evaluated with the ICDAR2003 and ICDAR2013 text detection competition datasets and the experiment results show high accuracy compared to previous methods.

Corpus-based evaluation of French text normalization (코퍼스 기반 프랑스어 텍스트 정규화 평가)

  • Kim, Sunhee
    • Phonetics and Speech Sciences
    • /
    • v.10 no.3
    • /
    • pp.31-39
    • /
    • 2018
  • This paper aims to present a taxonomy of non-standard words (NSW) for developing a French text normalization system and to propose a method for evaluating this system based on a corpus. The proposed taxonomy of French NSWs consists of 13 categories, including 2 types of letter-based categories and 9 types of number-based categories. In order to evaluate the text normalization system, a representative test set including NSWs from various text domains, such as news, literature, non-fiction, social-networking services (SNSs), and transcriptions, is constructed, and an evaluation equation is proposed reflecting the distribution of the NSW categories of the target domain to which the system is applied. The error rate of the test set is 1.64%, while the error rate of the whole corpus is 2.08%, reflecting the NSW distribution in the corpus. The results show that the literature and SNS domains are assessed as having higher error rates compared to the test set.