• Title/Summary/Keyword: text feature

Search Result 416, Processing Time 0.03 seconds

Text Region Extraction using Pattern Histogram of Character-Edge Map in Natural Images (문자-에지 맵의 패턴 히스토그램을 이용한 자연이미지에서의 텍스트 영역 추출)

  • Park, Jong-Cheon;Hwang, Dong-Guk;Lee, Woo-Ram;Kwon, Kyo-Hyun;Jun, Byoung-Min
    • Proceedings of the KAIS Fall Conference
    • /
    • 2006.11a
    • /
    • pp.220-224
    • /
    • 2006
  • The text to be included in the natural images has many important information in the natural image. Therefore, if we can extract the text in natural images, It can be applied to many important applications. In this paper, we propose a text region extraction method using pattern histogram of character-edge map. We extract the edges with the Canny edge detector and creates 16 kind of edge map from an extracted edges. And then we make a character-edge map of 8 kinds that have a character feature with a combination of an edge map. We extract text region using 8 kinds of character-edge map and 16 kind of edge map. Verification of text candidate region uses analysis of a character-edge map pattern histogram and structural feature of text region. The method to propose experimented with various kind of the natural images. The proposed approach extracted text region from a natural images to have been composed of a complex background, various letters, various text colors effectively.

  • PDF

Corrupted Region Restoration based on 2D Tensor Voting (2D 텐서 보팅에 기반 한 손상된 텍스트 영상의 복원 및 분할)

  • Park, Jong-Hyun;Toan, Nguyen Dinh;Lee, Guee-Sang
    • The KIPS Transactions:PartB
    • /
    • v.15B no.3
    • /
    • pp.205-210
    • /
    • 2008
  • A new approach is proposed for restoration of corrupted regions and segmentation in natural text images. The challenge is to fill in the corrupted regions on the basis of color feature analysis by second order symmetric stick tensor. It is show how feature analysis can benefit from analyzing features using tensor voting with chromatic and achromatic components. The proposed method is applied to text images corrupted by manifold types of various noises. Firstly, we decompose an image into chromatic and achromatic components to analyze images. Secondly, selected feature vectors are analyzed by second-order symmetric stick tensor. And tensors are redefined by voting information with neighbor voters, while restore the corrupted regions. Lastly, mode estimation and segmentation are performed by adaptive mean shift and separated clustering method respectively. This approach is automatically done, thereby allowing to easily fill-in corrupted regions containing completely different structures and surrounding backgrounds. Applications of proposed method include the restoration of damaged text images; removal of superimposed noises or streaks. We so can see that proposed approach is efficient and robust in terms of restoring and segmenting text images corrupted.

Context-based classification for harmful web documents and comparison of feature selecting algorithms

  • Kim, Young-Soo;Park, Nam-Je;Hong, Do-Won;Won, Dong-Ho
    • Journal of Korea Multimedia Society
    • /
    • v.12 no.6
    • /
    • pp.867-875
    • /
    • 2009
  • More and richer information sources and services are available on the web everyday. However, harmful information, such as adult content, is not appropriate for all users, notably children. Since internet is a worldwide open network, it has a limit to regulate users providing harmful contents through each countrie's national laws or systems. Additionally it is not a desirable way of developing a certain system-specific classification technology for harmful contents, because internet users can contact with them in diverse ways, for example, porn sites, harmful spams, or peer-to-peer networks, etc. Therefore, it is being emphasized to research and develop context-based core technologies for classifying harmful contents. In this paper, we propose an efficient text filter for blocking harmful texts of web documents using context-based technologies and examine which algorithms for feature selection, the process that select content terms, as features, can be useful for text categorization in all content term occurs in documents, are suitable for classifying harmful contents through implementation and experiment.

  • PDF

Modality-Based Sentence-Final Intonation Prediction for Korean Conversational-Style Text-to-Speech Systems

  • Oh, Seung-Shin;Kim, Sang-Hun
    • ETRI Journal
    • /
    • v.28 no.6
    • /
    • pp.807-810
    • /
    • 2006
  • This letter presents a prediction model for sentence-final intonations for Korean conversational-style text-to-speech systems in which we introduce the linguistic feature of 'modality' as a new parameter. Based on their function and meaning, we classify tonal forms in speech data into tone types meaningful for speech synthesis and use the result of this classification to build our prediction model using a tree structured classification algorithm. In order to show that modality is more effective for the prediction model than features such as sentence type or speech act, an experiment is performed on a test set of 970 utterances with a training set of 3,883 utterances. The results show that modality makes a higher contribution to the determination of sentence-final intonation than sentence type or speech act, and that prediction accuracy improves up to 25% when the feature of modality is introduced.

  • PDF

Automatic conversion of machining data by the recognition of press mold (프레스 금형의 특징형상 인식에 의한 가공데이터 자동변환)

  • 최홍태;반갑수;이석희
    • Proceedings of the Korean Operations and Management Science Society Conference
    • /
    • 1994.04a
    • /
    • pp.703-712
    • /
    • 1994
  • This paper presents an automatic conversion of machining data from the orthographic views of press mold by feature recognition rule. The system includes following 6 modules : separation of views, function support, dimension text recognition, feature recognition, dimension text check and feature processing modules. The characteristic of this system is that with minimum user intervention, it recognizes basic features such as holes, slots, pockets and clamping parts and thus automatically converts CAD drawing details of press mold into machining data using 2D CAD system instead of using an expensive 3D Modeler. The system is developed by using IBM-PC in the environment of AutoCAD R12, AutoLISP and MetaWare High C. Performance of the system is verified as a good interfacing of CAD and CAM when applied to a lot of sample drawings.

The Use of MSVM and HMM for Sentence Alignment

  • Fattah, Mohamed Abdel
    • Journal of Information Processing Systems
    • /
    • v.8 no.2
    • /
    • pp.301-314
    • /
    • 2012
  • In this paper, two new approaches to align English-Arabic sentences in bilingual parallel corpora based on the Multi-Class Support Vector Machine (MSVM) and the Hidden Markov Model (HMM) classifiers are presented. A feature vector is extracted from the text pair that is under consideration. This vector contains text features such as length, punctuation score, and cognate score values. A set of manually prepared training data was assigned to train the Multi-Class Support Vector Machine and Hidden Markov Model. Another set of data was used for testing. The results of the MSVM and HMM outperform the results of the length based approach. Moreover these new approaches are valid for any language pairs and are quite flexible since the feature vector may contain less, more, or different features, such as a lexical matching feature and Hanzi characters in Japanese-Chinese texts, than the ones used in the current research.

A Study of Research on Methods of Automated Biomedical Document Classification using Topic Modeling and Deep Learning (토픽모델링과 딥 러닝을 활용한 생의학 문헌 자동 분류 기법 연구)

  • Yuk, JeeHee;Song, Min
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.2
    • /
    • pp.63-88
    • /
    • 2018
  • This research evaluated differences of classification performance for feature selection methods using LDA topic model and Doc2Vec which is based on word embedding using deep learning, feature corpus sizes and classification algorithms. In addition to find the feature corpus with high performance of classification, an experiment was conducted using feature corpus was composed differently according to the location of the document and by adjusting the size of the feature corpus. Conclusionally, in the experiments using deep learning evaluate training frequency and specifically considered information for context inference. This study constructed biomedical document dataset, Disease-35083 which consisted biomedical scholarly documents provided by PMC and categorized by the disease category. Throughout the study this research verifies which type and size of feature corpus produces the highest performance and, also suggests some feature corpus which carry an extensibility to specific feature by displaying efficiency during the training time. Additionally, this research compares the differences between deep learning and existing method and suggests an appropriate method by classification environment.

Slab Region Localization for Text Extraction using SIFT Features (문자열 검출을 위한 슬라브 영역 추정)

  • Choi, Jong-Hyun;Choi, Sung-Hoo;Yun, Jong-Pil;Koo, Keun-Hwi;Kim, Sang-Woo
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.58 no.5
    • /
    • pp.1025-1034
    • /
    • 2009
  • In steel making production line, steel slabs are given a unique identification number. This identification number, Slab management number(SMN), gives information about the use of the slab. Identification of SMN has been done by humans for several years, but this is expensive and not accurate and it has been a heavy burden on the workers. Consequently, to improve efficiency, automatic recognition system is desirable. Generally, a recognition system consists of text localization, text extraction, character segmentation, and character recognition. For exact SMN identification, all the stage of the recognition system must be successful. In particular, the text localization is great important stage and difficult to process. However, because of many text-like patterns in a complex background and high fuzziness between the slab and background, directly extracting text region is difficult to process. If the slab region including SMN can be detected precisely, text localization algorithm will be able to be developed on the more simple method and the processing time of the overall recognition system will be reduced. This paper describes about the slab region localization using SIFT(Scale Invariant Feature Transform) features in the image. First, SIFT algorithm is applied the captured background and slab image, then features of two images are matched by Nearest Neighbor(NN) algorithm. However, correct matching rate can be low when two images are matched. Thus, to remove incorrect match between the features of two images, geometric locations of the matched two feature points are used. Finally, search rectangle method is performed in correct matching features, and then the top boundary and side boundaries of the slab region are determined. For this processes, we can reduce search region for extraction of SMN from the slab image. Most cases, to extract text region, search region is heuristically fixed [1][2]. However, the proposed algorithm is more analytic than other algorithms, because the search region is not fixed and the slab region is searched in the whole image. Experimental results show that the proposed algorithm has a good performance.

A Basic Thinking of Pansori Reading Text Appearance -A study on version of - (판소리 독서물 탄생의 기반 사유 -<춘향전> 필사본을 통한 고찰-)

  • Cha, Chounghwan
    • (The) Research of the performance art and culture
    • /
    • no.23
    • /
    • pp.313-346
    • /
    • 2011
  • This thesis investigated basic thinking of Pansori reading text appearance. Among Pansori reading texts, it is versions include unfamiliar contents and scenes in text. They was created by writers of Pansori reading text. Why created a writers of Pansori reading text them? First, writers of Pansori reading text created new contents and scenes in order to show their knowledge. Reading texts of this feature are 28pages version Chunhyangjun belonged to Kim Kwang-sun, 87pages version Chunhyangjun belonged to Sa Jae-dong, 154pages version Chunhyangjun belonged to Hong Yun-pyo etc. This reading texts was effected on knowledge culture of Chosun later period. Second, writers of Pansori reading text created new contents and scenes in order to reenact festivities field. Reading texts of this feature are 75pages version Chunhyangjun belonged to Kyungsang university, 52pages version Chunhyangjun belonged to Keimyung university etc. the former shows story field and Pansori field, the latter shows play field of Walja. Third, writers of Pansori reading text created new contents and scenes in order to lampoon yangban authority. Reading texts of this feature are 72pages version Chunhyangjun belonged to Chungnam university and it's affiliation, 59pages version Chunhyangjun belonged to Park Sun-ho and it's affiliation etc.

A Comparative Study of Feature Extraction Methods for Authorship Attribution in the Text of Traditional East Asian Medicine with a Focus on Function Words (한의학 고문헌 텍스트에서의 저자 판별 - 기능어의 역할을 중심으로 -)

  • Oh, Junho
    • Journal of Korean Medical classics
    • /
    • v.33 no.2
    • /
    • pp.51-59
    • /
    • 2020
  • Objectives : We would like to study what is the most appropriate "feature" to effectively perform authorship attribution of the text of Traditional East Asian Medicine Methods : The authorship attribution performance of the Support Vector Machine (SVM) was compared by cross validation, depending on whether the function words or content words, single word or collocations, and IDF weights were applied or not, using 'Variorum of the Nanjing' as an experimental Corpus. Results : When using the combination of 'function words/uni-bigram/TF', the performance was best with accuracy of 0.732, and the combination of 'content words/unigram/TFIDF' showed the lowest accuracy of 0.351. Conclusions : This shows the following facts from the authorship attribution of the text of East Asian traditional medicine. First, function words play an important role in comparison to content words. Second, collocations was relatively important in content words, but single words have more important meanings in function words. Third, unlike general text analysis, IDF weighting resulted in worse performance.