• 제목/요약/키워드: Text Preprocessing

검색결과 124건 처리시간 0.025초

Effects of Preprocessing on Text Classification in Balanced and Imbalanced Datasets

  • Mehmet F. Karaca
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제18권3호
    • /
    • pp.591-609
    • /
    • 2024
  • In this study, preprocessings with all combinations were examined in terms of the effects on decreasing word number, shortening the duration of the process and the classification success in balanced and imbalanced datasets which were unbalanced in different ratios. The decreases in the word number and the processing time provided by preprocessings were interrelated. It was seen that more successful classifications were made with Turkish datasets and English datasets were affected more from the situation of whether the dataset is balanced or not. It was found out that the incorrect classifications, which are in the classes having few documents in highly imbalanced datasets, were made by assigning to the class close to the related class in terms of topic in Turkish datasets and to the class which have many documents in English datasets. In terms of average scores, the highest classification was obtained in Turkish datasets as follows: with not applying lowercase, applying stemming and removing stop words, and in English datasets as follows: with applying lowercase and stemming, removing stop words. Applying stemming was the most important preprocessing method which increases the success in Turkish datasets, whereas removing stop words in English datasets. The maximum scores revealed that feature selection, feature size and classifier are more effective than preprocessing in classification success. It was concluded that preprocessing is necessary for text classification because it shortens the processing time and can achieve high classification success, a preprocessing method does not have the same effect in all languages, and different preprocessing methods are more successful for different languages.

이미지-텍스트 쌍을 활용한 이미지 분류 정확도 향상에 관한 연구 (A Study on Improvement of Image Classification Accuracy Using Image-Text Pairs)

  • 김미희;이주혁
    • 전기전자학회논문지
    • /
    • 제27권4호
    • /
    • pp.561-566
    • /
    • 2023
  • 딥러닝의 발전으로 다양한 컴퓨터 비전 연구를 수행할 수 있게 됐다. 딥러닝은 컴퓨터 비전 연구 중 이미지 처리에서 높은 정확도와 성능을 보여줬다. 하지만 대부분의 이미지 처리 방식은 이미지의 시각 정보만을 이용해 이미지를 처리하는 경우가 대부분이다. 이미지-텍스트 쌍을 활용할 경우 이미지와 관련된 설명, 주석 등의 텍스트 데이터가 이미지 자체에서는 얻기 힘든 추가적인 맥락과 시각 정보를 제공할 수 있다. 본 논문에서는 이미지-텍스트 쌍을 활용하여 이미지와 텍스트를 분석하는 딥러닝 모델 제안한다. 제안 모델은 이미지 정보만을 사용한 딥러닝 모델보다 약 11% 향상된 분류 정확도 결과를 보였다.

An End-to-End Sequence Learning Approach for Text Extraction and Recognition from Scene Image

  • Lalitha, G.;Lavanya, B.
    • International Journal of Computer Science & Network Security
    • /
    • 제22권7호
    • /
    • pp.220-228
    • /
    • 2022
  • Image always carry useful information, detecting a text from scene images is imperative. The proposed work's purpose is to recognize scene text image, example boarding image kept on highways. Scene text detection on highways boarding's plays a vital role in road safety measures. At initial stage applying preprocessing techniques to the image is to sharpen and improve the features exist in the image. Likely, morphological operator were applied on images to remove the close gaps exists between objects. Here we proposed a two phase algorithm for extracting and recognizing text from scene images. In phase I text from scenery image is extracted by applying various image preprocessing techniques like blurring, erosion, tophat followed by applying thresholding, morphological gradient and by fixing kernel sizes, then canny edge detector is applied to detect the text contained in the scene images. In phase II text from scenery image recognized using MSER (Maximally Stable Extremal Region) and OCR; Proposed work aimed to detect the text contained in the scenery images from popular dataset repositories SVT, ICDAR 2003, MSRA-TD 500; these images were captured at various illumination and angles. Proposed algorithm produces higher accuracy in minimal execution time compared with state-of-the-art methodologies.

Correction of Signboard Distortion by Vertical Stroke Estimation

  • Lim, Jun Sik;Na, In Seop;Kim, Soo Hyung
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제7권9호
    • /
    • pp.2312-2325
    • /
    • 2013
  • In this paper, we propose a preprocessing method that it is to correct the distortion of text area in Korean signboard images as a preprocessing step to improve character recognition. Distorted perspective in recognizing of Korean signboard text may cause of the low recognition rate. The proposed method consists of four main steps and eight sub-steps: main step consists of potential vertical components detection, vertical components detection, text-boundary estimation and distortion correction. First, potential vertical line components detection consists of four steps, including edge detection for each connected component, pixel distance normalization in the edge, dominant-point detection in the edge and removal of horizontal components. Second, vertical line components detection is composed of removal of diagonal components and extraction of vertical line components. Third, the outline estimation step is composed of the left and right boundary line detection. Finally, distortion of the text image is corrected by bilinear transformation based on the estimated outline. We compared the changes in recognition rates of OCR before and after applying the proposed algorithm. The recognition rate of the distortion corrected signboard images is 29.63% and 21.9% higher at the character and the text unit than those of the original images.

Interactive Typography System using Combined Corner and Contour Detection

  • Lim, Sooyeon;Kim, Sangwook
    • International Journal of Contents
    • /
    • 제13권1호
    • /
    • pp.68-75
    • /
    • 2017
  • Interactive Typography is a process where a user communicates by interacting with text and a moving factor. This research covers interactive typography using real-time response to a user's gesture. In order to form a language-independent system, preprocessing of entered text data presents image data. This preprocessing is followed by recognizing the image data and the setting interaction points. This is done using computer vision technology such as the Harris corner detector and contour detection. User interaction is achieved using skeleton information tracked by a depth camera. By synchronizing the user's skeleton information acquired by Kinect (a depth camera,) and the typography components (interaction points), all user gestures are linked with the typography in real time. An experiment was conducted, in both English and Korean, where users showed an 81% satisfaction level using an interactive typography system where text components showed discrete movements in accordance with the users' gestures. Through this experiment, it was possible to ascertain that sensibility varied depending on the size and the speed of the text and interactive alteration. The results show that interactive typography can potentially be an accurate communication tool, and not merely a uniform text transmission system.

Building Hybrid Stop-Words Technique with Normalization for Pre-Processing Arabic Text

  • Atwan, Jaffar
    • International Journal of Computer Science & Network Security
    • /
    • 제22권7호
    • /
    • pp.65-74
    • /
    • 2022
  • In natural language processing, commonly used words such as prepositions are referred to as stop-words; they have no inherent meaning and are therefore ignored in indexing and retrieval tasks. The removal of stop-words from Arabic text has a significant impact in terms of reducing the size of a cor- pus text, which leads to an improvement in the effectiveness and performance of Arabic-language processing systems. This study investigated the effectiveness of applying a stop-word lists elimination with normalization as a preprocessing step. The idea was to merge statistical method with the linguistic method to attain the best efficacy, and comparing the effects of this two-pronged approach in reducing corpus size for Ara- bic natural language processing systems. Three stop-word lists were considered: an Arabic Text Lookup Stop-list, Frequency- based Stop-list using Zipf's law, and Combined Stop-list. An experiment was conducted using a selected file from the Arabic Newswire data set. In the experiment, the size of the cor- pus was compared after removing the words contained in each list. The results showed that the best reduction in size was achieved by using the Combined Stop-list with normalization, with a word count reduction of 452930 and a compression rate of 30%.

디지털 포렌식 조사를 위한 NLP의 텍스트 전처리 연구 (A study on NLP Text Preprocessing for digital forensic investigation)

  • 이성원;김도현
    • 한국정보통신학회:학술대회논문집
    • /
    • 한국정보통신학회 2022년도 춘계학술대회
    • /
    • pp.189-191
    • /
    • 2022
  • 현대 사회에서 메신저 서비스는 다른 사람과의 의사소통을 위해 필수적으로 사용되고 있으며 이는 범죄자들도 예외는 아니다. 따라서 메신저 데이터는 디지털 포렌식 조사에서 필수적으로 분석해야 하는 대상이며, 대표적으로 2018년 버닝 썬 게이트, 2019년 N 번 방 사건이 메신저 데이터가 범죄를 해결하는 데 중요한 증거로 활용됐다. 메신저 서비스가 널리 사용됨에 따라 디지털 기기에 대량의 메신저 데이터가 저장되고, 이에 따라 디지털 포렌식 조사 과정에서 메신저 데이터를 분석하는데 많은 시간이 소요되고 있기 때문에 이를 효과적으로 대응하기 위한 텍스트 마이닝 연구가 필요하다. 본 논문에서는 인스턴트 메신저를 대상으로 효과적인 NLP 분석을 하기 위해 인스턴트 메시지의 특성에 따른 다양한 자연어 전처리 방법을 연구한다.

  • PDF

Text Line Segmentation of Handwritten Documents by Area Mapping

  • Boragule, Abhijeet;Lee, GueeSang
    • 스마트미디어저널
    • /
    • 제4권3호
    • /
    • pp.44-49
    • /
    • 2015
  • Text line segmentation is a preprocessing step in OCR, which can significantly influence the accuracy of document analysis applications. This paper proposes a novel methodology for the text line segmentation of handwritten documents. First, the average width of the connected components is used to form a 1-D Gaussian kernel and a smoothing operation is then applied to the input binary image. The adaptive binarization of the smoothed image forms the final text lines. In this work, the segmentation method involves two stages: firstly, the large connected components are labelled as a unique text line using text line area mapping. Secondly, the final refinement of the segmentation is performed using the Euclidean distance between the text line and small connected components. The group of uniquely labelled text candidates achieves promising segmentation results. The proposed approach works well on Korean and English language handwritten documents captured using a camera.

토픽모델링을 활용한 응급구조사 관련 연구동향 (Identifying research trends in the emergency medical technician field using topic modeling)

  • 이정은;김무현
    • 한국응급구조학회지
    • /
    • 제26권2호
    • /
    • pp.19-35
    • /
    • 2022
  • Purpose: This study aimed to identify research topics in the emergency medical technician (EMT) field and examine research trends. Methods: In this study, 261 research papers published between January 2000 and May 2022 were collected, and EMT research topics and trends were analyzed using topic modeling techniques. This study used a text mining technique and was conducted using data collection flow, keyword preprocessing, and analysis. Keyword preprocessing and data analysis were done with the RStudio Version 4.0.0 program. Results: Keywords were derived through topic modeling analysis, and eight topics were ultimately identified: patient treatment, various roles, the performance of duties, cardiopulmonary resuscitation, triage systems, job stress, disaster management, and education programs. Conclusion: Based on the research results, it is believed that a study on the development and application of education programs that can successfully increase the emergency care capabilities of EMTs is needed.

Creating Knowledge from Construction Documents Using Text Mining

  • Shin, Yoonjung;Chi, Seokho
    • 국제학술발표논문집
    • /
    • The 6th International Conference on Construction Engineering and Project Management
    • /
    • pp.37-38
    • /
    • 2015
  • A number of documents containing important and useful knowledge have been generated over time in the construction industry. Such text-based knowledge plays an important role in the construction industry for decision-making and business strategy development by being used as best practice for upcoming projects, delivering lessons learned for better risk management and project control. Thus, practical and usable knowledge creation from construction documents is necessary to improve business efficiency. This study proposes a knowledge creating system from construction documents using text mining and the design comprises three main steps - text mining preprocessing, weight calculation of each term, and visualization. A system prototype was developed as a pilot study of the system design. This study is significant because it validates a knowledge creating system design based on text mining and visualization functionality through the developed system prototype. Automated visualization was found to significantly reduce unnecessary time consumption and energy for processing existing data and reading a range of documents to get to their core, and helped the system to provide an insight into the construction industry.

  • PDF