• Title/Summary/Keyword: Transformer Encoder

Search Result 49, Processing Time 0.03 seconds

Korean automatic spacing using pretrained transformer encoder and analysis

  • Hwang, Taewook;Jung, Sangkeun;Roh, Yoon-Hyung
    • ETRI Journal
    • /
    • v.43 no.6
    • /
    • pp.1049-1057
    • /
    • 2021
  • Automatic spacing in Korean is used to correct spacing units in a given input sentence. The demand for automatic spacing has been increasing owing to frequent incorrect spacing in recent media, such as the Internet and mobile networks. Therefore, herein, we propose a transformer encoder that reads a sentence bidirectionally and can be pretrained using an out-of-task corpus. Notably, our model exhibited the highest character accuracy (98.42%) among the existing automatic spacing models for Korean. We experimentally validated the effectiveness of bidirectional encoding and pretraining for automatic spacing in Korean. Moreover, we conclude that pretraining is more important than fine-tuning and data size.

Low Lumination Image Enhancement with Transformer based Curve Learning

  • Yulin Cao;Chunyu Li;Guoqing Zhang;Yuhui Zheng
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.9
    • /
    • pp.2626-2641
    • /
    • 2024
  • Images taken in low lamination condition suffer from low contrast and loss of information. Low lumination image enhancement algorithms are required to improve the quality and broaden the applications of such images. In this study, we proposed a new Low lumination image enhancement architecture consisting of a transformer-based curve learning and an encoder-decoder-based texture enhancer. Considering the high effectiveness of curve matching, we constructed a transformer-based network to estimate the learnable curve for pixel mapping. Curve estimation requires global relationships that can be extracted through the transformer framework. To further improve the texture detail, we introduced an encoder-decoder network to extract local features and suppress the noise. Experiments on LOL and SID datasets showed that the proposed method not only has competitive performance compared to state-of-the-art techniques but also has great efficiency.

Hyperparameter experiments on end-to-end automatic speech recognition

  • Yang, Hyungwon;Nam, Hosung
    • Phonetics and Speech Sciences
    • /
    • v.13 no.1
    • /
    • pp.45-51
    • /
    • 2021
  • End-to-end (E2E) automatic speech recognition (ASR) has achieved promising performance gains with the introduced self-attention network, Transformer. However, due to training time and the number of hyperparameters, finding the optimal hyperparameter set is computationally expensive. This paper investigates the impact of hyperparameters in the Transformer network to answer two questions: which hyperparameter plays a critical role in the task performance and training speed. The Transformer network for training has two encoder and decoder networks combined with Connectionist Temporal Classification (CTC). We have trained the model with Wall Street Journal (WSJ) SI-284 and tested on devl93 and eval92. Seventeen hyperparameters were selected from the ESPnet training configuration, and varying ranges of values were used for experiments. The result shows that "num blocks" and "linear units" hyperparameters in the encoder and decoder networks reduce Word Error Rate (WER) significantly. However, performance gain is more prominent when they are altered in the encoder network. Training duration also linearly increased as "num blocks" and "linear units" hyperparameters' values grow. Based on the experimental results, we collected the optimal values from each hyperparameter and reduced the WER up to 2.9/1.9 from dev93 and eval93 respectively.

Poly-encoder based COVID-19 Question and Answering with Task Adaptation (Poly-encoder기반의 COVID-19 질의 응답 태스크)

  • Lee, Seolhwa;Lim, Heuiseok
    • Annual Conference on Human and Language Technology
    • /
    • 2020.10a
    • /
    • pp.188-191
    • /
    • 2020
  • 본 연구는 COVID-19 질의 응답 태스크를 위한 Poly-encoder 기반의 태스크를 제안하였다. COVID-19 질의 응답 시스템은 사람들에게 최신 정보에 대해 빠르고 신뢰성이 높은 정보를 전달하는 특성을 가져야한다. 검색 기반 질의 응답 시스템은 pairwise 연산을 기반으로 수행되는데, Poly-encoder는 사전 학습된 트랜스포머(transformer)기반의 pairwise 연산 방법론 중 기존 Cross-encoder와 Bi-encoder보다 실사용 및 성능이 뛰어남을 보였다 [1]. 특히, Poly-encoder는 정확도가 높으면서도 빠른 응답속도를 가지며 검색기반의 각종 태스크에서 좋은 성능을 보였다. 따라서 본 연구는 COVID-19를 위한 Poly-encoder기반의 질의 응답 태스크를 위하여 기존 질의 응답 태스크와 페르소나 기반의 질의 응답 태스크로 두 가지 유형의 태스크를 생성하여 모델을 학습하였다. 또한 신뢰성 있는 리소스정보로부터 모델에 최신 정보 반영을 위하여 자동 크롤러를 구축하여 데이터를 수집하였다. 마지막으로 전문가를 통한 데이터셋을 구축하여 질문-응답과 질의어-질문에 대한 모델 검증을 수행하였다.

  • PDF

U-net with vision transformer encoder for polyp segmentation in colonoscopy images (비전 트랜스포머 인코더가 포함된 U-net을 이용한 대장 내시경 이미지의 폴립 분할)

  • Ayana, Gelan;Choe, Se-woon
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.10a
    • /
    • pp.97-99
    • /
    • 2022
  • For the early identification and treatment of colorectal cancer, accurate polyp segmentation is crucial. However, polyp segmentation is a challenging task, and the majority of current approaches struggle with two issues. First, the position, size, and shape of each individual polyp varies greatly (intra-class inconsistency). Second, there is a significant degree of similarity between polyps and their surroundings under certain circumstances, such as motion blur and light reflection (inter-class indistinction). U-net, which is composed of convolutional neural networks as encoder and decoder, is considered as a standard for tackling this task. We propose an updated U-net architecture replacing the encoder part with vision transformer network for polyp segmentation. The proposed architecture performed better than the standard U-net architecture for the task of polyp segmentation.

  • PDF

A study on application of DCT algorithm with MVP(Multimedia Video Processor) (MVP(Multimedia Video Processor)를 이용한 DCT알고리즘 구현에 관한 연구)

  • 김상기;정진현
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 1997.10a
    • /
    • pp.1383-1386
    • /
    • 1997
  • Discrete cosine transform(DCT) is the most popular block transform coding in lossy mode. DCT is close to statistically optimal transform-the Karhunen Loeve transform. In this paper, a module for DCT encoder is made with TMS320C80 based on JPEG and MPEG, which are intermational standards for image compression. the DCT encoder consists of three parts-a transformer, a vector quantizer and an entropy encoder.

  • PDF

Performance Evaluation of Vision Transformer-based Pneumonia Detection Model using Chest X-ray Images (흉부 X-선 영상을 이용한 Vision transformer 기반 폐렴 진단 모델의 성능 평가)

  • Junyong Chang;Youngeun Choi;Seungwan Lee
    • Journal of the Korean Society of Radiology
    • /
    • v.18 no.5
    • /
    • pp.541-549
    • /
    • 2024
  • The various structures of artificial neural networks, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have been extensively studied and served as the backbone of numerous models. Among these, a transformer architecture has demonstrated its potential for natural language processing and become a subject of in-depth research. Currently, the techniques can be adapted for image processing through the modifications of its internal structure, leading to the development of Vision transformer (ViT) models. The ViTs have shown high accuracy and performance with large data-sets. This study aims to develop a ViT-based model for detecting pneumonia using chest X-ray images and quantitatively evaluate its performance. The various architectures of the ViT-based model were constructed by varying the number of encoder blocks, and different patch sizes were applied for network training. Also, the performance of the ViT-based model was compared to the CNN-based models, such as VGGNet, GoogLeNet, and ResNet. The results showed that the traninig efficiency and accuracy of the ViT-based model depended on the number of encoder blocks and the patch size, and the F1 scores of the ViT-based model ranged from 0.875 to 0.919. The training effeciency of the ViT-based model with a large patch size was superior to the CNN-based models, and the pneumonia detection accuracy of the ViT-based model was higher than that of the VGGNet. In conclusion, the ViT-based model can be potentially used for pneumonia detection using chest X-ray images, and the clinical availability of the ViT-based model would be improved by this study.

High-Speed Transformer for Panoptic Segmentation

  • Baek, Jong-Hyeon;Kim, Dae-Hyun;Lee, Hee-Kyung;Choo, Hyon-Gon;Koh, Yeong Jun
    • Journal of Broadcast Engineering
    • /
    • v.27 no.7
    • /
    • pp.1011-1020
    • /
    • 2022
  • Recent high-performance panoptic segmentation models are based on transformer architectures. However, transformer-based panoptic segmentation methods are basically slower than convolution-based methods, since the attention mechanism in the transformer requires quadratic complexity w.r.t. image resolution. Also, sine and cosine computation for positional embedding in the transformer also yields a bottleneck for computation time. To address these problems, we adopt three modules to speed up the inference runtime of the transformer-based panoptic segmentation. First, we perform channel-level reduction using depth-wise separable convolution for inputs of the transformer decoder. Second, we replace sine and cosine-based positional encoding with convolution operations, called conv-embedding. We also apply a separable self-attention to the transformer encoder to lower quadratic complexity to linear one for numbers of image pixels. As result, the proposed model achieves 44% faster frame per second than baseline on ADE20K panoptic validation dataset, when we use all three modules.

Assessing Techniques for Advancing Land Cover Classification Accuracy through CNN and Transformer Model Integration (CNN 모델과 Transformer 조합을 통한 토지피복 분류 정확도 개선방안 검토)

  • Woo-Dam SIM;Jung-Soo LEE
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.27 no.1
    • /
    • pp.115-127
    • /
    • 2024
  • This research aimed to construct models with various structures based on the Transformer module and to perform land cover classification, thereby examining the applicability of the Transformer module. For the classification of land cover, the Unet model, which has a CNN structure, was selected as the base model, and a total of four deep learning models were constructed by combining both the encoder and decoder parts with the Transformer module. During the training process of the deep learning models, the training was repeated 10 times under the same conditions to evaluate the generalization performance. The evaluation of the classification accuracy of the deep learning models showed that the Model D, which utilized the Transformer module in both the encoder and decoder structures, achieved the highest overall accuracy with an average of approximately 89.4% and a Kappa coefficient average of about 73.2%. In terms of training time, models based on CNN were the most efficient. however, the use of Transformer-based models resulted in an average improvement of 0.5% in classification accuracy based on the Kappa coefficient. It is considered necessary to refine the model by considering various variables such as adjusting hyperparameters and image patch sizes during the integration process with CNN models. A common issue identified in all models during the land cover classification process was the difficulty in detecting small-scale objects. To improve this misclassification phenomenon, it is deemed necessary to explore the use of high-resolution input data and integrate multidimensional data that includes terrain and texture information.

Bird's Eye View Semantic Segmentation based on Improved Transformer for Automatic Annotation

  • Tianjiao Liang;Weiguo Pan;Hong Bao;Xinyue Fan;Han Li
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.8
    • /
    • pp.1996-2015
    • /
    • 2023
  • High-definition (HD) maps can provide precise road information that enables an autonomous driving system to effectively navigate a vehicle. Recent research has focused on leveraging semantic segmentation to achieve automatic annotation of HD maps. However, the existing methods suffer from low recognition accuracy in automatic driving scenarios, leading to inefficient annotation processes. In this paper, we propose a novel semantic segmentation method for automatic HD map annotation. Our approach introduces a new encoder, known as the convolutional transformer hybrid encoder, to enhance the model's feature extraction capabilities. Additionally, we propose a multi-level fusion module that enables the model to aggregate different levels of detail and semantic information. Furthermore, we present a novel decoupled boundary joint decoder to improve the model's ability to handle the boundary between categories. To evaluate our method, we conducted experiments using the Bird's Eye View point cloud images dataset and Cityscapes dataset. Comparative analysis against stateof-the-art methods demonstrates that our model achieves the highest performance. Specifically, our model achieves an mIoU of 56.26%, surpassing the results of SegFormer with an mIoU of 1.47%. This innovative promises to significantly enhance the efficiency of HD map automatic annotation.