• Title/Summary/Keyword: Transformer-Encoder

Search Result 45, Processing Time 0.025 seconds

Korean automatic spacing using pretrained transformer encoder and analysis

  • Hwang, Taewook;Jung, Sangkeun;Roh, Yoon-Hyung
    • ETRI Journal
    • /
    • v.43 no.6
    • /
    • pp.1049-1057
    • /
    • 2021
  • Automatic spacing in Korean is used to correct spacing units in a given input sentence. The demand for automatic spacing has been increasing owing to frequent incorrect spacing in recent media, such as the Internet and mobile networks. Therefore, herein, we propose a transformer encoder that reads a sentence bidirectionally and can be pretrained using an out-of-task corpus. Notably, our model exhibited the highest character accuracy (98.42%) among the existing automatic spacing models for Korean. We experimentally validated the effectiveness of bidirectional encoding and pretraining for automatic spacing in Korean. Moreover, we conclude that pretraining is more important than fine-tuning and data size.

Hyperparameter experiments on end-to-end automatic speech recognition

  • Yang, Hyungwon;Nam, Hosung
    • Phonetics and Speech Sciences
    • /
    • v.13 no.1
    • /
    • pp.45-51
    • /
    • 2021
  • End-to-end (E2E) automatic speech recognition (ASR) has achieved promising performance gains with the introduced self-attention network, Transformer. However, due to training time and the number of hyperparameters, finding the optimal hyperparameter set is computationally expensive. This paper investigates the impact of hyperparameters in the Transformer network to answer two questions: which hyperparameter plays a critical role in the task performance and training speed. The Transformer network for training has two encoder and decoder networks combined with Connectionist Temporal Classification (CTC). We have trained the model with Wall Street Journal (WSJ) SI-284 and tested on devl93 and eval92. Seventeen hyperparameters were selected from the ESPnet training configuration, and varying ranges of values were used for experiments. The result shows that "num blocks" and "linear units" hyperparameters in the encoder and decoder networks reduce Word Error Rate (WER) significantly. However, performance gain is more prominent when they are altered in the encoder network. Training duration also linearly increased as "num blocks" and "linear units" hyperparameters' values grow. Based on the experimental results, we collected the optimal values from each hyperparameter and reduced the WER up to 2.9/1.9 from dev93 and eval93 respectively.

Poly-encoder based COVID-19 Question and Answering with Task Adaptation (Poly-encoder기반의 COVID-19 질의 응답 태스크)

  • Lee, Seolhwa;Lim, Heuiseok
    • Annual Conference on Human and Language Technology
    • /
    • 2020.10a
    • /
    • pp.188-191
    • /
    • 2020
  • 본 연구는 COVID-19 질의 응답 태스크를 위한 Poly-encoder 기반의 태스크를 제안하였다. COVID-19 질의 응답 시스템은 사람들에게 최신 정보에 대해 빠르고 신뢰성이 높은 정보를 전달하는 특성을 가져야한다. 검색 기반 질의 응답 시스템은 pairwise 연산을 기반으로 수행되는데, Poly-encoder는 사전 학습된 트랜스포머(transformer)기반의 pairwise 연산 방법론 중 기존 Cross-encoder와 Bi-encoder보다 실사용 및 성능이 뛰어남을 보였다 [1]. 특히, Poly-encoder는 정확도가 높으면서도 빠른 응답속도를 가지며 검색기반의 각종 태스크에서 좋은 성능을 보였다. 따라서 본 연구는 COVID-19를 위한 Poly-encoder기반의 질의 응답 태스크를 위하여 기존 질의 응답 태스크와 페르소나 기반의 질의 응답 태스크로 두 가지 유형의 태스크를 생성하여 모델을 학습하였다. 또한 신뢰성 있는 리소스정보로부터 모델에 최신 정보 반영을 위하여 자동 크롤러를 구축하여 데이터를 수집하였다. 마지막으로 전문가를 통한 데이터셋을 구축하여 질문-응답과 질의어-질문에 대한 모델 검증을 수행하였다.

  • PDF

U-net with vision transformer encoder for polyp segmentation in colonoscopy images (비전 트랜스포머 인코더가 포함된 U-net을 이용한 대장 내시경 이미지의 폴립 분할)

  • Ayana, Gelan;Choe, Se-woon
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.10a
    • /
    • pp.97-99
    • /
    • 2022
  • For the early identification and treatment of colorectal cancer, accurate polyp segmentation is crucial. However, polyp segmentation is a challenging task, and the majority of current approaches struggle with two issues. First, the position, size, and shape of each individual polyp varies greatly (intra-class inconsistency). Second, there is a significant degree of similarity between polyps and their surroundings under certain circumstances, such as motion blur and light reflection (inter-class indistinction). U-net, which is composed of convolutional neural networks as encoder and decoder, is considered as a standard for tackling this task. We propose an updated U-net architecture replacing the encoder part with vision transformer network for polyp segmentation. The proposed architecture performed better than the standard U-net architecture for the task of polyp segmentation.

  • PDF

A study on application of DCT algorithm with MVP(Multimedia Video Processor) (MVP(Multimedia Video Processor)를 이용한 DCT알고리즘 구현에 관한 연구)

  • 김상기;정진현
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 1997.10a
    • /
    • pp.1383-1386
    • /
    • 1997
  • Discrete cosine transform(DCT) is the most popular block transform coding in lossy mode. DCT is close to statistically optimal transform-the Karhunen Loeve transform. In this paper, a module for DCT encoder is made with TMS320C80 based on JPEG and MPEG, which are intermational standards for image compression. the DCT encoder consists of three parts-a transformer, a vector quantizer and an entropy encoder.

  • PDF

High-Speed Transformer for Panoptic Segmentation

  • Baek, Jong-Hyeon;Kim, Dae-Hyun;Lee, Hee-Kyung;Choo, Hyon-Gon;Koh, Yeong Jun
    • Journal of Broadcast Engineering
    • /
    • v.27 no.7
    • /
    • pp.1011-1020
    • /
    • 2022
  • Recent high-performance panoptic segmentation models are based on transformer architectures. However, transformer-based panoptic segmentation methods are basically slower than convolution-based methods, since the attention mechanism in the transformer requires quadratic complexity w.r.t. image resolution. Also, sine and cosine computation for positional embedding in the transformer also yields a bottleneck for computation time. To address these problems, we adopt three modules to speed up the inference runtime of the transformer-based panoptic segmentation. First, we perform channel-level reduction using depth-wise separable convolution for inputs of the transformer decoder. Second, we replace sine and cosine-based positional encoding with convolution operations, called conv-embedding. We also apply a separable self-attention to the transformer encoder to lower quadratic complexity to linear one for numbers of image pixels. As result, the proposed model achieves 44% faster frame per second than baseline on ADE20K panoptic validation dataset, when we use all three modules.

Assessing Techniques for Advancing Land Cover Classification Accuracy through CNN and Transformer Model Integration (CNN 모델과 Transformer 조합을 통한 토지피복 분류 정확도 개선방안 검토)

  • Woo-Dam SIM;Jung-Soo LEE
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.27 no.1
    • /
    • pp.115-127
    • /
    • 2024
  • This research aimed to construct models with various structures based on the Transformer module and to perform land cover classification, thereby examining the applicability of the Transformer module. For the classification of land cover, the Unet model, which has a CNN structure, was selected as the base model, and a total of four deep learning models were constructed by combining both the encoder and decoder parts with the Transformer module. During the training process of the deep learning models, the training was repeated 10 times under the same conditions to evaluate the generalization performance. The evaluation of the classification accuracy of the deep learning models showed that the Model D, which utilized the Transformer module in both the encoder and decoder structures, achieved the highest overall accuracy with an average of approximately 89.4% and a Kappa coefficient average of about 73.2%. In terms of training time, models based on CNN were the most efficient. however, the use of Transformer-based models resulted in an average improvement of 0.5% in classification accuracy based on the Kappa coefficient. It is considered necessary to refine the model by considering various variables such as adjusting hyperparameters and image patch sizes during the integration process with CNN models. A common issue identified in all models during the land cover classification process was the difficulty in detecting small-scale objects. To improve this misclassification phenomenon, it is deemed necessary to explore the use of high-resolution input data and integrate multidimensional data that includes terrain and texture information.

Bird's Eye View Semantic Segmentation based on Improved Transformer for Automatic Annotation

  • Tianjiao Liang;Weiguo Pan;Hong Bao;Xinyue Fan;Han Li
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.8
    • /
    • pp.1996-2015
    • /
    • 2023
  • High-definition (HD) maps can provide precise road information that enables an autonomous driving system to effectively navigate a vehicle. Recent research has focused on leveraging semantic segmentation to achieve automatic annotation of HD maps. However, the existing methods suffer from low recognition accuracy in automatic driving scenarios, leading to inefficient annotation processes. In this paper, we propose a novel semantic segmentation method for automatic HD map annotation. Our approach introduces a new encoder, known as the convolutional transformer hybrid encoder, to enhance the model's feature extraction capabilities. Additionally, we propose a multi-level fusion module that enables the model to aggregate different levels of detail and semantic information. Furthermore, we present a novel decoupled boundary joint decoder to improve the model's ability to handle the boundary between categories. To evaluate our method, we conducted experiments using the Bird's Eye View point cloud images dataset and Cityscapes dataset. Comparative analysis against stateof-the-art methods demonstrates that our model achieves the highest performance. Specifically, our model achieves an mIoU of 56.26%, surpassing the results of SegFormer with an mIoU of 1.47%. This innovative promises to significantly enhance the efficiency of HD map automatic annotation.

Time-Series Forecasting Based on Multi-Layer Attention Architecture

  • Na Wang;Xianglian Zhao
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.1
    • /
    • pp.1-14
    • /
    • 2024
  • Time-series forecasting is extensively used in the actual world. Recent research has shown that Transformers with a self-attention mechanism at their core exhibit better performance when dealing with such problems. However, most of the existing Transformer models used for time series prediction use the traditional encoder-decoder architecture, which is complex and leads to low model processing efficiency, thus limiting the ability to mine deep time dependencies by increasing model depth. Secondly, the secondary computational complexity of the self-attention mechanism also increases computational overhead and reduces processing efficiency. To address these issues, the paper designs an efficient multi-layer attention-based time-series forecasting model. This model has the following characteristics: (i) It abandons the traditional encoder-decoder based Transformer architecture and constructs a time series prediction model based on multi-layer attention mechanism, improving the model's ability to mine deep time dependencies. (ii) A cross attention module based on cross attention mechanism was designed to enhance information exchange between historical and predictive sequences. (iii) Applying a recently proposed sparse attention mechanism to our model reduces computational overhead and improves processing efficiency. Experiments on multiple datasets have shown that our model can significantly increase the performance of current advanced Transformer methods in time series forecasting, including LogTrans, Reformer, and Informer.

Lip and Voice Synchronization Using Visual Attention (시각적 어텐션을 활용한 입술과 목소리의 동기화 연구)

  • Dongryun Yoon;Hyeonjoong Cho
    • The Transactions of the Korea Information Processing Society
    • /
    • v.13 no.4
    • /
    • pp.166-173
    • /
    • 2024
  • This study explores lip-sync detection, focusing on the synchronization between lip movements and voices in videos. Typically, lip-sync detection techniques involve cropping the facial area of a given video, utilizing the lower half of the cropped box as input for the visual encoder to extract visual features. To enhance the emphasis on the articulatory region of lips for more accurate lip-sync detection, we propose utilizing a pre-trained visual attention-based encoder. The Visual Transformer Pooling (VTP) module is employed as the visual encoder, originally designed for the lip-reading task, predicting the script based solely on visual information without audio. Our experimental results demonstrate that, despite having fewer learning parameters, our proposed method outperforms the latest model, VocaList, on the LRS2 dataset, achieving a lip-sync detection accuracy of 94.5% based on five context frames. Moreover, our approach exhibits an approximately 8% superiority over VocaList in lip-sync detection accuracy, even on an untrained dataset, Acappella.