• Title/Summary/Keyword: and Parallel Processing

Search Result 2,013, Processing Time 0.027 seconds

Cross-Lingual Style-Based Title Generation Using Multiple Adapters (다중 어댑터를 이용한 교차 언어 및 스타일 기반의 제목 생성)

  • Yo-Han Park;Yong-Seok Choi;Kong Joo Lee
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.8
    • /
    • pp.341-354
    • /
    • 2023
  • The title of a document is the brief summarization of the document. Readers can easily understand a document if we provide them with its title in their preferred styles and the languages. In this research, we propose a cross-lingual and style-based title generation model using multiple adapters. To train the model, we need a parallel corpus in several languages with different styles. It is quite difficult to construct this kind of parallel corpus; however, a monolingual title generation corpus of the same style can be built easily. Therefore, we apply a zero-shot strategy to generate a title in a different language and with a different style for an input document. A baseline model is Transformer consisting of an encoder and a decoder, pre-trained by several languages. The model is then equipped with multiple adapters for translation, languages, and styles. After the model learns a translation task from parallel corpus, it learns a title generation task from monolingual title generation corpus. When training the model with a task, we only activate an adapter that corresponds to the task. When generating a cross-lingual and style-based title, we only activate adapters that correspond to a target language and a target style. An experimental result shows that our proposed model is only as good as a pipeline model that first translates into a target language and then generates a title. There have been significant changes in natural language generation due to the emergence of large-scale language models. However, research to improve the performance of natural language generation using limited resources and limited data needs to continue. In this regard, this study seeks to explore the significance of such research.

A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization (GPGPU 자원 활용 개선을 위한 블록 지연시간 기반 워프 스케줄링 기법)

  • Thuan, Do Cong;Choi, Yong;Kim, Jong Myon;Kim, Cheol Hong
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.6 no.5
    • /
    • pp.219-230
    • /
    • 2017
  • General-Purpose Graphics Processing Units (GPGPUs) build massively parallel architecture and apply multithreading technology to explore parallelism. By using programming models like CUDA, and OpenCL, GPGPUs are becoming the best in exploiting plentiful thread-level parallelism caused by parallel applications. Unfortunately, modern GPGPU cannot efficiently utilize its available hardware resources for numerous general-purpose applications. One of the primary reasons is the inefficiency of existing warp/thread block schedulers in hiding long latency instructions, resulting in lost opportunity to improve the performance. This paper studies the effects of hardware thread scheduling policy on GPGPU performance. We propose a novel warp scheduling policy that can alleviate the drawbacks of the traditional round-robin policy. The proposed warp scheduler first classifies the warps of a thread block into two groups, warps with long latency and warps with short latency and then schedules the warps with long latency before the warps with short latency. Furthermore, to support the proposed warp scheduler, we also propose a supplemental technique that can dynamically reduce the number of streaming multiprocessors to which will be assigned thread blocks when encountering a high contention degree at the memory and interconnection network. Based on our experiments on a 15-streaming multiprocessor GPGPU platform, the proposed warp scheduling policy provides an average IPC improvement of 7.5% over the baseline round-robin warp scheduling policy. This paper also shows that the GPGPU performance can be improved by approximately 8.9% on average when the two proposed techniques are combined.

Low Space Complexity Bit Parallel Multiplier For Irreducible Trinomial over GF($2^n$) (삼항 기약다항식을 이용한 GF($2^n$)의 효율적인 저면적 비트-병렬 곱셈기)

  • Cho, Young-In;Chang, Nam-Su;Kim, Chang-Han;Hong, Seok-Hie
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.45 no.12
    • /
    • pp.29-40
    • /
    • 2008
  • The efficient hardware design of finite field multiplication is an very important research topic for and efficient $f(x)=x^n+x^k+1$ implementation of cryptosystem based on arithmetic in finite field GF($2^n$). We used special generating trinomial to construct a bit-parallel multiplier over finite field with low space complexity. To reduce processing time, The hardware architecture of proposed multiplier is similar with existing Mastrovito multiplier. The complexity of proposed multiplier is depend on the degree of intermediate term $x^k$ and the space complexity of the new multiplier is $2k^2-2k+1$ lower than existing multiplier's. The time complexity of the proposed multiplier is equal to that of existing multiplier or increased to $1T_X(10%{\sim}12.5%$) but space complexity is reduced to maximum 25%.

Comparison of Korean Real-time Text-to-Speech Technology Based on Deep Learning (딥러닝 기반 한국어 실시간 TTS 기술 비교)

  • Kwon, Chul Hong
    • The Journal of the Convergence on Culture Technology
    • /
    • v.7 no.1
    • /
    • pp.640-645
    • /
    • 2021
  • The deep learning based end-to-end TTS system consists of Text2Mel module that generates spectrogram from text, and vocoder module that synthesizes speech signals from spectrogram. Recently, by applying deep learning technology to the TTS system the intelligibility and naturalness of the synthesized speech is as improved as human vocalization. However, it has the disadvantage that the inference speed for synthesizing speech is very slow compared to the conventional method. The inference speed can be improved by applying the non-autoregressive method which can generate speech samples in parallel independent of previously generated samples. In this paper, we introduce FastSpeech, FastSpeech 2, and FastPitch as Text2Mel technology, and Parallel WaveGAN, Multi-band MelGAN, and WaveGlow as vocoder technology applying non-autoregressive method. And we implement them to verify whether it can be processed in real time. Experimental results show that by the obtained RTF all the presented methods are sufficiently capable of real-time processing. And it can be seen that the size of the learned model is about tens to hundreds of megabytes except WaveGlow, and it can be applied to the embedded environment where the memory is limited.

The Motion Estimator Implementation with Efficient Structure for Full Search Algorithm of Variable Block Size (다양한 블록 크기의 전역 탐색 알고리즘을 위한 효율적인 구조를 갖는 움직임 추정기 설계)

  • Hwang, Jong-Hee;Choe, Yoon-Sik
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.46 no.11
    • /
    • pp.66-76
    • /
    • 2009
  • The motion estimation in video encoding system occupies the biggest part. So, we require the motion estimator with efficient structure for real-time operation. And for motion estimator's implementation, it is desired to design hardware module of an exclusive use that perform the encoding process at high speed. This paper proposes motion estimation detection block(MED), 41 SADs(Sum of Absolute Difference) calculation block, minimum SAD calculation and motion vector generation block based on parallel processing. The parallel processing can reduce effectively the amount of the operation. The minimum SAD calculation and MED block uses the pre-computation technique for reducing switching activity of the input signal. It results in high-speed operation. The MED and 41 SADs calculation blocks are composed of adder tree which causes the problem of critical path. So, the structure of adder tree has changed the most commonly used ripple carry adder(RCA) with carry skip adder(CSA). It enables adder tree to operate at high speed. In addition, as we enabled to easily control key variables such as control signal of search range from the outside, the efficiency of hardware structure increased. Simulation and FPGA verification results show that the delay of MED block generating the critical path at the motion estimator is reduced about 19.89% than the conventional strukcture.

The Design of 10-bit 200MS/s CMOS Parallel Pipeline A/D Converter (10-비트 200MS/s CMOS 병렬 파이프라인 아날로그/디지털 변환기의 설계)

  • Chung, Kang-Min
    • The KIPS Transactions:PartA
    • /
    • v.11A no.2
    • /
    • pp.195-202
    • /
    • 2004
  • This paper introduces the design or parallel Pipeline high-speed analog-to-digital converter(ADC) for the high-resolution video applications which require very precise sampling. The overall architecture of the ADC consists of 4-channel parallel time-interleaved 10-bit pipeline ADC structure a]lowing 200MSample/s sampling speed which corresponds to 4-times improvement in sampling speed per channel. Key building blocks are composed of the front-end sample-and-hold amplifier(SHA), the dynamic comparator and the 2-stage full differential operational amplifier. The 1-bit DAC, comparator and gain-2 amplifier are used internally in each stage and they were integrated into single switched capacitor architecture allowing high speed operation as well as low power consumption. In this work, the gain of operational amplifier was enhanced significantly using negative resistance element. In the ADC, a delay line Is designed for each stage using D-flip flops to align the bit signals and minimize the timing error in the conversion. The converter has the power dissipation of 280㎽ at 3.3V power supply. Measured performance includes DNL and INL of +0.7/-0.6LSB, +0.9/-0.3LSB.

A Bottom-up Algorithm to Find the Densest Subgraphs Based on MapReduce (맵리듀스 기반 상향식 최대 밀도 부분그래프 탐색 알고리즘)

  • Lee, Woonghee;Kim, Younghoon
    • Journal of KIISE
    • /
    • v.44 no.1
    • /
    • pp.78-83
    • /
    • 2017
  • Finding the densest subgraphs from social networks, such that people in the subgraph are in a particular community or have common interests, has been a recurring problem in numerous studies undertaken. However, these algorithms focused only on finding the single densest subgraph. We suggest a heuristic algorithm of the bottom-up type, which finds the densest subgraph by increasing its size from a given starting node, with the repeated addition of adjacent nodes with the maximum degree. Furthermore, since this approach matches well with parallel processing, we further implement a parallel algorithm on the MapReduce framework. In experiments using various graph data, we confirmed that the proposed algorithm finds the densest subgraphs in fewer steps, as compared to other related studies. It also scales efficiently for many given starting nodes.

Analysis tool for the diffusion model using GPU: SNUDM-G (GPU를 이용한 확산모형 분석 도구: SNUDM-G)

  • Lee, Dajung;Lee, Hyosun;Koh, Sungryong
    • Korean Journal of Cognitive Science
    • /
    • v.33 no.3
    • /
    • pp.155-168
    • /
    • 2022
  • In this paper, we introduce the SNUDM-G, a diffusion model analysis tool with improved computational speed. Although the diffusion model has been applied to explain various cognitive tasks, its use was limited due to computational difficulties. In particular, SNUDM(Koh et al., 2020), one of the diffusion model analysis tools, has a disadvantage in terms of processing speed because it sequentially generates 20,000 data when approximating the diffusion process. To overcome this limitation, we propose to use graphic processing units(GPU) in the process of approximating the diffusion process with a random walk process. Since 20,000 data can be generated in parallel using the graphic processing units, the estimation speed can be increased compared to generating data through sequential processing. As a result of analyzing the data of Experiment 1 by Ratcliff et al. (2004) and recovering the parameters with SNUDM-G using GPU and SNUDM using CPU, SNUDM-G estimated slightly higher values for certain parameters than SNUDM. However, in term of computational speed, SNUDM-G estimated the parameters much faster than SNUDM. This result shows that a more efficient diffusion model analysis for various cognitive tasks is possible using this tool and further suggests that the processing speed of various cognitive models can be improved by using graphic processing units in the future.

A study on performance improvement considering the balance between corpus in Neural Machine Translation (인공신경망 기계번역에서 말뭉치 간의 균형성을 고려한 성능 향상 연구)

  • Park, Chanjun;Park, Kinam;Moon, Hyeonseok;Eo, Sugyeong;Lim, Heuiseok
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.5
    • /
    • pp.23-29
    • /
    • 2021
  • Recent deep learning-based natural language processing studies are conducting research to improve performance by training large amounts of data from various sources together. However, there is a possibility that the methodology of learning by combining data from various sources into one may prevent performance improvement. In the case of machine translation, data deviation occurs due to differences in translation(liberal, literal), style(colloquial, written, formal, etc.), domains, etc. Combining these corpora into one for learning can adversely affect performance. In this paper, we propose a new Corpus Weight Balance(CWB) method that considers the balance between parallel corpora in machine translation. As a result of the experiment, the model trained with balanced corpus showed better performance than the existing model. In addition, we propose an additional corpus construction process that enables coexistence with the human translation market, which can build high-quality parallel corpus even with a monolingual corpus.

A Study on GPU-based Iterative ML-EM Reconstruction Algorithm for Emission Computed Tomographic Imaging Systems (방출단층촬영 시스템을 위한 GPU 기반 반복적 기댓값 최대화 재구성 알고리즘 연구)

  • Ha, Woo-Seok;Kim, Soo-Mee;Park, Min-Jae;Lee, Dong-Soo;Lee, Jae-Sung
    • Nuclear Medicine and Molecular Imaging
    • /
    • v.43 no.5
    • /
    • pp.459-467
    • /
    • 2009
  • Purpose: The maximum likelihood-expectation maximization (ML-EM) is the statistical reconstruction algorithm derived from probabilistic model of the emission and detection processes. Although the ML-EM has many advantages in accuracy and utility, the use of the ML-EM is limited due to the computational burden of iterating processing on a CPU (central processing unit). In this study, we developed a parallel computing technique on GPU (graphic processing unit) for ML-EM algorithm. Materials and Methods: Using Geforce 9800 GTX+ graphic card and CUDA (compute unified device architecture) the projection and backprojection in ML-EM algorithm were parallelized by NVIDIA's technology. The time delay on computations for projection, errors between measured and estimated data and backprojection in an iteration were measured. Total time included the latency in data transmission between RAM and GPU memory. Results: The total computation time of the CPU- and GPU-based ML-EM with 32 iterations were 3.83 and 0.26 see, respectively. In this case, the computing speed was improved about 15 times on GPU. When the number of iterations increased into 1024, the CPU- and GPU-based computing took totally 18 min and 8 see, respectively. The improvement was about 135 times and was caused by delay on CPU-based computing after certain iterations. On the other hand, the GPU-based computation provided very small variation on time delay per iteration due to use of shared memory. Conclusion: The GPU-based parallel computation for ML-EM improved significantly the computing speed and stability. The developed GPU-based ML-EM algorithm could be easily modified for some other imaging geometries.