• Title/Summary/Keyword: 연산지연

Search Result 451, Processing Time 0.024 seconds

Efficient Scheduling Schemes for Low-Area Mixed-radix MDC FFT Processor (저면적 Mixed-radix MDC FFT 프로세서를 위한 효율적인 스케줄링 기법)

  • Jang, Jeong Keun;Sunwoo, Myung Hoon
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.54 no.7
    • /
    • pp.29-35
    • /
    • 2017
  • This paper presents a high-throughput area-efficient mixed-radix fast Fourier transform (FFT) processor using the efficient scheduling schemes. The proposed FFT processor can support 64, 128, 256, and 512-point FFTs for orthogonal frequency division multiplexing (OFDM) systems, and can achieve a high throughput using mixed-radix algorithm and eight-parallel multipath delay commutator (MDC) architecture. This paper proposes new scheduling schemes to reduce the size of read-only memories (ROMs) and complex constant multipliers without increasing delay elements and computation cycles; thus, reducing the hardware complexity further. The proposed mixed-radix MDC FFT processor is designed and implemented using the Samsung 65nm complementary metal-oxide semiconductor (CMOS) technology. The experimental result shows that the area of the proposed FFT processor is 0.36 mm2. Furthermore, the proposed processor can achieve high throughput rates of up to 2.64 GSample/s at 330 MHz.

Fast Intermode Decision of Scalable Video Coding using Statistical Hypothesis Testing (스케일러블 비디오 부호화에서 통계적 가설 검증 기법을 이용한 프레임 간 모드 결정)

  • Lee, Bum-Shik;Kim, Mun-Churl;Hahm, Sang-Jin;Lee, Keun-Sik;Park, Keun-Soo
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2006.11a
    • /
    • pp.111-115
    • /
    • 2006
  • 스케일러블 비디오 코딩(SVC, Scalable Video Coding)은 MPEG(Moving Picture Expert Group)과 VCEG (Video Coding Expert Group)의 JVT(Joint VIdeo Team)에 의해 현재 표준화 되고 있는 새로운 압축 표준 기술이며 시간, 공간 및 화질의 스케일러빌리티를 지원하기 위해 계층 구조를 가지고 있다. 특히 시간적 스케일러빌리티를 위해 계층적 B-픽처 구조를 채택하고 있다. 스케일러블 비디오 코딩의 기본 계층은 H.264|AVC와 호환적이므로, 모션 예측과 모드 결정과정에서 $16{\times}16,\;16{\times}8,\;8{\times}16,\;8{\times}8,\;8{\times}4,\;4{\times}8$ 그리고 $4{\times}4$와 같은 7개의 서로 다른 크기를 갖는 블록을 사용한다. 스케일러블 비디오 코딩에서 사용되고있는 계층적 B-픽처 구조는 키 픽처인 I와 P 픽처를 제외하고는 한 GOP (Group of Picture)내에서 모두 B-픽처를 사용하므로 H.264|AVC와 비교했을 때 연산량 증가와 함께 부호화 지연도 급격히 증가한다. B-픽처는 양방향 모션 벡터인 LIST0와 LIST1을 사용하고 양방향 모두에서 다중 참조 픽처를 사용하기 때문이다. 본 논문에서는 통계적 가선 검증을 이용하여 스케일러블 비디오 부호화에 적용 가능한 고속 프레임간 모드 결정 알고리듬 대해 소개한다. 제안된 방법은 $16{\times}16$ 매크로 블록과 $8{\times}8$ 서브 매크로 블록에 통계적 가설 감증 기법을 적용하여 실행되며, 현재 블록과 복원된 참조 블록간의 픽셀 값을 비교하여 RD(Rate Distortion) 최적화 기반 모드 결정을 빨리 완료함으로써 고속 프레임간 모드 결정을 가능하게 한다. 제안된 방법은 프레임 간 모드 결정을 고속화함으로써 스케일러블 비디오 부호화기의 연산량과 복잡도를 최대 57%감소시킨다. 그러나 연산량 감소에 따른 비트율의 증가나 화질의 열화는 최대 1.74% 비트율 증가 및 0.08dB PSNR 감소로 무시할 정도로 작다.

  • PDF

A Study on the Construction of Parallel Multiplier over GF2m) (GF(2m) 상에서의 병렬 승산기 설계에 관한 연구)

  • Han, Sung-Il
    • Journal of the Korea Society of Computer and Information
    • /
    • v.17 no.3
    • /
    • pp.1-10
    • /
    • 2012
  • A low-complexity Multiplication over GF(2m) and multiplier circuit has been proposed by using cyclic-shift coefficients and the irreducible trinomial. The proposed circuit has the parallel input/output architecture and shows the lower-complexity than others with the characteristics of the cyclic-shift coefficients and the irreducible trinomial modular computation. The proposed multiplier is composed of $2m^2$ 2-input AND gates and m (m+2) 2-input XOR gates without the memories and switches. And the minimum propagation delay is $T_A+(2+{\lceil}log_2m{\rceil})T_X$. The Proposed circuit architecture is well suited to VLSI implementation because it is simple, regular and modular.

Comparison of Distance Transforms in Space-leaping for High Speed Fetal Ultrasound Volume Visualization (고속 초음파 태아영상 볼륨 가시화를 위한 공간도약 거리변환 비교)

  • Park, Hye-Jin;Song, Soo-Min;Kim, Myoung-Hee
    • Journal of the Korea Society for Simulation
    • /
    • v.16 no.3
    • /
    • pp.57-63
    • /
    • 2007
  • In real time rendering of fetus the empty space leaping while traversing a ray is most frequently used accelerating technique. The main idea is to skip empty voxel samples which do not contribute the result image and it speeds up the rendering time by avoiding sampling data while traversing a ray in the empty region, saving a substantial number of interpolations. Calculating the distance from the nearest object boundary for every yokel can reduce the sampling operation. Among widely-well-known distance maps, those estimates the true distance, such as euclidean distance, takes a long time to compute because of the complicated floating-point operations, and others which uses approximated distance functions, such as city-block and chessboard, provides faster computation time but sampling error may can occur. In this paper, therefore, we analyze the characteristics of several distance maps and compare the number of samples and rendering time. And we aim to suggest the most appropriate distance map for rendering of fetus in ultrasound image.

  • PDF

Implementation of a 3D Graphics Hardwired T&L Accelerator based on a SoC Platform for a Mobile System (SoC 플랫폼 기반 모바일용 3차원 그래픽 Hardwired T&L Accelerator 구현)

  • Lee, Kwang-Yeob;Koo, Yong-Seo
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.44 no.9
    • /
    • pp.59-70
    • /
    • 2007
  • In this paper, we proposed an effective T&L(Transform & Lighting) Processor architecture for a real time 3D graphics acceleration SoC(System on a Chip) in a mobile system. We designed Floating point arithmetic IPs for a T&L processor. And we verified IPs using a SoC Platform. Designed T&L Processor consists of 24 bit floating point data format and 16 bit fixed point data format, and supports the pipeline keeping the balance between Transform process and Lighting process using a parallel computation of 3D graphics. The delay of pipeline processing only Transform operation is almost same as the delay processing both Transform operation and Lighting operation. Designed T&L Processor is implemented and verified using a SoC Platform. The T&L Processor operates at 80MHz frequency in Xilinx-Virtex4 FPGA. The processing speed is measured at the rate of 20M Vertexes/sec.

An Efficient Index Buffer Management Scheme for a B+ tree on Flash Memory (플래시 메모리상에 B+트리를 위한 효율적인 색인 버퍼 관리 정책)

  • Lee, Hyun-Seob;Joo, Young-Do;Lee, Dong-Ho
    • The KIPS Transactions:PartD
    • /
    • v.14D no.7
    • /
    • pp.719-726
    • /
    • 2007
  • Recently, NAND flash memory has been used for a storage device in various mobile computing devices such as MP3 players, mobile phones and laptops because of its shock-resistant, low-power consumption, and none-volatile properties. However, due to the very distinct characteristics of flash memory, disk based systems and applications may result in severe performance degradation when directly adopting them on flash memory storage systems. Especially, when a B-tree is constructed, intensive overwrite operations may be caused by record inserting, deleting, and its reorganizing, This could result in severe performance degradation on NAND flash memory. In this paper, we propose an efficient buffer management scheme, called IBSF, which eliminates redundant index units in the index buffer and then delays the time that the index buffer is filled up. Consequently, IBSF significantly reduces the number of write operations to a flash memory when constructing a B-tree. We also show that IBSF yields a better performance on a flash memory by comparing it to the related technique called BFTL through various experiments.

Compact CNN Accelerator Chip Design with Optimized MAC And Pooling Layers (MAC과 Pooling Layer을 최적화시킨 소형 CNN 가속기 칩)

  • Son, Hyun-Wook;Lee, Dong-Yeong;Kim, HyungWon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.9
    • /
    • pp.1158-1165
    • /
    • 2021
  • This paper proposes a CNN accelerator which is optimized Pooling layer operation incorporated in Multiplication And Accumulation(MAC) to reduce the memory size. For optimizing memory and data path circuit, the quantized 8bit integer weights are used instead of 32bit floating-point weights for pre-training of MNIST data set. To reduce chip area, the proposed CNN model is reduced by a convolutional layer, a 4*4 Max Pooling, and two fully connected layers. And all the operations use specific MAC with approximation adders and multipliers. 94% of internal memory size reduction is achieved by simultaneously performing the convolution and the pooling operation in the proposed architecture. The proposed accelerator chip is designed by using TSMC65nmGP CMOS process. That has about half size of our previous paper, 0.8*0.9 = 0.72mm2. The presented CNN accelerator chip achieves 94% accuracy and 77us inference time per an MNIST image.

Design and Implementation of NVM-based Concurrent Journaling Scheme (저널링 파일 시스템을 위한 비휘발성 메모리 기반 병행적 저널링 기법의 설계 및 구현)

  • Pak, Suehee;Lee, Eunyoung;Han, Hyuck
    • The Journal of the Korea Contents Association
    • /
    • v.21 no.7
    • /
    • pp.157-163
    • /
    • 2021
  • A single write operation in a file system can modify multiple data, but these changes in the file system are not atomically written to disk. Thus, for the consistency of the file system, conventional journaling guarantees crash consistency instead of sacrificing the system performance. It is known that using non-volatile memory as a journal space can alleviate performance degradation due to low latency and byte-level accessibility of non-volatile memory. However, none of the journaling techniques considering non-volatile memory provide scalability. In this paper, journal space on non-volatile memory is divided into multiple regions for scalable journaling, thus dispersing concentrated operations in one region. Second, the journal area-specific operator structure is used to accelerate data write operations to storage devices. We apply the proposed technique to JFS to evaluate it on multi-core servers equipped with high-performance storage devices. The evaluation results show that the proposed technique performs better than the existing technique of the NVM-based journaling file system.

Analyzing Fine-Grained Resource Utilization for Efficient GPU Workload Allocation (GPU 작업 배치의 효율화를 위한 자원 이용률 상세 분석)

  • Park, Yunjoo;Shin, Donghee;Cho, Kyungwoon;Bahn, Hyokyung
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.19 no.1
    • /
    • pp.111-116
    • /
    • 2019
  • Recently, GPU expands application domains from graphic processing to various kinds of parallel workloads. However, current GPU systems focus on the maximization of each workload's parallelism through simplified control rather than considering various workload characteristics. This paper classifies the resource usage characteristics of GPU workloads into computing-bound, memory-bound, and dependency-latency-bound, and quantifies the fine-grained bottleneck for efficient workload allocation. For example, we identify the exact bottleneck resources such as single function unit, double function unit, or special function unit even for the same computing-bound workloads. Our analysis implies that workloads can be allocated together if fine-grained bottleneck resources are different even for the same computing-bound workloads, which can eventually contribute to efficient workload allocation in GPU.

A Multiobjective Genetic Algorithm for Static Scheduling of Real-time Tasks (다목적 유전 알고리즘을 이용한 실시간 태스크의 정적 스케줄링 기법)

  • 오재원;김희천;우치수
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.3
    • /
    • pp.293-307
    • /
    • 2004
  • We consider the problem of scheduling tasks of a precedence constrained task graph, where each task has its execution time and deadline, onto a set of identical processors in a way that simultaneously minimizes the number of processors required and the total tardiness of tasks. Most existing approaches tend to focus on the minimization of the total tardiness of tasks. In another methods, solutions to this problem are usually computed by combining the two objectives into a simple criterion to be optimized. In this paper, the minimization is carried out using a multiobjective genetic algorithm (GA) that independently considers both criteria by using a vector-valued cost function. We present various GA components that are well suited to the problem of task scheduling, such as a non-trivial encoding strategy. a domination-based selection operator, and a heuristic crossover operator We also provide three local improvement heuristics that facilitate the fast convergence of GA's. The experimental results showed that when compared to five methods used previously, such as list-scheduling algorithms and a specific genetic algorithm, the Performance of our algorithm was comparable or better for 178 out of 180 randomly generated task graphs.