Search | Korea Science

Exploiting Thread-Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline

Oh, Jaeg-Eun;Hwang, Seok-Joong;Nguyen, Huong Giang;Kim, A-Reum;Kim, Seon-Wook;Kim, Chul-Woo;Kim, Jong-Kook
- ETRI Journal
- /
- v.30 no.4
- /
- pp.576-586
- /
- 2008
In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler-hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write-back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single-instruction multiple-data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32-bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2-way MLEP and 33.7% faster with a 4-way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler.
PDF

A Study on the Automatic Design of Content Addressable Memory (연상 메모리의 자동설계에 관한 연구)

김종선;백인천;박노경;차균현
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.15 no.10
- /
- pp.857-867
- /
- 1990
Sine CAM structure is regular structure as that of RAM of PLA, CAM generator program is easy to implement. This program outputs CAM layout in the form of CIF(Caltech Intermediate Format) and graphic display program is debugging or displaying CAM generator output, which are implemented on PC/AT with MS C(5.0) graphic library and C language. CIF parser is programmed with YACC(Yet Another Compiler Compiler) and LEX (Lexical Analyzer) in order to flat the CIF data. For the purposed of plotting the layout output using ROLAND XY plotter is developed. By combining these program described above, from CIF generation to layout plotting can be executed on pull-down menu according to user's option.
PDF

A Study on the design of the compiler for a TTL simulator (TTL 시뮬레이터의 콤파일러 설계에 관한 연구)

신철재;김용득
- Journal of the Korean Institute of Telematics and Electronics
- /
- v.14 no.2
- /
- pp.17-27
- /
- 1977
The special mini-computer was designed with the one-bus line systems employing the integrated circuits, and was studied by the method of easily making the compiler in 16 bits with each instruction fields. When the 160 nano seconds for a fundamental cycle were used, the optimum operating time for a TTL IC was equal to the access time for the main memory unit. As a result, the circuits were very simple, and the simulator functioned well for all the programs.
PDF

A Review of Data Management Techniques for Scratchpad Memory (스크래치패드 메모리를 위한 데이터 관리 기법 리뷰)

DOOSAN CHO
- The Journal of the Convergence on Culture Technology
- /
- v.9 no.1
- /
- pp.771-776
- /
- 2023
Scratchpad memory is a software-controlled on-chip memory designed and used to mitigate the disadvantages of existing cache memories. Existing cache memories have TAG-related hardware control logic, so users cannot directly control cache misses, and their sizes are large and energy consumption is relatively high. Scratchpad memory has advantages in terms of size and energy consumption because it eliminates such hardware overhead, but there is a burden on software to manage data. In this study, data management techniques of scratchpad memory were classified and examined, and ways to maximize the advantages were discussed.
https://doi.org/10.17703/JCCT.2023.9.1.771 인용 PDF

Maximum Stack Memory Usage Estimation Through Target Binary File Analysis in Microcontroller Environment (마이크로컨트롤러 환경에서 타깃 바이너리 파일 분석을 통한 최대 스택 메모리 사용량 예측 기법)

Choi, Kiho;Kim, Seongseop;Park, Daejin;Cho, Jeonghun
- IEMEK Journal of Embedded Systems and Applications
- /
- v.12 no.3
- /
- pp.159-167
- /
- 2017
Software safety is a key issue in embedded system of automotive and aviation industries. Various software testing approaches have been proposed to achieve software safety like ISO26262 Part 6 in automotive environment. In spite of one of the classic and basic approaches, stack memory is hard to estimating exactly because of uncertainty of target code generated by compiler and complex nested interrupt. In this paper, we propose an approach of analyzing the maximum stack usage statically from target binary code rather than the source code that also allows nested interrupts for determining the exact stack memory size. In our approach, determining maximum stack usage is divided into three steps: data extraction from ELF file, construction of call graph, and consideration of nested interrupt configurations for determining required stack size from the ISR (Interrupt Service Routine). Experimental results of the estimation of the maximum stack usage shows proposed approach is helpful for optimizing stack memory size and checking the stability of the program in the embedded system that especially supports nested interrupts.
https://doi.org/10.14372/IEMEK.2017.12.3.159 인용 PDF KSCI

Advanced Calendar Queue Scheduler Design Methodology (진보된 캘린더 큐 스케줄러 설계방법론)

Kim, Jin-Sil;Chung, Won-Young;Lee, Jung-Hee;Lee, Yong-Surk
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.34 no.12B
- /
- pp.1380-1386
- /
- 2009
In this paper, we propose a CQS(Calendar Queue Scheduler) architecture which was designed for processing multimedia and timing traffic in home network. With various characteristics of the increased traffic flowed in home such as VoIP, VOD, IPTV, and Best-efforts traffic, the needs of managing QoS(Quality of Service) are being discussed. Making a group regarding application or service is effective to guarantee successful QoS under the restricted circumstances. The proposed design is aimed for home gateway corresponding to the end points of receiver on end-to-end QoS and eligible for supporting multimedia traffic within restricted network sources and optimizing queue sizes. Then, we simulated the area for each module and each memory. The area for each module is referenced by NAND($2{\times}1$) Gate(11.09) when synthesizing with Magnachip 0.18 CMOS libraries through the Synopsys Design Compiler. We verified the portion of memory is 85.38% of the entire CQS. And each memory size is extracted through CACTI 5.3(a unit in mm2). According to the increase of the memory’sentry, the increment of memory area gradually increases, and defining the day size for 1 year definitely affects the total CQS area. In this paper, we discussed design methodology and operation for each module when designing CQS by hardware.
PDF KSCI

32 Bit RISC Core modeling using SystemC

최홍미;박성모
- Proceedings of the IEEK Conference
- /
- 2002.06b
- /
- pp.325-328
- /
- 2002
In this paper, we present a SystemC model of a 32-Bit RISC core wi)ich is based on the ARMTTDMI architecture. The RISC core model was first modeled in C for architecture verification and then refined down to a level that allows concurrent behavior lot hardware timing using the SystcmC class library. It was driven in timed functional level that uses handshake protocol. It was compiled using standard C++ compiler. The functional simulation result was verified by comparing the contents of memory, the result of execution with the result from the ARMulator of ADS(Arm Developer Suite).
PDF

The Efficient Execution of Functional Language Loops on the Multithreaded Architectures (다중스레드 구조에서 함수 언어 루프의 효과적 실행)

Ha, Sang-Ho
- The Transactions of the Korea Information Processing Society
- /
- v.7 no.3
- /
- pp.962-970
- /
- 2000
Multithreading is attractive in that it can tolerate memory latency and synchronization by effectively overlapping communication with computation. While several compiler techniques have been developed to produce multithreaded codes from functional languages programs, there still remains a lot of works to implement loops effectively. Executing lops in a style of multithreading usually causes some overheads, which can reduce severely the effect of multirheading. This paper suggests several methods in terms of architectures or compilers which can optimize loop execution by multithreading. We then simulate and analyze them for the matrix multiplication program.
PDF

A Study on Effect of Code Distribution and Data Replication for Multicore Computing Architectures

Cho, Doosan
- International Journal of Advanced Culture Technology
- /
- v.9 no.4
- /
- pp.282-287
- /
- 2021
A multicore system must be able to take full advantage of the program's instruction and data parallelism. This study introduces the data replication technique as a support technique to maximize the program's instruction and data parallelism. Instruction level parallelism can be limited by data dependency. In this case, if data is replicated to each processor core and used, instruction level parallelism can be used to the maximum. The technique proposed in this study can maximize the performance improvement effect when applied to scientific applications such as matrix multiplication operation.
https://doi.org/10.17703/IJACT.2021.9.4.282 인용 PDF KSCI

Enhancing GPU Performance by Efficient Hardware-Based and Hybrid L1 Data Cache Bypassing

Huangfu, Yijie;Zhang, Wei
- Journal of Computing Science and Engineering
- /
- v.11 no.2
- /
- pp.69-77
- /
- 2017
Recent GPUs have adopted cache memory to benefit general-purpose GPU (GPGPU) programs. However, unlike CPU programs, GPGPU programs typically have considerably less temporal/spatial locality. Moreover, the L1 data cache is used by many threads that access a data size typically considerably larger than the L1 cache, making it critical to bypass L1 data cache intelligently to enhance GPU cache performance. In this paper, we examine GPU cache access behavior and propose a simple hardware-based GPU cache bypassing method that can be applied to GPU applications without recompiling programs. Moreover, we introduce a hybrid method that integrates static profiling information and hardware-based bypassing to further enhance performance. Our experimental results reveal that hardware-based cache bypassing can boost performance for most benchmarks, and the hybrid method can achieve performance comparable to state-of-the-art compiler-based bypassing with considerably less profiling cost.
https://doi.org/10.5626/JCSE.2017.11.2.69 인용 PDF KSCI

Search Result 82, Processing Time 0.024 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)