• Title/Summary/Keyword: vCPU

Search Result 55, Processing Time 0.028 seconds

Event Routing Scheme to Improve I/O Latency of SMP VM (SMP 가상 머신의 I/O 지연 시간 감소를 위한 이벤트 라우팅 기법)

  • Shin, Jungsub;Kim, Hagyoung
    • Journal of KIISE
    • /
    • v.42 no.11
    • /
    • pp.1322-1331
    • /
    • 2015
  • According to the hypervisor scheduler, the vCPU (virtual CPU) operates under two states: the running state and the stop state. When the vCPU is in the stop state, incoming events are delayed until that vCPU's state changes to the running state. The latency in handling such events that are sent to the vCPU is regarded as the I/O latency. Since a SMP (symmetric multiprocessing) VM (virtual machine) incorporates multiple vCPUs, the event latency on a SMP VM can vary according to specific vCPU that receives the event. In this paper, we propose a new scheme named event routing that sends events according to the operation state of each vCPU to reduce the event latency on an SMP VM. We implemented the proposed event routing scheme in Xen ARM hypervisor and confirmed the reduction of I/O latency from measuring the network RTT (round trip time) and the TCP bandwidth under a variety of testing conditions. The network RTT decreases by up to 94% and the TCP bandwidth increases up to 35% when compare to native Xen ARM.

Design of Electronic Control Unit for Parking Assist System (주차 보조 시스템을 위한 ECU 설계)

  • Choi, Jin-Hyuk;Lee, Seongsoo
    • Journal of IKEEE
    • /
    • v.24 no.4
    • /
    • pp.1172-1175
    • /
    • 2020
  • Automotive ECU integrates CPU core, IVN controller, memory interface, sensor interface, I/O interface, and so on. Current automotive ECUs are often developed with proprietary processor architectures. However, demends for standard processors such as ARM and RISC-V increase rapidly for saftware compatibility in autonomous vehicles and connected cars. In this paper, an automotive ECU is designed for parking assist system based on RISC-V with open instruction set architecture. It includes 32b RISC-V CPU core, IVN controllers such as CAN and LIN, memory interfaces such as ROM and SRAM, and I/O interfaces such as SPI, UART, and I2C. Fabricated in 65nm CMOS technology, its operating frequency, area, and gate count are 50MHz, 0.37㎟, and 55,310 gates, respectively.

Real-Time Scheduling Method to assign Virtual CPU in the Multocore Mobile Virtualization System (멀티코아 모바일 가상화 시스템에서 가상 CPU 할당 실시간 스케줄링 방법)

  • Kang, Yongho;Keum, Kimoon;Kim, Seongjong;Jin, Kwangyoun;Kim, Jooman
    • Journal of Digital Convergence
    • /
    • v.12 no.3
    • /
    • pp.227-235
    • /
    • 2014
  • Mobile virtualization is an approach to mobile device management in which two virtual platforms are installed on a single wireless device. A smartphone, a single wireless device, might have one virtual environment for business use and one for personal use. Mobile virtualization might also allow one device to run two different operating systems, allowing the same phone to run both RTOS and Android apps. In this paper, we propose the techniques to virtualize the cores of a multicore, allowing the reassign any number of vCPUs that are exposed to a OS to any subset of the pCPUs. And then we also propose the real-time scheduling method to assigning the vCPUs to the pCPU. Suggested technology in this paper solves problem that increases time of real-time process when interrupt are handled, and is able more to fast processing than previous algorithm.

A Performance Study on CPU-GPU Data Transfers of NVIDIA Tegra and Tesla GPUs (NVIDIA Tegra와 Tesla GPU에서의 CPU-GPU 데이터 전송성능 연구)

  • Kwon, Oh-Kyoung;Gu, Gibeom
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.11a
    • /
    • pp.39-42
    • /
    • 2021
  • 최근 HPC, 인공지능에서 GPU 성능이 향상되면서 사용이 보편화되고 있지만 GPU 프로그래밍은 난이도 측면에서 여전히 큰 장애물이다. 특히 호스트(host) 메모리와 GPU 메모리를 따로 관리해야 하는 어려움 때문에 편의성과 성능 측면에서 연구가 활발히 진행되고 있으며, 다양한 CPU-GPU 메모리 전송프로그래밍 방법들이 제시되고 있다. 본 연구는 NVIDIA Tegra 장치들과 NVIDIA SMX 기반 V100 GPU 카드에서 CPU-GPU 데이터 전송 기법별로 성능비교를 하고자 한다. 특히 NVIDIA Tegra 장치는 CPU와 GPU 통합메모리를 제공하고 있어서 CPU-GPU 메모리 전송방법의 관점에서 기존 GPU 장치와 다른 성능 특징을 보여준다. 성능비교를 위한 실험 워크로드는 HPC 응용프로그램에서 빈번하게 사용하는 2차원 행렬 전치 예제를 사용하였다. 실험을 통해 각 GPU 장치별로 CPU-GPU 메모리 전송 방법에 따른 GPU 커널 성능차이, 페이지 잠긴 메모리와 페이지 가능 메모리의 전송 성능차이, 마지막으로 전체 성능비교를 하였다.

A Performance Study on CPU-GPU Data Transfers of Unified Memory Device (통합메모리 장치에서 CPU-GPU 데이터 전송성능 연구)

  • Kwon, Oh-Kyoung;Gu, Gibeom
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.11 no.5
    • /
    • pp.133-138
    • /
    • 2022
  • Recently, as GPU performance has improved in HPC and artificial intelligence, its use is becoming more common, but GPU programming is still a big obstacle in terms of productivity. In particular, due to the difficulty of managing host memory and GPU memory separately, research is being actively conducted in terms of convenience and performance, and various CPU-GPU memory transfer programming methods are suggested. Meanwhile, recently many SoC (System on a Chip) products such as Apple M1 and NVIDIA Tegra that bundle CPU, GPU, and integrated memory into one large silicon package are emerging. In this study, data between CPU and GPU devices are used in such an integrated memory device and performance-related research is conducted during transmission. It shows different characteristics from the existing environment in which the host memory and GPU memory in the CPU are separated. Here, we want to compare performance by CPU-GPU data transmission method in NVIDIA SoC chips, which are integrated memory devices, and NVIDIA SMX-based V100 GPU devices. For the experimental workload for performance comparison, a two-dimensional matrix transposition example frequently used in HPC applications was used. We analyzed the following performance factors: the difference in GPU kernel performance according to the CPU-GPU memory transfer method for each GPU device, the transfer performance difference between page-locked memory and pageable memory, overall performance comparison, and performance comparison by workload size. Through this experiment, it was confirmed that the NVIDIA Xavier can maximize the benefits of integrated memory in the SoC chip by supporting I/O cache consistency.

A study of Power analysis Attack Mitigation for RISC-V processor (RISC-V 프로세서에 대한 전력 분석 완화 기법 연구)

  • Kibong Kang;Yunheung Paek
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2024.05a
    • /
    • pp.358-361
    • /
    • 2024
  • 2010 년 UC Berkely 에서 개발한 RISC-V ISA 는 x86, Arm 과 다르게 Free Open-source 라는 장점으로 인해 많은 연구와 개발이 이루어지고 있다. RISC-V ISA 는 RISC 명령어셋을 활용하며 서버 및 데스트탑 CPU 부터 IoT 디바이스까지 여러 방면에서 상용을 위한 노력이 계속되고 있다. 하지만 상용 CPU 에 비해 부채널 공격 방어 기법이 제한적으로 구현되어 있는 것을 확인하였고 특히 부채널 공격 중 전력 분석(Power Analysis)에 대한 방어 기법이 부족한 것을 확인하였다. 따라서 본 논문에서는 RISC-V 를 포함한 여러 아키텍처에 대해 전력 분석 및 하드웨어 방어 기법을 분석하고, RISC-V에 추가적으로 적용되어야 할 방어 기법에 대해 서술한다.

  • PDF

Real-time FCWS implementation using CPU-FPGA architecture (CPU-FPGA 구조를 이용한 실시간 FCWS 구현)

  • Han, Sungwoo;Jeong, Yongjin
    • Journal of IKEEE
    • /
    • v.21 no.4
    • /
    • pp.358-367
    • /
    • 2017
  • Advanced Driver Assistance Systems(ADAS), such as Front Collision Warning System (FCWS) are currently being developed. FCWS require high processing speed because it must operate in real time while driving. In addition, a low-power system is required to operate in an automobile embedded system. In this paper, FCWS is implemented in CPU-FPGA architecture in embedded system to enable real-time processing. The lane detection enabled the use of the Inverse Transform Perspective (IPM) and sliding window methods to operate at fast speed. To detect the vehicle, a Convolutional Neural Network (CNN) with high recognition rate and accelerated by parallel processing in FPGA is used. The proposed architecture was verified using Intel FPGA Cyclone V SoC(System on Chip) with ARM-Core A9 which operates in low power and on-board FPGA. The performance of FCWS in HD resolution is 44FPS, which is real time, and energy efficiency is about 3.33 times higher than that of high performance PC enviroment.

Cascade CNN with CPU-FPGA Architecture for Real-time Face Detection (실시간 얼굴 검출을 위한 Cascade CNN의 CPU-FPGA 구조 연구)

  • Nam, Kwang-Min;Jeong, Yong-Jin
    • Journal of IKEEE
    • /
    • v.21 no.4
    • /
    • pp.388-396
    • /
    • 2017
  • Since there are many variables such as various poses, illuminations and occlusions in a face detection problem, a high performance detection system is required. Although CNN is excellent in image classification, CNN operatioin requires high-performance hardware resources. But low cost low power environments are essential for small and mobile systems. So in this paper, the CPU-FPGA integrated system is designed based on 3-stage cascade CNN architecture using small size FPGA. Adaptive Region of Interest (ROI) is applied to reduce the number of CNN operations using face information of the previous frame. We use a Field Programmable Gate Array(FPGA) to accelerate the CNN computations. The accelerator reads multiple featuremap at once on the FPGA and performs a Multiply-Accumulate (MAC) operation in parallel for convolution operation. The system is implemented on Altera Cyclone V FPGA in which ARM Cortex A-9 and on-chip SRAM are embedded. The system runs at 30FPS with HD resolution input images. The CPU-FPGA integrated system showed 8.5 times of the power efficiency compared to systems using CPU only.

Implementation of an 8-Channel Statistical Multiplexer (8-채널 통계적 다중화기의 구현)

  • 이종락;조동호
    • Journal of the Korean Institute of Telematics and Electronics
    • /
    • v.21 no.5
    • /
    • pp.79-89
    • /
    • 1984
  • In this paper we present development of microprocessor-based 8-channel statistical multiplexer (SMUX). The hardware design includes one Z-80A CPU board with the clock rate of 4 MHz, one 16 Kbyte ROM board for program storage, one 16 Kbyte dynamic RAM board and three I/O boards, all connected through an S-100 compatible tristate bus. The SMUX can presently multiplex 8 channels with data rates ranging 50 bps to 9600 bps, but can be reduced to accommodate 4 channels by having a slight modification of software and removing one terminal I/O board. The system specifications meet CCITT recommendations X.25 link level, V.24, V.28, X.3 and X.28. Significant features of the SMUX are its capability of handling 4 input codes (ASCII, EBCDIC, Baudot, Transcode), the use of a dynamic buffer management algorithm, a diagnostic facility, and the efficient use of a single CPU for all system operation. Throughout the paper, detailed explanations are given as to how the hardware and software of the SMUX system have been designed efficiently.

  • PDF

Development of Real-Time Objects Segmentation for Dual-Camera Synthesis in iOS (iOS 기반 실시간 객체 분리 및 듀얼 카메라 합성 개발)

  • Jang, Yoo-jin;Kim, Ji-yeong;Lee, Ju-hyun;Hwang, Jun
    • Journal of Internet Computing and Services
    • /
    • v.22 no.3
    • /
    • pp.37-43
    • /
    • 2021
  • In this paper, we study how objects from front and back cameras can be recognized in real time in a mobile environment to segment regions of object pixels and synthesize them through image processing. To this work, we applied DeepLabV3 machine learning model to dual cameras provided by Apple's iOS. We also propose methods using Core Image and Core Graphics libraries from Apple for image synthesis and postprocessing. Furthermore, we improved CPU usage than previous works and compared the throughput rates and results of Depth and DeepLabV3. Finally, We also developed a camera application using these two methods.