Collaborative Inference for Deep Neural Networks in Edge Environments

Meizhao Liu;Yingcheng Gu;Sen Dong;Liu Wei;Kai Liu;Yuting Yan;Yu Song;Huanyu Cheng;Lei Tang;Sheng Zhang;

doi:10.3837/tiis.2024.07.003

KSII Transactions on Internet and Information Systems (TIIS)

제18권7호
/
Pages.1749-1773
/
2024
/
1976-7277(pISSN)
/
1976-7277(eISSN)

한국인터넷정보학회 (Korean Society for Internet Information)

DOI QR Code

Collaborative Inference for Deep Neural Networks in Edge Environments

Meizhao Liu (State Grid Jiangsu Electric Power Co Ltd, Information & Telecommunication Branch) ;
Yingcheng Gu (State Grid Jiangsu Electric Power Co Ltd, Information & Telecommunication Branch) ;
Sen Dong (State Key Lab. for Novel Software Technology, Nanjing University) ;
Liu Wei (State Grid Jiangsu Electric Power Co Ltd, Information & Telecommunication Branch) ;
Kai Liu (State Grid Jiangsu Electric Power Co Ltd, Information & Telecommunication Branch) ;
Yuting Yan (State Key Lab. for Novel Software Technology, Nanjing University) ;
Yu Song (State Grid Jiangsu Electric Power Co Ltd, Information & Telecommunication Branch) ;
Huanyu Cheng (State Grid Jiangsu Electric Power Co Ltd, Information & Telecommunication Branch) ;
Lei Tang (State Grid Jiangsu Electric Power Co Ltd, Information & Telecommunication Branch) ;
Sheng Zhang (State Key Lab. for Novel Software Technology, Nanjing University)

투고 : 2023.11.21
심사 : 2024.06.12
발행 : 2024.07.31

https://doi.org/10.3837/tiis.2024.07.003 인용 PDF HTML

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

Recent advances in deep neural networks (DNNs) have greatly improved the accuracy and universality of various intelligent applications, at the expense of increasing model size and computational demand. Since the resources of end devices are often too limited to deploy a complete DNN model, offloading DNN inference tasks to cloud servers is a common approach to meet this gap. However, due to the limited bandwidth of WAN and the long distance between end devices and cloud servers, this approach may lead to significant data transmission latency. Therefore, device-edge collaborative inference has emerged as a promising paradigm to accelerate the execution of DNN inference tasks where DNN models are partitioned to be sequentially executed in both end devices and edge servers. Nevertheless, collaborative inference in heterogeneous edge environments with multiple edge servers, end devices and DNN tasks has been overlooked in previous research. To fill this gap, we investigate the optimization problem of collaborative inference in a heterogeneous system and propose a scheme CIS, i.e., collaborative inference scheme, which jointly combines DNN partition, task offloading and scheduling to reduce the average weighted inference latency. CIS decomposes the problem into three parts to achieve the optimal average weighted inference latency. In addition, we build a prototype that implements CIS and conducts extensive experiments to demonstrate the scheme's effectiveness and efficiency. Experiments show that CIS reduces 29% to 71% on the average weighted inference latency compared to the other four existing schemes.

키워드

1. Introduction

In recent times, deep neural network (DNN), positioned as a cornerstone technology for Artificial Intelligence (AI) and Machine Learning (ML) [1], has achieved remarkable development. This technology has been widely applied in various fields including Computer Vision [2], Natural Language Processing [3] and speech recognition [4].

Nevertheless, with the improvement of universality and accuracy, the scale of DNN model is also growing, which means more memory and computational resources are required. For instance, when executing inference on a 224x224 image using VGG16, it entails processing over 138 million parameters through more than 15 billion operations. If executed on a Nexus 5 smartphone, the task would take approximately 16 seconds [5]. This is obviously intolerable for some real-time tasks. Consequently, to meet the memory and computation requirements, DNN inference tasks are typically offloaded to cloud servers with extensive computational resources. However, this traditional cloud computing paradigm encounters several challenges. First, it struggles to meet the real-time requirements of some Internet of Things (IoT) applications when the network condition is poor. Then, massive data may impose a considerable burden on network communication and cloud server processing. What’s more, concerns regarding privacy leaks due to data transmission to the cloud also cannot be ignored [6]. To address these problems, device-edge collaborative inference has emerged as a promising paradigm to promote edge intelligence.

Model partition is an important technology in collaborative inference. Motivated by the significant reduction of data size of some intermediate layers compared to that of input layer, a DNN model is partitioned so that the inference task can be sequentially executed on the end device and edge server. Proper partition can make full use of the computational resources of servers within limited communication overhead [7]. Most prior research on collaborative inference has been limited in the single scenario involving single task and single server. However, realistic scenarios often encompass the presence of multiple edge servers (ESs) and multiple end devices (EDs) with distinct DNN tasks. Meanwhile, different end devices and edge servers, encompassing smartphones, base stations, and gateways, may exhibit different computational capacities, forming the heterogeneous edge environments, in which collaborative inference needs to be considered.

This paper studies the DNN partition, task offloading and scheduling problem in heterogeneous collaborative inference systems, which aims to minimize the average weighted inference latency for DNN tasks. In this problem, each task can be partitioned at different layers according to the computational capacities of devices and network conditions, so we must determine the layers at which the DNN task is partitioned, i.e., partition strategy. Prior explorations have predominantly limited to the DNNs with chain topology. However, many advanced DNN models adopt DAG topology, e.g., GoogleNet [8] and ResNet [9], which brings new challenges to collaborative inference. Besides, each task can be offloaded to one of the servers in the system, so we must determine the server to which the task is offloaded, i.e., offloading strategy. What’s more, there can be more than one task offloaded to the same server, so we must determine the order in which the tasks are executed, i.e., scheduling strategy. The FCFS(First-Come-First-Serve) policy is commonly taken by previous works [11]. However, in real-world scenarios, different tasks often have different priorities. For example, in a smart home system, the priority of tasks responsible for security systems needs to be higher than those responsible for other tasks (such as audio control). When multiple tasks are offloaded to the same server, the scheduling strategy has an undeniable impact on their weighted inference latency. Hence, FCFS policy can hardly adapt to this priority scenario.

To fill these gaps, this paper deeply studies the collaboration of EDs and ESs in a heterogeneous scenario. We formulate this problem as a ILP problem and denote it as POSP, short for task Partition, Offloading and Scheduling Problem. Then a heuristic scheme CIS, i.e., collaborative inference scheme, is proposed for POSP. The main contributions of this paper are summarized as follows:

1) This paper puts forward the collaborative inference in a heterogeneous scenario. The stated problem seeks to minimize the average weighted inference latency by optimizing partition strategy, scheduling strategy and offloading strategy.

2) This paper builds a system model for heterogeneous collaborative inference, and proposes a scheme CIS to minimize average weighted inference latency based on this model. CIS decouples the optimization problem into three subproblems: DNN partition, task offloading and task scheduling.

3) Based on DADS [7], a widely used scheme for DNN partition, algorithm MCP is proposed to deal with the partition for different DNN models, no matter what the topology it is. Then we design SWRTF policy for the task scheduling problem since they have different priorities. At last, CIS utilizes branch and bound to obtain the final strategies, which traverse all feasible solutions in a breadth first manner with proper pruning.

4) Extensive experiments are conducted to verify the performance of this scheme. The comprehensive and in-depth analysis of the results demonstrates that our scheme can greatly reduce the inference latency compared with current approaches.

2. Related Work

Collaborative inference is a significant research direction in edge intelligence, which means end devices complete the DNN inference tasks with the assistance of edge servers or cloud servers.

Kang et al. [16] initially proposed layer-wise partition of DNN models as an approach to enable collaborative inference. However, their approach is limited to linearly-structured DNNs and proved ineffective for more general Directed Acyclic Graph (DAG) structured DNNs. Given that many DNNs exhibit DAG structures, Hu et al. [7] modeled the partition of these DNNs as a min-cut problem and provided a method for computing optimal partition points using max-flow solutions. On this basis, they introduced a system named DADS (Dynamic Adaptive DNN Splitting) that can handle model partition in dynamic network environments. Zhang et al. [17] noted that the min-cut-based partition method has a high time complexity, making it challenging to adapt to scenarios with rapidly changing network conditions. Consequently, they simplified the problem and introduced a two-stage system called QDMP for finding the optimal partitioning point. Wang et al. [18] proposed a hierarchical scheduling optimization strategy called DeepInference-L. By executing computations and data transfers between layers in a pipelined manner, they further reduced the overall latency of collaborative inference. Furthermore, Duan et al. [19] considered the scenarios where multiple DNN inference tasks run on a single mobile device. They employed convex optimization techniques to comprehensively address multi-task partition and scheduling strategies. However, these studies only consider the scenarios of a single device and a single server, which is not applicable to general edge computing scenarios.

Gao et al. [10] designed a dynamic evaluation strategy under a time slot model, dividing a DNN inference task into multiple subtasks and dynamically determining its offloading strategy. Tang et al. [11] proposed an iterative alternative optimization (IAO) algorithm to solve the problem of task partition in a multi-user scenario. Mohammed et al. [12] proposed that in the context of fog computing, a DNN model can be divided into multiple parts, each of which can be executed at fog nodes or locally. Combined with matching theory, an adaptive dynamic task partition and scheduling system DINA was proposed, which can greatly reduce the inference latency. Although the aforementioned studies take the multiple devices into consideration, they ignore the fact that offloading all tasks to a single edge server would lead to issues of excessive load on that server and underutilization of resources on other servers.

To address this problem, Yang et al. [13] introduced an edge-device collaborative inference system called CoopAI, which employs a novel partition algorithm to offload a DNN inference task onto multiple edge servers. By analyzing the characteristics of DNN inference, it permits servers to pre-fetch necessary data, reducing the cost of data exchange and consequently reducing inference latency. Liao et al. [14] delved into the DNN partitioning and task offloading challenges in heterogeneous edge computing scenarios. They conducted an analysis of the task offloading issue involving multiple terminal devices and multiple edge servers. Employing an optimal matching algorithm, they proposed an algorithm that comprehensively addresses both partitioning and offloading concerns, thereby reducing overall system inference latency and energy consumption. Shi et al. [15] presented an offline partitioning and scheduling algorithm, GSPI, for enhancing the speed of DNN inference tasks in a multi-user multi-server setting. However, it's important to note that they exclusively considered scenarios where all users execute the same DNN inference task.

This paper focuses on the collaborative DNN inference problem under the scenario with multiple end devices and multiple edge servers. Given the varying computational capacities of end devices and their distinct upload bandwidths to different edge servers, the manner in which DNN models are partitioned and the selection of servers for tasks to offloading significantly impact the inference latency. By comprehensively considering multiple factors, we propose the collaborative inference scheme CIS for heterogeneous edge computing environments.

3. System Model and Problem Formulation

We first introduce the heterogeneous collaborative inference system mentioned above in Section 3.1. Then we formalized our problem with the target of minimizing average weighted inference latency and the decision variable involving partition strategy 𝑃, offloading strategy 𝑋 and scheduling strategy Φ in Section 3.2.

3.1 Heterogeneous Collaborative Inference System

An edge computing system is comprised of a set of end devices and a set of resource-constrained edge servers. As shown in Fig. 1, each ED is equipped with a pretrained DNN model and executes the DNN inference task of this model. To accelerate the execution of DNN inference tasks, each DNN model can be partitioned at layer-level and then offloaded to one of the ESs. The EDs and ESs are connected in a LAN, where each ES is accessible for each ED.

E1KOBZ_2024_v18n7_1749_5_f0001.png 이미지

Fig. 1. Collaborative inference system in heterogeneous edge computing scenarios

We denote the set of 𝑛 end devices as 𝐷 = {𝑑₁, 𝑑₂, ⋯, 𝑑_𝑛}. For convenience, we use task 𝑗 to denote the task on 𝑑_𝑗. Each end device 𝑑_𝑗 is associated with two parameters: 𝑤_𝑗 and Cap_𝑑(𝑗). Here 𝑤_𝑗 represents the priority of task 𝑗 and task with greater 𝑤_𝑗 has a higher priority. Cap_𝑑(𝑗) represents the computational capacity of ED 𝑑_𝑗 measured in FLOPS. It worth nothing that in the heterogeneous system, these parameters of different EDs can be varying. In this paper, we assume all the tasks can be partitioned at most once, which means each task can only be offloaded to at most one server.

We denote the set of 𝑚 edge servers as 𝑆 = {𝑠₁, 𝑠₂, ⋯, 𝑠_𝑚}. Let Cap_𝑠(𝑖) denote the computational capacity of 𝑠_𝑖, measured in FLOPS. Let 𝑏_𝑖j denote the bandwidth between 𝑠_𝑖 and 𝑑_𝑗. In this paper, we assume that each 𝑏_𝑖j is given and constant. ESs will pre-load the DNN models of the tasks offloaded to them. After intermediate data being sent to the server, the task will be added to a waiting list to wait for scheduling. Once a server is idle, it will select a task from the waiting list to execute, and the execution can’t be interrupted until the task is finished. Table 1 lists the main symbols used in this article.

Table 1. Main notations

E1KOBZ_2024_v18n7_1749_5_t0001.png 이미지

3.2 System Model

3.2.1 DNN Layer-level Computation and Output data Model

DNN models are usually composed of a series of layers, such as convolutional layers, excitation layers, active layers, pooling layers and fully connected layers. To compute the inference latency of a DNN, we must analyze the computation and output data size of each layer of the DNN model.

Layer-level Computational Cost: We measure the computational cost of each layer using FLoating point OPerations (FLOPs), which represents the number of basic mathematical operations (such as addition, subtraction, multiplication, etc.) to be performed. Similar methods have been used in [14]. Let com(𝑣) denote the computational cost of layer 𝑣. Since some layers, like active layer, have a very small computational cost, we just consider the main DNN layers whose computational cost has impact on the inference latency of the model as follows:

• Convolutional Layer: Convolutional layer is one of the most basic layers in DNNs. It performs convolution operations on the input data through a set of convolution kernels to extract local features at different locations. The computational cost of convolution layer depends on the size of the input feature map and the size and number of the convolution kernel. For the convolution layer 𝑣, assuming the size of the input feature map is 𝑤_𝑖n × ℎ_𝑖n, the size of the convolution kernel is 𝑤_𝑘 × ℎ_𝑘, and the number of channels of the input feature map is 𝐶_𝑖n, the number of channels of the output feature map is 𝐶_out, the size of stride is 𝑤_𝑠 × ℎ_𝑠, then its computational cost is:

\(\begin{align}\operatorname{com}(v)=\left(\frac{w_{\text {in }}-w_{k}}{w_{s}}+1\right) *\left(\frac{h_{\text {in }}-h_{k}}{h_{s}}+1\right) * C_{\text {in }} * C_{\text {out }} * w_{k} * h_{k} * 2\end{align}\), (1)

where \(\begin{align}\left(\frac{w_{i n}-w_{k}}{w_{s}}+1\right) *\left(\frac{h_{i n}-h_{k}}{h_{s}}+1\right)\end{align}\) represents the number of multiplicative operations required for each output position and it is also the number of additive operations required for each output position.

• Fully Connected Layer: Fully connected layer is also one of the most basic layers in deep neural networks. By connecting each input neuron to an output neuron and giving each connection a weight, the features extracted from the previous layers are combined and integrated to generate the final output. For fully connected layer 𝑣, assuming that the dimension of the input feature vector is 𝑑_𝑖n and the dimension of the output feature vector is 𝑑_𝑜ut, then its computational cost is:

𝑐om(𝑣) = (𝑑_𝑖n + (𝑑_𝑖n − 1)) ∗ 𝑑_𝑜ut, (2)

where 𝑑_𝑖n denotes the multiplicative operation, and (𝑑_𝑖n − 1) denotes the additive operation.

Output Data Size: We use data(𝑣) to denote the output data size of layer 𝑣. For layer 𝑣, assuming the size of its output feature map is 𝑤_𝑜ut × ℎ_𝑜ut, then the output data size of this layer 𝑣 is:

data(𝑣) = 𝐶_𝑜ut ∗ 𝑤_𝑜ut ∗ ℎ_𝑜ut, (3)

If the tensor size of the input image is (3 × 224 × 224), the computation and output data size of the layers of MobileNet_V2 model are shown in Fig. 2.

E1KOBZ_2024_v18n7_1749_7_f0001.png 이미지

Fig. 2. Computation and output data size of each layer of MobileNetV2 model.

3.2.2 DNN Partition Model

The inference process of DNN is actually a process of forward propagation, starting from the input layer and gradually moving forward, each layer conducts a series of calculations on its own input and sends the results to its subsequent layers as their input. Thus, given a DNN model 𝑀, we can represent 𝑀 as a DAG (directed acyclic graph) 𝐺 =<𝑉, 𝐸>, where 𝑣_𝑖 ∈ 𝑉 corresponds to one layer and the directed edge 𝑒_𝑖j ∈ 𝐸 represents the dependency between 𝑣_𝑖 and 𝑣_𝑗. It should be emphasized that each vertex may have multiple edges starting from it and multiple edges ending at it. For example, Fig. 3(a) shows a piece of GoogLeNet [8], which can be modeled as a DAG as shown in Fig. 3(b).

E1KOBZ_2024_v18n7_1749_7_f0002.png 이미지

Fig. 3. A piece of GoogLeNet (a) and the DAG corresponding to it (b)

In the context of deep neural networks (DNNs), it should be noted that the computational cost and output size of each layer are different and independent of each other, which provides an opportunity for DNN partition. DNN partition is to divide a DNN model into two parts to execute them on different devices. Formally, we define 𝑝_𝑗 =< 𝑉^𝑙_𝑗, 𝑉^𝑟_𝑗 > as the partition of task 𝑗 which partitions its vertex set 𝑉_𝑗 into two disjoint subsets 𝑉^𝑙_𝑗 and 𝑉^𝑟_𝑗. The layers corresponding to the vertex in 𝑉^𝑙_𝑗 are executed on an ED, and the layers corresponding to the vertex in 𝑉^𝑟_𝑗 are executed on an ES. Fig. 3(b) shows a partition of GoogLeNet mentioned above.

Thus, as to task 𝑗 and one of its partitions 𝑝_𝑗, the local computational cost is:

\(\begin{align}L_{j}\left(p_{j}\right)=\sum_{v \in V_{j}^{l}} \operatorname{com}(v)\end{align}\), (4)

the transmission data size is:

\(\begin{align}C_{j}\left(p_{j}\right)=\sum_{v \in V_{j}^{c}} \operatorname{data}(v)\end{align}\), (5)

and the remote computational cost is:

\(\begin{align}R_{j}\left(p_{j}\right)=\sum_{v \in V_{j}^{r}} \operatorname{com}(v)\end{align}\), (6)

where 𝑉^𝑙_𝑗 and 𝑉^𝑟_𝑗 denote the set of layers executed on the EDs and the set of layers executed on the ESs respectively. 𝑉^𝑐_𝑗 represents the set of layers that need to send their output to the ES, which means each layer in 𝑉^𝑐_𝑗 is belong to 𝑉^𝑙_𝑗 and has a successor layer in 𝑉^𝑟_𝑗.

3.2.3 Task Scheduling Model

In reality, there are often fewer edge servers than end devices, so it is common for multiple tasks to be offloaded to the same server. In our system, the server can only execute one task at a time, so tasks need to wait for scheduling before execution. As a result, the scheduling policy, specifically the execution order of tasks, has a great impact on the average weighted inference latency. In previous related works, they schedule the tasks in a first-come-first-service (FCFS) manner [11], but it can’t solve our problem well for tasks have different priorities. For example, suppose there are three tasks offloaded to the same server. We define the arrival time of a task as the time it takes before the intermediate data is sent to the server, including local computing time and data transmission time. Also, we the server computing time as the time of task execution on the server. Then the three task’s arrival time, server computing time and task priority are (5,5,3), (7,2,2), (3,6,1) respectively. Fig. 4 shows the results of two scheduling strategies, where the left one is according to FCFS and the right one is in another way. Their average weighted inference latencies are 83/3 and 24 respectively, which shows that different scheduling strategies have an important impact on inference latency. To formally represent the scheduling strategy, let 𝜙(𝑠_𝑖) denote the task sequence on 𝑠_𝑖, then the scheduling strategy of the system can be represented as Φ = {𝜙(𝑠₁), 𝜙(𝑠₂), ⋯, 𝜙(𝑠_𝑚)}. The kth scheduled task on 𝑠𝑠𝑖𝑖 can be represented as 𝜙_𝑘(𝑠_𝑖), where 1 ≤ 𝑘 ≤ |𝜙(𝑠_𝑖)|.

E1KOBZ_2024_v18n7_1749_8_f0001.png 이미지

Fig. 4. Two different scheduling strategies

3.2.4 Task Offloading and Inference Latency Model

In the heterogeneous system, due to the disparity in computational capacity and bandwidth between ESs, it is obvious that the task offloading strategy does affect the inference latency. Formally, we use a set of binary variables 𝑋 = {𝑥₁₁, 𝑥₁₂, ⋯, 𝑥_1𝑚, ⋯, 𝑥_𝑛1, 𝑥_𝑛2, ⋯, 𝑥_𝑛m} to denote the offloading strategy. Specifically, 𝑥_𝑖j = 1 if and only if task 𝑗 is offloaded to ES 𝑠_𝑖; otherwise, 𝑥_𝑖j = 0. Since one task can only be offloaded to one ES, we can get that ∑^𝑚_𝑖=1𝑥_𝑖j = 1, ∀1 ≤ 𝑗 ≤ 𝑛.

Once a task is partitioned and decided to be offloaded to an ES, the execution of this task can be divided into four stages: local computing, data transmission, waiting for scheduling and remote computing. Therefore, if task 𝑗 is partitioned by 𝑝_𝑗 and offloaded to ES 𝑠_𝑖, its inference latency is:

𝑇_𝑖j = 𝑡^𝑙ocal_𝑖j + 𝑡^𝑡rans_𝑖j + 𝑡^𝑤ait_𝑖j + 𝑡^𝑟emote_𝑖j, (7)

where 𝑡^𝑙ocal_𝑖j denotes the latency of local computing:

𝑡^𝑙ocal _ij= 𝐿_𝑗(𝑝_𝑗)/Cap_𝑑(𝑗), (8)

𝑡^𝑡rans_𝑖j denotes the latency of data transmission:

\(\begin{align}t_{i j}^{\text {trans }}=\frac{C_{j}\left(p_{j}\right)}{b_{i j}}\end{align}\), (9)

𝑡^𝑤ait_𝑖j denotes the latency of waiting for scheduling:

𝑡^𝑤ait_𝑖j = 𝑚ax(𝑡^𝑤ait_𝑖j' + 𝑡^𝑟emote_𝑖j', + 𝑡^𝑙ocal_𝑖j + 𝑡^𝑡rans_𝑖j) - (𝑡^𝑙ocal_𝑖j+ 𝑡^𝑡rans_𝑖j), (10)

where 𝑗′ denotes the task scheduled before 𝑗 on 𝑠_𝑖. Suppose 𝑗 is 𝜙_𝑘(𝑠_𝑖), then 𝑗′ is 𝜙_𝑘−1(𝑠_𝑖). 𝑡^𝑟emote_𝑖j denotes the latency of remote computing:

𝑡^𝑟emote_𝑖j = 𝑅_𝑗(𝑝_𝑗)/Cap_𝑠(𝑖). (11)

Assuming that all tasks start at the same time, the average weighted inference latency of the entire system is:

3.2.5 Problem Formulation

The QoE [23] of the applications based on DNN model improves with the reduction of inference latency. Since different tasks have different priorities, we try to minimize the average weighted inference latency 𝑇 of the system by collaborative inference. Considering the computational capacity and network bandwidth of the system, an inference reduction problem is formulated in this subsection. We now refer to this optimization problem with partition strategy 𝑃, offloading strategy 𝑋 and scheduling strategy Φ as POSP, e.g. task Partition, Offloading and Scheduling Problem, and define it as follows:

\(\begin{align}\begin{aligned} \text { POSP: } & \min _{P, \phi, X} T \\ \text { s.t. } & \text { C1: } V_{j}^{l} \cup V_{j}^{r}=V_{j}, V_{j}^{l} \cap V_{j}^{r}=\emptyset, \forall 1 \leq j \leq n \\ & \text { C2: } \sum_{i=1}^{m} x_{i j} \leq 1, \forall 1 \leq j \leq n \\ & C 3: x_{i j}=1 \text { or } 0, \forall 1 \leq j \leq n, \forall 1 \leq i \leq m \\ & C 4: \phi\left(s_{i}\right)=\left\{j \mid x_{i j}=1\right\} .\end{aligned}\end{align}\). (12)

The optimization objective function 𝑇 is the average weighted inference latency of the entire system, which can be formulated as:

\(\begin{align}T=\frac{1}{n} \sum_{i=1}^{m} \sum_{j=1}^{n} x_{i j} w_{j} T_{i j}\end{align}\). (13)

Constraint 𝐶1 guarantees the valid of partition strategy of each task. Constraint 𝐶2 and 𝐶3 manifest that one task can only be offloaded to one ES at most or even just be completed locally. Constraint 𝐶4 is used to guarantee t task sequence on 𝑠_𝑖 is consistent with offloading strategy 𝑋. Due to the existence of multiple variables and the high coupling between variables, problem POSP is too complex to be solved directly. Thus, we need to give further analysis and decompose problem POSP to get the optimal strategies.

4. Algorithm Design

We have modeled our problem as an optimization problem POSP and point out that it is complex owing to the existence of multiple decision variables. In this section, we first reveal several structural properties of our problem and then decompose the problem into three parts. After that, we propose an algorithm MCP for DNN partition, a scheduling policy SWRTF for task scheduling and an algorithm BBO for DNN allocation to solve POSP step by step.

4.1 Problem Decomposition

Since there are multiple sets of variables in POSP and the variables are coupled with each other, we need to decouple them to decompose the complex problem. The main idea of our scheme is to give the corresponding partition strategy 𝑃 and scheduling strategy Φ for each offloading strategy 𝑋, thus binding 𝑃 and Φ to 𝑋, which means as long as 𝑋 is determined, 𝑃 and Φ can be determined accordingly. Then we just need to solve the new problem 𝒫 which only has 𝑋 as its decision variables. So, our scheme can be decomposed into three steps: 1) give the partition strategy 𝑃 for each offloading strategy 𝑋, 2) give the scheduling strategy Φ for each offloading strategy 𝑋, 3) generate the new problem 𝒫 which only has 𝑋 as its decision variables and solve this problem to give the optimal offloading strategy 𝑋. Since step 1) and 2) has bound 𝑃 and Φ to 𝑋, so with the result of step 3), we can give the final optimal strategy 𝑃, Φ and 𝑋 for problem POSP.

E1KOBZ_2024_v18n7_1749_10_f0001.png 이미지

Fig. 5. CIS flow

4.1.1 DNN Partition

For most DNN, they have many different partitions, which makes it more difficult for us to solve the problem. However, some partitions will lead to excessive inference latency thus almost impossible to become the optimal solution. Therefore, we can reduce the solution space by selecting an optimal partition 𝑝_𝑖j for each DNN task 𝑗 when trying to offload it to ES 𝑠_𝑖. Based on the method proposed in [7], we design an algorithm MCP, e.g., min-cut based partition, which first constructs a latency graph 𝐺′ for each task 𝑗 and ES 𝑠_𝑖 based on the DAG 𝐺 of the DNN model of task 𝑗 to convert the optimal partition problem to the minimum weighted s–t cut problem of 𝐺′, and then get the optimal partition 𝑝_𝑖j using a min-cut algorithm.

For task 𝑗 and ES 𝑠_𝑖, let 𝐺 = <𝑉, 𝐸> denote the corresponding DAG of the DNN model of task 𝑗, we can construct a weighted DAG 𝐺′, e.g., its latency graph, as follows:

1) Add a source node 𝑠𝑠 and a sink node 𝑡 to 𝐺′.

2) Add the remote computing edges 𝐸_remote: For each node 𝑣 ∈ 𝑉, add an edge from 𝑠 to 𝑣, whose weight is com(𝑣)/Cap_𝑠(𝑖), e.g., the time for executing this layer on 𝑠_𝑖.

3) Add the local computing edges 𝐸_local: For each node 𝑣 ∈ 𝑉, add an edge from 𝑣 to 𝑡, whose weight is com(𝑣)/Cap_𝑑(𝑗), e.g., the time for executing this layer on 𝑑_𝑗.

4) Add the data transmission edges 𝐸_tans: It is worth nothing that the output data of a layer only need to be transmit at most once even if it has more than one successor layers. Thus, there are two conditions we add data transmission edges.

a) For each node 𝑣 ∈ 𝑉, if it only has one successor 𝑣′ in 𝐺, add an edge from 𝑣 to 𝑣′, whose weight is \(\begin{align}\frac{C_{j}(\text { data }(v))}{b_{i j}}\end{align}\), i.e., the time for transmitting 𝑣’s output data from 𝑑_𝑗 to 𝑠_𝑖.

b) For each node 𝑣 ∈ 𝑉, if it has more than one successors in 𝐺, we first add a virtual node 𝑣_virtual to 𝐺′, then add an edge from 𝑣 to 𝑣_virtual, whose weight is \(\begin{align}\frac{C_{j}(\text { data }(v))}{b_{i j}}\end{align}\), and then add an edge from 𝑣_virtual to all the successors of 𝑣 in 𝐺, whose weight is positive infinity.

Fig. 6 shows an example of constructing the latency graph of a DAG. At this stage, the optimal partition problem has been converted to the minimum weighted s–t cut problem of the latency graph 𝐺′. For a s-t cut 𝐶 of 𝐺′, the edges in 𝐶 comprise three parts: remote computing edges, local computing edges and data transmission edges. Thus, the value of 𝐶 is exactly equal to the inference latency of the DNN task at the partition of 𝐶 without considering the time waiting for scheduling. Then we use a min-cut algorithm to get the minimum weighted s–t cut of 𝐺′, and get the optimal partition 𝑝_𝑖j from this. In the following steps of determining scheduling and offloading strategies, we suppose task 𝑗 is partitioned at 𝑝_𝑖j when offloaded to 𝑠_𝑖. Now, we have bound 𝑃 to 𝑋, i.e., the partition strategy 𝑃 can be determined accordingly when offloading strategy 𝑋 is determined. Since the time complexity of min-cut algorithm is 𝑂(|𝑉|²|𝐸| ln |𝑉|), where |𝑉| is the number of nodes in the DAG and |𝐸| is the number of edges in the DAG, the time complexity of constructing the latency graph is 𝑂(|𝑉| + |𝐸|), the time complexity of MCP is 𝑂(𝑚n|𝑉′|²|𝐸′| ln |𝑉′|), where |𝑉′| is the largest number of nodes in all the DAG and |𝐸′| is the largest number of edges in all the DAG.

E1KOBZ_2024_v18n7_1749_10_f0002.png 이미지

Fig. 6. Constructing the latency graph

Algorithm 1. MCP

JAKO202425557607509_algor 1.png 이미지

4.1.2 Task Scheduling

To give the scheduling strategy Φ for each offloading strategy 𝑋, we design a new scheduling policy for tasks offloaded to the same ES called “shortest weighted remaining time first” (SWRTF). In particular, SWRTF has three rules:

1) The ES will not be idle unless there are no tasks in the waiting list.

2) Tasks are executed non-preemptively, that is, once a task starts executing, other tasks must wait until the execution of the task ends.

3) The task with shortest weighted remaining time RT_𝑗 will be scheduled first, where RT_𝑗 = 𝑡^𝑟emote_𝑖j/𝑤_𝑗.

For a given task 𝑗 and ES 𝑠_𝑖, if task 𝑗 is decided to be offloaded to 𝑠_𝑖, then the partition 𝑝_𝑖j of the task 𝑗 are given as Section 4.1.1 mentioned. Suppose the set of tasks offloaded to 𝑠_𝑖 is 𝐽_𝑖 = {𝑗₁, 𝑗₂, ⋯, 𝑗_𝑞}. Then for each 𝑗_𝑘 in 𝐽_𝑖, the local computation 𝐿_𝑗𝑘(𝑝_𝑖𝑗𝑘), transmission data size 𝐶_𝑗𝑘(𝑝_𝑖𝑗𝑘) and remote computation 𝑅_𝑗𝑘(𝑝_𝑖𝑗𝑘) of 𝑗_𝑘 are given. Thus 𝑡^𝑙ocal_𝑖𝑗𝑘, 𝑡^𝑡rans_𝑖𝑗𝑘 and 𝑡^𝑟emote_𝑖𝑗𝑘 can be obtained from Eqs. (8)(9)(11). Thus, the arrival time, i.e., the time of local computing and data transmission, and the shortest weighted remaining time of each task are given. Then the scheduling strategy 𝜙(𝑠_𝑖) for these tasks can be given based on SWRTF. The same goes for other ESs. Now, we have bound Φ to 𝑋, i.e., scheduling strategy Φ can be determined accordingly when offloading strategy 𝑋 is determined.

4.1.3 Task Offloading

Based on the practical consideration on DNN partition and task scheduling in Section 4.1.1 and 4.1.2, we can bind partition strategy 𝑃 and scheduling strategy Φ to offloading strategy 𝑋, e.g., we just need to decide the offloading strategy 𝑋, then the partition strategy 𝑃 and scheduling strategy Φ will be decided accordingly. As a result, our initial optimization problem POSP can be transformed into an 0-1 integer optimization problem which only has one set of variables 𝑋:

\(\begin{align}\begin{array}{ll}\mathcal{P}: & \min _{X} T \\ \text { s.t. } & \text { C1: } x_{i j}=1 \xrightarrow{\text { yields }} p_{j}=p_{i j} \\ & \text { C2: } \sum_{i=1}^{m} x_{i j} \leq 1, \forall 1 \leq j \leq n \\ & \text { C3: } x_{i j}=1 \text { or } 0, \forall 1 \leq j \leq n, \forall 1 \leq i \leq m \\ & \text { C4: } \phi\left(s_{i}\right)=\left\{j \mid x_{i j}=1\right\}\end{array}\end{align}\). (14)

Actually, this is a variant of general assignment problem (GAP) where the cost of assigning a task to a server, i.e., its weighted inference latency 𝑤_𝑗𝑇_𝑖j, can be changed by the other tasks assigned to the same server since the waiting time 𝑡^𝑤ait_𝑖j is affected by 𝑋. Motivated by the algorithm proposed by Ross, G. Terry, and Richard M. Soland [20], we design a heuristic algorithm BBO, i.e., branch and bound optimization to solve this problem. The detailed introduction of BBO is in Section 4.2. Once the offloading strategy 𝑋 is determined by BBO, partition strategy 𝑃 and scheduling strategy Φ can be determined accordingly, thus determining the solution of the initial problem POSP.

4.2 Branch and Bound Optimization Algorithm (BBO)

Actually, branch and bound is a way to traverse the entire solution space of the problem with pruning to limit the time complexity. Branch is for generating the solution and bound is for pruning. According to branch and bound, the solution set for 𝒫 is separated into two mutually exclusive and collectively exhaustive subsets based on the 0-1 dichotomy of variable values, so as to the subsets created. Fig. 7 gives an example of separating the solution set of 𝒫. Each separation creates two new candidate problems whose solution sets differ only in the value assigned to a particular variable. We use BFS (breadth first search) to traverse the solution sets with bounding and pruning.

E1KOBZ_2024_v18n7_1749_13_f0001.png 이미지

Fig. 7. An example of the solution space of this problem

The main processing procedures for each candidate problem 𝒫_𝑘 are:

1) Bounding: make a relaxation of 𝒫_𝑘 to get a lower bound Θ_𝑘 of the objective function in this branch according to the relaxed problem 𝒫ℛ_𝑘 and update the upper bound of the objective function for problem 𝒫 by substituting a feasible solution into the objective function.

2) Branching: select a variable as the separation variable to further separate the solution set.

3) Pruning: check if the lower bound of this branch is too large that needs to be pruned. These procedures’ detailed explanations are as follows, and for notational convenience, let 𝒫_𝑘 denote the candidate problem of a branch and 𝐹_𝑖 denote the set of tasks which has been fixed to be offloaded to ES 𝑠_𝑖 in 𝒫_𝑘.

4.2.1 Bounding

Bounding is the procedure to bound the upper bound and lower bound of each candidate problem for pruning. Relaxing the problem to get the bound in a traditional method in branch and bound algorithm. For the current candidate problem 𝒫_𝑘, it is too complex to solve since the assignment cost of each task is tightly relate to the offloading strategy of other tasks, e.g., the assignment cost of each task is unknown at the beginning. Thus, we can get the relaxed problem 𝒫ℛ_𝑘 by fixing the assignment of each task relative to each ES before solving the problem. In the current candidate problem 𝒫_𝑘, some tasks’ assignment has been decided. Therefore, we can first compute the total weighted inference latency of an ES 𝑠_𝑖 only considering the tasks have been decided to be offload to it. Then, for each task 𝑗 whose assignment has not been decided, the assignment cost 𝐶_𝑖j equals to the increment of the total weighted inference latency of 𝑠_𝑖 after assigning task 𝑗 to 𝑠_𝑖:

𝐶_𝑖j = 𝑇_𝑖(𝑗) − 𝑇_𝑖, (15)

where 𝑇_𝑖 denotes the existing total weighted inference latency of tasks in 𝐹_𝑖 utilizing SWRTF and 𝑇_𝑖(𝑗) is the new total weighted inference latency of tasks in 𝐹_𝑖⋃{𝑗} utilizing SWRTF. The relaxed problem 𝒫ℛ_𝑘 can be formalized as:

\(\begin{align}\begin{array}{l} \mathcal{P} \mathcal{R}_{k}: \min _{X} T^{\prime} \\ \text { s.t. } \\ \text { C1: } \sum_{i=1}^{m} x_{i j} \leq 1, \forall 1 \leq j \leq n \text {, } \\ \text { C2: } x_{i j}=1 \text { or } 0, \forall 1 \leq j \leq n, \forall 1 \leq i \leq m \text {, } \end{array}\end{align}\), (16)

where \(\begin{align}T^{\prime}=\frac{1}{n} \sum_{i=1}^{m} \sum_{j=1}^{n} x_{i j} C_{i j}\end{align}\) is the average weighted assignment cost. Since 𝐶_𝑖j is known, problem 𝒫ℛ_𝑘 has an obvious solution 𝑋_𝑘 by assigning every unassigned task 𝑗 ∈ ⋃^𝑚_𝑖=1𝐹_𝑖 to the ES 𝑠_𝑖𝑗 that minimizes 𝐶_𝑖j. Substituting this solution into 𝒫ℛ_𝑘 yields the lower bound of this branch, denoted by Θ_𝑘. It is obvious that 𝑋_𝑘 is a feasible solution for 𝒫_𝑘, so we can calculate a valid objective function value 𝑇^𝑣alid_𝑘 of the initial problem 𝒫 by substituting 𝑋_𝑘 into 𝒫_𝑘. Since the problem seeks for minimum objective function value, 𝑇^𝑣alid_𝑘 is also the upper bound Ω of 𝒫.

Function ProblemRelax(𝒫_𝑘)

JAKO202425557607509_function 1.png 이미지

4.2.2 Branching

Branching is the procedure to separate the solution space of the problem into different sub-problems for traversing. To select a 𝑥_𝑖j ∈ 𝑋 to be the separation variable, i.e., separate the problem according to the value of 𝑥_𝑖j, we compute a “re-offloading profit” 𝛿_𝑖j for each ES 𝑠_𝑖 and each task 𝑗 that has not been determined to be offloaded to which ES in the current candidate problem, that is, 𝑗 ∉ ⋃^𝑚_𝑖=1𝐹_𝑖. Let 𝑋_𝑘(𝑖) denote the solution by modifying the feasible solution 𝑋_𝑘 with re-offloading task 𝑗 to ES 𝑠_𝑖. It is obvious that 𝑋_𝑘(𝑖) is also a feasible solution for 𝒫_𝑘. Then 𝛿_𝑖j can be defined as the reduction of the objective function 𝑇 in 𝒫_𝑘:

𝛿_𝑖j = 𝑇(𝑋_𝑘) − 𝑇(𝑋_𝑘(𝑖)), (17)

where 𝑇(𝑋_𝑘) denotes the value of objective function by substituting 𝑋_𝑘 into problem 𝒫_𝑘. The selected separation variable 𝑥_{𝑖^∗𝑗^∗} is the one that has maximum 𝛿_{𝑖^∗𝑗^∗} among those with 𝑗 ∉ ⋃^𝑚_𝑖=1𝐹_𝑖 in the candidate problem 𝒫_𝑘. Then, the solution set is further separated into two subsets, one with 𝑥_{𝑖^∗𝑗^∗} = 1 and the other one with 𝑥_{𝑖^∗𝑗^∗} = 0.

Function MaxProfit(𝒫_𝑘, X_𝑘)

JAKO202425557607509_function 2.png 이미지

4.2.3 Pruning

Pruning is the procedure to limit the time complexity of the traverse of solution space. To avoid redundant computing, we safely discard some solution set based on two rules. First, as described in the bounding procedure, we can maintain the upper bound Ω of the initial problem 𝒫 and compute a lower bound Θ_𝑘 for each candidate problem 𝒫_𝑘. If Θ_𝑘 > Ω, this branch should be pruned obviously since it is impossible to generate the optimal solution. Then, during the BFS of the solution sets, we set a threshold 𝜔 of the maximum number of candidate problems in each level. Particularly, the candidate problems in the same level are processed in a round and if the number of candidate problems in a round exceeds 𝜔, those with maximum lower bound Θ_𝑘 are pruned.

4.2.4 Algorithm Design and Analysis

Algorithm 2 illustrates the pseudocode of our BBO algorithm. We maintain two problem sets 𝑄 and 𝑄′ for BFS and an upper bound Ω of the initial problem 𝒫 (line 1). The main loop of this algorithm is the BFS process of the solution sets and each loop can be decomposed into three steps. In step1, we compute the lower bound Θ_𝑘 and feasible solution 𝑋_𝑘 of the candidate problem 𝒫_𝑘 utilizing a function ProblemRelax(𝒫_𝑘) (line 4). In step2, if the lower bound Θ_𝑘 of 𝒫_𝑘 is not greater than Ω, we update the upper bound Ω (lines 6-7) and select a separation variable 𝑥_{𝑖^∗𝑗^∗} for branching with the function MaxProfit(𝒫_𝑘, X_𝑘) (lines 8-11). After conducting step1 and step2 for each candidate problem in this round (line 3), we check if the problem set for next round is too big and remove some candidate problems if necessary (lines 12-14). At last, we return the solution with minimum objective function value (lines 21-22). There are 𝑚n processing rounds and we need to process at most 𝜔 candidate problems each round and the time complexity of each candidate problem is 𝑂(𝑛² log 𝑛). Thus, the time complexity of BBO algorithm is 𝑂(𝜔mn³ log 𝑛).

Algorithm 2. BBO (Branch and Bound Optimization)

JAKO202425557607509_algor 2.png 이미지

Since BBO gives the offloading strategy 𝑋, the partition strategy 𝑃 and scheduling strategy Φ can be given accordingly based on Section 4.1.

5. Implementation and Evaluation

In this section, we first introduce the prototype setup for our experiment, and then compare our scheme with several existing schemes.

5.1 Prototype Setup

To evaluate the performance of our scheme, we build a heterogeneous device-edge system prototype. We use two Laptops to act as the edge servers to assist the end devices in executing their inference tasks, one equipped with a 6-core 2.60GHz Intel CPU and16-GB RAM and the other one equipped with a 4-core 1.60GHz Intel CPU and 8-GB RAM. The end devices are composed of two Raspberry Pis, each of them equipped with a 4-core ARM Cortex-A72 CPU, a JetsonTX2 equipped with a 4-core ARM Cortex-A57 CPU, a Jetson xavierNX equipped with a 6-core ARM V8 CPU. All end devices are connected to the edge server through LAN. We use the pretrained DNN models from the standard implementation from famous package PyTorch. An AlexNet model is deployed on Raspberry Pi1, a MobileNet_V2 model is deployed on Raspberry Pi2, a ResNe18 model is deployed on the JetsonTX2 and a VGG19 model is deployed on the Jetson xavierNX. Let the priority of the tasks on the four end devices be 1, 2, 3, and 4, respectively. The settings are listed in Table 2. We use tiny-ImageNet [21], a subset of the ILSVRC2012 classification dataset, as our dataset.

Table 2. Experimental Settings

E1KOBZ_2024_v18n7_1749_16_t0002.png 이미지

5.2 Benchmarks

We evaluate our scheme by comparing the performance with four naive schemes and a SOTA (state-of-the-art) scheme as follows:

• Local-Only (LO): All inference tasks are executed locally.

• Edge-Only (EO): All EDs offload their tasks to the edge servers without DNN partition, the selection of the edge server to offload is random and the execution order of tasks on the server is FCFS.

• DADS [7] with random allocation and FCFS scheduling (RA-FS): We first use DADS to compute the optimal partition of each task for each ES, and then randomly allocate tasks to ESs where tasks are scheduled in the FCFS manner.

• DADS with random allocation and SWTRF scheduling (RA): We first use DADS to compute the optimal partition of each task for each ES, and then randomly allocate tasks to ESs where tasks are scheduled in the SWTRF manner.

• CCORAO [22]: This is an SOTA (state-of-the-art) algorithm for cloud assisted mobile edge computing in vehicular networks. The main idea of CCORA is to decide the offloading strategy and resource allocation strategy iteratively. We modify it to our problem POSP, i.e., deciding the offloading strategy and scheduling strategy iteratively.

5.3 Experimental Results

We first test the inference latency of each task on different devices. It can be seen in Fig. 8 that the computing capacity of different devices varies and the inference latency when tasks are executed locally is too high to support some real-time applications. Thus, we need to carefully design the collaborative inference scheme to reduce the inference latency in this heterogeneous system.

E1KOBZ_2024_v18n7_1749_17_f0001.png 이미지

Fig. 8. The inference latency of different models on different devices.

Then we conduct multiple experiments by modifying the bandwidth between devices. The bandwidth configurations for the experiments are shown in Table 3. We use the four different bandwidth configurations to simulate different network conditions of the system. Configuration1 simulates the situation that the overall network condition of the system is good, Configuration2 simulates the situation that the overall network condition of the system is bad, Configuration3 simulates the situation that the network condition between different devices varies and Configuration4 simulates the situation that the network conditions between all end devices to the 2 ESs has a large gap.

Table 3. Four bandwidth configurations

E1KOBZ_2024_v18n7_1749_18_t0001.png 이미지

Table 4 shows the partition and offloading strategies for tasks given by CIS at different configurations, where 2/11 means the model AlexNet has 11 layers and is partitioned at the 2^nd layer, 0/11 means offloading the entire task to the edge server and 11/11 means executing the entire task locally. We can draw from the results that the better the network condition, the earlier the tasks are offloaded. When the system suffers adverse network conditions, the end devices tend to execute their tasks locally.

Table 4. Partition and offloading strategy at different configurations.

E1KOBZ_2024_v18n7_1749_18_t0002.png 이미지

Fig. 9 shows the average weighted inference latency of different schemes. It can be seen that our scheme CIS has the similar performance to CCORAO, both of which can reduce about 29% to 71% on the average weighted inference latency compared to the other four naïve schemes. Further analysis of the experimental results reveals that the result of LO is stable but can’t be optimal, the result of RO is influenced by network conditions, when the network is poor, it may lead to intolerable latency, and the results of RA-FS and RA are relatively better since they make some improvement to LO and RO. CIS and CCORAO jointly considers DNN partition, task offloading and scheduling, so they can always obtain the optimal solutions when the problem space is small.

E1KOBZ_2024_v18n7_1749_19_f0001.png 이미지

Fig. 9. The average weighted inference latency of different schemes at different configurations.

5.4 Simulation Experiments

In this section, a series of simulation experiments are conducted to further evaluate the performance of our proposed scheme CIS. Initially, we evaluate the performance of CIS with different network conditions in Section 5.4.1. Then we test the schemes when the size and number of tasks change in Section 5.4.2. At last, the robustness of CIS with different computational capacities patterns for the multiple edge servers and devices is validated in Section 5.4.3. For the numerical analysis, the computational capacities of EDs, the computational capacities of ESs and the bandwidth between ED to ES take a uniform distribution in the range of [1, 5] FLOPS, [10, 20] FLOPS and [0.1, 2.0] Mbps, respectively. The DNN models in our experiments include: AlexNet, MobileNet_V2, ResNet18 and VGG19.

5.4.1 Performance with different network conditions

In this section, we evaluate the performance of CIS with different network conditions. The simulation scenario has 12 EDs and 6 ESs, the computational capacities of EDs are: [1.2, 1.4, 1.6, 1.8, 2, 2.2, 2.4, 2.6, 2.8, 2.9, 2.3, 4.0], the computational capacities of ESs are: [11, 15, 17.5, 20, 24, 22], the priorities of tasks on the EDs are: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. An AlexNet model is deployed 𝑑₁, 𝑑₅, 𝑑₉, a MobileNet_V2 model is deployed 𝑑₂, 𝑑₆, 𝑑₁₀, a ResNet18 model is deployed 𝑑₃, 𝑑₇, 𝑑₁₁, and a VGG19 model is deployed 𝑑₄, 𝑑₈, 𝑑₁₂. For convenience, we set the bandwidth between all devices the same. Fig. 10 shows the simulation results at difference bandwidth. The inference latency of CIS and CCORAO are similar and always smaller than that of other schemes since the problem space is small and they can always get the optimal solution. The inference latency of LO still does not vary with bandwidth and with the continuous improvement of network bandwidth, the performances of other five schemes are also improving. At the beginning, the improvement is quite significant since the network condition is the main bottleneck at this time. When the bandwidth is high enough, this improvement gets smaller as the computational capacity is the main bottleneck at this time. Overall, our proposed scheme CIS has a well performance at different network conditions.

E1KOBZ_2024_v18n7_1749_20_f0001.png 이미지

Fig. 10. Average weighted inference latency at difference bandwidth.

5.4.2 Performance with different numbers of tasks

In this section, we evaluate the performance of CIS with different numbers of tasks. We fix 100 ESs whose computational capacities are randomly taken from [10, 20] FLOPS. Then we conduct a series of experiments with different number of EDs, e.g., different number of tasks. From each number of EDs, we conduct the experiments for 50 times. For each experiment, the computational capacities of the EDs are randomly taken from [1, 5] FLOPS, the DNN model deployed on it is randomly picked from AlexNet, MobileNet_V2, ResNet18 and VGG19 and the bandwidth between an ED and an ES is randomly taken from [0.1, 2.0] Mbps. We take the results of scheme LO as the baseline and compute the relative average weighted inference latency for other five schemes, e.g., the average weighted inference latency of other scheme divided by that of LO. Then the average of results of the 50 times experiments is taken to compare different schemes. Fig. 11 shows the simulation results. There are four main observations regarding these results:

E1KOBZ_2024_v18n7_1749_21_f0001.png 이미지

Fig. 11. Relative average weighted inference latency of different numbers of Tasks

• The performance of each scheme decreases as the number of tasks increases. This is intuitive since the number and computational capacities of the ESs are fixed.

• RO is always the worst scheme and can lead to more than 140% average weighted inference latency compared to LO. This is because we randomly select the ES to offload for each task, which means many tasks may be offloaded to the same ES, thus increasing the result. Likely, RA-FS and RA can also have worse performance compared to LO since the selection of ESs to offload is random. They are better than RO since they will first partition the DNN models, which can reduce the computation overhead of the ESs.

• The performance gap between CIS and RA-FS or RA increases as the number of tasks increases, because the number of tasks offloaded to the same ES is small at the beginning, that is, the scheduling strategy of tasks has little impact on the results. The more the tasks, the more important it is to decide proper offloading strategy and scheduling strategy.

• Compared to CCORAO, when the number of tasks is small, the advantage of CIS is not obvious since they can both get the optimal solutions. However, as the number of tasks increases, CCORAO cannot conduct enough rounds of iteration to get convergence. Although CIS also discard some subproblems to limit its complexity, our careful design of heuristic rules for discarding subproblems does have an undeniable impact on the results.

5.4.3 Performance with different computational capacities patterns

In this section, we validate the robustness of CIS with different computational capacities patterns for the multiple edge servers and devices. We fix the number of EDs and ESs 300 and 100 respectively. For each DNN model, there are 75 EDs deployed with it. The computational capacities of ESs and EDs are randomly taken from [10, 20] FLOPS and [1, 5] FLOPS respectively. We conduct the experiments for 100 times and compare the relative average weighted inference latency for other five schemes. The results are shown in Fig. 12. Compared with RO, RA-FS and RA, the performance our scheme CIS has obvious advantages. Overall, the inference latency when taking CIS is always lower. What’s more, the results of CIS are less dispersed, which means unacceptable results rarely occur. This is mainly due to the fact that the first three strategies randomly select the offloading strategy, which is obviously not feasible when the number of tasks is large. When it comes to CCORAO and CIS, both of them almost always demonstrates a certain level of performance improvement compared to the baseline scheme LO. However, the results of CIS exhibit a more concentrated distribution and a lower highest inference latency. This indicates our carefully designed heuristic rules for discarding subproblems in Section 4.2.3 do make sense.

E1KOBZ_2024_v18n7_1749_22_f0001.png 이미지

Fig. 12. Relative average weighted inference latency with different computational capacities patterns

6. Conclusion

In this paper, we study the DNN inference acceleration in a heterogeneous edge computing scenario. We present a comprehensive analysis of the collaborative inference in the heterogeneous scenario and point out the complexity of this problem. A scheme CIS is proposed which jointly combines DNN partition, task offloading and task scheduling to accelerate the DNN inference tasks. Extensive experiments are conducted to evaluate our scheme. With a detailed analysis of evaluation results, CIS are validated to be more effective for improving the average weighted inference latency of the system.

Acknowledgement

This work was supported in part by the Science and Technology Project of State Grid Co., LTD (Research on data aggregation and dynamic interaction technology of enterprise-level real-time measurement data center, 5108-202218280A-2-399-XG).

참고문헌

Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol.521, pp.436-444, May, 2015.
J. Chen, and X. Ran, "Deep Learning With Edge Computing: A Review," in Proc. of Proceedings of the IEEE, vol.107, no.8, pp.1655-1674, Aug. 2019.
J. Chai, and A. Li, "Deep Learning in Natural Language Processing: A State-of-the-Art Survey," in Proc. of 2019 International Conference on Machine Learning and Cybernetics (ICMLC), pp. 1-6, 2019.
W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," in Proc. of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4960-4964, 2016.
J. Mao, X. Chen, K. W. Nixon, C. Krieger, and Y. Chen, "MoDNN: Local distributed mobile computing system for Deep Neural Network," in Proc. of Design, Automation & Test in Europe Conference & Exhibition, pp.1396-1401, 2017.
Q. Pu, G. Ananthanarayanan, P. Bodik, S. Kandula, A. Akella, P. Bahl, and I. Stoica, "Low Latency Geo-distributed Data Analytics," ACM SIGCOMM Computer Communication Review, vol.45, no.4, pp.421-434, Aug. 2015.
C. Hu, W. Bao, D. Wang, and F. Liu, "Dynamic Adaptive DNN Surgery for Inference Acceleration on the Edge," in Proc. of IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, pp.1423-1431, 2019.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going Deeper with Convolutions," in Proc. of 2015 lEEE Conference on Computer Vision and Pattern Recognition, pp.1-9, 2015.
K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in Proc. of 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp.770-778, 2016.
M. Gao, R. Shen, L. Shi, W. Qi, J. Li, and Y. Li, "Task Partitioning and Offloading in DNN-Task Enabled Mobile Edge Computing Networks," IEEE Transactions on Mobile Computing, vol.22, no.4, pp.2435-2445, Apr. 2023.
X. Tang, X. Chen, L. Zeng, S. Yu, and L. Chen, "Joint Multiuser DNN Partitioning and Computational Resource Allocation for Collaborative Edge Intelligence," IEEE Internet of Things Journal, vol.8, no.12, pp.9511-9522, 2021.
T. Mohammed, C. Joe-Wong, R. Babbar, and M. Di Francesco, "Distributed Inference Acceleration with Adaptive DNN Partitioning and Offloading," in Proc. of IEEE INFOCOM 2020 - IEEE Conference on Computer Communications, pp.854-863, 2020.
C.-Y. Yang, J.-J. Kuo, J.-P. Sheu, and K.-J. Zheng, "Cooperative Distributed Deep Neural Network Deployment with Edge Computing," in Proc. of ICC 2021 - IEEE International Conference on Communications, pp.1-6, 2021.
Z. Liao, W. Hu, J. Huang, and J. Wang, "Joint multi-user DNN partitioning and task offloading in mobile edge computing," Ad Hoc Networks, vol.144, 2023.
L. Shi, Z. Xu, Y. Sun, Y. Shi, Y. Fan, and X. Ding, "A DNN inference acceleration algorithm combining model partition and task allocation in heterogeneous edge computing system," Peerto-Peer Networking and Applications, vol.14, pp.4031-4045, 2021.
Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, "Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge," ACM SIGARCH Computer Architecture News, vol.45, no.1, pp.615-629, 2017.
S. Zhang, Y. Li, X. Liu, S. Guo, W. Wang, J. Wang, B. Ding, and D. Wu, "Towards Real-time Cooperative Deep Inference over the Cloud and Edge End Devices," Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol.4, no.2, pp.1-24, Jun. 2020.
N. Wang, Y. Duan, and J. Wu, "Accelerate Cooperative Deep Inference via Layer-wise Processing Schedule Optimization," in Proc. of 2021 International Conference on Computer Communications and Networks, pp.1-9, 2021.
Y. Duan, and J. Wu, "Joint Optimization of DNN Partition and Scheduling for Mobile Cloud Computing," in Proc. of ICPP '21: Proceedings of the 50th International Conference on Parallel Processing, pp.1-10, 2021.
G. T. Ross, and R. M. Soland, "A branch and bound algorithm for the generalized assignment problem," Mathematical programming, vol.8, pp.91-103, Dec. 1975.
Y. Le, and X. Yang, "Tiny imagenet visual recognition challenge," CS 231N, vol.7, no.7, 2015. https://cs231n.stanford.edu/reports/2015/pdfs/yle_project.pdf
J. Zhao, Q. Li, Y. Gong and K. Zhang, "Computation Offloading and Resource Allocation for Cloud Assisted Mobile Edge Computing in Vehicular Networks," IEEE Transactions on Vehicular Technology, vol.68, no.8, pp.7944-7956, 2019.
Y. Chen, K. Wu, Q. Zhang, "From QoS to QoE: A Tutorial on Video Quality Assessment", IEEE Communications Surveys & Tutorials, vol.17, no.2, pp.1126-1165, 2015.

KSII Transactions on Internet and Information Systems (TIIS)

Collaborative Inference for Deep Neural Networks in Edge Environments

초록

키워드

1. Introduction

2. Related Work

3. System Model and Problem Formulation

3.1 Heterogeneous Collaborative Inference System

3.2 System Model

3.2.1 DNN Layer-level Computation and Output data Model

3.2.2 DNN Partition Model

3.2.3 Task Scheduling Model

3.2.4 Task Offloading and Inference Latency Model

3.2.5 Problem Formulation

4. Algorithm Design

4.1 Problem Decomposition

4.1.1 DNN Partition

4.1.2 Task Scheduling

4.1.3 Task Offloading

4.2 Branch and Bound Optimization Algorithm (BBO)

4.2.1 Bounding

4.2.2 Branching

4.2.3 Pruning

4.2.4 Algorithm Design and Analysis

5. Implementation and Evaluation

5.1 Prototype Setup

5.2 Benchmarks

5.3 Experimental Results

5.4 Simulation Experiments

5.4.1 Performance with different network conditions

5.4.2 Performance with different numbers of tasks

5.4.3 Performance with different computational capacities patterns

6. Conclusion

Acknowledgement

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)