A Diversified Message Type Forwarding Strategy Based on Reinforcement Learning in VANET

Xu, Guoai;Liu, Boya;Xu, Guosheng;Zuo, Peiliang;

doi:10.3837/tiis.2022.09.015

KSII Transactions on Internet and Information Systems (TIIS)

제16권9호
/
Pages.3104-3123
/
2022
/
1976-7277(pISSN)
/
1976-7277(eISSN)

한국인터넷정보학회 (Korean Society for Internet Information)

DOI QR Code

A Diversified Message Type Forwarding Strategy Based on Reinforcement Learning in VANET

Xu, Guoai (School of Cyberspace Security, Beijing University of Posts and Telecommunications) ;
Liu, Boya (School of Cyberspace Security, Beijing University of Posts and Telecommunications) ;
Xu, Guosheng (School of Cyberspace Security, Beijing University of Posts and Telecommunications) ;
Zuo, Peiliang (Beijing Electronic Science and Technology Institute)

투고 : 2021.12.18
심사 : 2022.09.01
발행 : 2022.09.30

https://doi.org/10.3837/tiis.2022.09.015 인용 PDF KSCI HTML

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

The development of Vehicular Ad hoc Network (VANET) has greatly improved the efficiency and safety of social transportation, and the routing strategy for VANET has also received high attention from both academia and industry. However, studies on dynamic matching of routing policies with the message types of VANET are in short supply, which affects the operational efficiency and security of VANET to a certain extent. This paper studies the message types in VANET and fully considers the urgency and reliability requirements of message forwarding under various types. Based on the diversified types of messages to be transmitted, and taking the diversified message forwarding strategies suitable for VANET scenarios as behavioral candidates, an adaptive routing method for the VANET message types based on reinforcement learning (RL) is proposed. The key parameters of the method, such as state, action and reward, are reasonably designed. Simulation and analysis show that the proposed method could converge quickly, and the comprehensive performance of the proposed method is obviously better than the comparison methods in terms of timeliness and reliability.

키워드

1. Introduction

As of the beginning of 2020, 330 million vehicles over the world have been interconnected [1]. VANET provides users with safe, efficient, convenient and comfortable intelligent messaging services, which has become an important part of modern intelligent transportation and has been closely watched by experts and scholars at home and abroad [2]. And among the many research fields of VANET, routing is one of the key technologies that restrict its development. So many researchers have attempted to optimize routing methods suitable for VANET over the years.

Greed Perimeter Stateless Routing (GPSR) algorithm is a typical location-based routing protocol [3]. The nodes in this protocol do not need to maintain routing tables and are simple and easy to implement. It has become a common routing protocol for VANET systems. The application of reinforcement learning [4] algorithm to the Internet of vehicles can optimize the quality of service parameters in VANET to different degrees, such as delay, bandwidth, security, etc. This provides a better solution for the design and improvement of routing strategy in VANET. Reinforcement learning is an important field of artificial intelligence, in which the main research problem is that the agent should obtain the maximum reward value through interaction with the environment, and thus learn the strategy with the best return. Since the state-action pair matching problem in reinforcement learning is similar to the dynamic routing process, it has been widely used in this respect, which also provides a good solution for the design and improvement of message forwarding strategies in VANET.

J. Li [5] proposed a Q-learning-based routing algorithm for wireless sensor networks. Each node is marked as a state, and the transition of the state is defined as an action. The optimal path is established by traversing the routing table. M. Yuan [6] et al. raised a multipriority message-oriented VANET routing algorithm based on the Q-value, which aims to solve the problem of load balancing. By considering location information and received signal strength, the routing problem is transformed into a Q-learning optimization process on the basis of fuzzy constraints in [7]. C. Wu et al. [8] proposed an improved Q-learning routing protocol QLAODV, which could effectively deal with the high-speed dynamic movement of mobile ad hoc networks and frequent topology changes. R. Plate et al. [9] proposed a Q-learning routing method QKS that combines kinematics and scanning features to solve the problem of slow convergence caused by the Q-learning algorithm. In order to solve the multicast problem and improve the performance of the MAODV routing protocol, G. Santhi et al. [10] proposed the MANET multicast routing protocol QLMAODV by applying the Q-learning algorithm to the existing MAODV protocol. By preemptively selecting the sub-optimal routing before the failure of the current active routing for network state learning, Y. Sun et al. [11] proposed a location-based reinforcement learning routing protocol PBQR, which defines the stability factor and continuity factor, uses the Q-learning algorithm to evaluate the quality of neighbor nodes, selects the next hop node based on the node location information, and enhances link stability and reliability. J. Wu et al. [12] proposed an adaptive routing protocol based on reinforcement learning (ARPRL). By designing a new Q-value update function, using the DATA forwarding mechanism and the MAC layer feedback mechanism to assist in updating the Q-value table, it effectively solves the routing loop, link interruption and other issues. S. Jiang et al. [13] designed an auxiliary geographic routing based on Q-learning to improve the performance of data packet transmission and end-to-end delay. J. Aznar-Poveda et al. [14] proposed a joint beacon rate and transmission power control based on Q-learning and policy evaluation to ensure the timeliness of message forwarding.

Currently, most of the current research work focuses on algorithm innovation and improvement under the premise of a certain fixed goal. However, faced with the fact that the increasingly complex needs of users which leads to the diversity of network forwarding messages, designing a message forwarding strategy that selects different forwarding protocols for different message types, thereby reducing network overhead, is a key direction in the field of vehicular networking research. To the best of the authors' knowledge, there is quite little research on the routing methods related to the message types of VANET. In the "Web 5.0 Technology" white paper released in September 2021 [15], internet information is divided into 10 levels of certainty. This paper defines the messages into four types take into account the timeliness and reliability requirements of message transmission process, and two routing algorithms with completely different transmission processes are introduced. RL algorithm has fast convergence speed when the model is not too complex and can meet the requirement of low delay of scheme model. For this reason, we use reinforcement learning algorithm to achieve intelligent transmission strategy matching with different message types.

The follow-up content of the paper is arranged as follows. Section 1 introduces the related preliminary knowledge. Section 2 describes the system model in detail. Section 3 covers algorithm modeling combined with reinforcement learning and gives the details of the proposed algorithm. Section 4 verifies the specific performance of the method proposed in this paper. The last section summarizes the paper.

2. Preliminary knowledge

2.1 GPSR

Location-based protocols is a promising routing solution for VANET, regarding their performance. GPSR is a typical location-based routing algorithm. The node obtains the location information of the node through the positioning system, and uses the greedy forwarding algorithm to select the closest node to the destination node within the communication range to forward the message. If there is no communicable node, the neighboring forwarding algorithm is used to avoid routing holes. The node uses greedy forwarding and peripheral forwarding according to the situation until the communication is completed.

Until now, many scholars have made improvements to the GPSR protocol [16-19], especially to make it applicable to VANET. H. Yuan et al. [20] proposed GPSR-TM protocol. In this protocol, the relationship between vehicles is expressed in the form of social network, and the location, speed and social attributes of neighbor nodes are considered when the next-hop forwarding node is considered. Wang et al. [21] proposed a GPSR algorithm based on prediction. In each broadcast, the location of neighbor nodes is predicted by considering the node speed and direction, and then the next hop node is selected. A. Benmir et al. [22] proposed to send the same data packet in two different paths to maximize its receiving probability. Simulation results show that the scheme is better than GPSR in some performance. S. Younes et al. [23] proposed a novel Extended Kalman Filter Greedy Perimeter Stateless Routing protocol. EKF-GPSR uses a stochastic prediction model based on an EKF to obtain the user information when transmitting data instead of using information from the last beacon exchanged messages which is most likely outdated.

2.2 Reinforcement learning

Reinforcement learning is a kind of learning that maps the action space from the state space to enable the agent to obtain the greatest reward in the process of interacting with the environment [24]. One of the most important algorithms of RL is Q-learning that involves action-value function Q(a, s) which refers to the expected reward when taking a pair of action a and state s . It is defined as:

Q(s,a) ← Q(s,a) + α(r + γmax_a' Q(s',a') - Q(s,a)) (1)

The Markov decision process can usually be used to model reinforcement learning. ε − greedy is a way to reach a better compromise between exploration and utilization, and it can be described as:

\(\begin{aligned}\boldsymbol{a}=\left\{\begin{array}{llll}\text { randomly action } & \text { with } & \text { probability } & \varepsilon \\ \arg \max _{a \in A} Q(a, s) & \text { with } & \text { probability } & 1-\varepsilon\end{array}\right.\\\end{aligned}\) (2)

The application model in this paper is single-step reinforcement learning, which corresponds to the theoretical model of the "K-rocker gambling machine" [25].

2.3 Dempster-Shafer (D-S) evidence theory

Dempster-Shafer (D-S) evidence theory, proposed by Harvard University mathematician A.P. Dempster, was further improved by his student Shafer [26]. D-S evidence theory is a complete theory to deal with the uncertainty problem. It can not only emphasize the objectivity of things, but also emphasize the subjectivity of human estimation of things. Its biggest feature is that the description of uncertain information uses "interval estimation" instead of "point estimation". It distinguishes between unknown and uncertainty, and shows great flexibility in accurate reflection of evidence collection. It is often defined as follows:

(1) Recognition frame: the commonly used symbol Θ stands for recognition frame, which means all possible answers to a certain question, but only one answer is correct.

(2) Mass function：use m() to represent the mass function, which reflects the degree of trust. Among them, m : 2^Θ →[0,1] .

(3) Trust function: Bel (A) is defined as the sum of the basic probability distributions of all subsets of A , which represents all trust in A , that is: Bel : 2 ^Θ →[0,1] , \(\begin{aligned}\operatorname{Bel}(A)=\sum_{B \leq A} m(B)=1, A \subseteq \Theta\end{aligned}\) .

(4) Likelihood function: Pl (A) means that the trust degree of A is not denied, and it is the sum of the basic probability distributions of all the subsets that intersect with A . The uncertainty measure that A seems to hold can be expressed as: Pl : 2^Θ →[0,1] , \(\begin{aligned}\operatorname{Pl}(A)=1-\operatorname{Bel}(\bar{A}), A \subseteq \Theta\end{aligned}\),

3. System model

The routing scenario covered in this paper is shown in Fig. 1. It's worth noting that, due to the vulnerability of wireless communication, the adversary will launch various attacks against routing. Malicious nodes have various malicious behaviors. They may execute eavesdropping attack, denial of service attack, impersonation attack, black hole attack and so on [27]. Malicious nodes basically disguise themselves as normal nodes to participate in node forwarding. After messages are transmitted to malicious nodes, malicious will analysis and use the messages to destroy the network or gain illegal benefits. The behavior of 3108 Xu et al.: A Diversi malicious nodes mainly increases the end-to-end transmission delay and reduces the overall network performance in terms of its impact on the performance of VANET. In order to simulate the real scene as much as possible, we added malicious nodes into the system model. In this paper, malicious nodes specifically refer to “black hole” nodes [28]. This assumption will make the description of the system model more intuitive. The behavior of the black hole node can be expressed as after receiving a message, the node will discard the message instead of forwarding it to the next relay node. Malicious nodes will increase end-to-end transmission delay, cause waste of network resources, and reduce the performance of the entire network.

E1KOBZ_2022_v16n9_3104_f0001.png 이미지

Fig. 1. Schematic diagram of message forwarding strategy scenario

We assume the communication range of vehicle nodes is limited. So nodes need to cooperate with each other to complete data forwarding. Information such as position, speed and trust value of vehicle nodes within the communication range can be obtained periodically, and the message can be transmitted to the destination node through relay nodes. In the figure, the source node (vehicle) selects a matching forwarding action according to the forwarding message type, and transmits the message to the relay node. The relay node should be selected according to the message type, node distance, speed, trust value and other factors. The relay nodes participating in the forwarding cycle execute the policy to complete the communication process. After the communication is completed, the trust value and Q-value of different actions of corresponding nodes participating in the communication process can be updated simultaneously. For ease of understanding, the calculation process of updating Q-value and trust value is shown in detail below.

For state s_i , the update process of the average reward value Q_m(s_i, a_j) of the m-th attempt of action a_j is as follows:

\(\begin{aligned}Q_{m}\left(s_{i}, a_{j}\right)=\frac{1}{m}\left((m-1) \times Q_{m-1}\left(s_{i}, a_{j}\right)+\gamma_{m}\right)\end{aligned}\) (3)

Where state s_i ∈ S , S is the state space, action a_j ∈ A , A is the action space, γ_m represents the reward value obtained in the m-th attempt.

According to D-S theory, the comprehensive trust value of nodes includes objective trust value and subjective trust value. The objective trust value is the trust value obtained by direct communication with the source node, while the subjective trust value is the trust value obtained by communication between the node and its neighbor node. Each node will hold the trust value of other nodes. Since the integrated trust value needs to be calculated through the communication process when performing D-S element actions, the network overhead is larger than the GPSR algorithm. An example of the calculation process of the trust value is as follows. Assuming that the forwarding node n₁ has neighbor nodes n₂ and n₃ . According to the D-S evidence theory, the objective trust value of n₂ and n₃ are the trust function Bel(n₂) = mass (n₂) and Bel(n₃) = mass (n₃) . The subjective trust value is Bel(Θ) = mass(Θ) .Then the comprehensive trust value of the neighbor node can be obtained by the likelihood function.

Pl(n₂) = Bel(n₂) + Bel(Θ) = mass(n₂) + mass(Θ) (4)

Pl(n₃) = Bel(n₃) + Bel(Θ) = mass(n₃) + mass(Θ) (5)

Thus, n₁ gets the trust values Pl(n₂) and Pl(n₃) of n₂ and n₃ .

4. QLMTR strategy model

This section will describe the Q-learning based message type routing (QLMTR) strategy/algorithm. To construct a strategy model based on Q-learning, state space, action space and reward function need to be established. The state space, action space, reward functions of the strategy proposed in this paper will be described in detail below.

4.1 State space

As a preliminary study of message forwarding strategy, in order to facilitate the expression of the method, we further simplify the message classification defined in [15]. Specially, message types are described and analyzed in accordance with the requirements of urgency and reliability, and state space S is defined as a set of message types accordingly. It is worth mentioning that the type and quantity of message types can be adjusted according to specific needs in practical applications. In this paper, S is classified as follows:

Type I messages is defined as the messages with emergency and high reliability requirements, which refers to the information that can directly affect the driving safety of the vehicle, such as the driving state of the vehicle itself. This information requires high timeliness and reliability, and cannot be lost during the transmission process. It requires the success of the transmission in t₁ times of communication interactions.

Type II messages correspond to the messages with emergency but low reliability requirements, which refers to information that can slightly affect the driving process of the vehicle, such as road surface information, road section information, etc. This information is required to be delivered to the vehicle node in real time. During the execution of the algorithm, the source node is required to try to send at most t₂ times.

Type III messages correspond to the messages with emergency but high reliability requirements, which refers to information that might affect the driving safety of vehicles, such as congestion, traffic density, and other information. This kind of information is required to be accurately transmitted to vehicle nodes. During the execution of the algorithm, the source node is required to try to send at most t₃ times.

Type IV messages is defined as the messages with non-urgent but low reliability requirements, which refers to news, information, entertainment and other information, that serves drivers but does not affect driving safety. During the execution of the algorithm, the source node is required to try to send at most t₄ times.

Considering the timeliness requirements of different message types, in general, t₁ ≤ t₂ ≤ t₃ ≤ t₄ .

4.2 Action space

The meta-action includes the message transmission based on the GPSR routing algorithm (hereinafter referred to as GPSR meta-action), and the message transmission of the improved GPSR routing algorithm based on the D-S evidence theory (hereinafter referred to as the D-S meta-action), both alone and a combined action of the two compose the action space A.

We designed it in this way because GPSR is a typical routing protocol. Similarly, D-S evidence theory has been proved to have a good effect on trust measurement. In order to illustrate the methods and performance of the intelligent routing strategy proposed in this paper, we select these two typical methods as meta-actions. It can be pointed out that the action space can choose different routing methods according to actual needs.

4.2.1 Meta-action

GPSR meta-action. The message transmission action is executed according to the GPSR standard algorithm.

D-S meta-action. The D-S meta-action uses trust measures to solve the situation that there may be malicious nodes in VANET. The source node uses the D-S evidence theory to measure the comprehensive trust value of the node, and selects the node with high comprehensive trust value as the forwarding node, and the forwarding node repeats this method until the message is transmitted to the destination node.

4.2.2 Combined action

The number of actions in the action space A is related to the number of transmission attempts by the source node. This paper takes t₂ transmissions required for Type II messages as an example. For the convenience of explanation, let t₂ = 2, and the corresponding actions in the action space are shown in Table 1.

Table 1. Action combination table

E1KOBZ_2022_v16n9_3104_t0001.png 이미지

4.3 Reward function

As different message types have different requirements for forwarding actions, this paper set that the number of transmissions and the transmission success rate respectively correspond to the requirements of transmission urgency and reliability, for instance, it is not necessary to consider their network overhead for the message type with high timeliness and reliability requirements. The reward and punishment functions of the four types of messages are as follows.

Since urgent and reliable (Type I) messages require a small number of transmissions and a high transmission success rate, the reward value function of this state is defined as \(\begin{aligned}\gamma=\left(\frac{\alpha_{1}}{t}+\beta_{1}\right) R\end{aligned}\).

The two message states, which are urgent but have low reliability requirements (Type II), and not urgent but have low reliability requirements (Type IV), do not require high transmission success rate. The reward value function of this state is defined as :

\(\begin{aligned}\gamma=\left(\frac{\alpha_{2}}{t_{G P S R} C_{G P S R}+t_{D-S} C_{D-S}}+\beta_{2}\right) R\end{aligned}\) (6)

For a message state that is not urgent but requires high reliability (Type III), the number of transmissions can be relatively high, so the reward value function in this state is defined as:

\(\begin{aligned}\gamma=\left(\frac{\alpha_{3}}{t_{G P S R} C_{G P S R}+t_{D-S} C_{D-S}}+\beta_{3}\right) R\end{aligned}\) (7)

Among them, R is the indicator function, when the transmission is successful, R = 1, and R = 0 if unsuccessful. Both α_{( ⋅ )} and β_{( ⋅ )} are weighting factors, which respectively represent the number of transmissions and the proportion of transmission success in the reward, α_{( ⋅ )} > 0 , β_{( ⋅ )} ≥ 0 , and α_{( ⋅ )} + β_{( ⋅ )} = 1 . t is the number of transmissions, t_GPSR and t_D-S are the number of transmissions using GPSR meta-action and D-S meta-action in t transmission, t_GPSR + t_D-S = C_GPSR and C_D-S are the network overheads using GPSR meta-action and D-S meta-action respectively, generally, C _D-S ≥ C_GPSR.

4.4 Algorithm Details

The specific process of the forwarding strategy mentioned in this paper is shown in Algorithm 1. In this paper, we assume that each vehicle is equipped with GPS position module. Vehicles can obtain the location, speed, trust value and other information of the neighbor node. The reward value is calculated according to the reward function for the selected action corresponding to the type of message forwarded each time. The calculation method of trust value is based on D-S theory, please refer to the section of system model. We set R as the node communication radius, and L is the distance between nodes.

Algorithm 1. Details of the QLMTR strategy

Initialize source node p₁ and destination node p₂.

p₁ periodically sends and receives Hello packets to obtain neighbor nodes within the communication radius R .

If L_p1p2≤ R, then the source node directly completes the communication with the destination node.

Else L_p1p2 > R , then p₁ according to the type of message forwarded, select the corresponding action and update the reward Q-value.

If p₁ selects GPRS meta-action, then the GPSR routing algorithm is directly executed to complete the communication.

Else p₁ selects D-S meta-action, then calculates the trust value of each neighbor node, and executes the improved GPSR algorithm based on D-S to complete the communication.

End If

p₁ and p₂ update their trust values for the nodes participating in the forwarding process. At the same time, p₁ updates the reward value Q function with reference to the reward function corresponding to the current state.

5. Simulation and analysis

To simplify the description, we only show the simulation results of Type I messages, Type III messages and Type IV messages. The simulation parameter settings are shown in Table 2. In the simulation, the number of simulations for each message under the condition of the number of malicious nodes is 200 times. In order to highlight the huge difference between GPSR meta-action and D-S meta-action in terms of network overhead, they are set as 1 and 30 respectively. In order to highlight the reasonable effectiveness and advantages of the strategy proposed in the paper, the specific simulation results are shown below.

Table 2. Simulation parameter settings

E1KOBZ_2022_v16n9_3104_t0002.png 이미지

5.1 Simulation results for Type I messages

The Q-value changes of the two meta-action of Type I messages under different numbers of malicious nodes are shown in Fig. 2. It can be clearly seen from the figure that for urgent and reliable messages, in the presence of malicious nodes, the sending node can accurately find the transmission method (i.e. D-S meta-action) that should be used through the strategy designed in this paper. In particular, it can also be seen that with the increase in the number of malicious nodes, the Q-value of GPSR meta-action shows a clear downward trend. In contrast, the Q-value of D-S meta-action has always remained the same. This is due to the fact that the transmission method based on the D-S evidence theory can accurately identify malicious nodes in VANET, and prevent such nodes from participating in the forwarding process of messages with high timeliness and reliability. At the same time, the transmission method based on GPSR algorithm is not with this capability, with the increase of malicious nodes, more failed transmissions will inevitably occur.

E1KOBZ_2022_v16n9_3104_f0002.png 이미지

Fig. 2. Q-value of type I message meta-action VS Number of malicious nodes

5.2 Simulation results for Type III messages

For non-urgent and reliable messages, since the maximum number of transmissions allowed is 3, there are 8 meta-action to choose from. The Q-value changes of different meta-action combinations with the number of malicious nodes are shown in Fig. 3. It can be clearly seen from the figure that the Q-value of the GGG meta-action is higher when the number of malicious nodes is low (i.e. less than 17). As the number of malicious nodes increases, the Q-value decreases significantly. Meanwhile, it can be seen that the Q-value changes of DGG meta-action, DDG meta-action, DGD meta-action, and DDD meta-action are exactly the same. The same as the aforementioned reasons, they all use the method based on DS evidence theory when transmitting the message for the first time. So this method has a high success rate. Since the second and third transmissions are not required, the Q-value is exactly the same. In order to further compare the remaining three meta-actions (GGD meta-action, GDG meta-action, and GDD meta-action), Fig. 4 shows the Q-value changes of the three ways. It can be seen from the figure that the Q-value of the GGD meta-action is higher than the other two meta-action in more than half of the scenarios, which shows that when the sending node uses the strategy proposed in this paper, it tends to use the GPSR method for transmission first, and use the D-S method for transmission in the last transmission. This is owing to that Type III messages are relatively insensitive to timeliness and allow transmission to fail, but require that the information must be successfully transmitted within the maximum number of transmissions. At the same time, it can be concluded that through the strategy proposed in this paper, the nodes in VANET can adaptively adjust the transmission method that should be adopted according to the network situation and their own message sending requirements. In order to achieve the transmission purpose of the condition, the overall benefit is the highest.

E1KOBZ_2022_v16n9_3104_f0003.png 이미지

Fig. 3. Q-value of type III message meta-action VS Number of malicious nodes

E1KOBZ_2022_v16n9_3104_f0004.png 이미지

Fig. 4. Changes in the Q-value of GGD, GDG, and GDD in Type III messages as malicious nodes increase

5.3 Simulation results for Type IV messages

For Type IV messages, the maximum number of transmissions is set to 4, so the action space is composed of 16 actions. The Q-values of these actions under different numbers of malicious nodes are shown in Fig. 4. Similarly, the GGGG meta-action is most sensitive to the number of malicious nodes. When the malicious node exceeds 40, its Q-value is much lower than the other 15 meta-action. In addition, it can be seen that for the first transmission of meta-actions (that is, DXXX-type meta-action) which use the transmission method based on the D-S evidence theory, their Q-values always remain unchanged and consistent.

In order to further compare the remaining meta-action, Table 3 shows the meta-action with the largest Q-value for different malicious nodes. Although none of the meta-actions shows absolute advantages, it is not difficult to observe from the table that it is allowed to perform In the case of four transmissions, the meta-action with the GPSR mode in the first two transmissions account for 96% of all the meta-action with the largest Q-value. This is basically consistent with the principle of the strategy proposed in this paper, that is, it is unreliable for non-urgency. For such messages, more attempts should be made to use the GPSR method with lower network overhead for transmission. However, when there are a large number of malicious nodes in the network, the GPSR method will have a high probability of transmission failure. In this case, the DS method should be used. What needs to be added is that the Q-values of the remaining types of meta- actions are not much different (see Fig. 5), and their overall Q-value changing trends remain the same, which shows that the meta-action selection of the sending node when sending Type IV messages is relatively flexible.

E1KOBZ_2022_v16n9_3104_f0005.png 이미지

Fig. 5. Q-value of category IV message meta-action VS Number of malicious nodes

Table 3. Actions with the maximum Q-value for different malicious nodes in Type IV messages

E1KOBZ_2022_v16n9_3104_t0003.png 이미지

5.4 Comparison of transmission times

This paper uses Type III messages as an example to illustrate the comparison of transmission times. Note that the transmission times of other types of messages also have a similar trend. Due to the limitation of the length of the paper, this paper will omit the relevant content. Fig. 6 shows the average number of transmissions for the eight types of meta-action combinations in the transmission of Type III messages. An interesting result can be observed, that is, when the sending node uses the DS method for the first transmission (that is, DXX-type meta-action), the total number of transmissions is always 1. The reason is consistent with the previous analysis, i.e., when the DS mode is used for transmission, the message can be successfully transmitted at one time. For the same reason, the second transmission is a D-S meta-action (GDG and GDD), and the maximum number of transmissions will not exceed two. As for the remaining GGG meta-action and GGD meta-action, it can be seen from the figure that the average transmission times of the two increase with the increase of malicious nodes. This is because the increase of malicious nodes reduces the transmission success rate of the GPSR method. Therefore, it is necessary to increase the number of transmissions in exchange for the transmission success rate.

E1KOBZ_2022_v16n9_3104_f0006.png 이미지

Fig. 6. The average number of transmissions of meta-actions in Type III messages

5.5 Comparison of the number of transmission nodes

Similarly, only the average number of nodes participating in forwarding of different meta-action combinations of type III messages is used as an example to illustrate the correlation. The changes in the other three types of messages are similar. It can be seen from Fig. 7 that as the number of malicious nodes increases, the average number of participating nodes for DGG meta-action, DDG meta-action, DGD meta-action, and DDD meta-action that perform only one transmission also increases. This is due to reliability. The reduction in the number of forwarding nodes affects the transmission path. Similarly, for the remaining four meta-action, their average number of participating nodes also shows a significant increase, and due to the influence of the number of transmissions, the number of participating nodes in GGG meta-action and GGD meta-action are also significantly greater than that of GDG meta-action and GGD meta-action GDD meta-action.

E1KOBZ_2022_v16n9_3104_f0007.png 이미지

Fig. 7. The average number of participating nodes for meta-actions in Type III messages

5.6 Convergence

The strategy proposed in this paper is based on the theory of single-step reinforcement learning. It means that the action selection faced by the sending node and the forwarding node is a two-choice or multiple-choice process, so the process of strategy learning has the characteristics of low complexity and fast convergence. According to the simulation observation of convergence, it is found that for the four types of messages studied in this paper, the update process of the corresponding meta-action Q-value can reach convergence in the process of 200 message transmissions. Due to space limitations, this paper only specifically gives the convergence of type III messages under the condition of 16 malicious nodes. As shown in Fig. 8, the update process of the Q-value is obtained by averaging 1000 times. For other types of messages, the convergence process is similar. It can be clearly seen from the figure that for different meta-action combinations, owing to the reward and punishment functions in the respective learning process are different, there is a certain difference in the speed of convergence. However, different combinations of meta-action can converge to a stable Q-value in 58 message transmissions. Therefore, the strategy proposed in this paper has a faster convergence rate.

E1KOBZ_2022_v16n9_3104_f0008.png 이미지

Fig. 8. Convergence of the Q-value learning process for the meta-actions of type III messages, with 16 malicious nodes

5.7 Strategic advantage

In view of the fact that there is no forwarding strategy for the diversity of message categories in the VANET scenario, this paper compares the GPSR algorithm and the improved GPSR algorithm based on D-S evidence theory from the reliability of message forwarding and network overhead, and compares the three transmission delay situation and complexity are compared specifically to reflect the advantages of the proposed strategy.

In the GPSR algorithm, a node obtains the status of n nodes around it and uses greedy forwarding to select the nearest node to the destination node to transmit messages. GPSR algorithm does not need to maintain the routing table, but it needs to planarization the network topology with a complexity of O(n) , where n is the density of neighbor nodes. For the improved GPSR algorithm based on D-S evidence theory, the node obtains the status of n nodes around it. The node will use D-S evidence theory to calculate node trust value, and select node with high trust value for message transmission. Due to the method only involves the superposition of trust values of nodes based on GPSR, its complexity is O(n) . In our paper, the node obtains the status of n nodes around it, and selects GPSR or improved GPSR transmission according to the type of messages to be transmitted. Furthermore, the general definition of the complexity of machine learning algorithms is mainly concerned with the complexity of their utilization, mainly reflected in the query mapping table, as the training process can be completed in advance. In the application of the proposed method, the action output can be obtained only by querying the mapping table, and the the querying table complexity is O(log₂ N) , where N is the number of elements in the mapping table. So the complexity of QLMTR is O(log₂ N) + O(n) , and the complexity can be further reduced to O(log₂ N) . Furthermore, since the state space of the proposed scheme is not large, the complexity is completely acceptable. The complexity comparison of the three methods is shown in Table 4.

Table 4. Complexity comparison of the three methods

E1KOBZ_2022_v16n9_3104_t0004.png 이미지

It should be noted that, even in terms of complexity in application, the proposed method is more complex than the two comparison methods. Because the two comparison methods only need to obtain one or a limited number of parameters of the surrounding nodes without other calculation or storage, the complexity of the proposed method is still very low for the nodes and is completely acceptable.

Without loss of generality, this paper still uses Type III messages as an example to illustrate the advantages of algorithm (strategy) performance. Fig. 9 shows the comparison between the proposed strategy and the GPSR algorithm in terms of transmission success rate and the comparison between the proposed strategy and the improved GPSR algorithm based on D-S evidence theory in terms of network overhead. As can be seen from the figure, the proposed strategy is significantly better than the GPSR algorithm which is sensitive to malicious nodes in terms of transmission success rate. At the same time, the strategy can guarantee the success rate of message transmission with relatively small overhead. As a whole, the proposed strategy could match with the type of transmission message, and can flexibly adjust and compromise between the transmission success rate and network overhead and other factors.

E1KOBZ_2022_v16n9_3104_f0009.png 이미지

Fig. 9. Comparison of the proposed strategy and comparison method in terms of transmission success rate and network overhead

In order to facilitate the comparison of transmission delays, we choose a one-way message arrival process to calculate the time-consuming situation of transmitting messages. In order to eliminate the influence of the message transmission failure on the delay calculation process, we assume that the proposed strategy and the two comparison algorithms both adopt the method of sending unlimited times until the transmission is successful. Specifically, the transmission delayT_delay is defined as \(\begin{aligned}T_{\text {del ay }}=(\bar{N}-1) \times T_{\text {prepare }}+(\bar{M}+1) \times T_{\text {action }}\end{aligned}\) , where \(\begin{aligned}\bar{N}\end{aligned}\) is the average number of transmissions of the message, T_prepare is the preparation delay between two repeated transmissions of the sending node, and \(\bar{M}\) is the average number of forwarding nodes that participate in the process until the transmission is successful, T_action is the delay between transmission nodes related to the specific meta-action. See Table 2 for the values of the above parameters.

Fig. 10 shows the comparison of the transmission delays for Type III messages between the strategy proposed in this paper and the two transmission protocols. It is not difficult to see from the figure, that with the increase of the number of malicious nodes, the message transmission delay of the three protocol strategies shows an obvious upward trend. Among them, the GPSR protocol is sensitive to malicious nodes, and its transmission times increase with the increase of malicious nodes, so the delay increases most rapidly. In addition, it can be seen from the figure that the strategy proposed in this paper combines the advantages of GPSR protocol with low transmission delay when the number of malicious nodes is low. With the increase of the number of malicious nodes, the proposed strategy can be used in the selection of the transmission protocol. The proposed strategy can tilt to the D-S meta-action in the selection of sending protocol, so that the increase in the number of malicious nodes will not show a rapid increase in time delay. It is not difficult to infer from the figure that when the number of malicious nodes further increases, the strategy proposed in this paper will abandon the GPSR meta-action, so its delay value will be consistent with the delay value of the improved GPSR algorithm based on the D-S evidence theory. Therefore, the strategy proposed in this paper is superior to the commonly used GPSR protocol algorithm and the improved GPSR algorithm based on D-S evidence theory proposed in this paper in terms of delay performance.

E1KOBZ_2022_v16n9_3104_f0010.png 이미지

Fig. 10. Comparison of the performance of the three protocols in terms of transmission delay

6. Concluding remarks

In reality, according to the characteristics of different message types, messages have different transmission requirements in VANET application scenarios. This paper considered the impact of malicious nodes on the security of the network, defined the message types. Meanwhile, by drawing on the characteristics of reinforcement learning that can spontaneously select actions to adapt to environmental needs through the exploration and utilization of the learning process, this paper has designed and proposed a set of message types as the state space. GPSR algorithm, improved GPSR algorithm based on D-S evidence theory and the combination of the two is the forwarding strategy of the action space. Numerous simulations have demonstrated that the proposed strategy can meet the spontaneous transmission requirements of different message types in the VANET scene. In addition, the strategy proposed in this paper can adjust the state space and meta-action elements according to the actual application, thus the strategy has broad application prospects.

참고문헌

Cyber Security Administration of the Ministry of Industry and Information Technology, White Paper on Network Security of Internet of Vehicles, 2020. [R/OL].
M. Syfullah, J. M. Lim and F. L. Siaw, "Mobility-Based Clustering Algorithm for Multimedia Broadcasting over IEEE 802.11p-LTE-enabled VANET," KSII Transactions on Internet and Information Systems, vol. 13, no. 3, pp. 1213-1237, 2019. https://doi.org/10.3837/tiis.2019.03.006
A. BENGAG, M. E. Boukhari, "Enhancing GPSR routing protocol based on Velocity and Density for real-time urban scenario," in Proc. of 2020 International Conference on Intelligent Systems and Computer Vision (ISCV), pp. 1-5, 2020.
R. A. Nazib, S. Moh, "Reinforcement Learning-Based Routing Protocols for Vehicular Ad Hoc Networks: A Comparative Survey," IEEE Access, vol. 9, pp. 27552-27587, 2021. https://doi.org/10.1109/ACCESS.2021.3058388
Jianyong Li, Research on optimization of learning algorithm for routing in wireless sensor networks, Southwest University, 2016.
Ming Yuan, Research on Vanet Routing Algorithm Based on Reinforcement Learnin, Xidian University, 2017.
Celimuge Wu, Satoshi Ohzahata, and Toshihiko Kato, "Flexible, Portable, and Practicable Solution for Routing in VANETs: A Fuzzy Constraint Q-Learning Approach," IEEE Transactions on Vehicular Technology, 62(9), 4251-4263, 2013. https://doi.org/10.1109/TVT.2013.2273945
Celimuge Wu, Kazuya Kumekawa, and Toshihiko Kato, "Distributed Reinforcement Learning Approach for Vehicular Ad Hoc Networks," IEICE Transactions on Communications, E93.B(6), 1431-1442, 2010. https://doi.org/10.1587/transcom.E93.B.1431
Plate R, Wakayama C, "Utilizing kinematics and selective sweeping in reinforcement learningbased routing algorithms for underwater networks," Ad Hoc Networks, 34(NOV.), 105-120, 2015. https://doi.org/10.1016/j.adhoc.2014.09.012
G. Santhi, A. Nachiappan, M. Z. Ibrahime, R. Raghunadhane and M. K. Favas, "Q-learning based adaptive QoS routing protocol for MANETs," in Proc. of 2011 International Conference on Recent Trends in Information Technology (ICRTIT), pp. 1233-1238, 2011.
Yiming Lin, A Reinforcement Learning-based Routing Protocol in VANETs, Xiamen University, 2017.
Jinqiao Wu, Min Fang, Xiao Li, "Reinforcement Learning Based Mobility Adaptive Routing for Vehicular Ad-Hoc Networks," Wireless Personal Communications, 101(4), 2143-2171, 2018. https://doi.org/10.1007/s11277-018-5809-z
Shanshan Jiang, Zhitong Huang, Yuefeng Ji, "Adaptive UAV-Assisted Geographic Routing With Q-Learning in VANET," IEEE Communications Letters, vol. 25, no. 4, pp. 1358-1362, April 2021. https://doi.org/10.1109/LCOMM.2020.3048250
J. Aznar-Poveda, A. -J. Garcia-Sanchez, E. Egea-Lopez and J. Garcia-Haro, "MDPRP: A QLearning Approach for the Joint Control of Beaconing Rate and Transmission Power in VANETs," IEEE Access, vol. 9, pp. 10166-10178, 2021. https://doi.org/10.1109/ACCESS.2021.3050625
Network 5.0 Industry and Technology innovation Alliance, Network 5.0 Technology White Paper (2.0), 2021. [R/OL].
C. Lin, S. Yuan, S. Chiu and M. Tsai, "ProgressFace: An Algorithm to Improve Routing Efficiency of GPSR-Like Routing Protocols in Wireless Ad Hoc Networks," IEEE Transactions on Computers, vol. 59, no. 6, pp. 822-834, June 2010. https://doi.org/10.1109/TC.2010.47
Alsaqour, R., Abdelhaq, M., Saeed, R., Uddin, M., Alsukour, O., Al-Hubaishi, M. and Alahdal, T., "Dynamic packet beaconing for GPSR mobile ad hoc position-based routing protocol using fuzzy logic," Journal of Network and Computer Applications, vol. 47, pp. 32-46, 2015. https://doi.org/10.1016/j.jnca.2014.08.008
Wang, T., Anwar, S., Sun, H. and Zhou, Y., "Modified greedy perimeter stateless routing for vehicular ad hoc networking algorithm," International Journal of Sensor Networks, vol. 27(3), pp.163-171, 2018. https://doi.org/10.1504/ijsnet.2018.10014316
X. Yang, M. Li, Z. Qian and T. Di, "Improvement of GPSR Protocol in Vehicular Ad Hoc Network," IEEE Access, vol. 6, pp. 39515-39524, 2018. https://doi.org/10.1109/access.2018.2853112
H. Yuan, J. Geng, C. Liu, F. Bian, and T. Surapunt, "An improved GPSR routing algorithm based on vehicle trajectory mining," in Proc. of the 5th International Conference Geo-Spatial Knowledge and Intelligence, pp. 343-349, 2018.
C. Wang, Q. Fan, X. Chen, and W. Xu, "Prediction based greedy perimeter stateless routing protocol for vehicular self-organizing network," in Proc. of IOP Conference Series: Materials Science and Engineering, vol. 322, p.052019, 20108. December 2017.
A. Benmir, A. Korichi, A. Bourouis, M. Alreshoodi and L. Al-Jobouri, "An Enhanced GPSR Protocol for Vehicular Ad hoc Networks," in Proc. of 2019 11th Computer Science and Electronic Engineering (CEEC), pp. 85-89, 2019.
S. Younes, M. Khelifi, A. Alioua and I. Souici, "EKF-GPSR: An Extended Kalman Filter for Efficient Routing in Vehicular Networks," in Proc. of 2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), pp. 1-6, 2021.
M. Dabbaghjamanesh, A. Moeini and A. Kavousi-Fard, "Reinforcement Learning-Based Load Forecasting of Electric Vehicle Charging Station Using Q-Learning Technique," IEEE Transactions on Industrial Informatics, vol. 17, no. 6, pp. 4229-4237, June 2021. https://doi.org/10.1109/TII.2020.2990397
Zhihua Zhou, Machine Learning, Beijing: Tsinghua University Press, 2016.
Yuanyuan Meng, Liancheng Min Xu, Ren, Yanfei Wang, "D-S Evidence Fusion Method Based on High Conflict Correction," ComputerEngineering, 44(1), 79-83, 90, 2018.
K. Gu, X. Dong, X. Li and W. Jia, "Cluster-Based Malicious Node Detection for False Downstream Data in Fog Computing-Based VANETs," IEEE Transactions on Network Science and Engineering, vol. 9, no. 3, pp. 1245-1263, 1 May-June 2022. https://doi.org/10.1109/TNSE.2021.3139005
A. M. El-Semary and H. Diab, "BP-AODV: Blackhole Protected AODV Routing Protocol for MANETs Based on Chaotic Map," IEEE Access, vol. 7, pp. 95197-95211, 2019. https://doi.org/10.1109/access.2019.2928804

KSII Transactions on Internet and Information Systems (TIIS)

A Diversified Message Type Forwarding Strategy Based on Reinforcement Learning in VANET

초록

키워드

1. Introduction

2. Preliminary knowledge

2.1 GPSR

2.2 Reinforcement learning

2.3 Dempster-Shafer (D-S) evidence theory

3. System model

4. QLMTR strategy model

4.1 State space

4.2 Action space

4.2.1 Meta-action

4.2.2 Combined action

4.3 Reward function

4.4 Algorithm Details

5. Simulation and analysis

5.1 Simulation results for Type I messages

5.2 Simulation results for Type III messages

5.3 Simulation results for Type IV messages

5.4 Comparison of transmission times

5.5 Comparison of the number of transmission nodes

5.6 Convergence

5.7 Strategic advantage

6. Concluding remarks

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)