Adversarial Attacks on Heterogeneous Multi-Agent Deep Reinforcement Learning System with Time-Delayed Data Transmission

: This paper studies the gradient-based adversarial attacks on cluster-based, heterogeneous, multi-agent, deep reinforcement learning (MADRL) systems with time-delayed data transmission. The structure of the MADRL system consists of various clusters of agents. The deep Q-network (DQN) architecture presents the ﬁrst cluster’s agent structure. The other clusters are considered as the environment of the ﬁrst cluster’s DQN agent. We introduce two novel observations in data transmission, termed on-time and time-delay observations. The proposed observations are considered when the data transmission channel is idle, and the data is transmitted on time or delayed. By considering the distance between the neighboring agents, we present a novel immediate reward function by appending a distance-based reward to the previously utilized reward to improve the MADRL system performance. We consider three types of gradient-based attacks to investigate the robustness of the proposed system data transmission. Two defense methods are proposed to reduce the effects of the discussed malicious attacks. We have rigorously shown the system performance based on the DQN loss and the team reward for the entire team of agents. Moreover, the effects of the various attacks before and after using defense algorithms are demonstrated. The theoretical results are illustrated and veriﬁed with simulation examples.


Introduction
The reinforcement learning (RL) algorithm is the process of learning, mapping states to actions, and ultimately maximizing a reward signal through the interaction of an agent with a specific environment [1,2]. Deep reinforcement learning (DRL) is characterized by a combination of RL and deep learning (DL) algorithms, two subdivisions of machine learning (ML) [3][4][5]. The DRL's advantage is that it addresses the high-dimensional problems that RL algorithms encounter [4,6,7]. Q-learning, as a type of RL algorithm, learns action values in a specific state [1]. Despite Q-learning's technological advances, it has one major flaw-similarly to dynamic programming, Q-learning can only update data within a two-dimensional array {state × action} [8]. The deep Q-network (DQN) algorithm is introduced, which merges Q-learning, and deep neural networks (NN) [7,9]. To cope with the two-dimensional array problem arising from the Q-learning algorithm, the DQN has been used in a wide range of applications [10][11][12]. There are two main reasons for using the DQN algorithm instead of other DRL approaches in this work: (i) the stability in performing complicated tasks. The discussed stability is the consequence of utilizing randomly sampled experience replay and a target network; (ii) the ability to predict the Q-value function.

Contributions
We studied the time-delayed data transmission problem between agents in a clusterbased, heterogeneous, multi-agent deep reinforcement learning (MADRL) system under adversarial attacks. The paper contributions are: (i) in addition to the leaderless multiagent system (MAS), we proposed a leader-follower MAS, such that the preassigned leader in each cluster communicates with the leader of other clusters as well as the agents of its own cluster; (ii) we considered two novel observations in data transmission, called on-time and time-delay observations, and we investigated their effects on the DQN loss and team reward; (iii) we proposed a novel immediate reward function that considers the package length, packet header size, and distance between neighboring agents to improve the MAS performance in terms of approximated cumulative team discounted reward during timedelayed data transmission; (iv) we considered the fast gradient sign method (FGSM), fast gradient method (FGM), and basic iterative method (BIM) adversaries (gradientbased attacks) to attack the DQN algorithm. Then we investigated the effects of such attacks on MAS performance and time-delayed data transmission; (v) we introduced two defense algorithms against the performed adversarial attacks. In the proposed defense methods, the DQN agent's deep NN learns from a state that produces the maximum perturbation value and uses its negative feedback to improve the system performance during an adversarial attack.

Related Research
DQN algorithm has been used in data transmission between multiple agents. The transfer learning (TL) approach is combined with the DQN algorithm, and a multi-source transfer double DQN (MTDDQN) is introduced in [13]. The MTDDQN is based on actor learning and enables the collection, summarization, and transfer of action knowledge by the RL agent between multiple agents [13]. Compared to [13], the current paper uses one DQN agent in a cluster-based, heterogeneous, MAS for on-time and time-delayed data transmission.
Data transmission in MAS has been investigated in various scenarios and for linear and nonlinear systems [14]. For instance, periodic event-triggered output regulation for linear MAS by considering a leader-follower topology is proposed in [15]. An adaptive event-triggered consensus control of linear MAS with directed leader-follower topology in the presence of a cyber-attack that affects the control input without modification in communication topology is developed in [16].
A dynamic event-triggered asynchronous control integrating fuzzy models with directed topology is presented in [17]. A new adaptive event-triggered leaderless consensus control of nonlinear MAS, including directed topology, can be found in [18]. Moreover, some researches address data exchanges between linear and nonlinear systems as a heterogeneous MAS, e.g., a leaderless and a leader-follower consensus of heterogeneous second-order MAS on time scales using an asynchronous impulsive approach, is presented in [19]. The previous studies have shown that there is little research on data transmission within homogeneous or heterogeneous MADRL systems, and the majority of the research has been focused on linear and nonlinear MAS. A heterogeneous MAS based on carriersense multiple access (CSMA) that utilizes DRL algorithm in data transmission, termed carrier-sense deep reinforcement learning multiple access (CS-DLMA), is introduced in [20]. The CS-DLMA uses α-fairness objective to measure system performance. Inspired by [20] and using CS-DLMA, we study time-delayed data transmission between agents of a leaderless MADRL system. The same study is carried out for a leader-follower MADRL system. Note that CSMA is an access control protocol in which an agent in the network checks the state of the data channel for data transmission.
Cyber-attacks can happen to any system, especially those that transmit data. Various adversarial attacks pose a threat to ML algorithms and DL systems [21,22]. The ML algorithms are misled by adversarial attacks that manipulate input data to undermine algorithm performance, access the ML model, and modify model behavior [23,24]. Therefore, it is important to study the effects of various attacks on ML algorithms [25]. This paper uses three types of gradient-based adversarial attacks, termed FGSM [26], FGM [27,28], and BIM [27,29] to investigate their effects on the DQN algorithm, and consequently, MADRL system performance. Paper [20] is devoid of any information on cyber-attack on the system. The authors of [30] have examined the adverse effects of FGSM attack on DRL-based traffic signal control for a single-intersection and multiple intersection cases; however, its effects are not investigated on sending and receiving data. Hence, the FGSM attack plus FGM and BIM attacks, that try to fool the NN, are considered in this paper to check their impacts on the data transmission robustness. The authors of [31] have used the discussed attacks to target the observation set provided by the RL algorithm environment; as a result, we have applied the three types of adversarial attacks to target the produced environmental state of the DRL algorithm.
There are various defense techniques for ML algorithms, and these adversarial defense methods are used to improve the robustness of a designed model [32]. Among all the presented methods [33][34][35][36], the best defense procedure occurs when the adversarial examples are fed to the NN training process [28]. In this regard, we use the worst perturbation as the input of the NN to train the model and reach the robustness against malicious attacks. This paper scrutinizes adversarial attack issues facing ML algorithms and studies the time-delayed data transmission robustness due to three types of gradient-based malicious attacks-FGSM, FGM, and BIM adversaries-between agents of a cluster-based, heterogeneous MADRL system. This study shows how the leaderless or leader-follower MAS performs due to time-delayed data transmission. After an attack on a leader-follower MADRL system, this paper presents two adversarial attack defense approaches against gradient-based attacks. This paper does not study the detectability of attacks, but how to reverse the attacks.
A potential application of this research is to use the proposed system in smart grids to make the grid more reliable, secure, and efficient. Moreover, by converting static agents to mobile agents and considering relevant contributions to the novel architecture, this system can be used for data transmission between the agents of all types of multi-agent autonomous vehicles, e.g., multi-rescue robots.
After giving a brief introduction on various aspects of this research in Section 1, the background is explained in Section 2. The methodology of the proposed approach is offered in Section 3. Results and discussion on the introduced system and its behavior during adversarial attacks are provided in Section 4. The conclusion and future works are presented in Section 5. A preliminary conference version of this manuscript was submitted and accepted previously [37]. New results on data transmission robustness of the proposed leader-follower MAS due to adversarial attacks are presented for additional comparison and evaluation in this paper.

Background
Decision-making is based on the information received from the environment by an RL or DRL agent. It is considered that the finite Markov decision process (MDP) represents the dynamics of the environment for decision-making. The 5-tuple M = s, a, T, R, γ presents an MDP for an RL and DRL system, where s is a finite set of environmental states, and a is a finite action set. Moreover, T(s t , a t , s t+1 ) → [0, 1] is the state-transition probability function that agent takes action a t in the state s t , and is transferred to the state s t+1 to do the next action. Further, R(s t , a t , s t+1 ) ∑ ∞ k=0 γ k r t+k+1 → IR n is a cumulative reward function, where r t+k+1 shows the immediate reward, with discount factor γ that is the trade-off between an immediate reward and potential future reward.
In the leaderless MAS scenario, all agents communicate with their cluster-mates, as well as agents of other clusters. In the leader-follower MAS scenario, in each cluster, only the preassigned leader communicates with the other agents in the same cluster as well as leaders of different clusters. Thus, data transmission occurs between the leader and the followers of one cluster as well as leaders of clusters. The leaderless and leader-follower MAS are considered as the graph G = (V, E ), where V is the set of all agents, and E ⊆ {(i, j)|i ∈ V, j ∈ V } is the set of all communication links between agents. The agents i and j communicate if and only if (i, j) ∈ E [38].

Methodology
In this Section, the leaderless and leader-follower topologies are introduced. Then, the components of the DQN algorithm (observation, action, state, and reward) are explained. Afterward, the DQN loss for on-time and time-delayed data transmission is justified. Three types of adversarial attacks to target the proposed leader-follower MAS (state of the DQN agent) are extended and explained. Finally, two defense methods against performed adversarial attacks are introduced.

Leaderless and Leader-Follower Topologies
A generic illustration of the MADRL system topology including N static, heterogeneous agents, and P clusters is shown in Figure 1. The leaderless and leader-follower MAS scenarios can be envisioned from the presented topology in Figure 1. The goal of each static agent in this topology is to transfer data with the maximum average reward.

Observation
In this paper, observation describes the state of the data transmission channels that are either idle or busy [20]. If the channel between each pair of agents is busy at time step t, no data can be transmitted at time step t + 1 due to data transition by another agent. However, if this channel is idle, data can be transmitted. The transmitted data either reaches its destination successfully or collides along the way, gets corrupted, and does not reach the goal. Therefore, the defined observation set by [20] is o t = {busy, idle, successful, collided}.
We propose a modified observation set to use in our MADRL system. We add on-time and time-delayed arrival states to the observation set. As in [20], it is first checked that the data channel between each pair of agents is either idle or busy. If the channel is idle, the data can be transferred successfully on-time, successfully with time-delay, or collided. Therefore, in the introduced scenarios of this paper, the novel observation set o t = {busy, idle, on-time, time-delay, collided} is proposed.
The lengths of the transferred packets in the network are different and belong to the set of R c ∈ {1, 2, . . . , R c max }. When the observation is on-time, it means that each agent in the network transmits the packet at the next R c mini-slot, with the action time duration in the set of T d (a t ) ∈ {1, 2, . . . , R c max }. With the time-delay observation, each agent in the network transmits the packet at the next R c mini-slot, with the action time duration in the set of T d (a t ) ∈ {R c max + 1, R c max + 2, . . .}. In both on-time and time-delay observations, when an agent transmits a packet in a data transmission channel, no other agent sends the data at that specified channel to avoid a collision. In the following, T d (a t ) and T d (a t ) are abbreviated as T d and T d , respectively. When the observation collides, it means that the agent transmits the packet at the next R c mini-slot; however, another agent transmits data in at least one of the R c mini-slots. Note that each mini-slot is a required time to perform CSMA. In this paper, each mini-slot is considered a time step.

Action
Action is one of the significant components of RL and DRL algorithms [39]. In general, the agent receives the corresponding state from the environment at time step t and performs the appropriate action accordingly. Due to the quality of the performed action at time step t, the agent receives the reward associated with that action at time step t + 1. According to the observation set in this paper, the agent first checks whether the data channel is idle or busy. This stage should be done at less than one mini-slot. The performed actions in this paper are based on [20], as follows: • No Selection: If the channel is busy at time step t (checked at less than one mini-slot), the DQN agent does not take any action at the next time step. Hence, a t+1 = 0. • Uniform Selection with Probability ε: If the channel is idle at time step t, the DQN agent chooses an action (transfer or not to transfer a packet) at time step t + 1. If the agent at the next R c mini-slot transmits packets with the length of R c , then the action at time step t + 1 is a t+1 = R c , where R c ∈ {0, 1, 2, . . . , R c max }. This action selection method is a uniform random selection with probability ε (exploration) using ε-greedy algorithm. • Non-uniform Selection with Probability 1 − ε: If the channel is idle at time step t, an action to transfer or not to transfer a packet at time step t + 1 can be chosen by the DQN agent. According to the conventional ε-greedy DQN algorithm, the action will be the maximum Q-value {Q(s, a; θ)|a ∈ A}, where Q is a parametric function including state s, action a, and parameter θ as a vector, including the weights in the NN. Moreover, A is the set of actions. Therefore, the action at time step t + 1 is where θ − denotes the target Q-value weight. This action selection method is a nonuniform selection with probability 1 − ε (exploitation) using ε-greedy algorithm. Note that the ε-greedy algorithm is a widely used policy-based exploration approach in RL and DRL algorithms [1,40].

State
In this paper, there are two types of states; channel state c s t+1 , and DQN algorithm state s t+1 [20]. The DQN algorithm state used in the DRL process is based on the channel state. The channel state at where L is the state history length that describes the number of past time steps to be tracked by the DQN algorithm.

Reward
Selecting a reward function is usually based on what the RL and DRL systems are supposed to do [41]. In this paper, first, the selection of the reward function depends on the data specifications, including the length of a sent package and the packet header size [20]. The larger the packet header size, the more problems it causes in sending data in the channel (time-delay data transmission or data collision). Therefore, the oversize packet header causes the DQN agent to receive less reward. Afterward, we propose another component to obtain each agent's more precise average reward. Distances between agents may be significant for sending and receiving data and maintaining distances between clusters. Additionally, the distance between two mobile agents is crucial to avoid collisions (the study of mobile agents is beyond the scope of this paper). Hence, we consider the length of a sent package, the packet's header size, and the distance between a couple of agents in determining the immediate reward function.
In this paper, the utilized rewards are: • At time step t, if the transferred package does not reach the destination (another agent from another cluster or the agent from the same cluster) successfully, and collides on the way, the immediate reward for i-th agent at time step t + 1 is r i t+1 = 0. • Using the observation set of [20], if the data packet successfully transferred by each agent in the network, the immediate reward for i-th agent at time step t + 1 is where H p is the packet's header and is a part of each mini-slot. • By proposing on-time and time-delay observations, we append other components to the immediate reward and present a new immediate reward for each agent. Considering constant κ ∈ R + * and κ ∈ [1, ∞), the new immediate reward for i-th agent is introduced by where κ = 1 if the data packet is transferred by each agent in the network successfully and on-time, and κ > 1 if the data packet is transferred by each agent in the network successfully and with time-delay. • Considering the distance between agents who transmit data to each other, we propose another type of immediate reward for i-th agent using the combined immediate reward function [41]. If the data packet successfully transferred by i-th agent to j-th agent (on-time), we propose the novel distance-based immediate reward for i-th agent at time step t + 1 as If the data packet transferred by i-th agent to j-th agent successfully and time-delayed, the distance-based immediate reward for i-th agent at time step t + 1 when κ > 1 is given by  [41]. In this paper, the combined immediate reward function is obtained based on positions (x i , y i ) and (x j , y j ) of i-th and j-th agents, respectively. Nevertheless, the original combined immediate reward function, defined by [41], is based on the current position and the desired position of i-th agent in the MARL system.
Typically the formal definition of the DQN target Q-value of a state-action pair (s t , a t ) is provided by using the cumulative reward function R(s t , a t , s t+1 ), discount factor γ ∈ (0, 1), and target Q-value weight θ − . According to [20], the DQN target Q-value of a state-action pair (s t , a t ) is given by The Q-value function Q(s t , a t ; θ) is defined by using the state-transition probability function T(s t , a t , s t+1 ) that defines the conditional probabilities between the states. Furthermore, θ is the Q-value weight. According to the gradient method, the parameter θ is updated as where α is the learning rate. Both immediate reward choices (2)-(5) and the final Q-value function, obtained by updating the parameter θ of (9), are connected to the actual data transmission by considering two options: (i) characteristics of transferred packets including the package length and packet header size; (ii) specifications of neighboring agents' distances in such a way that unregulated distance between agents delays data transmission.

Remark 1.
Learning process in DQN algorithm is more stable than Q-learning process since the update rule introduces a delay between the time when Q-value Q(s t , a t ; θ) is updated and the time when target network Q(s t+1 , a t+1 ; θ − ) is updated [42]. Therefore, the target network remains unchanged due to the time-delay. Theorem 1. Suppose that the MAS including N agents is modeled by a graph G, and the learning process is performed by the DQN algorithm. If the i-th agent transfers data to the j-th agent successfully and with time-delay then the average approximated cumulative team discounted reward of a state-action pair (s t , a t ) satisfies the following Proof. To avoid time-delay, we consider the action time duration With the time-delay, we assume that the action time duration is unbounded above and By considering a specific value of γ ∈ (0, 1), we have By considering a specific value of γ ∈ (0, 1), for high values of T d the following is given Using (13) and (14), for i-th agent the following is achieved According to Remark 1, by considering the maximum target network among the possible actions that can be taken from the next state and using (15), the following is valid Utilizing (12) and (16) yields Therefore, To achieve the least amount of training loss (θ, s, a), the difference between the target Q-value and predicted Q-value should converge to zero. Hence, the below equation can be considered for i-th agent, lim Substituting (19) in (18) yields According to the monotone convergence condition, the following is given By considering (21), the inequality (20) is modified as below By averaging each side of inequality (23), and redistribute each side of the inequality to time t, the following is given Therefore, by considering inequality (24) and condition (25) for MAS, including N agents, the inequality (10) is proven as follows Theorem 2. Suppose that graph G as a MAS includes N agents. The distance between i-th agent and j-th agent is d ij in such a way that ξ ≤ d ij ≤ λ, where ξ and λ ∈ R + * are constant values and ξ = λ. Using the results of [41], it is expected that by considering ξ ≤ d ij ≤ λ the distancebased immediate reward (5) improves the DQN learning process and compensates for the negative effect of the time-delayed data transmission. Therefore, the average approximated cumulative team discounted reward of a state-action pair (s t , a t ) satisfies the following Proof. In the case of time-delayed data transmission, distance-based immediate reward, which is calculated based on the distance d ij between i-th and j-th neighboring agents, assists the learning process of the DQN agent. Therefore, this immediate reward compensates for the negative effect of time-delayed data transfer at time step t and causes to take more appropriate action at the next time step t + 1.
Since the agents are static and the distance d ij between them is constant, the distancebased immediate reward (in combination with the package length and packet header size) helps the DQN agent to adjust the learning process over time in terms of data transmission speed. Hence, the Q-value is improved at each time step by speeding up the data transmission. This trend will continue in which, at higher time steps, the approximated cumulative team discounted reward of time-delayed data transmission increases more than the on-time data transmission conditions.

DQN Loss
The output layer loss function of the DQN algorithm's NN is the mean squared error (MSE) loss function. By decreasing the DQN loss, the DQN reward increases. Therefore, by observing the DQN loss behavior, the DQN reward performance is predicted. In [43], for uniform action selection, the DQN loss function is given by where B e is the experience replay mini-batch size. Using the non-uniform action (1), containing the set of R c ∈ {0, 1, 2, . . . , R c max } as possible actions, and applying target Q-value (7) as well as predicted Q-value, the DQN loss function for non-uniform action selection with action time duration T d ∈ {1, 2, . . . , R c max } is defined as where e t = (s t , a t , T d , r t+1 , s t+1 ) is the experience at time t that is obtained from Note that the Equation (28) is derived from Equation (29) if T d = 1. Moreover, the timedelayed DQN loss function for action time duration T d ∈ {R c max + 1, R c max + 2, . . .} is given by for experience e t = (s t , a t , T d , r t+1 , s t+1 ) at time t that is achieved from e t = (c s t , a t , T d , r t+1 , c s t+1 ). Note that average loss calculations based on experience e t and experience e t are performed from m = 1 to B e as the experience replay mini-batch size.

Adversarial Attacks
Three types of gradient-based adversarial attacks are considered to benchmark the data transmission robustness of the proposed leader-follower MAS by considering the new observation set and the proposed distance-based immediate reward (Figure 2). The changes made by this type of attack are very subtle, but they can also affect the system's performance. Before occurring an attack, the DQN algorithm aims to reduce the average training loss in a given time step and enhance the average reward.

FGSM Adversarial Attack
The FGSM is a type of attack proposed in [26]. Our methodology involves attacking the system's state by causing the FGSM adversary to make very few changes to the state over a period of time to increase the system's average training loss. Using the gradient of the loss function with respect to the state, FGSM maximizes the perturbation and minimizes the difference between the perturbed and original inputs [26,30,44]. In this regard, using (29) and (30), for on-time and time-delayed data transmission, respectively, the FGSM attack signal (perturbation) is obtained by where is the attack magnitude to ensure the perturbations are small, and sign() is the sign function. Further, ∇ s is the gradient of the loss function related to model state s as well as correct action a, is the loss function of the DQN agent, and θ is the model parameters. After adding the attack signal to the state s, the adversarial input s adv is calculated as follows where L ∞ -norm bound η ∞ ≤ for optimal perturbation η. Using (30) and (32), the adversarial input s adv for time-delayed data transmission is given by Once s adv is calculated, it is fed to the NN and replaces the primary input s of the NN. The NN is fooled and trained based on the adversarial input s adv .

FGM Adversarial Attack
The FGM attack signal is a generalization of the FGSM attack signal and is calculated as: Using (34), the adversarial input s adv is calculated by where L 2 -norm bound η 2 ≤ for optimal perturbation η. By substituting (30) in (35), the adversarial input s adv for time-delayed data transmission is given. The training procedure is performed similarly to the FGSM adversarial attack.

BIM Adversarial Attack
The BIM attack is a simple and straight extension of the FGSM attack proposed by [29]. This method uses a fast gradient multiple times by considering small step size instead of applying the perturbation in a single step. The BIM attack signal and the adversarial input s adv are given by where β = T is a small step size and T is the number of iterations. Using (30) in (37), the adversarial input s adv t+1 for time-delayed data transmission is provided.

First Adversarial Attack Defense
We provide a simple but effective approach to defend against adversarial attacks and mitigate their destructive effects on the MADRL system performance (Figure 3). In the proposed Algorithm 1 that is based on NN behavior, we consider the argmax operation on perturbation vector η to find the argument that gives the maximum value from η. In other words, we desire to find a state that provides the maximum perturbation value. We assume that the FGSM, FGM, or BIM adversarial attacks are detectable. Once one of the FGSM, FGM, or BIM adversarial attacks is detected, the state vector s * is calculated based on the set of states (inputs) and substituted with perturbation vector η as follows where η and s * are n × 1 vectors. The state vector s * , which determines the maximum perturbation value, has the worst effect on the MAS performance; however, the deep NN learns from the state vector s * and uses its negative feedback to improve the system performance during an adversarial attack. For BIM adversarial attack the Equation (38) is presented as s * n×1 = arg max ).

BIM Adversarial Attack Defense
Using (36) and (37), the state vector s * and the adversarial input s adv t+1 are calculated to defend against the BIM adversarial attack as follows

Second Adversarial Attack Defense
We provide another effective method to mitigate the gradient-based attacks' destructive effects on the MADRL system performance and defend against the discussed adversarial attacks. We assume that the FGSM, FGM, or BIM adversarial attacks are detectable. In Algorithm 2, that is an extension of Algorithm 1, once one of the FGSM, FGM, or BIM adversarial attacks is detected, a convert function changes the sign of the state. Hence, before the attacker can confuse the NN, the state is modified and replaced with the correct state that was fed to the NN. This is done to mislead the attacker so that the attacker generates the attack signal η based on the converted state. Changing the state sign not only fools the attacker and reduces its destructive effects but also causes the generated attack signal by the attacker to be used for appropriate NN training. The remainder of the Algorithm 2 performs similarly to the Algorithm 1.

Results and Discussion
We illustrate results for on-time and time-delayed data transmission between agents of heterogeneous MAS with and without a leader, using the DQN algorithm. Additionally, the impacts of FGSM, FGM, and BIM attacks, as well as the consequences of the defense algorithms on the proposed leader-follower system, are illustrated and shown numerically.
Two types of graphs G are considered: complete (leaderless) and connected (leaderfollower) graphs. The leaderless and leader-follower scenarios, including N = 5 static, heterogeneous agents, and P = 3 clusters, are illustrated in Figure 4.  In both types of graphs, the DQN agent's internal structure consists of feed-forward NN architecture for training, including 36 layers with Adam optimizer and MSE loss function. Note that we have used the trial and error method to choose the number of NN hidden layers. To select any number of layers, we performed the learning process five times to reach a definite result about the number of layers. The activation function of all 36 layers is rectified linear unit (ReLU) function. The DQN agent learning rate is α = 0.01, the discount factor is γ = 0.999, the experience replay mini-batch size is B e = 64, and the constant positive real number to calculate immediate reward r i t+1 is κ = 4 if the data packet is transferred in the network with time-delay. The packet's header for all agents is considered as H p = 0.5. The threshold to determine the on-time or time-delayed data transmission is 11 mini-slots. To compute the attack signal, the attack magnitude is = 0.6, and the number of iterations is T max = 30, 000. The five agents' two-dimensional positions are where i ∈ {1, 2, ..., 5}. The positions, which are used to obtain the distance-based immediate rewards of (4) and (5), remain constant during the total time steps due to the static agents. As opposed to this, when agents are mobile, their positions should be updated and added to the list of former positions at any time step, as we will investigate in the subsequent research. Moreover, the returned values of the novel observation set are All scenarios are carried out during the 20,000 timesteps for the data transmission part of the experiment. The experiments are performed during the 30,000 timesteps while investigating the data transmission robustness due to various adversarial attacks. The results are shown after five times training to ensure the reliability of the results.
We have used, modified, and extended a part of the code given in [45] as a part of our implementation. Furthermore, for algorithm's execution, a system with 3.60 GHz Intel Core i7-7700 processor, 16 GB installed RAM, 64-bit operating system, and x64-based processor is used.

Multi-Agent Performance Analysis
According to Table 1, without considering distance-based reward and time-delay, both leaderless and leader-follower MAS scenarios achieve the superior team reward compared to the case when the packets transfer in the network with time-delay. In this case and for leaderless MAS, by considering time-delay, the team reward has been reduced by −27.04%. In a similar situation and for leader-follower MAS, by considering time-delay, the team reward has been decreased by −9.91%. Figure 5 illustrates the reward convergence of a heterogeneous MAS, including N = 5 agents in P = 3 different clusters, during 20,000 time steps for leaderless and leader-follower scenarios by considering on-time and time-delay observations. Based on the results in Table 1 and Figure 5, delay in sending data reduces team rewards for both leaderless and leader-follower scenarios. Table 1. Comparison of each agent's average reward and DQN loss of a heterogeneous MAS, including N = 5 agents in P = 3 different clusters, during 20,000 time steps without considering the distance-based reward.  As can be seen from Tables 1 and 2, the simulation confirms the claim of Theorem 2 in such a way that the distance-based immediate reward has improved the system performance despite the time-delayed data transmission (regardless of whether the system is leaderless or leader-follower). Moreover, according to Table 1, without considering distance-based reward and time delay, the DQN algorithm in both leaderless and leaderfollower MAS scenarios achieves less average loss compared to the case when the packets transfer in the network with time-delay. For leaderless MAS, by considering time-delay, the average DQN loss has been increased by +102.97%. For leader-follower MAS, by considering time-delay, the average DQN loss has been enhanced by +14.67%. Figure 6 shows the DQN loss convergence of a heterogeneous MAS, including N = 5 agents in P = 3 different clusters, during 20,000 timesteps for leaderless and leader-follower scenarios by considering on-time and time-delay observations. The large fluctuations in the amount of loss after 10,000 timesteps in Figure 6b,d are due to delay in data transmission. Table 2. Comparison of each agent's average reward and DQN loss of a heterogeneous MAS, including N = 5 agents in P = 3 different clusters, during 20,000 time steps by considering the novel distance-based reward. Regarding Table 2, by considering distance-based reward and time-delay, both leaderless and leader-follower MAS scenarios achieve a higher team reward compared to the criteria when the packets transfer in the network on time. In this case and for leaderless MAS, by considering time-delay, the team reward has been increased by +27.72%. In a comparable status and for leader-follower MAS, by considering time-delay, the team reward has been enhanced by +30.18%. Figure 7 shows the reward convergence of a heterogeneous MAS, including N = 5 agents in P = 3 different clusters, during 20,000 time steps for leaderless and leader-follower scenarios by considering on-time and time-delay observations as well as distance-based reward. Based on the results in Table 2 and Figure 7, the proposed distance-based immediate reward, in combination with the previous immediate reward, cover the negative effects of data transmission delays for both leaderless and leader-follower topologies. Based on Table 2, by considering distance-based reward and time-delay, the DQN algorithm in leaderless and leader-follower MAS scenarios achieves a higher loss compared to the case when the packets transfer in the network on-time. For leaderless MAS, by considering time-delay, the average DQN loss has been increased by +7.54%. In similar criteria and for leader-follower MAS, by considering time-delay, the average DQN loss has been enhanced by +93.94%. Figure 8 demonstrates the DQN loss convergence of a heterogeneous MAS, including N = 5 agents in P = 3 different clusters, during 20,000 time steps for leaderless and leader-follower scenarios by considering on-time and time-delay observations as well as distance-based reward. The time-delayed data transmission has caused large fluctuations in the amount of loss after 10,000 timesteps in Figure 8b,d. As can be seen from Tables 1 and 2, in scenarios that data is transmitted with time-delay, the average loss of DQN is increased compared to the cases where data is transferred on-time.

Performance Analysis of the Proposed MAS under Adversarial Attacks
According to Table 3, by considering time-delayed data transmission and distancebased reward, the team reward of the leader-follower MADRL system including N = 5 agents in P = 3 various clusters without adversarial attack equals to 0.3415 (Figure 9a). Moreover, in similar conditions, the DQN loss of the discussed MADRL system is 6996.28 (Figure 10a). Under FGSM adversarial attack, the team reward of the leader-follower MADRL system is decreased to 0.3236 by −5.24% (Figure 9b), and the DQN loss is increased to 20920.10 by +199.01% (Figure 10b). Furthermore, under FGM adversarial attack, the MADRL system team reward is reduced to 0.3054 by −10.57% (Figure 9c), and the DQN loss is enhanced to 60232.71 by +760.92% (Figure 10c). Under BIM adversarial attack, the team reward of the leader-follower MADRL system is declined to 0.2929 by −14.23% (Figure 9d), and the DQN loss is increased to 27949.57 by +299.49% (Figure 10d). Hence, it is evident that the time-delayed data transmission of the proposed leader-follower MADRL system is not robust under three types of adversarial attacks during 30,000 timesteps, meaning that its team reward is reduced after an attack, and the DQN loss is enhanced.

Performance Analysis of the Proposed MAS after Applying First Adversarial Attack Defense
Regarding Tables 3 and 4, after using the proposed adversarial attack defense Algorithm 1, the destructive effects of the FGSM, FGM, and BIM malicious attacks are mitigated during 30,000 time steps. In this regard, the team reward of the leader-follower MADRL system reached 0.3433 from 0.3236 by +6.08% after applying the adversarial attack defense method against the FGSM attack (Figure 11b). Moreover, the DQN loss is decreased from 20,920.10 to 8081.50 by −61.36% (Figure 12b). For FGM adversarial attack and after using the introduced defense procedure, the team reward of the MADRL system is enhanced from 0.3054 to 0.3342 by +9.43% (Figure 11c). The DQN loss is reduced from 60,232.71 to 6966.24 by −88.43% (Figure 12c). Furthermore, the team reward of the MADRL system under BIM attack is enhanced from 0.2929 to 0.3336 by +13.89%, and the DQN loss is decreased from 27,949.57 to 3705.51 by −86.74% after utilizing the suggested attack defense technique (Figures 11d and 12d).  Regarding Tables 3 and 5, after using the proposed adversarial attack defense Algorithm 2, the destructive effects of the FGSM, FGM, and BIM malicious attacks are mitigated during 30,000 time steps. The team reward of the leader-follower MADRL system reached 0.3563 from 0.3236 by +10.10% after applying the adversarial attack defense method against the FGSM attack ( Figure 13b). Moreover, the DQN loss is decreased from 20,920.10 to 2905.81 by −86.11% (Figure 14b). For FGM adversarial attack and after using the introduced defense procedure, the team reward of the MADRL system is enhanced from 0.3054 to 0.3187 by +4.35% (Figure 13c). The DQN loss is reduced from 60,232.71 to 4100.37 by −93.19% (Figure 14c). Furthermore, the team reward of the MADRL system under BIM attack is enhanced from 0.2929 to 0.3292 by +12.39%, and the DQN loss is decreased from 27,949.57 to 4526.58 by −83.80% after utilizing the suggested attack defense technique (Figures 13d and 14d). Figure 15a,b shows the team reward and DQN loss before and after defense Algorithm 1 against various adversarial attacks, respectively. Figure 16a,b shows the team reward and DQN loss before and after defense Algorithm 2 against various adversarial attacks, respectively.

Variety of Agents
According to [20], in the proposed topology, there are three types of agents assigned to P = 3 different clusters. The first cluster's agents use a DQN architecture. The agents in the second cluster follow the ALOHA protocol [46][47][48]. Moreover, the time division multiple access (TDMA) protocol makes up the agents' architecture of the third cluster [49]. The concentration of this paper is on DQN agent behaviors (first cluster) and their effects on the other clusters of agents' performance of the MADRL system in different situations.

Conclusions
We studied the on-time and time-delayed data transmission of a leaderless (complete graph), heterogeneous, MADRL system using the DQN algorithm. Moreover, we investigated the on-time and time-delayed data transmission of a leader-follower (connected graph), heterogeneous, MADRL system using the DQN algorithm as well. We studied the MADRL system's performance under various conditions. We did the data transmission investigation on a cluster-based MAS. We proposed a novel immediate reward, including a new version of distance-based reward. We used three types of adversarial attacks to check the data transmission robustness of the MADRL system. We introduced two approaches to defend against malicious attacks and mitigate the destructive effects of adversarial attacks. The results of various scenarios were demonstrated and compared with each other numerically.
Future work will contain agents including different NN architectures in the MADRL system to reach the position consensus. Further, adversarial attack detection will be considered. Moreover, another adversarial attack defense approach will be introduced. Furthermore, the proposed model will examine obstacle and collision avoidance.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: