RL-Based Detection, Tracking, and Classiﬁcation of Malicious UAV Swarms through Airborne Cognitive Multibeam Multifunction Phased Array Radar

: Detecting, tracking, and classifying unmanned aerial vehicles (UAVs) in a swarm presents signiﬁcant challenges due to their small and diverse radar cross-sections, multiple ﬂight altitudes, velocities, and close trajectories. To overcome these challenges, adjustments of the radar parameters and/or position of the radar (for airborne platforms) are often required during runtime. The runtime adjustments help to overcome the anomalies in the detection, tracking, and classiﬁcation of UAVs. The runtime adjustments are performed either manually or through ﬁxed algorithms, each of which can have its limitations for complex and dynamic scenarios. In this work, we propose the use of multi-agent reinforcement learning (RL) to carry out the runtime adjustment of the radar parameters and position of the radar platform. The radar used in our work is a multibeam multifunction phased array radar (MMPAR) placed onboard UAVs. The simulations show that the cognitive adjustment of the MMPAR parameters and position of the airborne platform using RL helps to overcome anomalies in the detection, tracking, and classiﬁcation of UAVs in a swarm. A comparison with other artiﬁcial intelligence (AI) algorithms shows that RL performs better due to the runtime learning of the environment through rewards.


Introduction
Unmanned aerial vehicles (UAVs), commonly referred to as drones, have gained tremendous popularity over the past decade [1,2]. Nowadays, UAVs are used in numerous applications [3,4], and their usage is expected to increase in the future [5]. However, UAVs can also be exploited for malicious purposes, posing significant threats [6]. The main reason behind these threats is the limitations in the early detection, tracking, and classification of malicious UAVs at long ranges due to their small radar cross-section (RCS) and their ability to fly close to the terrain [6]. Additionally, the challenges of detecting, tracking, and classifying UAVs are heightened when they fly in a swarm, as these UAVs can have a varying RCS and velocity, and follow complex, time-varying trajectories in close proximity to each other.
Numerous research efforts are underway to develop novel methods for UAV detection, tracking, and classification [7]. Detection methods can be broadly classified into two categories: non-radar-based and radar systems [6]. Popular non-radar-based methods include electro-optical/infrared, radio frequency (RF) analysis, and the analysis of sound emissions from the UAV [6,8]. The majority of non-radar systems have no active emissions. The passive detection of UAVs using non-radar systems has limitations discussed in [6]. Compared to non-radar methods, radar-based methods are popular and are widely used • Common UAVs cannot carry a large multifunction radar due to weight and power constraints. To overcome this limitation, we propose an approach that employs multiple small MMPARs carried by UAVs. These radar nodes are networked to provide a cumulative radar response, allowing us to detect, track, and classify malicious UAVs in a swarm. • By using multiple MMPAR nodes, our approach eliminates the risk of a single-point failure due to malfunction or external jamming. • Using multiple MMPAR nodes onboard UAVs provides superior spatial coverage and mobility compared to a single radar node on the ground. • The use of multiple radar beams in MMPAR and the ability to adaptively schedule the beams help to accurately resolve multiple targets in the range, Doppler, and angular domains. This is because the multiple beams provide better spatial coverage and resolution, while adaptive scheduling ensures that the beams are directed towards areas of interest, where potential targets may be present. Additionally, the use of multiple beams also helps to mitigate the effects of clutter and interference, which can degrade radar performance. • The multifunction beams generated by the MMPAR can be used for other tasks in addition to the main radar task. For example, the beams can be used for communication purposes or for RF passive listening, which can be performed simultaneously by sharing the radar resources adaptively. This allows for the efficient utilization of the MMPAR resources, enabling the UAVs to perform multiple tasks with a single device. • Anomaly detection and removal is an important aspect of our approach, as it helps to reduce false alarms and improve the accuracy of the system. By using RL, the system is able to learn from its environment and adjust its parameters to improve its performance. This can help to identify and remove anomalies in real time, leading to the more reliable and efficient detection, tracking, and classification of UAVs. • Our approach works optimally in complex and dynamically changing scenarios, e.g., UAV swarms, clutter, and jamming.
To the best of our knowledge, MMPAR using multi-agent RL to detect, track, and classify UAVs in a swarm is not available in the literature. A comparison of our work with the corresponding literature is provided in Table 1. The rest of the paper is organized as follows: Section 2 provides the details of the MMPAR setup onboard UAV, the multi-agent RL used in our work is given in Section 3, simulation setup and results are provided in Section 4, and Section 5 concludes the paper.

Multibeam Multifunction Phased Array Radar Onboard UAVs
In this section, the beam steering and scheduling of MMPAR are discussed. The details of MMPAR and target parameters are also provided in this section.

Steering and Scheduling of Multifunction Phased Array Beams
In our approach, we consider that there are N UAVs, each of which carries an MMPAR. The MMPAR on multiple UAVs is networked through communication beams. Each MMPAR has B steerable phased array beams. Each beam performs a given function. The radiation pattern of the j-th beam for j = 1, 2, · · · , B is given as where F j θ, φ is the overall radiation pattern of the j-th beam, and θ, and φ represent the angles in the azimuth and elevation planes, respectively, F | ∀j. The angular step between any two steering angles in the azimuth and elevation planes is represented as ∆θ and ∆φ, respectively.
Our approach uses six beams, as shown in Figure 1, denoted by B = 6. These beams include two communication beams, a volume search beam, a cued search beam, a track beam, and a passive RF listener beam. Each beam has unique characteristics that depend on its function. For instance, the half-power beamwidths of the communication, volume search, and passive RF listener beams are larger compared to the cued search beam. The tracking beam has the highest angular resolution and gain and is used to estimate the final parameters of the target. Although the large beamwidths of the volume search, communication, and passive RF listener beams allow for large spatial area coverage during a scan, the angular resolution is relatively small. Furthermore, to maintain simplicity, we have taken the steering limits of the beams to be the same in both the azimuth and elevation planes.
Algorithm 1 outlines the working of the MMPAR and beam scheduling. The six beams are scheduled adaptively based on the detected targets. At the beginning of the update interval δt, beam scheduling requests are received. The scheduling of beams for the duration δt is determined based on the received requests, the current state of the beams, the target state, and the priority sequence. The details of the functions performed and the scheduling of the beams are as follows: (x (rad) (t), y (rad) (t), z (rad) (t), v (rad) (t)) Volume search beam scanning in the azimuth and elevation planes, respectively: Communication beams aligned to fixed nodes on ground Communications with other radar platform onboard UAV • To improve the detection and tracking of UAVs, the volume search beam is designed with the largest coverage among all the beams. This enables the radar to scan a vast volume of space during the given time interval δt. However, once a target is detected, the scheduling priority of the volume search beam corresponding to that target is reduced for future update intervals. This approach ensures that the other beams with higher scheduling priority can focus on tracking the target with more accuracy and efficiency. • The cued search beam is used to confirm the presence of a target detected by the volume search. Both the cued search and volume search beams can scan simultaneously. • The tracking beam is directed towards the coordinates provided by the cued search beam after a time interval of δt. As the coverage area of the track beam is significantly larger than the physical tracks covered by a target during δt, the target remains within the coverage area. Due to its high angular and range resolution, the tracking beam can distinguish between multiple targets in the range and angular bins, and estimate their position, RCS, and Doppler estimates. Additionally, the tracking beam has a higher priority (after confirming target presence) than other beams. Targets are tracked adaptively, meaning that mobile/fast-moving targets are visited frequently compared to static/slow targets to update their state. Figure 3 illustrates the interaction of different types of beams with targets.

MMPAR onboard UAV2
Active RF emission by the target Malicious UAVs in the swarm Figure 3. Two UAVs equipped with MMPAR are utilized to detect malicious UAVs operating as a swarm. The multifunction beams of MMPAR are employed for searching, tracking, passive listening (for active RF emissions), and classifying targets. However, one of the malicious UAVs in the swarm is causing interference with its active RF emissions. Additionally, one of the beams is utilized for communication between the radar platforms.
• The passive RF listener beam is not scheduled in a fixed sequence but rather is randomly applied at different time intervals of δt. Its purpose is to detect active RF emissions from a malicious UAV, including intentional RF emissions for jamming and the RF link of the UAV for further analysis. Figure 3 illustrates the passive RF listener beam that is used to detect directed RF emissions from a malicious UAV.
Algorithm 1 Pseudo-code for beam scheduler and working of MMPAR 1: procedure RADARBEAMSCHEDULER 2: % Beam scheduling and radar operation. 3: % i = 1, 2, · · · , N are the number of MMPAR, each onboard a UAV. 4: % Input to the procedure are the MMPAR beams and corresponding resources: S (rad) i,j (t), ∀i,j = 1, 2, · · · , 6, given in (11). 5: At the start of every update interval δt: 6: RF passive listener beam applied if rand > 0.5, where rand is a random generator in the range [0 1). 7: if No target is detected then 8: Volume search and communication beams continue during δt. 9: end if 10: if a target is detected then 11: Request the handover of current relay-based communications to the GS nodes. 12: Store the coordinates of the target during the current volume search as x (tar,vol) (t), y (tar,vol) (t), z (tar,vol) (t) .

13:
At the same δt, direct the cued search beam toward the coordinates of the target obtained during volume search. 14: if cued search confirms the target then 15: Store the coordinates of the target as x (tar,cue) (t), y (tar,cue) (t), z (tar,cue) (t) . 16: At the next update interval t + δt, direct the track beam toward the estimated position of the target obtained from the cued search. The coordinates of the target after applying the track beam are estimated as x (tar,trk) (t + δt), y (tar,trk) (t + δt), z (tar,trk) (t + δt) . The RCS and Doppler estimates of the target are also obtained.

17:
At every next update interval (provided the target is present in the previous interval) the track beam will be assigned the highest priority. The state of the target S (tar) i is updated at every δt. 18: Also, the targets are tracked adaptively, i.e., targets with high maneuverability are visited often compared to slow-moving targets. 19: else 20: Confirm targets again using volume search. 21: end if 22: else 23: Continue using communication and search beams. 24: end if 25: During all δt, the communications between the MMPAR onboard UAVs continue.

MMPAR Parameters
The radar parameters assigned to each type of beam are unique. The configurable radar and radar platform parameters are provided in Table 2. The range of radar parameters provided in Table 2 can be represented as where P (TX,max) and P (TX,min) are the maximum and minimum transmit power, respectively, ∆P (TX) represents the step change in the transmit power, f (c,max) and f (c,min) are the maximum and minimum center frequency, respectively, and ∆ f (c) is the step change in the center frequency. The maximum and minimum bandwidth ranges are represented by BW (max) and BW (min) , respectively, ∆BW is the step change in the bandwidth, τ (max) and τ (min) are the maximum and minimum pulsewidths, respectively, and ∆τ is the step change between the pulsewidth values. The maximum and minimum azimuth half-power beamwidths are represented as θ (b,max) and θ (b,min) , respectively, and ∆θ (b) is the step change in the beamwidth in the azimuth plane. φ (b,max) and φ (b,min) are the maximum and minimum half-power beamwidths in the elevation plane, and ∆φ (b) is the step change in the beamwidth in the elevation plane. The maximum and minimum ranges of the PRF are represented as PRF (max) and PRF (min) , respectively, and ∆PRF is the step change in the PRF. N (PC) is the number of intrapulse modulation options available, N (Wav) is the number of waveform types available, and N (Pol) is the number of polarization options available. Moreover, the range (magnitude) of the velocity of an i-th airborne radar platform (UAV) is Different types of beams are assigned different radar parameters from (10). The parameters assigned to simultaneously scheduled beams should be smaller than or equal to the total resources of the parameters available. The parameters assigned to beams remain the same unless a change is recommended by the cognitive block. Let S (rad) (i,j) (t) represent the j-th beam at the i-th radar onboard the UAV at time t. The beam S (rad) (i,j) (t) and the corresponding radar parameters assigned are given as Table 2. Radar and radar platform parameters.

Serial # Radar Parameter Representation
Antenna gain based on beamwidths in the azimuth and elevation planes Pulse repetition frequency Intrapulse modulation/pulse compression PC 9 Waveform type Wav 10 Polarization Radar platform position x (rad) ,y (rad) , z (rad) 12 Radar platform velocity v (rad)

Reinforcement Learning in Our Approach
In this section, we first discuss the states of the targets, followed by a discussion of anomalies in the detection, tracking, and classification of targets. We then present the implementation of multiagent Q-learning to remove these anomalies. Additionally, we discuss the implementation of supervised AI algorithms for the same purpose.

State of the Targets
We consider that there are M malicious UAVs (taken as targets) in the swarm. Each MMPAR onboard a UAV detects a subset of the total targets given as k = 1, 2, · · · , M i . The RCS, position, and velocity estimates (based on Doppler) of the k-th target detected by i-th , respectively. Classification is performed by assigning a class category C (tar) (i,k) to each target based on its RCS and velocity. For simplification, we defined two class categories. One class category is the fixed wing and the other is the rotary wing.
Moreover, each k-th target is detected at a given signal-to-noise ratio (SNR) on i-th MMPAR, represented as SNR (tar) (k,i) . The SNR for the search, cued, track, and passive listening beams are given, respectively, as SNR (pl,tar) where t (s) (i,j) and Ω (i,j) are the scan time and solid angle, respectively, for the volume search beam (j = 3) at i-th radar, T (s) i , L i , and B (n) i represent the system temperature, cumulative losses, and noise bandwidth of the radar receiver, respectively, at the i-th radar, k is the Boltzman constant, R is the range of the target, and EIRP (tar) is the effective isotropic radiated power (EIRP) emitted by the active RF emitter onboard a target. The cumulative losses are due to the atmosphere and obstructions from the clutter. The clutter includes birds and buildings. The birds and buildings are modeled using non-homogeneous point clutter.
The RCS, position, velocity, class category, and detected SNR (obtained during tracking and passive listening) are considered as the state of the target. The current state of the targets (numbered as k = 1, 2, · · · , M i ) observed at the i-th UAV are given as Similarly, the past states of the targets over nδt observation intervals stored in corresponding arrays are given as Both the current and past states of the targets over nδt update time intervals are provided to the RL cognitive block shown in Figure 4.

Anomalies
An anomaly can be defined as a random fluctuation in the state of a target, which can be caused by external factors such as clutter and RF interference in the environment, or by limitations in radar processing. Such limitations include range and Doppler ambiguities, range resolution, angular resolution, radar receiver sensitivity, and dynamic range. Anomalies for different types and scenarios of radar systems are covered in the literature [23][24][25]. Anomalies are expected to increase in complex and dynamic scenarios, such as those involving time-varying clutter and UAV swarms. These anomalies can result in the inaccurate detection, tracking, and classification of targets. At a given time t, an anomaly in the state of a k-th target observed at i-th MMPAR can be represented as where ρ (tar) (i,k) (t), and SNR (tar) (i,k) (t) represent the target state with anomalies due to random fluctuations, and X (rcs) (t), X (pos) (t), X (v) (t), X (C) (t), and X (SNR) (t) are the random processes for RCS, position, velocity, classification, and SNR, respectively, and are represented by Gaussian processes.
Algorithm 2 Pseudo-code for multi-agent RL algorithm using Q-learning and MMPAR data 1: procedure STATEANOMALYACTIONREWARD 2: The input to the algorithm are target states S (tar) (t), radar parameters S (rad) (t), radar platform position x (rad) ,y (rad) , z (rad) , and policy for a given scenario.

3:
Consider that we are at i-th MMPAR onboard a UAV.
do not conform to the policy for anomaly then 5: Identify the anomaly. 6: if Anomaly is already present in the anomaly list then 7: For a k-th anomaly, p th state, reward r, and j-th action from Table 3, the corresponding Q-value is calculated as: 8: where β is the learning rate, r is the reward, γ is the future discount rate, and Q represents maximum expected future rewards given a future state S , and action a . 9: Update the Q-learning table. 10: else 11: Update the anomaly list. 12: end if 13: end if 14: return Q-learning table given in Table 4. 15: end procedure An anomaly detector is used to identify anomalies during radar processing. In Figure 4, an anomaly detector for i-th MMPAR onboard a UAV is shown. The inputs to the anomaly detector are the present and past states of the targets, S  (t − δt : t − nδt), respectively (provided in Section 3.1). The anomaly detector in Figure 4 also contains a policy and an anomaly classifier block corresponding to a given terrain. An anomaly is identified by comparing the current and past states of a target over duration nδt with a given policy. The anomaly detection procedure is described in Algorithm 2. A list of the possible anomalies during the detection, tracking, and classification of UAVs in a swarm is provided in Table 5. Table 3. There are six anomalies identified in our approach. Three states can arise after an anomaly occurs and an action is taken. Rewards are assigned for each state and action. The total reward for an action corresponding to a state is the addition of a fixed reward and overhead of the action represented by O. The Q-values are calculated using this reward (see Algorithm 2).

Anomalies
States Reward for a 1 Reward for a 2 Reward for a (a) N Anomaly not removed Anomaly (1) Anomaly removed but reappears after t > δt Anomaly removed and does not reappear Anomaly not removed Anomaly (6) Anomaly removed but reappears after t > δt Anomaly removed and does not reappear Table 4. Anomaly and possible actions to remove said anomaly in a given scenario. The rewards and Q-values corresponding to actions are also provided. Create a dynamically changing scenario by introducing multiple targets and randomness representing noise and clutter, and active RF interference emissions.

4:
Identify anomalies and select optimum action based on Q-value. See Table 4.

5:
if Anomaly persists after an optimum action then 6: Take the action with lesser priority (smaller Q-value than the previous). 7: if Anomaly is removed after lesser priority action then 8: Update the Q-learning table with an updated Q-value for the new action. Calculate the percentage of recurrence of anomalies after corrective actions are applied. 14: % For supervised AI algorithms 15: For an anomaly, the AI algorithm is trained with an optimum action label based on the target state S

16:
Calculate the percentage of recurrence of anomalies for different supervised 17: AI algorithms. 18: return Percentage recurrence of anomalies for RL and supervised AI algorithms. 19: end procedure Table 5. List of anomalies observed at nδt update intervals.

1
Targets not correctly detected (because targets are not resolved in range and angular domains). 2 Targets detected with SNR below a threshold (either due to clutter or RF interference). 3 Velocity variations above a threshold. 4 RCS fluctuations above a threshold. 5 Active RF emissions from malicious UAVs. 6 Classification of targets changes (due to the above listed anomalies).

Multi-Agent Q-Learning
If an anomaly persists for a duration of nδt, then corrective action is required to remove the anomaly. Such corrective action involves changing the runtime radar parameters and/or repositioning the radar platform. Table 6 provides a list of possible actions to remove the anomalies listed in Table 5. The combined set of available actions A i (t) at a given time t and at the i-th MMPAR can be represented as Choosing an action from Table 6 may not result in an optimal outcome. It is preferable to select an action that can maximize current and future returns, i.e., remove current and potential future anomalies with minimum overhead. After an action is taken, we can reach three states given in Table 3. Each state has an associated fixed reward. There is also an overhead for each action represented by O. The total reward is the addition of a fixed reward for the state-action pair and overhead for the action shown in Table 3. This reward is used to calculate the Q-value given in Algorithm 2. The optimum Q-values are obtained after multiple iterations of training. Overall, the anomalies, corresponding optimum actions, and Q-values are listed in Table 4. Change of number of pulses a 6 7 Change in transmit power a 7 8 Change of antenna beamwidth a 8 9 Change in the polarization of the phased array a 9 10 Change in type of sounding signal a 10 11 Changes in the beam scheduling a 11 12 Change of position of the radar platform onboard UAV a 12 Algorithm 2 provides the implementation of multi-agent Q-learning using MMPAR data. Each MMPAR onboard a UAV acts as an RL agent. The state of each agent is represented by the current and previous states of the targets after an anomaly is detected. Q-values are learned using the Q-learning algorithm and rewards from the environment for a given state-action pair. The optimum action based on the Q-value is updated in Table 4. To evaluate the Q-learning table in a simulated environment scenario, multiple targets, noise, clutter, and active RF emissions as interference are introduced to create a complex and dynamic scenario. Algorithm 3 outlines the evaluation procedure.
The learning rate in Algorithm 2 is controlled through β. We prefer to adopt a balanced learning rate for our approach. Therefore, we set the value of β = 0.75 initially and then gradually decrease β as the number of samples from Q(s, a) increases. This approach will help the convergence of RL and at the same time ensure a balanced learning rate. Similarly, we set the value of the future discount rate γ to 0.85 to balance the current and future rewards in our approach.
A Markov decision process (MDP) is a mathematical framework used to model decision-making problems in situations where outcomes are partly random and partly under the control of a decision maker. In the case of the presented approach, the MDP represents the states of the targets, the occurrence of anomalies, the actions taken to remove the anomalies, and the corresponding Q-values shown in Figure 5.
The goal of the MDP is to achieve an anomaly-free state for radar operation. If an anomaly arises, the best action is chosen based on the optimum Q-value to remove the anomaly and move to the desired state, as shown in Figure 5. If the action based on an optimum Q-value is not able to remove the anomaly, the Q-value is updated for the state-action pair using a negative reward. A second priority action is used to remove the anomaly if the first action fails, and the process repeats. By using an MDP, the approach can systematically and optimally address anomalies in radar operation.

Anomaly Removal Using Supervised AI Algorithms
In addition to RL, supervised AI algorithms are also used to remove anomalies. The AI algorithms used are Naive Bayes (NB), Classification Decision Tree (CDT), Linear Discriminant Analysis (LDA), and Random Forest (RF). The AI algorithms are trained to provide the best action label corresponding to an anomaly in a given scenario. The training data D (Tr) and model for the AI algorithms of the i-th MMPAR onboard a UAV after anomaly identification are given as where Mdl 1 , Mdl 2 , Mdl 3 , and Mdl 4 are the AI models corresponding to NB, CDT, LDA, and RF classifiers, respectively, f is the modeling function of the classifier, and C (Tr) is the class identifier assigned to training data. The class identifiers here are action labels, provided in Table 6. During the evaluation phase, the optimum action (class label) is predicted as where p = 1, 2, 3, 4, D (Eval) represents the evaluation data of the target collected by the radar, and predict represents the prediction function.

Anomaly6
Compare Yes Desired state (anomaly-free) Figure 5. A Markov decision process is used to represent our approach. Initially, the state of the target is examined for anomalies at a particular radar onboard a UAV. If an anomaly is detected, a set of actions is available to correct the anomaly and transition to the desired state. Each action corresponds to a Q-value. The action with the highest Q-value is selected.

Simulation Setup and Results
In this section, the simulation setup, results of our approach, and comparison of our approach with the supervised AI algorithms are provided.

Simulation Setup
The simulations are carried out in Matlab. The MMPAR onboard the UAV is generated using the Matlab Radar Toolbox. The parameters of the radar are provided in Table 7. The radar resources shared by different beams (obtained from the range of parameters in Table 7) are provided in Figure 6. The maximum azimuth and elevation scan limits are [−55 • : 55 • ] and [0 • : −35 • ] in the azimuth and elevation planes, respectively. The update rate for the system simulation is f (u) = 20 Hz, and the corresponding update interval is δt = 50 ms. The cumulative losses of the i-th radar represented by L i vary between 0 and 12 dB. Table 7. Simulation parameters of the radar. Wav Noise radar waveform 8

Serial # Radar Parameters
Pol Vertical 9 Radar noise figure  5 dB  We consider two UAV platforms, each equipped with an MMPAR. These UAVs are labeled as UAV1 and UAV2, respectively. Both UAV1 and UAV2 start from an initial position and follow a straight-line trajectory at a constant velocity towards the targets. UAV1 has three targets in front of it, while UAV2 scans six targets. All targets are airborne and within the range of the radars. Figure 7 shows a snapshot of UAV2 carrying MMPAR scanning the six targets using volume, cued, and track beams at different time instances. At the top figure, the positions of the beams and targets at different sectors of the azimuth and elevation planes are shown. In the bottom figure, an (x, y) plot of the beams is presented, and the position of targets is shown. It can be observed from Figure 7 that only volume and cued search beams are used at t 1 for targets 1 and 6. At time t 2 , volume and cued search beams are used for targets 2, 3, 4, whereas a track beam is used for targets 1 and 5. In Figure 8, an airborne platform carrying a malicious RF emitter emits a beam towards UAV1 carrying MMPAR. The RF listener beam of MMPAR onboard UAV1 is used to listen to the RF emissions. The other two targets do not emit RF energy, as shown in Figure 8. Furthermore, suburban and urban terrains are considered in the simulations. In both the suburban and urban terrains, clutter consisting of birds and buildings is considered. The height of the buildings in the urban terrain is larger than in the suburban terrain. The clutter is modeled as non-homogeneous point clutter occurring at random intervals of simulation. The clutter reduces the SNR of the radar beams. The fluctuation in the SNR of different beams due to clutter is modeled using Equations (12)- (15).

Results and Analysis
The simulations introduce anomalies to the radar environment by adding randomness according to (18). The simulation duration is 200 s, comprising 4000 update intervals. The first two anomalies and their respective priority actions for the MMPAR onboard UAV2 are shown in Figures 9 and 10. For the first anomaly in Figure 9, the goal is to have above 300 correctly detected targets throughout the simulation duration, and if the number of correctly detected targets is 300 or less, it is considered an anomaly. Anomaly1 is obtained for both suburban and urban scenarios. The priority actions taken to remove the anomaly include increasing the peak power, reducing the beamwidth, and increasing the number of pulses and bandwidth. The peak power varies between 150 × 10 3 and 400 × 10 3 , the beamwidth is reduced by 1 degree in both the azimuth and elevation planes, the number of pulses is increased from 10 to 35, and the bandwidth increases from 10 × 10 6 to 35 × 10 6 . The results show that the priority actions help to increase the number of correctly detected targets above the anomaly threshold of 300. Furthermore, it can be observed that the number of correctly detected targets is smaller for the urban scenario compared to the suburban scenario, mainly due to tall building clutter in the urban scenario.   In Figure 10, the second anomaly from Table 4 is shown, and we aim to keep the number of targets detected above the SNR threshold of 10 dB above 300. The priority actions are shown for both suburban and urban scenarios. The three priority actions selected based on the highest Q-values include increasing peak power, decreasing the beamwidth, and increasing the number of pulses. As illustrated in Figure 10, applying the priority actions increases the number of occurrences when the SNR of the detected targets is above the threshold. Similar to Figure 9, the number of targets above the SNR threshold for suburban scenarios is higher compared to urban scenarios. The other anomalies and corresponding actions from Table 4 can also be plotted.  Figure 9. Anomaly1 at MMPAR onboard UAV2 occurs when the number of correctly detected targets over a fixed simulation duration is less than 300. Corresponding actions are taken to remove the anomaly, which include varying the peak power in the range of 150 × 10 3 W: 50 × 10 3 : 400 × 10 3 W, reducing the beamwidth   Figure 10. Anomaly2 occurs when the number of targets detected with an SNR above the 10 dB threshold is less than 300. Similar actions to those in Figure 9 are taken to remove the anomaly and increase the number of targets detected with an SNR above 10 dB. Anomaly2 and corresponding actions using RL to remove Anomaly2 at UAV2 for suburban and urban scenarios are shown. The MMPAR onboard UAV1 uses a passive RF listener beam to detect active RF emissions from a target UAV, as depicted in Figure 8. The frequency sweep is performed using the passive listener beam to detect the active emissions, and the direction of arrival is determined by the same beam, shown as a line in the figure. Another line in Figure 8 shows the detection caused by the reflection of RF energy from a non-emitting target within the emitter beam footprint towards the RF listener beam.

Comparison of RL with Other AI Algorithms
In addition to RL, we also utilize the four supervised AI algorithms discussed in Section 3.4 to address anomalies. These algorithms predict the best action to remove the anomaly, and we compare their effectiveness with the Q-learning RL algorithm. The difference between the actual and estimated positions at MMPAR onboard UAV2 for suburban and urban scenarios is shown in Figure 11. The position error in Figure 11 is lowest for RL and highest without using any algorithm for anomaly removal. The position error is higher for the urban scenario compared to the suburban due to the tall building clutter in the urban scenario. Moreover, the average position error values for NB, CDT, LDA, and RF are close for urban compared to suburban scenarios. We also compare the percentage of anomaly recurrence after applying the optimal action, as outlined in Algorithm 3. Figure 12 shows the results of this comparison.  Figure 11. Difference between the actual and estimated positions (position error) of targets for MMPAR at UAV2 for suburban and urban scenarios. The position error due to radar anomalies, using RL and other AI algorithms and without any algorithm for anomaly removal, is shown. The average position error is lowest for RL and highest when no algorithm is used.   Figure 12. The percentage of anomaly recurrence after taking optimum actions based on Q-values is presented in the graph for suburban and urban scenarios. The simulations were conducted over a duration of 200 seconds for MMPAR onboard UAV1 and UAV2. It is observed that the highest percentage of anomaly recurrence occurs when no AI algorithm is used, while the lowest percentage is obtained using RL. Additionally, the percentage of anomaly recurrence is higher for six targets as compared to three targets. (a) Suburban scenario. (b) Urban scenario. Figure 12 shows that RL outperforms other supervised AI algorithms in handling anomalies. This is mainly due to RL's continuous situational awareness of environment feedback in the form of rewards and the ability to update its policy during runtime. In contrast, once a supervised AI algorithm is trained, it tries to find the best match based on training without taking any runtime feedback from the environment. Furthermore, the performance of supervised AI algorithms is expected to decrease in complex and dynamic scenarios with rapidly changing environments. Additionally, the recurrence percentage of anomalies is higher for all AI algorithms when six targets are present compared to three targets.
The runtime calculation demand is smaller for RL compared to supervised AI algorithms. Only basic initial training is required for the RL algorithm in our approach to learn about the best action for an anomaly (see Algorithm 2). During the runtime, a calculation overhead for our approach will only arise when an optimum action has not been selected and due to sub-optimum action, we have to select another action. On the other hand, supervised AI algorithms require a large dataset for training. The size of the training dataset can vary with the environment. Also, the results of supervised AI algorithms directly depend on the training dataset. Overall, the runtime calculation demand for supervised AI algorithms is higher compared to RL. The class label (action) selection based on comparison with the training data results in high runtime calculation demand for supervised AI algorithms. Moreover, if a new anomaly scenario arises that has not been covered in the training, it will be difficult to select an optimum action for the new anomaly using supervised AI learning. On the contrary, RL will be able to learn about the optimum action for the new anomaly scenario from the environment.

Conclusions and Future Work
In this work, we have used MMPAR onboard UAVs to detect, track, and classify malicious UAVs in a swarm. In addition to radar operation, the MMPAR onboard UAVs in our study can also support communications. During radar operation, we identify and remove anomalies in the detection, tracking, and classification using optimal actions. Multi-agent RL is utilized to select optimal actions for a given anomaly. The optimal actions are chosen based on the highest Q-values from Q-learning. We have also provided a performance comparison of RL with selected supervised AI algorithms. The results show that the RL-based approach is better at handling anomalies in a dynamic environment compared to supervised AI algorithms. In our future work, we plan to implement centralized multi-agent Q-learning and compare the results with the current decentralized multi-agent Q-learning.
We also plan the real-world implementation of our proposed method. Major limitations for real-world implementation are the large weight, power, and computation requirements of MMPAR. Therefore, MMPAR is challenging to place on small and medium-sized UAVs.
Author Contributions: Conceptualization, W.K.; visualization, Q.Y.; writing-original draft preparation, W.K.; supervision, I.G. All authors have read and agreed to the published version of the manuscript.
Funding: This work has been supported in part by the National Science Foundation under grant number CNS-1814727.