The core of the collaborative secure localization lies in DL-based anomaly detection that extracts spatio-temporal features. The detection part consists of two components, including a spatial feature extraction module based on the GAT and a temporal feature extraction module based on the VAE. Learned representations are used to build the reconstruction difference, which reflects the difference between the expected and the collected ranging data, serves as an indirect indicator of anomalies.
4.3.1. Spatial Feature Extraction Module
To efficiently capture the feature correlations of the multidimensional ranging sequence in the spatial dimension, we propose to use a graph,
G=
, to represent and learn spatial feature correlations, where
is the set of nodes that each corresponding to a specific sensing UAV, and
is the set of edges that represents their spatial feature relationships between UAVs identified by the satellite. The feature vector of each sensing UAV
is set as the delayed ranging sequence
, and the basis of the feature extraction lies in the similarity between the ranging sequences, calculated as
where
and
represent the
k-th element of the vectors
and
, respectively.
Specifically, we define an adjacency matrix
to represent the potential correlation between the ranging values of collaborative sensing UAVs associated with each edge. All elements of
are initialized to 1 by default and changed through a GAT model that explores the correlation factors between nodes in the graph data and learn to aggregate the potential spatial relationships between UAVs. An example of the network layer is shown in
Figure 5. To obtain a low-dimensional representation of the network, instead of directly using the similarity given in Formula (
6), all UAVs share a learnable weight matrix
, which is used for linear transformation of the original features. For UAV
, the shared weight matrix
is first used to linearly transform its own and its adjacent UAVs’ original features
,
to obtain
and
. Then, self-attention is applied to itself and its adjacent UAVs, based on the following function:
where Q is the query vector, i.e.,
, representing
’s attention to other adjacent UAVs, K is the key vector, which is the information carrier matching Q. The similarity between Q and K is calculated to measure the importance of different adjacent UAVs to
; V is the value vector, storing and updating the feature information of
. In the proposed setting, both K and V equals to
.
is the dimension of K. Here, the Softmax function is used to convert the similarity scores between
and
into a probability distribution, representing the relative weights of different adjacent UAVs.
Subsequently, a shared attention mechanism is adopted in each GAT layer to learning the correlation weights between adjacent UAVs. The attention weight between
and
at the
l-th layer represents the value of attention that UAV
pays to
, is calculated as:
where
and
represent the feature embedding vectors at the
-th layer, with
,
represents the weight of the single-layer feedforward network adjacent to the attention mechanism,
is the feature dimension after linear transformation using
, ∥ represents the concatenation operation, and LeakyReLU is an activation function. To facilitate comparison of the attention degrees to different adjacent UAVs, the Softmax function is used to normalize the attention coefficients of all adjacent UAVs of
, including itself, i.e.,
, to obtain the attention weight:
where
represents the normalized attention coefficient between UAVs
and
. Based on
, the features of adjacent UAVs, including the UAV’s own features, are aggregated, and a weighted sum is performed to obtain the updated feature embedding vector
of UAV
at the
l-th layer:
where
represents the ReLU activation function.
To enhance the generalization ability of the GAT model, this paper adopts the multi-head attention mechanism shown in
Figure 6 to more comprehensively capture graph structure information. First, the query weight matrix
, key weight matrix
, and value weight matrix
are extracted separately from the embedding vector
. Then, after scaled dot-product calculation and matrix concatenation,
is obtained. Under the multi-head attention mechanism, the update method for UAV
’s embedding is as follows:
where
represents the normalized attention coefficient calculated by the
-th attention head. Average multi-head attention is used instead of concatenation to maintain the consistency of the dimensions of
and the original input
. After multiple GAT layer operations, this module outputs the spatial feature embedding vectors of all UAVs
.
This paper optimizes the GAT model of the spatial feature extraction module through negative sampling, specifically using the binary cross-entropy loss function:
where
represents the cosine similarity metric function,
and
are the sets of positive node pairs and negative node pairs in graph
G, respectively.
indicates that node
and
have an adjacency relationship, forming a positive node pair with strong association or high feature similarity, while negative node pairs indicate two UAVs that are randomly sampled non-adjacent UAV pairs, forming negative node pairs with weak or no association. The goal of training the GAT model is to maximize the similarity of feature embeddings of positive node pairs while minimizing the similarity of feature embeddings of negative node pairs.
4.3.2. Temporal Feature Extraction Module
After obtaining the spatial feature embedding vector
for each node, a VAE model based on an attention mechanism is further used to extract the temporal dependencies of the multidimensional ranging sequence.
Figure 7 details the basic structure of the temporal feature extraction module and the training method of the proposed anomaly detection model. First, an LSTM-based attention mechanism is used to capture the importance of different time steps within the time window. Compared to ordinary LSTM networks, the attention-based LSTM learns non-fixed weight parameters
during training, i.e., assigning dynamic weights to the inputs, and then emphasizes the information of important time steps based on the degree of correlation between time steps.
The hidden state
and cell state
from the previous time step can be calculated using Equation (
13).
Combined with the input
at the current time step, they are fed into a linear layer to obtain
:
where
and
represent the weight and bias term of the linear layer, respectively. After Softmax normalization, the weight
is obtained:
For UAV
, its feature embedding vector
is processed through the LSTM layer and attention layer, and the output
is given by
Aggregating
yields the entire output of the spatial feature embedding vector
after passing through the attention mechanism-based LSTM, i.e., the weighted embedding of spatio-temporal features.
Subsequently, the resulting , which has sequentially extracted spatial feature embeddings and temporal dependency features, is input into an unsupervised learning VAE model. By training, the feature patterns of normal sequence data are learned, enabling more effective discrimination of significant differences between normal samples and potential anomalous samples, providing a basis for anomaly detection.
The VAE compresses the high-dimensional features
into a low-dimensional latent representation
through dimensionality reduction, and then reconstructs
based on
to obtain the reconstructed output
. Specifically, the VAE maps the original simple probability distribution to the true probability distribution of the training set, where
is generated based on the sampling and parameters of the input high-dimensional features
.
contains both the key information of
and satisfies a normal distribution. The probability of
can be calculated for
using the total probability formula:
where
is the probability of the latent representation
, and
is the probability of
given
. However,
resides in a high-dimensional space and cannot be enumerated, making it difficult to compute
. Moreover, the posterior distribution is also difficult to compute:
Therefore, this paper introduces the encoder of the VAE as an inference model
to approximate the posterior distribution
, and the decoder
as a generative model to address the above problems, where
and
are the learnable parameters of the generative model and inference model, respectively. During VAE model training, the mean
and variance
parameters of the latent space representation
are trained using samples. To enable backpropagation, the reparameterization trick is used, i.e., sampling noise
from a standard normal distribution and computing
based on this:
where ⊙ represents element-wise multiplication, and
does not participate in the gradient calculation process.
The temporal feature extraction module uses the attention mechanism-based VAE to capture the importance of UAVs in different time windows, learn the potential distribution patterns of normal sequences, and establish a corresponding probability distribution model. The model is trained by maximizing the likelihood of the input. Considering that the input likelihood is difficult to compute directly, the problem is transformed into maximizing the evidence lower bound of the log-likelihood. Thus, the loss function for training the attention mechanism-based VAE is expressed as
consists of two parts: reconstruction loss and KL divergence regularization term. To train the VAE reconstruction part while simultaneously learning the variable weights of the attention mechanism, the mean squared error (MSE) between the spatial feature embedding vector
and the VAE output
is computed in the reconstruction loss part to reflect the difference between them. The optimization goal is to make them as similar as possible, which helps in identifying anomalous samples that cause significant reconstruction errors during testing. The KL divergence regularization term aims to minimize the KL divergence between the approximate posterior and the prior of the latent representation
, ensuring that the
generated by the inference model
conforms as much as possible to a standard normal distribution.
Finally, we use a joint loss function for end-to-end training of the temporal and spatial feature processing modules, simultaneously optimizing the spatial correlation modeling capability and the normal pattern reconstruction capability based on temporal dependencies of the proposed anomaly detection model, and use the hyperparameter
to balance the relative importance of the two parts:
Algorithm 1 describes the training process of the proposed anomaly detection model. Additionally, during model training and testing, unlike the common practice of calculating reconstruction errors per timestamp in general anomaly detection models, this paper calculates the reconstruction errors of ranging data at different timestamps along the UAV dimension, providing a basis for the subsequent UAV-dimensional scoring mechanism. The value of the UAV-dimensional reconstruction error is taken as the absolute error between the original input and the reconstructed output of the VAE reconstruction part at the corresponding position.