We begin with an overview of the network structure during the pre-training phase. We explain how the network enables SSL and the advantages it provides. The details required to use the model in downstream tasks are then detailed. These include the network structure and loss function.
3.1. Pre-Training
Definition of Self-Supervised Representation Learning: Automatically learn features with a recognition ability applicable to multiple downstream tasks under the absence of signal labels.
In the pre-training phase, the state of the art MAE is used to learn the time-frequency information representation of QSAS as shown in
Figure 2. MAE covers patches of the input image at random and reconstructs missing pixels for learning a low-dimensional representation of data. After the
STFT, we send time-frequency images to the autoencoder. The design rules of the autoencoder are as follows:
Asymmetrical encoder–decoder network architecture is used. The encoder only operates on visible patches, and the potential feature representation of visible patches is obtained through convolution operation, namely the blue block in the figure. The lightweight decoder can refactor signals in time-frequency domains based on underlying features and mask tokens (orange blocks) by combining time-frequency information.
High mask ratio increases the difficulty and improves the generalization ability of model learning. When the mask ratio is high, the model needs to learn from limited information and reconstruct the original data, which can force the model to learn more robust feature representation, thus improving the generalization ability of the model. In addition, a high mask ratio can reduce the model’s dependence on local details, thus improving the model’s ability to understand the overall structure. However, a too high mask ratio will also lead to information loss and underfitting of the model, so it is necessary to make reasonable adjustment in practice.
Figure 2.
Simple illustration of autoencoder network model.
Figure 2.
Simple illustration of autoencoder network model.
The details of the autoencoder network are shown in the
Figure 3, where
k represents the size of the convolution kernel,
s represents the step length, namely the sampling interval, and
p represents the padding. The small yellow block represents a ConvNeXt V2 block, as detailed on the right. The small blue block represents the downsample, as detailed on the right. The data conversion process is as follows: firstly, the whole time-frequency graph is downsampled and divided into multiple patches, and the input image is compressed into a low-dimensional representation. Then, we add a mask to randomly patches of the image area, and a dense tensor is converted into a sparse tensor. After feature extraction by multiple sparse convolution layers of the encoder, the sparse tensor is converted into a dense tensor. Then we restore the masked area by the decoder.
Wherein, the generating process of a mask is as follows: Firstly, a mask tensor is defined. Its shape is the same as the input image. Its initial value is all 1. Then, we generate a random set of rectangles representing the areas to be masked. Then, we set the value of the corresponding rectangular region in the mask tensor to 0. In order to accommodate the convolution operation, we upsample the input mask, enlarge to the size of the last layer, and add a dimension to mask by the unsqueeze function, so its type is the same as the input data. Then, the mask tensor is used as a multiplication factor to multiply the input patches to obtain the masked image.
3.1.1. Encoder
The encoder, shown in the light green background in the
Figure 3, compose data downsampling, to sparse, complex CNN layers and dense. Data downsampling is to use the convolution layer with a convolution kernel size of 4 and step length of 4 to segment the image into patches. The details of the complex network layer are in the next subsection. The uniqueness of the encoder lies in the sparse and dense. We treat the mask input as a set of sparse patches, namely a two-dimensional sparse pixel array, and use sparse convolution to process the visible part.
Because of the disorder of sparse data, voxelizing the points and applying convolution to three-dimensional grids is a natural solution [
47]. Sparse convolution is a practical substitution for vanilla 3D convolution, and skips the non-active regions that only operate when the center of the convolutional kernel covers active voxels. Active voxels are stored as sparse tensors for fixed convolution operations. Sparse input data are transformed into a sparse matrix, which only stores non-zero positions (effective sites) and corresponding weight values. Submanifold sparse convolution (SSC) is used to reduce the influence of sparsity. SSC is calculated only when the center of the convolution kernel slides through the activate sites of the sparse matrix. The size and shape of the convolution kernel can be adjusted adaptively, and the same sparsity degree can be maintained in the whole network, which is suitable for deep CNN.
The traditional convolution operation is to convolve every site of the input tensor, which leads to a lot of redundant computation. Sparse convolution only convolves non-zero sites in the input tensor, so a large number of multiplication operations in the convolution operation can be turned to 0. The SSC reduces the amount of computation and improves the efficiency of training and testing under the condition of making full use of the input information.
3.1.2. Decoder
The decoder is a single ConvNext V2 block, shown in the blue background in the
Figure 3, that uses a full CNN to generate mask tokens. The decoder restores the low-dimensional representation of the time-frequency image obtained by the encoder to the original image. With a lightweight decoder, we reconstruct the original image based on underlying features and mask tokens. Specifically, the decoder receives underlying features output by the encoder and mask tokens, restores the underlying feature representation to the original image through deconvolution operation, and fills the mask in the original image according to the mask tokens.
Mean squad error (
) is the reconstructed loss function, and the error was calculated only on the mask part. The goal of the network is to minimize the reconstruction error. We try to reconstruct the sample as close to the original sample as possible, so that the network can learn the effective representation of the original image.
where
represents the predicted value, and
represents the true value. For each sample, we calculate
and sum up the squared differences for all samples. Then, we divide this sum by
, where 3 represents the number of channels per pixel, and
p represents the resolution of the patch, namely patch size, and
represents the number of pixels per channel.
3.2. Fine-Tuning
Transfer learning: Pre-train the existing large-scale dataset, then transfer the learned knowledge to the classification task of few-shot, and fine-tune parts of layers to improve the classification accuracy.
After pre-training the encoder map, input the data to the low-dimensional representation, which can be used as the feature representation of the data for the subsequent fine-tuning. During the fine-tuning, the weight is converted to the standard form, and the dense layer does not need special treatment. As shown in the
Figure 4, we freeze the first few layers, that is, transfer the corresponding structural parameters of the encoder to the fine-tuning network, then add layer normalization and linear layer and train them.
Inspired by the ConvNeXt V2 neural network structure, we make full use of the advantages of the local connection and weight sharing of CNN to obtain excellent performance of QSAS identification. ConvNeXt [
14] trains the ResNet-50 in the same way as the Transformer [
48] training method. On this basis, ConvNeXt experiments with Macro design, ResNeXt, Inverted bottleneck, Large kernel size, and Various layer-wise Micro. ConvNeXt V2 modifies the supervised ConvNeXt network into a SSL network by referring to MAE, but all the networks are proposed and tested in ImageNet and other datasets [
49,
50]. The radar signal is very different from the open dataset. Therefore, we combine the characteristics of the radar signal to further follow up the network parameters. The specific structure of the ConvNeXt V2 block (small yellow part) and downsampling block (small blue part) in the network is shown in
Figure 3. The specific design process and important details of the whole network are as follows.
First, the network downsamples the image through a convolution layer with a convolution kernel size of and a step distance of 4, and the height and width of the image are reduced to of the original. Then, it successively passes Stage 1, Stage 2, Stage 3 and Stage 4. Each stage is composed of a series of ConvNeXt V2 blocks. We will talk about the structure of the ConvNeXt V2 block. If the height, width and channel of the input feature map are h, w and , through the depthwise convolution with kernel size of , step size of 1, padding of 3 and through a LayerNorm, then the output size is still . Then, through a convolution layer, which has a kernel size of and activation function of Gaussian error linear units (GELU). Then, the height and width remain the same, and dim increases by 4 times. Then, through a convolution layer, which has a kernel size of and the DropPath layer, the output size reduces to . Then, add the input as output. Linear, as a fully connected layer, maps input features to different class labels.
The network follows the structure of VGG [
51], which divides the backbone network into four different stages. The number of blocks in each stage is
, and the input channel is
. So the stage ratio is 1:1:3:1, consistent with Swin-transformer [
52].
Depthwise convolution is a special case of group convolution in ResNeXt. The convolution kernel’s channel is 1 and only convolves with a single channel of the input. The number of convolution kernels is the same as the channel of the input and the channel of the output, as shown in
Figure 5a. So the channel of the feature matrix does not change, and a better balance between FLOPs and accuracy can be achieved.
The inverse bottleneck layer with two large, middle and small ends can effectively avoid information loss, as shown in the
Figure 5b. In addition, the depthwise convolution is moved up and the convolution kernel size is changed to
, which is consistent with the window size in Swin-transformer.
We use three regularization techniques: LayerNorm (LN), Global Response Normalization (GRN) and DropPath. The details are as follows. We only add one GELU activation function between two convolutions.
LN is added only before the first convolution. LN normalizes each dimension of each sample, turns the mean to 0, and turns the variance to 1. It is helpful to solve the problem of gradient vanishing and gradient explosion in neural networks and improve the generalization performance of the model.
GRN normalizes the features of different channels, makes features comparable, and then enhances the feature competition among channels. GRN can avoid overfitting and improve the generalization performance of the model.
where,
denotes the input data,
denotes the L2 norm for the
ith channel,
C denotes the number of channels, and
and
are two learnable parameters.
DropPath randomly selects some network layers for each training sample in the process of forward propagation and sets their output to 0. Some paths of the network are deleted randomly, so different paths will be deleted for each training sample. Another feature of DropPath is that it operates on a network hierarchy. Specifically, DropPath works on the deep structure of the network. The deep structure is often the bottleneck of the network and tends to lead to overfitting. Therefore, DropPath can help regularize the network on the deep structure, thus reducing overfitting.
The residual connection structure can effectively avoid the problem of gradient disappearing and gradient explosion, so that the network can learn features in deeper layers.
Using a separate downsample layer, a convolution with step size 2 is inserted between different stages, and a LN is added before and after downsampling.
The loss function selects the cross entropy loss function.