In this section, we introduce the lightweight CNN that we proposed for low-resolution SAR images in detail. We propose attention-based multi-stream convolutional neural networks (AMS-CNN) for SAR image classification by extracting key and diverse features, which is depicted in
Figure 2. ASM-CNN includes three parts. On the one hand, the first section utilizes three convolutional blocks and pooling layers to extract low-level features, where each convolutional block has two convolutional layers with a different number of channels. The first part, on the other hand, utilizes two domains of attention blocks to focus on the important features, where the output of the attention module is the product between the input features and the attention features. The second part extracts various high-level characteristics (i.e., maximum, average, and median features) from three streams using varied strides and channels. Finally, the third part combines the features extracted by three streams for image classification, and the AMS-CNN output is the type of vehicle. Moreover, we explore the different GAN variants for foliage-penetrating SAR image generation.
3.1. Basic Functions of CNN
In this section, we introduce the basic functions of CNN: the activation function, loss function, and dropout.
Activation Function. The activation function follows the convolution layer and is able to express the nonlinear mapping relationship between the input and output. Traditional nonlinear activation functions include double tangent and sigmoid. However, the gradient of these functions disappears in a certain range, which is not conducive to the training of the network. Recently, ReLU has attracted more attention for its good properties, which can obtain non-zero values when the input value is greater than zero, reducing gradient disappearance. However, the network with the ReLU active function can not work for a value less than zero. To make the activation function applicable to all values, LeakyReLU is supplemented on the basis of ReLU. The function of LeakyReLU is as follows:
Dropout. Dropout refers to the random omission of the hidden units from the network with a given probability on training cases, which can effectively reduce overfitting, especially with a limited dataset. It is always used in the fully connected layer, which makes the networks more lightweight and robust.
Loss Function. In network forward propagation, the parameters in the network need to be updated according to a specific strategy. The most popular loss functions are mean square error and cross-entropy. In our problem of classification, the cross-entropy loss function can more accurately express the relationship between the network and real distribution, reflecting the difference between the real and predicted labels. For n samples
, the cross-entropy loss function is as follows:
where
W is the parameter in the network,
is the real label of the sample
, and
is the predicted label of the sample
. Adding parameter regularization (i.e., dropout) to the loss function can reduce model overfitting. The loss function with regularization is as follows:
where
L is the total number of layers,
l is the number of layers,
is the parameter of the
lth layer, and
is the regularization parameter.
3.2. Attention Mechanism
Channel attention module. Channel attention focuses on the inter-channel relationship of features, as presented in
Figure 3. The channel attention utilizes the max-pooling and average-pooling processes for the input feature, where the result is a one-dimension vector, before using shared Multi-Layer Perceptron (MLP) to obtain two types of channel information. The max-pooling and average-pooling are proposed to obtain comprehensive channel information where the operations are along the spatial axis. The MLP is composed of three layers, with the middle layer acting as a bottleneck and the units being one-third the size of the first layer. The two kinds of channel information obtained by MLP are summed, and then sigmoid is applied. The channel attention function is as follows:
where
is the input of the channel attention module feature map (
), AvgPool and MaxPool are the operations of taking the average and maximum of each feature map, MLP is a multi-layer perceptron, and
represents the sigmoid function.
Spatial attention module.
Figure 4 depicts the structure of the spatial attention module, which focuses on the inter-spatial connection of features. The spatial attention module, similar to the channel attention module, utilizes average-pooling and max-pooling to focus on the key spatial information where the operations are along the channel axis. Then, the two parts of the results are merged according to the channel, which means a two-channel feature map is obtained. The convolutional operation is followed to convert it to a one-channel feature map, and then a sigmoid is applied to it. The following is the spatial attention function:
where
is the input of the spatial attention module feature map (
),
is the filter with the convolutional kernel of 7, and
represents the sigmoid function.
3.3. Structure of AMS-CNN
First part. Three convolutional blocks and two attention modules make up the first part. The convolutional block has two convolution layers and one pooling layer, with the convolution layer being the key to extracting features. The convolution kernels of each layer are set to 3 × 3, and the stride size is 1. After the convolutional layer, a pooling layer is able to compress the local view information. The pooling layer can be configured for average or maximum pooling, with maximum pooling being used in AMS-CNN, where the pooling kernels of each layer are set to 2 × 2 and the stride size is 2. The activation function is LeakyReLU, which realizes the nonlinearity of the network. The feature dimensions extracted by the convolution layers in three convolutional blocks differ at 16, 32, and 64. We used to represent the output of three convolutional blocks. AMS-CNN employs channel and layer attention modules in the first part, with the channel attention module after the first convolutional block and the layer attention module following the second convolutional block.
Second part. The second part includes three streams. We use
to represent the output of the kth layer of the jth stream of the second part, for
. This sequentially represents the maximum, average, and median structure when j = 1,2,3. The maximum structure includes a convolutional layer with a large stride to extract local features, which is used to extract 64 features, and the convolution kernel is set to 3 × 3. Simultaneously, the average and median structures focus on details for their convolutional layer with a small stride, in which the convolution kernel is set to 1 × 1 and output channels are 512. All of these three structures include a global max pool after the convolutional layer. The output of the maximum structure is depicted in Equation (
6). The average and median structures are the same for the first two layers, so the output of the global max pool of them is shown in Equation (
7):
where
is the height and width of the feature map, and i is the ith channel. A full connection with the dropout is used for the third layer of the maximum structure. In contrast to the maximum structure, the third layer of the average and median structure adopts group averaging and group median with nearly fixed weight and bias to extract features. These three structures not only acquire specified features but also increase the diversity of the features. The output of group averaging is shown in Equation (
8), which divides the features based on the number of classes and then averages each class. As depicted in Equation (
9), the group median groups the features and calculates the median in each class. In average and median structures, they share all the weight, except for the last layer, which reduces the parameters of AMS-CNN:
where
a is the number of features of the 2nd layer, and a/f is the number of classes.
Third part. The third part is to combine the maximum, median, and average features extracted in the second part and classify the objects. The optimization function Adam was used to update the training parameters in the loss function and the parameters in Adam were set to
and
. The loss function of the AMS-CNN is as follows:
where
,
, and
are the cross-entropy loss function of each stream. We adopted 0.5 for the value of
.
3.4. Data Generation with GANs
Deep learning focuses on training the model close to the actual model, and the amount of data samples is crucial. The large dataset benefit for the trained model close to the real model, resulting in greater generalization ability and less overfitting. Large datasets (e.g., ImageNet, CIFAR-10, and MNIST) are widely used in image classification and recognition research. AlexNet [
27] achieved a top-5 error rate of 17% for 1000 different classes using the 1.2 million ImageNet datasets. Kaiming He et al. [
6] used CIFAR-10 with a data volume 60,000 to verify the proposed residual network structure. The content of these datasets is common in real life, resulting in easy to collect.
However, the open-source datasets of SAR images are less available because they are difficult to collect and laborious to label. Data augmentation uses existing data to increase the amount of data. The most prevalent data augmentation methods, such as flipping, clipping, and scaling, do not increase data diversity, resulting in the model with insufficient stability. GAN has been actively employed in dataset augmentation since its release for the diversity of generated data.
GAN is a method of generative models that can be trained on a small amount of data. GAN is made up of two components: the generator and the discriminator. The generator generates images from random Gaussian noise, and the discriminator determines if the images created by the generator are true images. GAN focuses on the images generated by the generator that can fool the discriminator.
The vanilla GAN generator G and discriminator D are fully-connected layers. G is a generator network, the input is random noise data of the given dimension, and the output is the generated image data. D is the discriminator network (i.e., classifier) used to identify whether the input data represent a real picture. The process of GAN is depicted in
Figure 5. The randomly generated noise is the generator’s input, and the generator’s output is the discriminator’s input. Simultaneously, the real data with labels are input into the discriminator. The parameters of the generator and discriminator are iterated until the image generated by the generator successfully fools the discriminator. The GAN loss function is as follows:
where x is the real sample, z is the noise sample,
is the generator network,
is the discriminator network,
is the data generated by the noise through the generator network,
is the possibility of discriminating x as the real sample data by the discriminator, and
is the probability that the discriminator will judge the generated data as real.
Since the vanilla GAN was proposed, many works [
25,
28,
29] successively improved it and proposed various variants, where DCGAN [
28] and LSGAN [
25] are the more well-known GAN variants. The vanilla GAN network tends to gradient disappears because the JS divergence is used to measure the distance between two distributions. The Jensen-Shannon (JS) divergence is constant if there is no intersection between two distributions. In practice, there is a negligible probability of an intersection between the data distribution generated by the generator and the real data distribution. Consequently, the discriminator can easily identify the difference, resulting in the disappearance of the gradient. LSGAN weakens the discriminator to solve this problem by changing the sigmoid activation layer in the discriminator to a linear activation layer. Although the two distributions do not intersect, the difference between the two distributions cannot be distinguished, solving the problem of vanishing gradients.
DCGAN is another variant based on GAN, consistent with GAN in principle. Given the successful application of CNNs in image recognition, DCGAN replaces the multilayer-perceptual network of the generator and discriminator with the convolutional network. Although GAN’s exquisite network structure design enables it to complete self-supervised learning data distribution, the network’s training is unstable, leading to some meaningless results. DCGAN is based on convolutional networks, and the details of the network structure are adjusted to speed up the network convergence and improve the quality of the samples. These detailed structural adjustments include adding data normalization in the generator and discriminator networks, using the tanh activation function in the last layer of the G network, and using the rectified linear unit (ReLU) activation function in the other layers LeakyReLU as the activation function in the D network function.
Previous studies have proposed different evaluation metrics to evaluate the GAN models. Inception score (IS) and Fréchet inception score (FID) [
30] are two widely used evaluation indicators. IS is used to assess the generated images based on the JS divergence of the two distributions, while FID is based on the mean and variance of the data distributions. IS evaluates the GAN networks from two perspectives: generated image quality and diversity. The higher the quality of the generated image, the smaller the conditional entropy. The larger the edge probability entropy, the more diverse the generated images.
The performance indicators of IS are expressed as the Kullback-Leibler (KL) divergence, where the larger KL divergence indicates the higher diversity and quality of the generated images. FID is based on the inception net network, which obtains a 2048-dimensional feature. The principle of FID is that the generated data distribution should be as similar as possible to the real data distribution. The mean and covariance are used to calculate the distance between the two distributions. The IS and FID formulas are as follows:
where
is the data distribution of the generated image,
is the image sampled by x from
,
is the KL divergence of the distributions p and q,
is the distribution of the label y under the condition of x, and
is the distribution of the label y.
where
is the distribution mean of the real data,
is the distribution mean of the generated data,
is the distribution variance of the real data,
is the variance of the generated data, and
is the trace of the matrix. The lower the value of FID, the closer the two distributions and the higher quality and diversity of the generated images. Therefore, this study combines the IS and FID scores to evaluate the images generated by different GANs. The results in
Section 4.1 demonstrate that the foliage-penetrating SAR images obtained by the DCGAN network have larger IS and smaller FID scores.
3.5. CARABAS-II Dataset
In remote sensing, the absence of labeled data impedes training a deep network, notably in synthetic aperture radar (SAR) image interpretation, compared to the large-scale annotated dataset in natural photographs.
Large datasets can significantly increase the accuracy and stability of deep learning networks, but they are not often available because of the cost of collection and the difficulty of labeling. SAR images, especially foliage-penetrating SAR images, are difficult to obtain. To our knowledge, CARABAS-II [
31] is the only open-source foliage-penetrating SAR image dataset, with just 24 images. To improve the reliability of AMS-CNN, we utilize the CARABAS-II dataset and data augmentation based on it using DCGAN.
The CARABAS-II dataset [
31], which comprises 24 VHF band SAR images obtained during a flight in northern Sweden, was utilized. The SAR system uses HH-polarized radio waves with a transmission frequency of 20–90 MHz and obtains images with a resolution of 2.5 × 2.5 m. The dataset was obtained by four deployments during the flight, including Sigismund, Karl, Fredrik, and Adolf–Fredrik, with the imaged area encompassing 25 military vehicles concealed in the forest. Three types of military vehicles were in the imaged area: TGB11, TGB30, and TGB40; their SAR and natural images are depicted in
Figure 1.
The parameters that change in gathered data include incidence angle, flight heading, radio frequency interference (RFI), forest location, and target heading, as shown in
Table 1. There are 12 different settings, each repeated, resulting in 24 images. Each captured image covers an area of size 2 × 3 km, containing 10 TGB11s with dimensions 4.4 × 1.9 × 2.2 m, 8 TGB30s with dimensions 6.8 × 2.5
3.0 m, and 7 TGB40s with dimensions 7.8 × 2.5 × 3.0 m. Each pixel size is 1 × 1 m, making the image data have 3000 rows and 2000 columns, and the distance between two vehicles is about 50 m. Given that the extracted images only include one type of vehicle, we obtained SAR images of each vehicle with a size of 38 × 38, corresponding to the setting in [
5].