Figure 1 shows the network architecture of DPIG-Net. It is similar to [
43], but it is closely related to ship polarization. The data used in this work were the open OpenSARShip dataset, samples of which were from the Sentinel-1 [
44] SAR satellite. Sentinel-1 works in dual-polarization mode, i.e., vertical–vertical (VV) and vertical–horizontal (VH). The offered data were denoted by
and
which were in the form of complex numbers. Since
has higher scattering energy of ships [
7], it was selected as the source of the middle main branch guiding other branches for feature extraction. Additionally, the input of the middle main branch was denoted by
. We selected
as the source of the upper branch since
reflects less scattering energy of ships than
[
7], and the input of the upper branch was denoted by
. See [
7] for more details.
Moreover, to fully leverage the polarization information, the lower branch in PCCAF was constructed to measure the polarization channel difference for a more comprehensive description of ship characteristics, and its input was given by:
where * denotes a complex conjugate operation. Significantly,
and
used in our work must be complex data, rather than the previous commonly used amplitude-based real data. To the best of our knowledge, OpenSARShip might be the only data that can meet this requirement. Notably, FUSAR-Ship only offers amplitude real data, so
could not be obtained by Equation (1). Moreover, images in FUSAR-Ship are not paired in the form of VV–VH or HH–HV, which prevents the application of our network.
In particular, our current work only considered the dual-polarization case due to the limitation of available data. If full-polarization data is available in the future, one can expand DPIG-Net into four parallel branches to receive four different polarization inputs (or more branches for the cross-channel model).
PCCAF received three types of data (
,
, and
) for feature extraction. Its output was denoted by
, which contained the high-level semantic features [
45] of the three types of data. DRDLF received
for feature fusion through several cascaded dilated residual dense blocks and global residual learning from the main branch
. Finally, 2D feature maps were flattened into 1D feature vectors to transmit into a fully-connected (
) layer. The terminal
was responsible for category prediction with the soft-max function. Significantly, the reason that we set two fully connected layers was to gradually aggregate the flattened feature, which was conducive to keeping important semantic features and training the network. More fully connected layers may provide benefits, but the amount of calculations and number of parameters will increase sharply. Therefore, we only kept two fully connected layers in DRDLF.
DPIG-Net showed a tendency of feature aggregation from the three input branches to the terminal feature integration. Most previous works only adopted to predict ship categories, i.e., the middle main branch of PCCAF. In contrast, we made full use of the polarization information ( and ) to guide the classification prediction of . We named the above paradigm the dual-polarization information-guided SAR ship classification.
2.1. Polarization Channel Cross-Attention Framework (PCCAF)
PCCAF established a simple encoder
to preliminarily extract features from the three types of data. The encoder structure is shown in
Table 1. The encoder
used standard convs to extract features, batch normalization (BN) [
46] to ensure training, and ReLU to activate neurons. The max-pooling operation was used to reduce the size of feature maps. With network deepening, the channel width increased by a multiple of 2. Significantly, the number of channels is known to increase as the resolution decreases in order to prevent the loss of discriminative features [
47]. Moreover, our feature encoder
only had four stages, rather than the usual five stages [
36]. This was to avoid the loss of spatial features [
48] due to the small size [
49] of SAR ships. Their outputs were denoted by
,
, and
for the subsequent processing. A more advanced encoder might achieve better performance, but it was not within the scope of this research.
To better exploit the benefit of polarization information, we designed a cross-attention subnetwork to model the correlations between different polarization branches. The design concept of the cross-attention subnetwork was that the middle main branch generated referenced feature maps to guide the other two auxiliary branches. Most existing attention networks merely refine their own feature maps in the uncrossed mode, which cannot solve the multi-branch dual-polarization-guided case. That is, their module input has only one entry, but our proposed cross-attention subnetwork was specially designed for dual-polarization ship missions, i.e., our module input had two entries. The cross-attention subnetwork could be summarized as:
where
denotes the referenced feature maps (in this paper,
, i.e., the main VV branch),
denotes the feature maps to be corrected (in this paper,
means the VH branch
or the polarization difference branch
),
denotes the learned mapping, and
denotes the cross-attention map.
Figure 2a shows the network implementation. Taking
and
as an example, the same procedure was applied to
and
. We first concatenated the two input feature maps directly, and then, three convs with a skip connection were employed to learn the inputs’ interrelations. Finally, the learning knowledge was activated by a sigmoid to obtain the final cross-attention map
. Significantly, the reason that we selected a sigmoid as the activation function was that a sigmoid is easily differentiated for backpropagation and can narrow the range of attention weights in the cross-attention map for stable network training. Moreover, in comparison with other activation functions, such as Tanh and ReLU, a sigmoid is able to map any real number to output from 0 to 1, which is suitable for measuring the attention level of one position in a feature map [
50]. Specifically, the closer the attention weight in the cross-attention map is to 0, the less important the feature of the corresponding position in the feature map, and vice versa.
Furthermore, for better skip connection fusion between shallow low-level features and deep high-level features, we designed a self-attention module (SA-module) to refine the previous features. The motivation for the SA-module was also related to SAR image characteristics, e.g., speckle noise and sea clutter. It can relieve their related interferences to enhance ship saliency, as shown in
Figure 2a. The SA-module could highlight important global information in space [
51], suppress low-value information, and promote network information flow. Ablation studies in
Section 4.1 indicated that it could offer a ~2% accuracy improvement on the six-category task. The SA-module generated a self-attention map to modify input and then the result was added to the raw conv branch. The above was described as:
where
denotes the
-th conv feature map,
denotes the SA-Module operation, and
denotes the 3 × 3 conv.
Figure 2b shows the implementation process of the SA-module. The representation of the input at
-position was embedded by
, which was instantiated by a1 × 1 conv. The spatial features of the
-position were embedded by
. The spatial features of the
-position were embedded by
. The relationship between
-position and
-position was calculated through the relationship function
, which was defined as:
where
and
serve as learnable weights.
serves as a normalization factor to normalize the relationship between two positions for stable training of the network. In practice, we instantiated
and
through a 1 × 1 conv, respectively.
was instantiated by soft-max along dimension
, where
was instantiated by matrix multiplication after 1 × 1 conv was completed. The response at
-position was obtained by a matrix element-wise multiplication between input
and self-attention map
. Significantly, the reason that soft-max was selected for normalization was derived from concerns about the definition of the relationship function
. On the one hand,
needs a normalization factor as the denominator for normalization in case network training is unstable [
52]. On the other hand,
should be conveniently instantiated in consideration of efficiency and operability. Using existing operators such as convolution and soft-max is suitable for instantiating
while designing a network. Therefore, using soft-max along dimension
as the instantiation of
was a convenient method for normalization [
51].
The final resulting cross-attention map was acted on the other two branches by matrix-element multiplication to obtain the refined polarization-guided features:
where
denotes polarization-guided features that will be used to guide the main polarization branch.
Finally, the output of the main polarization branch was the concatenation of three types of features:
where
denotes the output of PCCAF. We found that feature concatenation performed better than feature adding because the former could avoid the resistance effects between different polarization features with our subsequent feature fusion operations.
2.2. Dilated Residual Dense Learning Framework (DRDLF)
DRDLF used some dilated residual dense blocks (DRDBs) to fuse the extracted polarization features coming from the previous PCCAF stage. The input of DRDLF was denoted as
, which was associated with the dual-polarization information using the concatenation operation of Equation (5) where
denotes the feature maps of
VH information,
denotes that of
VV information, and
denotes that of VV-VH correlation information.
was refined by a 3 × 3 conv for feature concentration and channel dimensionality reduction. The result was denoted by
. Then, several cascaded DRDBs were used for feature aggregation. DRDB was motivated by RDB [
53], which was designed for image super-resolution tasks. However, there are many speckle noises around SAR ship images [
54,
55], so we inserted a dilated rate of 2 to the standard conv for larger receptive fields.
Figure 3 shows the DRDB’s implementation. Its input was the previous output
, and its output was denoted by
. DRDB contained three 3 × 3 conv layers with a dilated rate of 2, and their results were denoted by
,
, and
respectively. They were concatenated directly as
. To meet the requirement of residual connection in the entire DRDB, a 1 × 1 conv was used for channel reduction. Finally, the sum between
and
was its output. In DRDLF, we arranged
DRDBs for feature fusion where
was empirically set to the optimal value 3. The results of
DRDBs from
to
were concatenated and then processed by a 1 × 1 conv for overall channel reduction. The result was denoted by
. Significantly, we did not select dilated convs with a higher dilated rate or more dilated convs for feature extraction. Even though a higher dilated rate and more dilated convs can obtain a larger receptive field, which is helpful to extract contextual information and discriminate between the foreground and background [
56], this will deteriorate the spatial details of ships, especially in the case of low-resolution SAR images. Therefore, the chosen dilated rate and number of dilated convs was more like a trade-off in the design of the network.
Significantly, we observed that after a series of DRDB processing with multiple dense connections, the details of the main VV branch might be gradually diluted, causing unstable training and deteriorating performance. Thus, inspired by [
57], we proposed a global residual learning to solve this problem. As shown in
Figure 1, the global residual learning connected PCCAF and DRDLF, thus maintaining the dominant position of the main branch and making the other two branches smoothly play an auxiliary guiding role. This was an important design aspect of our dual-polarization-guided network. The global residual learning was described by:
where
denotes the final output of DRDLF. From
Figure 1, we set another two 3 × 3 convs to process
for more semantic features
, which was helpful for balancing spatial and semantic information.
To sum up, combined with the above designed PCCAF and DRDLF, our proposed DPIG-Net could make full use of the polarization information ignored in previous works. The other two types of polarization data were well refined to assist in the feature extraction and feature fusion of the main branch. Finally, an effective dual-polarization information-guided SAR ship classification paradigm was realized. DPIG-Net successfully handled the problems of how to conduct polarization guidance and how to carry out more effective polarization guidance, which are of great value.