Contrastive Learning Network Based on Causal Attention for Fine-Grained Ship Classiﬁcation in Remote Sensing Scenarios

: Fine-grained classiﬁcation of ship targets is an important task in remote sensing, having numerous applications in military reconnaissance and sea surveillance. Due to the inﬂuence of various imaging factors, ship targets in remote sensing images have considerable inter-class similarity and intra-class difference, which brings signiﬁcant challenges to ﬁne-grained classiﬁcation. In response, we developed a contrastive learning network based on causal attention (C2Net) to improve the model’s ﬁne-grained identiﬁcation ability from local details. The asynchronous feature learning mode of “decoupling + aggregation” is adopted to reduce the mutual inﬂuence between local features and improve the quality of local features. In the decoupling stage, the feature vectors of each part of the ship targets are de-correlated using a decoupling function to prevent feature adhesion. Considering the possibility of false associations between results and features, the decoupled part is designed based on the counterfactual causal attention network to enhance the model’s predictive logic. In the aggregation stage, the local attention weight learned in the decoupling stage is used to carry out feature fusion on the trunk feature weight. Then, the proposed feature re-association module is used to re-associate and integrate the target local information contained in the fusion feature to obtain the target feature vector. Finally, the aggregation function is used to complete the clustering process of the target feature vectors and ﬁne-grained classiﬁcation is realized. Using two large-scale datasets, the experimental results show that the proposed C2Net method had better ﬁne-grained classiﬁcation than other methods.


Introduction
Ship is an important type of remote sensing target and its classification and recognition are of great significance to both the military and civilian fields.In the military field, ship target identification is important in the deployment of maritime forces for reconnaissance and provides important intelligence support for command and decision-making.In civilian applications, accurate ship type identification plays a vital role in maritime rescue, fishery management, and anti-smuggling.Therefore, ship target recognition has many application prospects and values.
Ship target classification tasks can be roughly divided into target-level, coarse-grained, and fine-grained.Research on target-level classification [1,2] mainly focuses on distinguishing the target and background, which is essentially a binary task.Coarse-grained classification [3] focuses on categories with significant differences, such as aircraft carriers, warships, and merchant ships.Fine-grained classification [4][5][6][7][8] further distinguishes each category into sub-categories.Inter-class similarity and intra-class difference caused by different imaging factors (e.g., illumination, perspective change, and deformation) can make the fine-grained classification task more challenging, as shown in Figure 1.Therefore, the high-precision, fine-grained classification of ship targets has become a research hotspot in computer vision.Some strongly supervised methods use target local location tags [9] or key location tags [10] to train the model's localization and feature extraction ability for local areas.For example, Zhang et al. [11] used additional label information to train attributes to guide branches to achieve multi-level feature fusion.Chen et al. [12] conducted ship recognition by training a multi-level feature extraction network.This strong supervision method can improve the fine-grained classification effect.However, its high training information conditions, which require the local annotation of the dataset or the target area localization and slicing, results in low efficiency and limited applicability.
In contrast, the weak supervision method does not require additional supervision information and only uses image-level annotation to complete the model's training.Lin et al. [13] proposed the use of bilinear pooling to integrate features extracted from the same position by parallel branches to obtain more representational features.Huang et al. [14] connected the convolutional neural network (CNN) structure and the self-focused transformer structure in parallel to focus on the multi-scale features of ships.The globalto-local progressive learning module (GLPM) designed by Meng et al. [15] enhances the fine-grained feature extraction capability by promoting the information exchange between global and local features.In general, the weak supervision method greatly reduces dependence on auxiliary supervision information and is more practical.
The above methods focus on the discriminative details of the target.The strong supervision method reduces the model's learning difficulty and guides it to carry out targeted training by adding auxiliary training information through manual labeling.The weak supervision method realizes the automatic discriminative feature learning through the model structure's design.They are typical black-box deep learning models that ignore the real intrinsic causal relationship between the predicted results and the attention area.This causes the prediction results to rely on the false correlation between the two, hindering the model from learning the target's discriminative features.
To solve this problem, some studies have introduced causal reasoning into deep learning and explored causal reasoning networks.Early causal reasoning was mainly applied to statistical models with fewer variables and was not suitable for computer vision.Pearl et al. [16] pointed out that data fitting at the statistical level cannot establish a causal relationship between input and output; only models with counterfactual reasoning ability can make logical decisions similar to human brains.With the deepening of the research, the method combining causal reasoning and deep learning has become widely applied in computer vision.Rao et al. [17] used counterfactual causality to learn multi-head attention and analyze its influence on prediction.Xiong et al. [18] focused on the predictive logic Some strongly supervised methods use target local location tags [9] or key location tags [10] to train the model's localization and feature extraction ability for local areas.For example, Zhang et al. [11] used additional label information to train attributes to guide branches to achieve multi-level feature fusion.Chen et al. [12] conducted ship recognition by training a multi-level feature extraction network.This strong supervision method can improve the fine-grained classification effect.However, its high training information conditions, which require the local annotation of the dataset or the target area localization and slicing, results in low efficiency and limited applicability.
In contrast, the weak supervision method does not require additional supervision information and only uses image-level annotation to complete the model's training.Lin et al. [13] proposed the use of bilinear pooling to integrate features extracted from the same position by parallel branches to obtain more representational features.Huang et al. [14] connected the convolutional neural network (CNN) structure and the self-focused transformer structure in parallel to focus on the multi-scale features of ships.The global-to-local progressive learning module (GLPM) designed by Meng et al. [15] enhances the fine-grained feature extraction capability by promoting the information exchange between global and local features.In general, the weak supervision method greatly reduces dependence on auxiliary supervision information and is more practical.
The above methods focus on the discriminative details of the target.The strong supervision method reduces the model's learning difficulty and guides it to carry out targeted training by adding auxiliary training information through manual labeling.The weak supervision method realizes the automatic discriminative feature learning through the model structure's design.They are typical black-box deep learning models that ignore the real intrinsic causal relationship between the predicted results and the attention area.This causes the prediction results to rely on the false correlation between the two, hindering the model from learning the target's discriminative features.
To solve this problem, some studies have introduced causal reasoning into deep learning and explored causal reasoning networks.Early causal reasoning was mainly applied to statistical models with fewer variables and was not suitable for computer vision.Pearl et al. [16] pointed out that data fitting at the statistical level cannot establish a causal relationship between input and output; only models with counterfactual reasoning ability can make logical decisions similar to human brains.With the deepening of the research, the method combining causal reasoning and deep learning has become widely applied in computer vision.Rao et al. [17] used counterfactual causality to learn multi-head attention and analyze its influence on prediction.Xiong et al. [18] focused on the predictive logic and interpretability of networks, concentrating on specific parts of ship targets by combining counterfactual causal attention and convolution filters and visualizing the decision basis.Counterfactual causal reasoning gives the model the ability to make decisions at the logical level and enhances the attention to local details.
The supervision information of the above methods comes from the output, and the model's training process is guided through the output loss.Chen et al. [19] pointed out that the fine-grained classification process is a push-and-pull process, such that different feature classes are separated while similar classes are aggregated.The features can only be synchronously pushed and pulled by attaching the loss function to the output end.Features between similar subclasses cannot be completely separated, resulting in certain adhesion, which can adversely affect feature aggregation.To address the limitations of the synchronous push-pull mode, a contrast learning-based network was proposed in which the separation and aggregation stages of features are separated in an asynchronous mode and the loss supervision is carried out for each stage.Contrast learning is an unsupervised learning paradigm [20], which uses a double-branch structure to construct homologous and non-homologous image pairs using different image processing, guiding the pushand-pull process by comparing similarities between image features.Although the current comparative learning method has lower requirements for the annotation of training data and is more convenient in practical applications, it focuses mainly on global image features and rarely considers the local feature details crucial for fine-grained classification.
To solve these problems, a contrastive learning network based on causal attention (C2Net) is proposed by making full use of the local detailed features of ship targets.C2Net can take into account the asynchronous learning of local and global features to improve local feature quality.
The main contributions of this paper are as follows: 1.
To improve local feature quality, a causal attention model was developed based on feature decoupling (FD-CAM).A FD-CAM uses a decoupling function to guide the feature separation operation, eliminate the adhesion between the local features, and it uses counterfactual causal inference architecture to learn the true association between the features and classification results to reduce the influence of false association problems in data-driven deep learning.

2.
A feature aggregation module (FAM) is proposed for the weighted fusion and reassociation of local features.A FAM weights the features extracted from the trunk network using the local attention weight from the FD-CAM's learning to obtain the locally decoupled fusion features.Fusion features are input into the feature recoupling module (FRM) to realize the re-association between local features; the aggregation function is used to guide the clustering process of feature vectors.

3.
Extensive experiments were conducted on two publicly available ship datasets to evaluate the performance of the proposed approach.The experimental results showed that our method achieved better results compared to other methods, showing strong fine-grained classification ability.
The remainder of this paper is organized as follows: Section 2 introduces the C2Net overall structure and details the principle and fusion modes of feature decoupling and the aggregation branches.Section 3 describes the study datasets, evaluation indicators, implementation details, and the performance of C2Net, while Section 4 summarizes the work of this paper.

Overview of the Method
The structure of the contrastive learning network based on causal attention (C2Net) proposed in this paper is shown in Figure 2. The input image is divided into two branches, each randomly performing a different image transformation operation.The transformed image is input into the CNN backbone network to extract features and to obtain the feature map F ∈ R W×H×C , where W, H, and C are the feature map's width, height, and channel number, respectively.F is then input into the causal attention model based on feature decoupling (FD-CAM) to learn the local feature attention diagram of the ship target.The details are in Section 2.2.Finally, the feature aggregation module (FAM) uses the feature attention mapping from the FD-CAM to fuse the features and the re-association and clustering of the fused features to achieve fine-grained target recognition.The feature aggregation module is described in detail in Section 2. , where W, H, and C are the feature map's width, height, and channel number, respectively.F is then input into the causal attention model based on feature decoupling (FD-CAM) to learn the local feature attention diagram of the ship target.The details are in Section 2.2.Finally, the feature aggregation module (FAM) uses the feature attention mapping from the FD-CAM to fuse the features and the re-association and clustering of the fused features to achieve fine-grained target recognition.The feature aggregation module is described in detail in Section 2.3.

Causal Attention Model Based on Feature Decoupling
The counterfactual attention model, based on causal reasoning [16], guides the learning process of the attention diagram by establishing the causal relationship between the features and predicted results, making the model's prediction logic more transparent and avoiding the unexplainability of the black box model in deep learning.Since the attention map represents the intensity of attention on different areas of the image, improving local attention learning is vital for accurate classification.In this study, a causal attention model based on feature decoupling (FD-CAM) is proposed.An FD-CAM is a typical counterfactual causal inference framework [17,18] consisting of two branches: a multiple attention branch and a counterfactual attention branch.
Multiple attention branches are used to learn the local attention features of the target.For the input feature map F, CNNs are first used to extract advanced features.Then, context information is collected through a pooling operation for feature aggregation to obtain the multi-head attention mapping , where M is the number of attention channels.Considering the extreme aspect ratio of ship targets, the strip pooling module (SPM) [21] is selected to pool F, as shown in Figure 3. Strip pooling (SP) is a long pool core that can avoid the interference of other spatial dimensions while conducting long-distance feature correlation along one spatial dimension and facilitating the extraction of local features.The SPM can account for global and local features by deploying SP in horizontal and vertical directions and has good adaptability to long ship targets.The feature map

Causal Attention Model Based on Feature Decoupling
The counterfactual attention model, based on causal reasoning [16], guides the learning process of the attention diagram by establishing the causal relationship between the features and predicted results, making the model's prediction logic more transparent and avoiding the unexplainability of the black box model in deep learning.Since the attention map represents the intensity of attention on different areas of the image, improving local attention learning is vital for accurate classification.In this study, a causal attention model based on feature decoupling (FD-CAM) is proposed.An FD-CAM is a typical counterfactual causal inference framework [17,18] consisting of two branches: a multiple attention branch and a counterfactual attention branch.
Multiple attention branches are used to learn the local attention features of the target.For the input feature map F, CNNs are first used to extract advanced features.Then, context information is collected through a pooling operation for feature aggregation to obtain the multi-head attention mapping A ∈ R W×H×M , where M is the number of attention channels.Considering the extreme aspect ratio of ship targets, the strip pooling module (SPM) [21] is selected to pool F, as shown in obtained by the SPM's pooling is copied along a specific dimension to restore the shape of W × H × M. To obtain the regional feature set , A and F are multiplied (element by element) and the global average pooling is performed using the formula: where m r is the m-th regional feature vector; The local features of the target are obtained by summing M regional feature vectors, as shown below: l is essentially a feature weighted by spatial attention, which can be obtained by directly summing the multiple attention forces along the channel direction, multiplying the input features element by element, and implementing the global average pooling.The proof is as follows.
Suppose the sum of multiple attention forces along the channel direction is sum A and the calculation is as follows: Further derivation leads to the following: The fully connected layer is selected as the classifier

C( )
 and l is input into

C( )
 to obtain the predicted output of the multiple attention branches: Counterfactual attention branches are used to explore the influence of multiple attention branches on the final classification results by counterfactual intervention; the input is a randomly generated false attention diagram.The regional feature vector and the local To obtain the regional feature set R = {r 1 , r 2 , • • • , r M } ∈ R M×C×1 , A and F are multiplied (element by element) and the global average pooling is performed using the formula: where r m is the m-th regional feature vector; r mc is the value of the c-th element in the region feature vector r m ; A m is the M-th attention map in A and F c is the c-th feature map in F; GAP is the global average pooling; is the element by element multiplication.The local features of the target are obtained by summing M regional feature vectors, as shown below: l is essentially a feature weighted by spatial attention, which can be obtained by directly summing the multiple attention forces along the channel direction, multiplying the input features element by element, and implementing the global average pooling.The proof is as follows.
Suppose the sum of multiple attention forces along the channel direction is A sum and the calculation is as follows: Further derivation leads to the following: The fully connected layer is selected as the classifier C(•) and l is input into C(•) to obtain the predicted output of the multiple attention branches: Counterfactual attention branches are used to explore the influence of multiple attention branches on the final classification results by counterfactual intervention; the input is a randomly generated false attention diagram.The regional feature vector and the local feature vector of the false attention map are calculated in the same way as the multi-head attention using the following formulas: The prediction results obtained according to the counterfactual local feature vector are: By calculating the difference between Y A and Y Â, we can quantify the impact of the multi-head attention features on the prediction results: Y effect can be understood as the learning objective of the local attention mechanism, that is, the internal correlation between the multiple attention and the predicted results after eliminating the interference of false attention.The cross-entropy loss function CE(•) is used to guide the learning of association, given by the expression: where y label is the truth tag.
The coupling between the features in counterfactual causal reasoning networks is an important factor affecting local attention learning.Inspired by the work of Chen et al. [19], we draw on the feature separation idea of contrast learning, carrying out separation through manual intervention to realize the decoupling of regional features of input images and local features between non-homologous images to obtain more distinctive attention features and prevent the mutual influence of local features in attention learning.
Regional feature decoupling is carried out within each feature set, as shown in Figure 4a.The feature vector in the regional feature set represents the filtering feature of local attention on input feature map F. The essence of the decoupling feature vector is to decouple local attention, which can improve the learning of local classification features by reducing the correlation between the channels of multiple attention maps.In decoupling, combining the vectors in the region feature set is necessary for feature separation.To avoid double computation and ensure computational efficiency, cyclic shift is adopted to sort through the vector set, and decoupling is performed with the vector at the corresponding position in the original set, as shown below: where CycShift(•) is the cyclic shift; s is the shift step size; R and r are the regional feature set and feature vector obtained by the cyclic shift, respectively; L R is the decoupling loss of regional features.Dec(•) is the decoupling function, defined as follows: where v 1 and v 2 are one-dimensional vectors.Since the input images are transformed in two different ways, the feature aggregation of regional feature vectors between homologous images must be considered.Since two homologous maps correspond to the same target, the attention feature maps of the same channel should have the same focusing region, and the regional feature vectors filtered by them share the same local features.Figure 4b presents the feature aggregation process.The aggregation function is defined as follows: 4b presents the feature aggregation process.The aggregation function is defined as follows: Agg( , ) log[0.5 ] The loss function used to guide the polymerization process is given by the expression: where 1 r and 2 r represent the regional feature vectors of two homologous maps.Algorithm 1 shows the pseudo-code.

Initialization:
/* The cyclic shift operation is performed on the region feature set R to ensure that the region feature vectors are paired once in the decoupling process.Since the decoupling takes place within the regional feature set, The loss function used to guide the polymerization process is given by the expression: where r 1 and r 2 represent the regional feature vectors of two homologous maps.Algorithm 1 shows the pseudo-code.
Algorithm 1. Regional feature decoupling and polymerization Input: Homology image region feature set R 1 and R 2 .The number of attention channels M. Output: Regional feature decoupling loss L R .Homologous map local feature aggregation loss L Agg .
/* The cyclic shift operation is performed on the region feature set R to ensure that the region feature vectors are paired once in the decoupling process.Since the decoupling takes place within the regional feature set, R 1 Unlike regional feature decoupling, the local feature decoupling between non-homologous images is de-correlated for instance-level targets.By separating the target from its subclass cluster, we can eliminate the feature adhesion between the targets, avoid feature confusion, and reduce the difficulty of learning causal attention.As shown in Figure 5, the local features of the input image are combined in pairs for the decoupling operation; the decoupling loss is as follows: where B indicates the input batch size.Note that the local features of the pairwise combination originate from the same input branch and the corresponding input image uses the same transformation operation.Algorithm 2 shows the pseudo-code for the local feature decoupling.

L
Unlike regional feature decoupling, the local feature decoupling between non-homologous images is de-correlated for instance-level targets.By separating the target from its subclass cluster, we can eliminate the feature adhesion between the targets, avoid feature confusion, and reduce the difficulty of learning causal attention.As shown in Figure 5, the local features of the input image are combined in pairs for the decoupling operation; the decoupling loss is as follows:

Feature Aggregation Module
In Section 2.2, based on feature separation in contrast learning, the regional feature representations of homologous maps and the local features of non-homologous maps are separated to decouple the regional features and classes.In this section, the clustering operation in contrast learning is introduced and the feature aggregation module (FAM) is used to pull the FD-CAM-separated features back into the corresponding subclass cluster.
The FAM structure is shown in Figure 2. To input the main stem feature F and multi-head attention diagram, a 1 × 1 convolution is used to reduce the dimension F, and F ∈ R W×H×M is determined by multiplying the feature map after dimensionality reduction with multiple attention map A pixel by pixel.Each channel F focuses on different local areas and are weakly correlated with each other.To realize the re-association between the target's local features, we designed a feature re-association module (FRM) based on an SP, as shown in Figure 6.Different from the SPM, the FRM strips the pool of the three dimensions of F and extends the output dimensions of the three dimensions to obtain the re-association attention diagram A r ∈ R W×H×M , as follows: with multiple attention map A pixel by pixel.Each channel  F focuses on different loca areas and are weakly correlated with each other.To realize the re-association between the target's local features, we designed a feature re-association module (FRM) based on an SP as shown in Figure 6.Different from the SPM, the FRM strips the pool of the three dimensions of  F and extends the output dimensions of the three dimensions to obtain the re association attention diagram , as follows:  The re-association feature map r F is obtained by using the pair weighting r A and  F .Through multi-dimensional information integration, the FRM can consider both the spatial and channel domains to achieve better feature re-association.Global average pooling for r F is implemented to obtain the feature vector of the target: To cluster the feature vectors, the clustering centers of various categories must be defined.We adopt the method based on agent loss to realize feature clustering by setting learnable explicit agents for each category and pulling similar feature vectors toward their corresponding proxy vectors.The principle is shown in Figure 7. Considering that a single display agent has certain limitations for fine-grained recognition tasks, this paper follows the method by Chen et al. 2020 [19], setting multiple proxy vectors as the clustering centers for various categories to reduce the clustering difficulty and improve efficiency.Specifically, let the set of proxy vectors be , where n is the number of proxy vectors.The aggregation function is used to pull the feature vector to its proxy vector set: The re-association feature map F r is obtained by using the pair weighting A r and F .Through multi-dimensional information integration, the FRM can consider both the spatial and channel domains to achieve better feature re-association.Global average pooling for F r is implemented to obtain the feature vector of the target: To cluster the feature vectors, the clustering centers of various categories must be defined.We adopt the method based on agent loss to realize feature clustering by setting learnable explicit agents for each category and pulling similar feature vectors toward their corresponding proxy vectors.The principle is shown in Figure 7. Considering that a single display agent has certain limitations for fine-grained recognition tasks, this paper follows the method by Chen et al. 2020 [19], setting multiple proxy vectors as the clustering centers for various categories to reduce the clustering difficulty and improve efficiency.Specifically, let the set of proxy vectors be P = {p 1 , • • • , p n }, where n is the number of proxy vectors.The aggregation function is used to pull the feature vector to its proxy vector set:  Finally, the full-connection layer is used as the classifier to complete the fine-grained classification of ship targets: where pred y is the predicted result of the category.

Dynamic Adjusting Label Strategy
The loss function in this paper mainly includes the loss of the causal attention branch, the loss of the feature aggregation branch, and the loss of classification, which are defined as follows: To ensure the separability of the agents between classes and the relevance of the agents within classes, we perform separation operations on the agents of different classes and aggregation operations on the agent vectors of the same class: where c 1 and c 2 represent two categories.Finally, the full-connection layer is used as the classifier to complete the fine-grained classification of ship targets: where y pred is the predicted result the category.

Loss Function
The loss function in this paper mainly includes the loss of the causal attention branch, the loss of the feature aggregation branch, and the loss of classification, which are defined as follows: L cls = CE(y pred , y label ) L CAM is used to guide the feature separation and causal attention learning of the causal attention branches.L FAM is used to supervise the feature re-association of the feature aggregation branches and the clustering process of the proxy vectors.L cls uses cross-entropy loss to punish the final classification prediction.The total loss function is defined as:

Experiments
This section evaluates the proposed method on a public dataset.First, we introduce the fine-grained classification datasets FGSC-23 [11] and FGSCR-42 [22] used in this paper.Secondly, we elaborate on the specific implementation details used during the experiment, including the environmental version, hardware configuration, training parameters, and dataset processing.We then discuss the ablation experiment design used to assess the performance of each part of the model.Finally, we compare the proposed method with other fine-grained classification methods.

Experiments
This section evaluates the proposed method on a public dataset.First, we introduce the fine-grained classification datasets FGSC-23 [11] and FGSCR-42 [22] used in this paper.Secondly, we elaborate on the specific implementation details used during the experiment, including the environmental version, hardware configuration, training parameters, and dataset processing.We then discuss the ablation experiment design used to assess the performance of each part of the model.Finally, we compare the proposed method with other fine-grained classification methods.

FGSC-23
The FGSC-23 dataset was mainly collected from panchromatic remote sensing images of Google Earth and GF-2 satellites, including about 4080 sliced ship images with a resolution of 0.

FGSCR-42
The FGSCR-42 dataset comprised 7776 images of common types of ships, ranging from 50 × 50 to 500 × 1500 pixels and obtained from Google Earth and other remote sensing datasets, such as DOTA [23] and HRSC2016 [24].Under 10 main categories, the ship type consisted of 42

FGSCR-42
The FGSCR-42 dataset comprised 7776 images of common types of ships, ranging from 50 × 50 to 500 × 1500 pixels and obtained from Google Earth and other remote sensing datasets, such as DOTA [23] and HRSC2016 [24].Under 10 main categories, the ship type consisted of 42

Evaluation Metrics
Overall accuracy (OA) and average accuracy (AA) were selected as the main metrics for the model's performance.Assessing the forecast accuracy from an overall perspective, OA determines the proportion of correctly predicted samples from the total samples without classifying them and is given by the expression: TP TN OA TP TN FP FN where TP, TN, FP, and FN are the number of true positive samples, true negative samples, false positive samples, and false negative samples, respectively.The higher the OA value, the higher the prediction accuracy and the better the model's effect.

Evaluation Metrics
Overall accuracy (OA) and average accuracy (AA) were selected as the main metrics for the model's performance.Assessing the forecast accuracy from an overall perspective, OA determines the proportion of correctly predicted samples from the total samples without classifying them and is given by the expression: where TP, TN, FP, and FN are the number of true positive samples, true negative samples, false positive samples, and false negative samples, respectively.The higher the OA value, the higher the prediction accuracy and the better the model's effect.
In comparison, AA focuses on the prediction of each category and evaluates the model's performance by calculating the average recall rate of all categories using the following: where C is the number of categories and R c is the recall of class c, which is used to represent the recognition accuracy of the class.The prediction results were then visualized through confusion matrix (CM) and 2D feature distribution to provide a more intuitive assessment of the model's performance.

Implementation Details
The experiments were conducted using the Windows 10 operating system and the development platform was Anaconda Pytorch 1.6.0+ CUDA10.1.All experiments were performed on a laptop equipped with an Inter(R) Core(TM) i7-10875H @ 2.30GHz CPU and a NVIDIA GeForce RTX 2080 Super with Max-Q Design (8G of video memory) GPU.We used the stochastic gradient descent (SGD) [25] as the optimizer and adjusted the learning rate using the cosine annealing algorithm [26].The model trained 200 epochs on the FGSC-23 dataset with a batch size of 16,100 epochs on the FGSCR-42 dataset and 10 epochs in the warm-up training.The backbone network was the ResNet50 [27], pretrained on ImageNet [28].The initial learning rate and weight decay were set to 0.001 and the momentum to 0.9.The proxy vector was initialized with Xavier Uniform [29].To ensure that the input image was not distorted, we used the edge-filling method to scale all the image sizes to 224 × 224 pixels and carried out image enhancement operations by random horizontal flip, random vertical flip, and random rotation, among others.

Ablation Studies
We then designed a series of ablation experiments to study the effects of various parts of C2Net on classification accuracy, including the FD-CAM and FAM branches.The FD-CAM analysis focused on the causal attention structure and feature decoupling, while the FAM analysis focused on the FRM and feature aggregation.The experiment was conducted on the FGSC-23 datasets and FGSCR-42 datasets; the results are shown in Tables 1 and 2. In addition, we also explored the influence of some hyperparameters and functional forms on the model's performance.

Effectiveness of the FD-CAM
The FD-CAM was used to determine the local attention diagram of channel decoupling.We conducted ablation experiments on the counterfactual causal attention structure and the feature decoupling operation in the FD-CAM using the control variable method to study the influence of each part on the classification performance.As shown in Tables 1 and 2, compared with the backbone network, the OA and AA, using only the counterfactual causal attention structure, increased by 1.95% and 1.81%; the OA and AA, using only the feature decoupling operation, increased by 1.83% and 1.71%, respectively.The complete FD-CAM achieved 90.90% OA and 90.63% AA on the FGSC-23 and 94.19% AA on the FGSCR-42.The results in Table 2 show that when the FAM-only classifier was used as the baseline model, the AA of the FD-CAM increased by 1.20%.The experimental results indicate that the counterfactual causal attention structure improved the learning quality of local attention by changing the model's discriminant mode, while feature decoupling eliminated feature redundancy by separating the local features; thus, improving the classification performance.In addition, the classification accuracy of the combined FD-CAM and FAM was higher compared to using only the FAM, confirming the effectiveness and feasibility of the FD-CAM.
To assess the model's effect more directly, we visualized the confusion matrix and two-dimensional feature distribution diagrams of the stochastic neighbor embedding (t-SNE) algorithm [30], as shown in Figures 10 and 11.The horizontal axis of the confusion matrix shows the prediction results, the vertical axis presents the actual results, and the diagonal refers to the accuracy of each class.From Figure 10, the confusion matrix with the most chaotic color distribution occurred when only the trunk network was used.When the FD-CAM was used, the color distribution became more concentrated in the diagonal line, and the recall rate for most kinds improved.Comparing the feature distributions in Figure 11, the FD-CAM had a higher aggregation degree than the trunk network, and C2Net had a higher aggregation degree than only the FAM.The results suggest that the local feature decoupling of the FD-CAM can reduce the mutual interference between the discriminant features of different target parts, providing high-quality local features and effectively improving classification performance.

Effectiveness of the FAM
Ablation experiments were conducted on two FAM parts: the FRM and characteristic polymerization.In Table 1, when the baseline model was the primary trunk network, the FRM increased in OA from 88.83% to 91.02% and in AA from 88.66% to 90.60%.The experimental results in Table 2 also show that the classification accuracy of the FRM improved after the addition.When the baseline model was only the FD-CAM, the FRM increased in AA from 94.19% to 95.01%, indicating that multi-dimensional information integration can more comprehensively establish the correlation between features.
After the FRM completes the feature association, the clustering of target features is

Effectiveness of the FAM
Ablation experiments were conducted on two FAM parts: the FRM and characteristic polymerization.In Table 1, when the baseline model was the primary trunk network, the FRM increased in OA from 88.83% to 91.02% and in AA from 88.66% to 90.60%.The experimental results in Table 2 also show that the classification accuracy of the FRM improved after the addition.When the baseline model was only the FD-CAM, the FRM increased in AA from 94.19% to 95.01%, indicating that multi-dimensional information integration can more comprehensively establish the correlation between features.
After the FRM completes the feature association, the clustering of target features is implemented through the feature aggregation operation.Without the use of the FD-CAM, the improvement in classification accuracy by feature aggregation is limited.When combined with the FD-CAM, the feature aggregation reclusters locally decoupled features, improving the OA and AA by 2-3%.The FAM and FD-CAM, mainly composed of the FRM and feature aggregation, realize the asynchronous feature learning process together.When the FD-CAM was used alone, the OA decreased by 2.18% and the AA decreased by 0.5-1.5% compared with C2Net.As shown in Figure 11, the FAM can effectively pull the features of similar targets to the corresponding clustering center, resulting in a relatively compact feature distribution.

Effectiveness of Attention Channels
Multi-head attention mapping improves the model's fine-grained recognition ability by learning the detailed local features of different parts of the ship target.To explore the influence of attention density on the model's classification ability, ablation experiments were conducted on different channel numbers; the results are shown in Table 3.When the number of channels was 16, the OA reached 92.48% and the AA reached 91.43%.When the number of channels is too small, the model cannot learn enough detailed features.Conversely, if the number of channels is too large, there will be feature redundancy among the channels, resulting in considerable feature ambiguity during decoupling and affecting the learning of the model's local attention.As the attribute agent in the target category, the display agent plays the clustering center in the feature aggregation stage.In this paper, multiple proxy vectors are set for each category as a clustering center group, which enlarges the range of the clustering center and reduces the dependence on a single central point.This can improve the effect and efficiency of clustering to a certain extent.In order to find the optimal number of agent vectors, the OA and AA of different proxy vector numbers were compared; the results are presented in Table 4.The best classification accuracy was achieved when n = 2, with an OA of 92.48% and an AA of 91.43%.The effectiveness of the clustering center group can be verified by comparing it with the experimental results of a single agent.However, when the number of agents is too large, the central cluster becomes loose, which is not conducive to feature aggregation.In addition, using large numbers of agents would increase the training burden, resulting in the inadequate learning of the agent vector and adversely impacting feature aggregation.Is this what you mean: Both the decoupling and aggregation functions are compound logarithmic and cosine distance functions.Logarithmic loss is a common loss function in logistic regression tasks; its domain is a finite interval (0,1).The value range for the cosine distance is also a finite interval (−1,1), so the vector regression problem can be transformed into a binary classification problem.The advantages of logarithmic function in probability distribution characterization can be used to improve feature decoupling and aggregation.To evaluate the effectiveness of the logarithmic form, a set of ablation experiments were employed to compare the influence of the two proposed compound functions and the ordinary cosine distance on the experimental results.As shown in the results in Table 5, both Dec(•) and Agg(•) helped to improve the model's classification.When used separately, Dec(•) increased the OA by 0.85% and the AA by 1.61%, while Agg(•) increased the OA by 0.25% and the AA by 0.94%.When used together, the OA increased by 1.95% and the AA increased by 1.72%, confirming the method's effectiveness.

Comparisons with Other Methods
The proposed C2Net was then compared with other methods to analyze the model's fine-grained classification performance.The experiments were first conducted on the FGSC-23 dataset, calculating the recognition accuracy, OA, and AA of 23 sub-categories; the results are shown in Table 6.Inception v3 [32], DenseNet121 [33], MobileNet [35], and Xception [37] are general CNNs and classification is carried out by extracting high-level features from images; their main limitation is that only global features are available, while detailed features are ignored.To address the problem of limited samples, FDN [5], DCL [31], and B-CNN [34] improve the recognition accuracy of remote sensing targets through multi-feature fusion and pseudotag training; however, they do not consider the fusion of different receptive field features, resulting in a low utilization rate of local information.ME-CNN [6] combines the CNN, Gabor filter, LBP operator, and other means, in extracting multiple features, providing more information than the FDN.T2FTS [38] solves the long-tail recognition problem in remote sensing images using a hierarchical distillation framework.FBNet [39] uses a feature-balancing strategy to strengthen the representation of weak details and refine local features through an iterative interaction mechanism, addressing the problem of fuzzy or missing target details.PMG [36] and LGFFE [40] have been proposed for fine-grained classification tasks under weak supervision, providing more discriminant features through multi-level feature learning.
In contrast, the proposed C2Net method takes into account the adhesion redundancy among the local features, improves feature quality by asynchronous feature learning, and fully utilizes the detailed local features of targets.The experimental results show that C2Net achieved the highest accuracy, with an OA of 92.48% and an AA of 91.43%, 3.03-10.18%and 3.36-9.89%higher than the other approaches.Because we do not expand the dataset to balance the number of categories, class 11 has the smallest number in the dataset and the recognition accuracy is lower than other methods, at 65.00%.To verify the generalization of the proposed method, we conducted further tests using the FGSCR-42 dataset.As shown in Table 7, C2Net achieved the highest AA of 95.42%, higher by 2.21~5.50%compared with other methods, proving the model's strength in the fine-grained classification of ship targets.

Conclusions
This study explored the fine-grained classification of ship objects in optical remotesensing images and developed a comparative learning network based on causal attention.In this network, the FD-CAM and FAM are used to decouple and aggregate features in an asynchronous manner to improve the quality of local features.The FD-CAM is designed based on the counterfactual causal attention model to eliminate false associations between the results and features by strengthening the predictive logic of the model.To prevent feature adhesion, the FD-CAM uses a decoupling function to separate features and obtain a high-quality local attention weight.The partial attention weight is used as the input of the FAM to weigh the trunk, and the FRM is used to achieve feature re-association.Then, the feature clustering process is guided by proxy loss to achieve a fine-grained classification.Experimental results on two common ship datasets showed that the proposed method achieved optimal accuracy and better classification performance than other methods.Its limitation is that the number of model parameters is not lightweight enough and there is a certain gap between the practical engineering application.In future work, we plan to study the decoupling and aggregation of local features further, to improve the fine-grained classification accuracy, and on this basis, explore a lighter weight classification model.

22 Figure 1 .Figure 1 .
Figure 1.Therefore, the high-precision, fine-grained classification of ship targets has become a research hotspot in computer vision.

Figure 2 .
Figure 2. C2Net structure.GAP means global average pooling operation.SPM is a strip pooling module.FRM indicates the feature re-association module.Yeffect shows the true effect of the bulls' attention on the forecast outcomes.LAgg is an aggregation loss function used to guide the feature polymerization process.

Figure 2 .
Figure 2. C2Net structure.GAP means global average pooling operation.SPM is a strip pooling module.FRM indicates the feature re-association module.Y effect shows the true effect of the bulls' attention on the forecast outcomes.L Agg is an aggregation loss function used to guide the feature polymerization process.

Figure 3 .
Strip pooling (SP) is a long pool core that can avoid the interference of other spatial dimensions while conducting long-distance feature correlation along one spatial dimension and facilitating the extraction of local features.The SPM can account for global and local features by deploying SP in horizontal and vertical directions and has good adaptability to long ship targets.The feature map obtained by the SPM's pooling is copied along a specific dimension to restore the shape of W × H × M. Remote Sens. 2023, 15, x FOR PEER REVIEW 5 of 22

mcrF
is the value of the c-th element in the region feature vector m r ; m A is the M-th attention map in A and c is the c-th feature map in F; GAP is the global average pooling; e is the element by element multiplication.

Figure 4 .Algorithm 1
Figure 4. Schematic diagram of regional feature decoupling and aggregation.(a) Regional feature decoupling; (b) Homology image region feature aggregation.

1 R and 2 RFigure 4 .
Figure 4. Schematic diagram of regional feature decoupling and aggregation.(a) Regional feature decoupling; (b) Homology image region feature aggregation.
and R 2 are no longer represented separately.*/ for s in range(M − 1) do /* Cyclic shift s steps.*/ R = CycShift(R, s) /* The vectors are decoupled in pairs.*/ for m in range(M) do L R = L R + Dec(r m , r m ) end end /* Homologous map feature aggregation.*/ for m in range(M) do

Algorithm 2 .
Local feature decoupling of non-homologous images Input: Local features l 1 and l 2 of the two branches' input images.The batch size B of the input image.Output: Regional feature decoupling loss L l .Initialization: L l = 0. /* Iterate over all local features.*/ for i in range(B) do for j in range(B) do # Determine whether it is a local feature of non-homologous image.If it is, perform feature decoupling; otherwise, skip it if i is equal to j then continue else L l = L l + Dec(l B indicates the input batch size.Note that the local features of the pairwise combination originate from the same input branch and the corresponding input image uses the same transformation operation.Algorithm 2 shows the pseudo-code for the local feature decoupling.
row (•) indicates the average pooling of the row dimensions, SP col (•) indicates the average pooling of the column dimensions, and SP channel (•) indicates the average pooling of the channel dimensions.Expand(•) indicates a dimension replication operation.F row , F col , and F channel are the feature maps of the row dimension, column dimension, and channel dimension after strip pooling.

F
indicates the average pooling of the row dimensions, col SP ( )  indicates the average pooling of the column dimensions, and channel SP ( )  indicates the average pooling of the channel dimensions.Expand( )  indicates a dimension replication operation.row  F col , and channel  F are the feature maps of the row dimension, column dimension, and channel dimension after strip pooling.
c1 and c2 represent two categories.

Figure 10 .
Figure 10.Confusion matrix.(a-d) Confusion matrices of the main trunk network, backbone network +FD-CAM, backbone network +FAM, and C2Net on the FGSC-23 test set, respectively; (e-h) are the confusion matrices of the FGSCR-42 test set.

Figure 10 .
Figure 10.Confusion matrix.(a-d) Confusion matrices of the main trunk network, backbone network +FD-CAM, backbone network +FAM, and C2Net on the FGSC-23 test set, respectively; (e-h) are the confusion matrices of the FGSCR-42 test set.

Figure 10 .Figure 11 .
Figure 10.Confusion matrix.(a-d) Confusion matrices of the main trunk network, backbone network +FD-CAM, backbone network +FAM, and C2Net on the FGSC-23 test set, respectively; (e-h) are the confusion matrices of the FGSCR-42 test set.

Figure 11 .
Figure 11.2D feature distribution.(a-d) Feature distributions of the main trunk network, backbone network +FD-CAM, backbone network +FAM, and C2Net on the FGSC-23 test set, respectively; (e-h) feature distributions of the FGSCR-42 test set.

Table 1 .
The ablation experiments of the different parts of the C2Net on the FGSC-23 dataset.

Table 2 .
The ablation experiments of the different parts of C2Net on the FGSCR-42 dataset.

Table 3 .
The performance of attention channels M on the FGSC-23 dataset.

Table 4 .
The performance of proxy vector number n on the FGSC-23 dataset.

Table 6 .
Accuracy (%) of the different methods on the testing set of the FGSC-23 dataset.The overall accuracy (OA), average accuracy (AA), and the accuracy of each category are listed.