DSDet: A Lightweight Densely Connected Sparsely Activated Detector for Ship Target Detection in High-Resolution SAR Images

: Traditional constant false alarm rate (CFAR) based ship target detection methods do not work well in complex conditions, such as multi-scale situations or inshore ship detection. With the development of deep learning techniques, methods based on convolutional neural networks (CNN) have been applied to solve such issues and have demonstrated good performance. However, compared with optical datasets, the number of samples in SAR datasets is much smaller, thus limiting the detection performance. Moreover, most state-of-the-art CNN-based ship target detectors that focus on the detection performance ignore the computation complexity. To solve these issues, this paper proposes a lightweight densely connected sparsely activated detector (DSDet) for ship target detection. First, a style embedded ship sample data augmentation network (SEA) is constructed to augment the dataset. Then, a lightweight backbone utilizing a densely connected sparsely activated network (DSNet) is constructed, which achieves a balance between the performance and the computation complexity. Furthermore, based on the proposed backbone, a low-cost one-stage anchor-free detector is presented. Extensive experiments demonstrate that the proposed data augmentation approach can create hard SAR samples artiﬁcially. Moreover, utilizing the proposed data augmentation approach is shown to effectively improves the detection accuracy. Furthermore, the conducted experiments show that the proposed detector outperforms the state-of-the-art methods with the least parameters (0.7 M) and lowest computation complexity (3.7 GFLOPs).


Introduction
Synthetic aperture radar (SAR) has the unique capability of earth observation in all-weather conditions, regardless of day and night, which gives it an important place in marine exploration [1][2][3][4][5]. As a multitude of spaceborne SAR sensor platforms (e.g., RADARSAT-2, TerraSAR-X [6], and GF-III) are put into operation, high-resolution SAR images are no longer difficult acquire [7,8], which promotes the use of SAR imaging in ocean monitoring.
Marine ship target detection in SAR images plays an important role in sustainable fishing, marine ecosystem protection, and military target strikes. The traditional SAR ship target detection framework can be mainly divided into four stages: land-sea segmentation, preprocessing, prescreening, and discrimination [9,10], for which researchers have developed a variety of methods. Since the detection for land areas results in a large number of false alarms, and since dealing with these false alarms greatly increases the burden on the system, land-sea segmentation serves as an essential pretreatment. For most of the general land-sea segmentation methods, such as geographic information system (GIS), snake model [11], and Otsu [12,13], prior knowledge as well as various handcraft features are used to segment the SAR image. In order to improve detection performance, researchers have proposed preprocessing methods to enhance the ship targets' characteristics. Weighted information entropy [14] and visual attention mechanism [15,16] serve as two such methods.
Among the four detection stages, prescreening is the crucial step [17] whose primary purpose is to locate targets. Actually, many ship detection methods only contain this step [18]. Among existing detection methods, CFAR based methods [19] have been widely investigated [20][21][22][23][24][25]. This type of method determines the detection threshold based on a pre-established clutter statistical model, which has the characteristic of constant false alarms [26,27]. These statistical methods strongly rely on the statistical distribution of sea clutters. However, such models are easily affected by ocean currents, climate, and imaging systems, which reduces the robustness of CFAR [10]. To alleviate the sea clutter model's mismatch risk, researchers have explored many clutter distribution models. However, with the increasing complexity of the models, parameter estimation becomes a challenge and even constrains the practical application of CFAR technology.
Discrimination is used by an operator to eliminate non-ship targets based on the classification features extracted in the prescreening areas. Length-width ratio [10], geometric shape, scale-invariant feature transform (SIFT) [28], and histogram of oriented gradient (HOG) [29] are the commonly used features. However, these handcraft features do not work well in complex inshore areas.
In recent years, due to the significant strides in deep learning in the field of computer vision, i.e., image classification [30], object detection [31,32], and image segmentation [33,34], researchers try to introduce deep learning methods in ship detection. Deep learning methods detect the position of ships by spontaneously learning the ships' characteristics through a labeled dataset. They do not require land-sea segmentation and have demonstrated satisfactory effects in multi-scale and inshore ship detection tasks. Faster R-CNN [35] and You Only Look Once (YOLOv1-v3) [36][37][38] are two classic algorithms that represent the two-stage and one-stage detectors, respectively, laying the foundation for the basic architecture for current mainstream detection algorithms. Recently, many SAR ship detection methods based on these architectures have been proposed [39,40]. A dense network was constructed by Jiao et al. [41] to extract additional features at different levels. Additionally, Cui et al. [42] added an attention network in a feature pyramid to solve the problem of multi-scale ship detection. Wang et al. [43] improved the original SSD method by introducing an angle regression branch and aggregating semantic information. Moreover, Lin et al. [44] improved the Faster R-CNN and concatenated three-level features to obtain multi-scale feature maps. Yang et al. [45] detected ship targets in four different level features. In addition, to further improve the detection performance and address the influence of multi-scale and complex backgrounds, Zhao et al. [46] employed receptive fields block and the convolutional block attention module (CBAM) [47] to build a top-down feature pyramid. Furthermore, Fu et al. [48] added level-based attention and spatial-based attention networks into the feature pyramid network to enhance the feature extraction ability of the detector.
Although current CNN-based ship detection methods have attained compelling results, certain problems still require further elucidation. Recent deep learning-based ship detectors mainly focus on detection accuracy. Good performance always comes with a larger number of parameters as well as a heavy computational burden. However, few studies focus on reducing the computation complexity. Accordingly, how to balance the detection performance and the computation complexity is a problem.
Moreover, most recent approaches rely on pre-defined anchor boxes, which makes them achieve adequate performances [33,39]. However, it should be noted that the anchorbased detectors suffer from some drawbacks. First, many hyper-parameters are introduced when designing these anchor boxes. To achieve good detection performance, these predefined anchors require complex manual calibration of the hyper-parameters. Second, ship targets have large-scale variations (e.g., size and orientation). To adapt to this variation, a few different pre-defined anchors should be designed for the detectors. However, it can be observed that the orientations of ship targets are arbitrary, and the corresponding bounding boxes also have enormous change. The pre-defined anchors cannot effectively cover this change. Meanwhile, to acquire better performance, anchors are densely placed on the image. Considering the sparsity of the ships, redundant pre-defined anchors will increase the computational burden. Therefore, anchor-free methods may be potentially better in ship detection tasks, which directly determine the geometric shape by extracting the semantic information of the target.
Furthermore, as a data-hungry approach, deep learning demands a large number of training samples to ensure its performance and generalization ability. Compared with optical datasets, the number of samples in SAR datasets is much smaller, which limits the detection performance. Data augmentation is an efficient way to address these issues. Crop, rotation, saturation, bilateral blurring, MixUp, CutMix and Mosaic are the representative conventional data augmentation methods. However, such methods cannot improve the detection performance to a satisfactory extent. Many novel data augmentation methods have been developed to improve the SAR classification performance [49][50][51]. However, similar studies in the field of SAR ship detection have hardly been conducted.
In response to the aforementioned problems, this paper proposes DSDet for ship target detection in high-resolution SAR images, as illustrated in Figure 1. First, a style embedded ship sample data augmentation network (SEA) is constructed to augment the dataset. Then, a lightweight densely connected sparsely activated network (DSNet) is devised as the backbone. Furthermore, based on the proposed backbone, a low-cost onestage anchor-free detector is presented, achieving a balance between performance and computation complexity. Furthermore, as a data-hungry approach, deep learning demands a large number of training samples to ensure its performance and generalization ability. Compared with optical datasets, the number of samples in SAR datasets is much smaller, which limits the detection performance. Data augmentation is an efficient way to address these issues. Crop, rotation, saturation, bilateral blurring, MixUp, CutMix and Mosaic are the representative conventional data augmentation methods. However, such methods cannot improve the detection performance to a satisfactory extent. Many novel data augmentation methods have been developed to improve the SAR classification performance [49][50][51]. However, similar studies in the field of SAR ship detection have hardly been conducted.
In response to the aforementioned problems, this paper proposes DSDet for ship target detection in high-resolution SAR images, as illustrated in Figure 1. First, a style embedded ship sample data augmentation network (SEA) is constructed to augment the dataset. Then, a lightweight densely connected sparsely activated network (DSNet) is devised as the backbone. Furthermore, based on the proposed backbone, a low-cost onestage anchor-free detector is presented, achieving a balance between performance and computation complexity. The proposed detection framework provides the following contributions: • A new SAR ship sample data augmentation framework based on generative adversarial network (GAN) is proposed, which can purposefully generate abundant hard samples, simulate various hard situations in marine areas, and improve detection performance. Additionally, as data augmentation is only applied in the training stage, it does not incur extra inference costs; • A cross-dimension attention style embedded ship sample generator, as well as a maxpatch discriminator, are constructed; • A lightweight densely connected sparsely activated detector is constructed, which achieves a competitive performance among state-of-the-art detection methods; • The proposed method is proposal-free and anchor-free, thereby eliminating the complicated computation of the intersection over union (IoU) between the anchor boxes and ground truth boxes during training. As a result, this method is also completely free of the hyper-parameters related to anchor boxes, which improves its flexibility compared to its anchor-based counterparts.
The remainder of this paper is organized according to the following manner. The The proposed detection framework provides the following contributions: The proposed method is proposal-free and anchor-free, thereby eliminating the complicated computation of the intersection over union (IoU) between the anchor boxes and ground truth boxes during training. As a result, this method is also completely free of the hyper-parameters related to anchor boxes, which improves its flexibility compared to its anchor-based counterparts.
The remainder of this paper is organized according to the following manner. The style embedded ship sample data augmentation is introduced in Section 2. Section 3 presents a detailed description of the lightweight densely connected sparsely activated detection method. Then, the comparative experimental results with real SAR images are provided and analyzed in Section 4. Finally, the paper's conclusion is given in Section 5.

Style Embedded Ship Sample Data Network
Usually, a conventional object detector is trained offline. Therefore, researchers always prefer to take this advantage by developing better training methods to make the object detector attain better accuracy without increasing the inference cost [52]. Conventional data augmentation methods crop, rotate, or blur the original sample, whereas hard samples are not augmented. They still cannot be detected efficiently. To solve this issue, this section constructs a novel ship sample augmentation method. The concept of this approach is to create hard samples artificially and purposefully. Specifically, ship slices are embedded into SAR images to simulate the various hard situations encountered during detection. However, simply embedding the ship slices into SAR images cannot simulate a real SAR image as the embedded slices are not in harmony with the surrounding environment. To address this problem, a style embedded ship sample data augmentation network is constructed. Figure 2 shows the flow chart of the proposed sample augmentation method.

Style Embedded Ship Sample Data Network
Usually, a conventional object detector is trained offline. Therefore, researchers always prefer to take this advantage by developing better training methods to make the object detector attain better accuracy without increasing the inference cost [52]. Conventional data augmentation methods crop, rotate, or blur the original sample, whereas hard samples are not augmented. They still cannot be detected efficiently. To solve this issue, this section constructs a novel ship sample augmentation method. The concept of this approach is to create hard samples artificially and purposefully. Specifically, ship slices are embedded into SAR images to simulate the various hard situations encountered during detection. However, simply embedding the ship slices into SAR images cannot simulate a real SAR image as the embedded slices are not in harmony with the surrounding environment. To address this problem, a style embedded ship sample data augmentation network is constructed. Figure 2 shows the flow chart of the proposed sample augmentation method. Given a real SAR image I , the pre-prepared ship slice images are embedded into I so as to obtain the embedded image e I . To improve the generated results, this paper proposed a two-channel (original SAR image and embedded SAR image) input mechanical. Next, the ship mask M is made, indicating the regions where the embedded ships are located. Mask M is only used in the training stage to help improve the final Given a real SAR image I, the pre-prepared ship slice images are embedded into I so as to obtain the embedded image I e . To improve the generated results, this paper proposed a two-channel (original SAR image and embedded SAR image) input mechanical. Next, the ship mask M is made, indicating the regions where the embedded ships are located. Mask M is only used in the training stage to help improve the final result. The purpose of the ship sample augmentation method is to train a model that reconstructs I g to be closed to the real SAR image. To achieve this goal, a GAN framework is utilized. As shown in Figure 2, the method consists of two parts: generator G and discriminator D.
Specifically, the real SAR image and the embedded SAR image are treated as positive and negative samples, respectively. On the one hand, the discriminator is trained to distinguish positive images from negative images. On the other hand, the generator is expected to produce a harmonized image that can fool the discriminator. The discriminator and generator improve the performance during the confrontation. Details are described as follows.

Cross-Dimension Attention Style Embedded Ship Sample Generator
Based on U-Net [53], a cross-dimension attention style embedded ship sample generator is constructed in this section. The architecture of the network is shown in Figure 3. It follows the framework of the encoder-decoder. The encoder module utilizes classic convolution to extract features, while the decoder module utilizes deconvolution to expand the spatial resolution of the features and concatenates the same stage features from the encoder module. The concatenation operator in U-Net realizes the interaction between shallow features and deep features. Notably, the importance of features in the shallow and deep levels is different [54], and we hope the generator pays more attention to the embedded ships. As a result, to aggregate the features and improve the generated result, attention block and residual block are inserted, as depicted in Figure 2. The details of the network are shown in Figure 3.
Remote Sens. 2021, 13, x FOR PEER REVIEW 6 of 23 distinguish positive images from negative images. On the other hand, the generator is expected to produce a harmonized image that can fool the discriminator. The discriminator and generator improve the performance during the confrontation. Details are described as follows.

Cross-Dimension Attention Style Embedded Ship Sample Generator
Based on U-Net [53], a cross-dimension attention style embedded ship sample generator is constructed in this section. The architecture of the network is shown in Figure  3. It follows the framework of the encoder-decoder. The encoder module utilizes classic convolution to extract features, while the decoder module utilizes deconvolution to expand the spatial resolution of the features and concatenates the same stage features from the encoder module. The concatenation operator in U-Net realizes the interaction between shallow features and deep features. Notably, the importance of features in the shallow and deep levels is different [54], and we hope the generator pays more attention to the embedded ships. As a result, to aggregate the features and improve the generated result, attention block and residual block are inserted, as depicted in Figure 2. The details of the network are shown in Figure 3. respectively [54]. Figure 4 shows the structure of this module. is the rotated feature. Then, adaptive pooling is applied to preserve a rich representation of the feature while simultaneously shrinking its depth, which is expressed as: The cross-dimension aggregation attention module is realized by capturing the interaction between (C, W), (C, H) and (H, W) dimensions of the input features, respectively [54]. Figure 4 shows the structure of this module. The inputted feature map F in ∈ R C×H×W goes through three branches. Taking the second branch as an example, the input feature F in ∈ R C×H×W is rotated through 90 • clockwise along the W axis. F 2 ∈ R H×W×C is the rotated feature. Then, adaptive pooling is applied to preserve a rich representation of the feature while simultaneously shrinking its depth, which is expressed as: Remote Sens. 2021, 13, 2743 is rotated through 90° clockwise along the W axis. 2 is the rotated feature. Then, adaptive pooling is applied to preserve a rich representation of the feature while simultaneously shrinking its depth, which is expressed as:  Next, the pooled feature is processed through a standard convolution layer and a sigmoid activation layer. After this step, the intermediate output is subsequently rotated through 90 • anticlockwise along the W axis to obtain F 2 ∈ R C×1×W . Similarly, the output of the other two branches are F 1 and F 3 .
Finally, the residual module is utilized to obtain the aggregation output, shown as follow: The generated image I g = G(I, I e ) is enforced to be close to the real SAR image via: where · 2 is the L2-norm. region = M represents the region where the embedded ships are located. region =!M represents the region without the ships. It should be noted that mask M is only used in the training stage to help improve the final result.

Max-Patch Discriminator
Discriminator D is designed to help generator G generate more plausible SAR images. In this section, a max-patch discriminator is constructed, which consists of seven convolution layers. After each convolution layer, LeakyReLU activation and instance normalization layers are applied. Sigmoid activation is placed after the last layer. The architecture of the network is shown in Table 1. Voting is used to determine whether the input is positive or negative. Specifically, the response values of the network are sorted, then the max N values are selected to calculate the average, which is taken as the final discrimination result. N is set as 25 in this paper.
Cross entropy is leveraged for training, which is given by: where D and G denote the discriminator and the generator, respectively. D(·) and G(·) are their outputs. L D and L G represent the discrimination quality loss and generation quality loss, respectively. The overall loss function of the training process is defined as a weighted sum of the losses, which is expressed as: where λ 1 and λ 2 are set as 1 in this paper. G loss and D loss represent the generator loss and discriminator loss, respectively.

Lightweight Densely Connected Sparsely Activated Detector
In this section, firstly, the convolution module is introduced. Then, the lightweight backbone network is constructed. Finally, the detection framework is described in detail.

Convolution Module
The conventional convolution neural network is widely used in modern network architectures, such as ResNet [55], GoogLeNet [56], and Darknet [36][37][38], which is used by most of the object detectors in extracting features. However, good feature extraction ability is always associated with a large number of parameters and high computational complexity. The group convolution and the depthwise convolution are two architectures that reduce the computational complexity by changing the convolution density between all channels. The architectures of the conventional convolution, the group convolution and the depthwise convolution are shown in Figure 5. The overall loss function of the training process is defined as a weighted sum of the losses, which is expressed as: where λ1 and λ2 are set as 1 in this paper. loss G and loss D represent the generator loss and discriminator loss, respectively.

Lightweight Densely Connected Sparsely Activated Detector
In this section, firstly, the convolution module is introduced. Then, the lightweight backbone network is constructed. Finally, the detection framework is described in detail.

Convolution Module
The conventional convolution neural network is widely used in modern network architectures, such as ResNet [55], GoogLeNet [56], and Darknet [36][37][38], which is used by most of the object detectors in extracting features. However, good feature extraction ability is always associated with a large number of parameters and high computational complexity. The group convolution and the depthwise convolution are two architectures that reduce the computational complexity by changing the convolution density between all channels. The architectures of the conventional convolution, the group convolution and the depthwise convolution are shown in Figure 5. According to Figure 5a, the conventional convolution's filter should process each feature map to generate a new layer. For Figure 5a, the conventional convolution has 8 4 32 × = convolution operators. Compared with the conventional convolution, the group convolution has a lower computational complexity. In regard to Figure 5b, the group convolution has 4 4 16 × = convolution operators. The depthwise convolution only needs to convolute one input channel. Hence, for Figure 5c, the depthwise convolution merely has 8 1 8 × = convolution operators. To reduce the computational complexity, the group convolution and the depthwise convolution are adopted to construct the backbone.

The Architecture of DSNet Backbone
Target detection requires a wealth of information. In deep convolution networks, the degree of information abundance varies from low to high, and the special resolution varies from high to low. With the increases of network depth and decrease of spatial resolution, a single layer cannot provide enough information. How to better integrate the information between different stages and blocks of the network is a problem that needs further consideration.
Reusing features in deep networks via dense connection is an effective way to achieve high computational efficiency [57]. The counter-intuitive effect of the dense connected According to Figure 5a, the conventional convolution's filter should process each feature map to generate a new layer. For Figure 5a, the conventional convolution has 8 × 4 = 32 convolution operators. Compared with the conventional convolution, the group convolution has a lower computational complexity. In regard to Figure 5b, the group convolution has 4 × 4 = 16 convolution operators. The depthwise convolution only needs to convolute one input channel. Hence, for Figure 5c, the depthwise convolution merely has 8 × 1 = 8 convolution operators. To reduce the computational complexity, the group convolution and the depthwise convolution are adopted to construct the backbone.

The Architecture of DSNet Backbone
Target detection requires a wealth of information. In deep convolution networks, the degree of information abundance varies from low to high, and the special resolution varies from high to low. With the increases of network depth and decrease of spatial resolution, a single layer cannot provide enough information. How to better integrate the Remote Sens. 2021, 13, 2743 8 of 21 information between different stages and blocks of the network is a problem that needs further consideration.
Reusing features in deep networks via dense connection is an effective way to achieve high computational efficiency [57]. The counter-intuitive effect of the dense connected mode is that it requires fewer parameters than traditional convolutional neural networks as it does not need to relearn redundant features. A densely connected mode can improve the flow of information through the network, which makes training easier. Each layer has a direct path from the loss function and the input information directly to the gradient, which allows for deeper supervision. Furthermore, a dense connection has a regularization effect, which reduces overfitting on smaller training data sets [57].
Since a densely connected model reuses shallow features, as the network depth increases, the number of network layers increases and the computational complexity also increases significantly. To solve such a problem DSNet is constructed, which adopts a densely connected model to reuse shallow features and utilizes sparse convolution (e.g., group convolution and depthwise convolution) networks to activate the feature layers. Moreover, the output channels of each convolutional layer are shuffled to ensure communication between different groups. The sparse activate module of this architecture is shown in Figure 6.
Remote Sens. 2021, 13, x FOR PEER REVIEW 9 of 23 as it does not need to relearn redundant features. A densely connected mode can improve the flow of information through the network, which makes training easier. Each layer has a direct path from the loss function and the input information directly to the gradient, which allows for deeper supervision. Furthermore, a dense connection has a regularization effect, which reduces overfitting on smaller training data sets [57].
Since a densely connected model reuses shallow features, as the network depth increases, the number of network layers increases and the computational complexity also increases significantly. To solve such a problem DSNet is constructed, which adopts a densely connected model to reuse shallow features and utilizes sparse convolution (e.g., group convolution and depthwise convolution) networks to activate the feature layers. Moreover, the output channels of each convolutional layer are shuffled to ensure communication between different groups. The sparse activate module of this architecture is shown in Figure 6.

The Location of Bounding Box
The geometric shape of the ship target is a long ellipse, and the orientation is arbitrary. Therefore, the corresponding bounding boxes undergo a large change (e.g.,  as it does not need to relearn redundant features. A densely connected mode can improve the flow of information through the network, which makes training easier. Each layer has a direct path from the loss function and the input information directly to the gradient, which allows for deeper supervision. Furthermore, a dense connection has a regularization effect, which reduces overfitting on smaller training data sets [57]. Since a densely connected model reuses shallow features, as the network depth increases, the number of network layers increases and the computational complexity also increases significantly. To solve such a problem DSNet is constructed, which adopts a densely connected model to reuse shallow features and utilizes sparse convolution (e.g., group convolution and depthwise convolution) networks to activate the feature layers. Moreover, the output channels of each convolutional layer are shuffled to ensure communication between different groups. The sparse activate module of this architecture is shown in Figure 6.

The Location of Bounding Box
The geometric shape of the ship target is a long ellipse, and the orientation is arbitrary. Therefore, the corresponding bounding boxes undergo a large change (e.g., extreme aspect ratios). Figure 8 shows the typical SAR ship images. The design of the

The Location of Bounding Box
The geometric shape of the ship target is a long ellipse, and the orientation is arbitrary. Therefore, the corresponding bounding boxes undergo a large change (e.g., extreme aspect ratios). Figure 8 shows the typical SAR ship images. The design of the anchors is empirical, which cannot fully describe the shape of ship targets. Although the regression branch can slightly amend the anchor box, the anchor-based framework still has lower flexibility, and ships with a peculiar shape may be lost. Therefore, an anchor-free framework could be more suitable for ship detection.
Remote Sens. 2021, 13, x FOR PEER REVIEW 10 of 23 anchors is empirical, which cannot fully describe the shape of ship targets. Although the regression branch can slightly amend the anchor box, the anchor-based framework still has lower flexibility, and ships with a peculiar shape may be lost. Therefore, an anchorfree framework could be more suitable for ship detection. Depending on how to encode the bounding box's location, anchor-free methods can be divided into point-grouping detectors and point-vector detectors [58]. Point-grouping detectors use two individual branches to detect key points and their offset maps. These key points can then be grouped together by the offset maps. The point-vector detectors determine the bounding box of targets by the key point and its vector. The encoded location of the bounding box is illustrated in Figure 9. Considering point-grouping methods need to cluster the detected corner points, which suffer from mismatching in the case of densely distributed conditions [57], the point-vector bounding box is accepted in this paper.
Anchor-based detectors use the pixel position on the input image as the anchor's center point to regress the bounding box, amending the preset anchors. In contrast, DSDet regards the locations of bounding boxes as training samples instead of anchor boxes and directly regresses the bounding box at the location. All the points in the ground truth boxes are regarded as positive samples. This is different from anchor-based methods which only select the high IoU score anchor boxes as the positive samples.
The predicted bounding box is encoded by a four-dimension (4-D) vector to regress the bounding box, which is calculated by: Depending on how to encode the bounding box's location, anchor-free methods can be divided into point-grouping detectors and point-vector detectors [58]. Point-grouping detectors use two individual branches to detect key points and their offset maps. These key points can then be grouped together by the offset maps. The point-vector detectors determine the bounding box of targets by the key point and its vector. The encoded location of the bounding box is illustrated in Figure 9. anchors is empirical, which cannot fully describe the shape of ship targets. Although the regression branch can slightly amend the anchor box, the anchor-based framework still has lower flexibility, and ships with a peculiar shape may be lost. Therefore, an anchorfree framework could be more suitable for ship detection. Depending on how to encode the bounding box's location, anchor-free methods can be divided into point-grouping detectors and point-vector detectors [58]. Point-grouping detectors use two individual branches to detect key points and their offset maps. These key points can then be grouped together by the offset maps. The point-vector detectors determine the bounding box of targets by the key point and its vector. The encoded location of the bounding box is illustrated in Figure 9. Considering point-grouping methods need to cluster the detected corner points, which suffer from mismatching in the case of densely distributed conditions [57], the point-vector bounding box is accepted in this paper.
Anchor-based detectors use the pixel position on the input image as the anchor's center point to regress the bounding box, amending the preset anchors. In contrast, DSDet regards the locations of bounding boxes as training samples instead of anchor boxes and directly regresses the bounding box at the location. All the points in the ground truth boxes are regarded as positive samples. This is different from anchor-based methods which only select the high IoU score anchor boxes as the positive samples.
The predicted bounding box is encoded by a four-dimension (4-D) vector to regress the bounding box, which is calculated by: Considering point-grouping methods need to cluster the detected corner points, which suffer from mismatching in the case of densely distributed conditions [57], the point-vector bounding box is accepted in this paper.
Anchor-based detectors use the pixel position on the input image as the anchor's center point to regress the bounding box, amending the preset anchors. In contrast, DSDet regards the locations of bounding boxes as training samples instead of anchor boxes and directly regresses the bounding box at the location. All the points in the ground truth boxes are regarded as positive samples. This is different from anchor-based methods which only select the high IoU score anchor boxes as the positive samples.
The predicted bounding box is encoded by a four-dimension (4-D) vector (x t , y t , x b , y b ).
Here (x t , y t ) and (x b , y b ) denote the coordinates of the left-top and right-bottom corners of the bounding box. The 4-D training target v = (l, t, r, b) is utilized to regress the bounding box, which is calculated by: where (x, y) is the coordinate of the pixel point.

Deep Feature Fusion Pyramid
Early methods only employed one stage feature map to detect targets. High-level feature maps have large receptive fields and can capture richer semantic information. However, it is hard for them to detect small-scale targets due to their low spatial resolution. In contrast, low-level feature maps have richer spatial information but provide less semantic information, enabling high localization accuracy but worse classification performance. This imbalance between different levels reduces multi-scale ship detection performance. Therefore, it is a natural choice to construct a feature pyramid using different levels of features to detect targets. Furthermore, different level features capture different context information. The important features of targets may not distribute in a single level. Hence, features at different levels should be appropriately fused.
Based on the considerations above, this paper introduces a deep feature fusion pyramid, which aims to let small proposals access the fusion pyramid exploiting more useful contextual information and large proposals acquire rich spatial information.
The structure is illustrated in Figure 10, for which DSNet is taken as the backbone of the detector. The output of stages 3-5 are utilized to detect targets. As shown in Figure 1, the output of stages 3-5 in the network is defined as {C 3 , C 4 , C 5 }. {N 3 , N 4 , N 5 } denotes the feature levels generated by the feature fusion pyramid. The augmented path starts from the lowest level C 3 and gradually approaches C 5 . From C 3 to C 5 , the spatial size is gradually down-sampled with a factor of 2.
where ( ) , x y is the coordinate of the pixel point.

Deep Feature Fusion Pyramid
Early methods only employed one stage feature map to detect targets. High-level feature maps have large receptive fields and can capture richer semantic information. However, it is hard for them to detect small-scale targets due to their low spatial resolution. In contrast, low-level feature maps have richer spatial information but provide less semantic information, enabling high localization accuracy but worse classification performance. This imbalance between different levels reduces multi-scale ship detection performance. Therefore, it is a natural choice to construct a feature pyramid using different levels of features to detect targets. Furthermore, different level features capture different context information. The important features of targets may not distribute in a single level. Hence, features at different levels should be appropriately fused.
Based on the considerations above, this paper introduces a deep feature fusion pyramid, which aims to let small proposals access the fusion pyramid exploiting more useful contextual information and large proposals acquire rich spatial information.
The structure is illustrated in Figure 10, for which DSNet is taken as the backbone of the detector. The output of stages 3-5 are utilized to detect targets. As shown in Figure 1, the output of stages 3-5 in the network is defined as { } 3 4 5 , , , , N N N denotes the feature levels generated by the feature fusion pyramid. The augmented path starts from the lowest level 3 C and gradually approaches 5 C . From 3 C to 5 C , the spatial size is gradually down-sampled with a factor of 2. In order to reduce the computational complexity, the feature fusion pyramid is simplified by using fewer convolution operations. In particular, for feature level 3~4, feature map i C first passes through a 1 × 1 convolutional layer.

Loss Function
At the end of the detection, a non-maximum suppression (NMS) process is adopted to select the position of the targets. NMS process ranks all detection results according to their classification confidence and selects the highest classification score bounding box as In order to reduce the computational complexity, the feature fusion pyramid is simplified by using fewer convolution operations. In particular, for feature level 3~4, feature map C i first passes through a 1 × 1 convolutional layer. Then, each element in this layer is added with the up-sampled high-level feature to obtain an intermediate feature N i . Finally, intermediate feature N i is processed via down-top path to generate N i . This process is summarized as:

Loss Function
At the end of the detection, a non-maximum suppression (NMS) process is adopted to select the position of the targets. NMS process ranks all detection results according to their classification confidence and selects the highest classification score bounding box as the final position of the targets. This process has the risk that low classification confidence but high-quality bounding boxes may be filtered. To address this issue, generalized focal loss [59] is introduced in the loss function. The total loss function is constructed as: The training loss function consists of two parts: L qua and L reg . They represent the quality loss and regression loss, respectively. Here, N pos is the number of positive samples, while q x,y and q * x,y denote the quality prediction score and the ground truth label, respectively. The ground truth label q * x,y represents the IoU score of the regressed box. The quality loss adopts quality focal loss [59] to measure the difference between the predicted quality and the ground truth label. N denotes the positive region. b x,y and b * x,y denote the predicted location and the ground truth box.
where β is set as 2.
Distance-IoU loss [60] and distribution focal loss [59] are adopted to measure the distance between the predicted box and the ground truth box, which are calculated by: where B and B gt represent the predicted box and ground the truth box. C is the smallest box covering B and B gt . λ 3 and λ 4 are set as 0.3 and 2, respectively. The relative offsets from the location to the four sides of a bounding box are adopted as the regression targets, as shown in Figure 9c. g is the regressed label. Given the range of label g with minimum g 0 and maximum g n (g 0 < g < g n ), the range [g 0 , g n ] is divided into a set {g 0 , g 1 , · · · , g n }. The estimated regression valueĝ can be calculated by: P(·) can be easily implemented through a SoftMax S(·) layer consisting of n + 1 units, with P(g i ) being denoted as S i for simplicity. n is set as 7, and the interval is 1. Then, the loss function can be expressed as follow: where g i and g i+1 are the nearest to the label g (g i < g < g i+1 ).

Experiments and Discussions
In this section, experiments with real SAR images are carried out to assess the competence of the proposed method. In the following experiments, the dataset and evaluation metrics are introduced first. Then, the performance of the ship sample augmentation method is illustrated. Additionally, detailed experiments of the proposed lightweight detection network are conducted.

Dataset
The SSDD [61] and HRSID dataset [8] are selected to evaluate the proposed method. SSDD is the first public SAR ship detection dataset, which is mainly provided by Radarsat-2, TerraSAR-X, and Sentinel-1 sensors, taken in Yantai, China, and Visakhapatnam, India, with the resolution of 1 m-15 m. It contains a large number of ship targets in the sea and coastal areas. In SSDD, there are 1160 images and 2456 ships with an average of 2.12 ships per image. The training subset contains 928 images, and the test subset contains 232 images.
HRSID is a large SAR ship detection dataset published recently. It contains multi-scale ships labeled with bounding box in various environments, including different scenes, sensor types and polarization modes. Statistically, there are 5604 cropped SAR images and 16,951 annotated ships in HRSID. The average number of ships per image is 3. Table 2 shows the main parameters of SSDD and HRSID. In the data augmentation experiment, half of the training data in SSDD are randomly selected as positive samples and the other half are used to embed ship slices to train the generator. The training epoch is 50, and the optimizer is Adam who has a learning rate of 0.0004. Beta 1 and beta 2 are set as 0.9 and 0.999, respectively.
The detector model is pre-trained on the COCO dataset [62]. In the following experiment, the training epoch is 100, and the stochastic gradient descent (SGD) algorithm is used as the optimizer. The initial learning rate is set as 0.1, and it decays in the 50th and 75th, adopting 0.01 and 0.001, respectively.

Evaluation Criteria
In order to quantitatively evaluate the detection performance of the network, the following evaluation criteria are used.
The detection precision and recall are the basic performance evaluation criteria of the traditional detection algorithms. The definitions are expressed by: where TP is the number of truly detected ships, FP is the number of backgrounds detected as ships, and FN represents the number of ships detected as the background. The true detected ship is defined as the target whose IoU between its bounding box and ground truth is higher than 0.5. High precision and recall rate is difficult to meet at the same time, hence, AP shown in Equation (19) is adopted to evaluate the overall performance of the detection methods. where P denotes precision, and R represents recall. AP is the primary challenge metric with the calculation of average IoU, which has ten IoU thresholds distributed from 0.5 to 0.95 with a step of 0.05. AP50 is the AP score when the IoU threshold is chosen as 0.5. Similarly, AP75 is the AP score when the IoU threshold is chosen as 0.75. Aps, APm, and APl denote the objects with small (area < 32 2 pixels), medium (32 2 < area < 64 2 pixels), and large (64 2 < area) size.  Figure 11 shows the generation performance of the proposed sample augmentation method. Figure 11a shows the original SAR images; Figure 11b shows the embedded SAR images that simulate the various states of ship targets in the inshore and offshore areas; Figure 11c illustrates the generated images. Evidently, the embedded ships are very inconsistent with the surrounding environment (as shown in Figure 11b). On the contrary, ships in the generated images are observed to be consistent with the surrounding environment, which demonstrates the effectiveness of the proposed augmentation method.

The Comparison of the Generated Results between the Proposed Method and U-Net
In order to verify the effectiveness of the proposed generator's network, the proposed network is compared with U-Net. Figure 12a shows the embedded SAR images, while Figure 12b depicts the results of U-Net and Figure 12c illustrates the results of the proposed method. According to Figure 12, the proposed network has a superior generation performance than U-Net. It can be observed that the proposed method not only preserves the details of the ship targets but also integrates the targets and the background well, which demonstrates the outperformance of the proposed method.

The Comparison of the Generated Results between the Proposed Method and U-Net
In order to verify the effectiveness of the proposed generator's network, the proposed network is compared with U-Net. Figure 12a shows the embedded SAR images, while Figure 12b depicts the results of U-Net and Figure 12c illustrates the results of the proposed method. According to Figure 12, the proposed network has a superior generation performance than U-Net. It can be observed that the proposed method not only preserves the details of the ship targets but also integrates the targets and the background well, which demonstrates the outperformance of the proposed method.

The Effectiveness of the Proposed Two-Channel Input Mechanical
To verify the effectiveness of the proposed two-channel input mechanical, an experiment is conducted. Figure 13 shows the results of the single-channel (only the embedded SAR image) input and two-channel input. Evidently, the results of the two-channel input exhibit better performance. The reason is that the two-channel input mechanical increases the available information of the generator so that the generator can easily find the embedded targets from the comparison and make it harmonious with the surrounding environment.
On the contrary, the single-channel input mechanical is unable to specify the target areas for the generator, which increases the learning difficulty of the generator.

The Comparison of the Generated Results between the Proposed Method and U-Net
In order to verify the effectiveness of the proposed generator's network, the proposed network is compared with U-Net. Figure 12a shows the embedded SAR images, while Figure 12b depicts the results of U-Net and Figure 12c illustrates the results of the proposed method. According to Figure 12, the proposed network has a superior generation performance than U-Net. It can be observed that the proposed method not only preserves the details of the ship targets but also integrates the targets and the background well, which demonstrates the outperformance of the proposed method. To verify the effectiveness of the proposed two-channel input mechanical, an experiment is conducted. Figure 13 shows the results of the single-channel (only the embedded SAR image) input and two-channel input. Evidently, the results of the twochannel input exhibit better performance. The reason is that the two-channel input mechanical increases the available information of the generator so that the generator can easily find the embedded targets from the comparison and make it harmonious with the surrounding environment. On the contrary, the single-channel input mechanical is unable to specify the target areas for the generator, which increases the learning difficulty of the generator.

Accuracy
To evaluate the performance of the proposed method, comparison experiments on Faster R-CNN [35], YOLO-V3 [38], FCOS [58], SSD [63], and EfficientDet [64] are conducted. The comparison results of the detection performance are quantitatively shown in Table 3. The results in bold signify the best result of the corresponding index. It can be observed from Table 3   To evaluate the performance of the proposed method, comparison experiments on Faster R-CNN [35], YOLO-V3 [38], FCOS [58], SSD [63], and EfficientDet [64] are conducted. The comparison results of the detection performance are quantitatively shown in Table 3. The results in bold signify the best result of the corresponding index. It can be observed from  Without considering the data augmentation method, the proposed DSDet (DSNet 2.0×) obtains the highest performance in the AP, AP50, and APs, garnering a 1.2%, 3.1%, and 1.7% improvement compared to the highest performance of the comparison methods. As for AP75 and APm, the performance of the proposed DSDet (DSNet 2.0×) is noted to be slightly lower than the highest performance of the comparison method, with a rather small gap. In terms of the overall performance, the proposed method demonstrates obvious advantages. Moreover, when using the proposed data augmentation approach, the proposed method exhibits obvious superior performance in almost all evaluation criteria. Figure 14 shows the effect of the proposed data augmentation method on AP50 and AP75. The proposed data augmentation approach gains 1.5%, 0.6%, 1.6%, 6.3%, 0.6%, 0.8%, 1.2% and 0.5% in terms of AP50 for Faster R-CNN, YOLO-V3, FCOS, SSD, EfficientDet, DSNet 1.0×, and DSNet 2.0×, respectively. Moreover, the AP75 values of Faster R-CNN, YOLO-V3, FCOS, SSD, DSNet 1.0×, and DSNet 2.0× also gain 0.5%, 0.2%, 3.1%, 0.6%, 1.0%, 0.8% and 1.6% improvement, respectively. Actually, the AP, APs, APm, and APl performances are also significantly improved, which means the small, medium, and large targets' detection accuracies are also improved, as shown in Table 3. Evidently, the proposed data augmentation approach can effectively improve the detection performance.

Computational Complexity
The visualization results under different metrics are given in Figure 15  121.1 GFLOPs, 127.8 GFLOPs, 107.5 GFLOPs, respectively. The second-lightest YOLO-V3 is still heavier than the proposed DSDet 1.0× by 24-fold. Moreover, the accuracy of the proposed detector is also the highest. In general, the above results demonstrate that, compared with the comparison methods, the proposed detector has the highest accuracy, lowest computational complexity, and least number of parameters. obvious advantages. Moreover, when using the proposed data augmentation approach, the proposed method exhibits obvious superior performance in almost all evaluation criteria. Figure 14 shows the effect of the proposed data augmentation method on AP50 and AP75. The proposed data augmentation approach gains 1.5%, 0.6%, 1.6%, 6.3%, 0.6%, 0.8%, 1.2% and 0.5% in terms of AP50 for Faster R-CNN, YOLO-V3, FCOS, SSD, EfficientDet, DSNet 1.0×, and DSNet 2.0×, respectively. Moreover, the AP75 values of Faster R-CNN, YOLO-V3, FCOS, SSD, DSNet 1.0×, and DSNet 2.0× also gain 0.5%, 0.2%, 3.1%, 0.6%, 1.0%, 0.8% and 1.6% improvement, respectively. Actually, the AP, APs, APm, and APl performances are also significantly improved, which means the small, medium, and large targets' detection accuracies are also improved, as shown in Table 3. Evidently, the proposed data augmentation approach can effectively improve the detection performance.

Computational Complexity
The visualization results under different metrics are given in Figure 15 to illustrate the complexity and accuracy of the proposed method. Here, compared with the comparison methods, the proposed detector has the least number of parameters (0.     [9]). Input size is set uniformly as 800 × 800 to calculate the computation complexity.

Results of Ship Detection on HRSID
To verify the robustness and migration capacity of the proposed detector in different datasets, the detection performance of the proposed detector is further tested in HRSID dataset. HRSID provides abundant baselines. We use these baselines to verify the detection performance. Table 4 shows the comparison of different detectors on the HRSID dataset. According to Table 4, compared with other baselines, HRSDNet with backbone HRFPN-W40 has the best overall performance, whose AP, AP50, AP75, APs, APm, and APl criteria are 69.4%, 89.3%, 79.8%, 70.3%, 71.1% and 28.9%, respectively. The detection accuracy of RetinaNet with backbone ResNet-100 + FPN is noted to be much lower than the other baselines. The AP, AP50, AP75, APs, APm and APl of RetinaNet with backbone ResNet-100 + FPN are 59.8%, 84.8%, 67.2%, 60.4%, 62.7%, and 26.5%, respectively.
Additionally, the AP, AP50, AP75, APs, APm and APl of DSNet with backbone DSNet 2.0× are found to be 60.5%, 90.7%, 74.6%, 66.8%, 64.0%, and 7.6%, respectively. The performance of the proposed DSDet was slightly lower than the highest performance baseline HRSDNet. Despite the slight sacrifice in accuracy, the model parameter and model size of the proposed detector is rather small, which are the 1/130.2 and 1/130 of the HRSDNet with backbone HRFPN-W40. In terms of the model parameter and model size, the proposed detector outperforms all comparison detectors by a large margin. Moreover, the proposed detector also attains the highest accuracy in AP50. In general, the proposed detector has a competitive overall performance with the least number of parameters and model size among the state-of-the-art detectors.   [9]). Input size is set uniformly as 800 × 800 to calculate the computation complexity.

Results of Ship Detection on HRSID
To verify the robustness and migration capacity of the proposed detector in different datasets, the detection performance of the proposed detector is further tested in HRSID dataset. HRSID provides abundant baselines. We use these baselines to verify the detection performance. Table 4 shows the comparison of different detectors on the HRSID dataset. According to Table 4, compared with other baselines, HRSDNet with backbone HRFPN-W40 has the best overall performance, whose AP, AP50, AP75, APs, APm, and APl criteria are 69.4%, 89.3%, 79.8%, 70.3%, 71.1% and 28.9%, respectively. The detection accuracy of RetinaNet with backbone ResNet-100 + FPN is noted to be much lower than the other baselines. The AP, AP50, AP75, APs, APm and APl of RetinaNet with backbone ResNet-100 + FPN are 59.8%, 84.8%, 67.2%, 60.4%, 62.7%, and 26.5%, respectively.
Additionally, the AP, AP50, AP75, APs, APm and APl of DSNet with backbone DSNet 2.0× are found to be 60.5%, 90.7%, 74.6%, 66.8%, 64.0%, and 7.6%, respectively. The performance of the proposed DSDet was slightly lower than the highest performance baseline HRSDNet. Despite the slight sacrifice in accuracy, the model parameter and model size of the proposed detector is rather small, which are the 1/130.2 and 1/130 of the HRSDNet with backbone HRFPN-W40. In terms of the model parameter and model size, the proposed detector outperforms all comparison detectors by a large margin. Moreover, the proposed detector also attains the highest accuracy in AP50. In general, the proposed detector has a competitive overall performance with the least number of parameters and model size among the state-of-the-art detectors.

Conclusions
Compared with optical datasets, the number of samples in SAR datasets is much smaller. Moreover, most state-of-the-art CNN-based ship target detectors are computationally expensive. To address these issues, this paper proposes a SAR ship sample data augmentation method as well as a lightweight densely connected sparsely activated detector. The proposed sample data augmentation framework can purposefully generate abundant hard samples, simulate various hard situations in marine areas, and improve detection performance. In addition, dense connection and sparse convolution modules are utilized to construct the backbone. Based on the proposed backbone, a low-cost one-stage anchor-free detector is presented. The validity of the proposed method is then confirmed on the public datasets SSDD and HRSID. The experimental results indicated that the proposed data augmentation method can evidently improve the detection performance. Benefiting from the lightweight design of the detection network, the proposed detector achieves competitive performance compared to other state-of-the-art detectors with the least number of parameters and lowest computation complexity.
Ship instance segmentation in SAR images under complex sea conditions is an important research topic in the field of detection. The proposed lightweight detector can be remolded to construct a low-cost SAR ship instance segmentation method. Consequently, our future studies will focus on the ship instance segmentation for high-resolution SAR images.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.