Rethinking the Random Cropping Data Augmentation Method Used in the Training of CNN-Based SAR Image Ship Detector

The random cropping data augmentation method is widely used to train convolutional neural network (CNN)-based target detectors to detect targets in optical images (e.g., COCO datasets). It can expand the scale of the dataset dozens of times while consuming only a small amount of calculations when training the neural network detector. In addition, random cropping can also greatly enhance the spatial robustness of the model, because it can make the same target appear in different positions of the sample image. Nowadays, random cropping and random flipping have become the standard configuration for those tasks with limited training data, which makes it natural to introduce them into the training of CNN-based synthetic aperture radar (SAR) image ship detectors. However, in this paper, we show that the introduction of traditional random cropping methods directly in the training of the CNN-based SAR image ship detector may generate a lot of noise in the gradient during back propagation, which hurts the detection performance. In order to eliminate the noise in the training gradient, a simple and effective training method based on feature map mask is proposed. Experiments prove that the proposed method can effectively eliminate the gradient noise introduced by random cropping and significantly improve the detection performance under a variety of evaluation indicators without increasing inference cost.


Introduction
Object detection is an important research direction in the field of computer vision. Thanks to the rapid development of deep learning technology, many detection models based on convolutional neural network (CNN) have been designed to achieve highprecision optical image target detection, such as YOLO [1], SSD [2], and Faster-RCNN [3]. At the same time, the prosperity of optical image target detection technology also brings hope for high-precision synthetic aperture radar (SAR) image ship detection tasks. Since simple morphological filtering or traditional detection methods cannot well solve the SAR image ship detection problem in high-resolution nearshore scenes, many researchers have introduced some excellent CNN-based optical image detection models into SAR image ship detection [4][5][6][7]. These studies proved that the performance of CNN-based detection models on SAR ship detection tasks is much better than traditional SAR ship detection algorithm such as CFAR [8].
Since the neural network model is prone to overfitting, it is necessary to perform some data augmentation operations in the training phase in order to obtain a high-precision CNN-based detection model [9]. Random cropping is one of the most effective data augmentation methods when training optical image target detection models, and it is also the basis for other more advanced data augmentation methods, such as mosaic [10] and CutMix [11]. It randomly cuts a slice from the original training image as the input of the model during the training phase, which greatly enriches the diversity of model's training data. Random cropping ensures that the same target will not always appear in the same position of the corresponding training sample image, which effectively prevents the model from overfitting to the target spatial position. Together with random flipping, random cropping, and its variants are intensively applied to the current research on optical image target detection algorithms [10][11][12].
As researchers continue to introduce excellent optical image target detection models into the SAR image ship detection task, various data augmentation methods including random cropping have also been introduced into the training process of the CNN-based SAR image ship detector [8,13]. However, directly introducing data augmentation methods for optical image datasets into SAR image ship datasets may cause some unexpected problems, which deserve further research.
In this paper, a careful analysis of the geometric characteristics of the ship targets in the SAR image ship detection dataset is performed. A training gradient noise source introduced by the traditional random cropping data augmentation method during the training process of a CNN-based SAR image ship detector is pointed out for the first time. This training gradient noise source is considered to be harmful for the detection performance of the SAR image ship detector. In order to eliminate this training gradient noise, a simple training method is proposed for CNN-based SAR image ship detector training process that utilizes random cropping as its data augmentation method. Experimental results show that removing these gradient noises can significantly improve the detection performance of the model, which in turn proves the necessity of removing these gradient noises.
The main improvements as well as the contributions of this paper are mainly reflected from the following aspects: • A hidden source of training gradient noise introduced by the traditional random cropping data augmentation method is pointed out for the first time, which can lead to inaccurate target bounding box regression results and false alarm targets. • A simple training method is proposed to suppress the gradient noise introduced by the traditional random cropping algorithm. This method uses a feature map mask to prevent pixels that generate gradient noise from participating in the calculation of training loss. The proposed method is proven to effectively improve the performance of the CNN-based SAR image ship detector, especially for high-precision bounding box regression tasks.
The remainder of the paper is organized as follows: Section 2 introduces the background of the problem, the basic network model used in this paper and the proposed training strategy for random cropping. Section 3 reports the experimental results on public dataset. Sections 4 and 5 come to a discussion and conclusion.

Basic Detection Model
This paper adopts a simplified CenterNet [12] model as the basic CNN model for testing the proposed training method, and we named our model as ShipDet. ShipDet uses DLA-34 [14] segmentation network as the basic structure. It uses the basic loss calculation method proposed in TTFNet [15] for training. The detailed structure of ShipDet is shown in Figure 1. Assuming the size of the input image is X × Y, the DLA-34 segmentation network first performs feature extraction on the image and generates a feature map with a size of X 4 × Y 4 . Then, the feature map is sent to two convolutional layers for target localization and bounding box regression, respectively. i j belongs to the center point of a target. The size of the convolution kernel of the second convolution layer is 1 × 1, and the number of channels is 4. Let Ŝ be the output of the second convolution layer where ̂∈ 4× 4 × 4 . If the pixel ( , ) i j is considered to be the center point of a target, ˆi j S indicate the distance between the pixel ( , ) i j and the four sides of this bounding box. In the training phase, the localization loss is calculated with modified focal loss [12] and the regression loss is calculated with L1 loss.
Given m -th annotated box, it is firstly linearly mapped to the feature map scale with stride of 4. Then, 2D Gaussian kernel The first convolution layer has a 1 × 1 convolution kernel and 1 channel and it is followed by a sigmoid layer. LetĤ be the output of this sigmoid layer whereĤ ∈ O 1× X 4 × Y 4 , O ∈ (0, 1).Ĥ ij represents the probability that the pixel (i, j) belongs to the center point of a target.
The size of the convolution kernel of the second convolution layer is 1 × 1, and the number of channels is 4. LetŜ be the output of the second convolution layer wherê S ∈ R 4× X 4 × Y 4 . If the pixel (i, j) is considered to be the center point of a target,Ŝ ij indicate the distance between the pixel (i, j) and the four sides of this bounding box.
In the training phase, the localization loss is calculated with modified focal loss [12] and the regression loss is calculated with L1 loss.
Given m-th annotated box, it is firstly linearly mapped to the feature map scale with stride of 4. Then, 2D Gaussian kernel Given the prediction Ĥ and the ground truth H , the localization loss loc L is calculated as: where N is the number of targets in the training image.
Given m -th annotated box in the feature map scale, the ground truth of regression is given by ∈ 4× 4 × 4 . Given pixel ( , ) i j located inside the m -th annotated box in the feature map scale, ij S can be represented as a 4-dim vector ( , , , ) , which is defined as the distances from pixel ( , ) i j to four sides of m -th box in the feature map scale with a normalization coefficient of 4. In other words, the predicted box 1 1 2 2 ( , , , ) x y x y in the original image scale can be represented as: Let A be the set of pixels where 0 ij H > , then the regression loss is calculated as: where reg N is the number of pixel where 0 ij H > . ij W is a weight used to balance the loss of bounding boxes of different sizes, which will not affect our subsequent discussion. The calculation method of ij W can be found in [15]. The final total loss can be expressed as: In the test phase, only the pixel corresponding to a peak point in the localization branch output feature map is considered as the center point of a predicted target bounding box, and the output of other pixels in the localization branch will be discarded.
Compared with the traditional anchor-based detection model such as RetinaNet [16] or Faster-RCNN [3], we found that this simplified anchor-free model has faster convergence speed (thanks to its faster inference speed) in the field of SAR ship detection, so we use it as the basic model to analyze our proposed method. Given the predictionĤ and the ground truth H, the localization loss L loc is calculated as: where N is the number of targets in the training image. Given m-th annotated box in the feature map scale, the ground truth of regression is given by S ∈ R 4× X 4 × Y 4 . Given pixel (i, j) located inside the m-th annotated box in the feature map scale, S ij can be represented as a 4-dim vector (w l , h t , w r , h b ) m , which is defined as the distances from pixel (i, j) to four sides of m-th box in the feature map scale with a normalization coefficient of 4. In other words, the predicted box (x 1 , y 1 , x 2 , y 2 ) in the original image scale can be represented as: Let A be the set of pixels where H ij > 0, then the regression loss is calculated as: where N reg is the number of pixel where H ij > 0. W ij is a weight used to balance the loss of bounding boxes of different sizes, which will not affect our subsequent discussion. The calculation method of W ij can be found in [15]. The final total loss can be expressed as: where w loc = 1.0 and w reg = 5.0 in our setting.
In the test phase, only the pixel corresponding to a peak point in the localization branch output feature map is considered as the center point of a predicted target bounding box, and the output of other pixels in the localization branch will be discarded.
Compared with the traditional anchor-based detection model such as RetinaNet [16] or Faster-RCNN [3], we found that this simplified anchor-free model has faster convergence speed (thanks to its faster inference speed) in the field of SAR ship detection, so we use it as the basic model to analyze our proposed method.

Training Gradient Noise Introduced by Random Cropping
The traditional random cropping data augmentation method used for target detection first randomly selects a target in the original image before cropping, and then it randomly crops an image slice under the premise that this target is included in the cropped image. After the cropped image slice is obtained, the cropping algorithm will automatically adjust the target bounding boxes in the image slice to ensure that the range of target bounding boxes in the image slice is limited to the range of the image slice.
The traditional random cropping algorithm does not introduce obvious errors in some detection tasks. Taking a vehicle detection training sample in the optical field shown in Figure 3 as an example, the two vehicle targets represented by the red box exceed the cropping range represented by the gray box, so the traditional random cropping algorithm automatically moves the edge of the red bounding box beyond the cropping range to the edge of the image slice. The traditional random cropping algorithm can ensure that the target bounding box at the edge of the image slice has sufficient accuracy for a horizontal target, such as a vehicle, which can be seen from Figure 3.

Training Gradient Noise Introduced by Random Cropping
The traditional random cropping data augmentation method used for target detection first randomly selects a target in the original image before cropping, and then it randomly crops an image slice under the premise that this target is included in the cropped image. After the cropped image slice is obtained, the cropping algorithm will automatically adjust the target bounding boxes in the image slice to ensure that the range of target bounding boxes in the image slice is limited to the range of the image slice.
The traditional random cropping algorithm does not introduce obvious errors in some detection tasks. Taking a vehicle detection training sample in the optical field shown in Figure 3 as an example, the two vehicle targets represented by the red box exceed the cropping range represented by the gray box, so the traditional random cropping algorithm automatically moves the edge of the red bounding box beyond the cropping range to the edge of the image slice. The traditional random cropping algorithm can ensure that the target bounding box at the edge of the image slice has sufficient accuracy for a horizontal target, such as a vehicle, which can be seen from Figure 3. However, the orientation angle of many ship targets is not horizontal or vertical in the SAR image ship detection task, which makes the bounding boxes of the targets located at the edge of the slice in the randomly cropped image slice no longer accurate. Figure 4 shows three training samples containing ship targets of different scales. It can be seen from the red bounding box in the right column of Figure 4 that the target bounding boxes which cross the cropping boundary will be adjusted by the traditional random cropping algorithm. However, these target bounding boxes may still be inaccurate after the automatic adjustment. Part of the edge of the red bounding box should be adjusted to the red dotted line after random cropping. However, this operation cannot be done automatically by the random cropping algorithm, because the random cropping algorithm does not know the true boundary of each target.
If these training image slices generated by the traditional random cropping algorithm are used when training the SAR image ship detection model, those incorrect target borders will cause the model to make errors when calculating the training loss. These errors will introduce noise into the gradient in the back propagation process. Obviously, this gradient noise will hurt the learning process of the model, leading to the deterioration of model performance. However, the orientation angle of many ship targets is not horizontal or vertical in the SAR image ship detection task, which makes the bounding boxes of the targets located at the edge of the slice in the randomly cropped image slice no longer accurate. Figure 4 shows three training samples containing ship targets of different scales. It can be seen from the red bounding box in the right column of Figure 4 that the target bounding boxes which cross the cropping boundary will be adjusted by the traditional random cropping algorithm. However, these target bounding boxes may still be inaccurate after the automatic adjustment. Part of the edge of the red bounding box should be adjusted to the red dotted line after random cropping. However, this operation cannot be done automatically by the random cropping algorithm, because the random cropping algorithm does not know the true boundary of each target.
If these training image slices generated by the traditional random cropping algorithm are used when training the SAR image ship detection model, those incorrect target borders will cause the model to make errors when calculating the training loss. These errors will introduce noise into the gradient in the back propagation process. Obviously, this gradient noise will hurt the learning process of the model, leading to the deterioration of model performance.
Since traditional random cropping algorithm cannot automatically correct inaccurate target bounding boxes caused by random cropping, we can only try to eliminate the contribution of inaccurate target bounding boxes to training loss as much as possible during the model training process, thereby avoiding the introduction of noise into the training gradient. We propose to generate a feature map mask to guide the loss calculation, which is explained in Figure 5. When the random cropping algorithm generates an image slice, we generate a feature mask according to the target distribution in the image slice.   Since traditional random cropping algorithm cannot automatically correct inaccurate target bounding boxes caused by random cropping, we can only try to eliminate the contribution of inaccurate target bounding boxes to training loss as much as possible during the model training process, thereby avoiding the introduction of noise into the training gradient. We propose to generate a feature map mask to guide the loss calculation, Remote Sens. 2021, 13, 34 7 of 22 which is explained in Figure 5. When the random cropping algorithm generates an image slice, we generate a feature mask according to the target distribution in the image slice.
Remote Sens. 2021, 13, 34   First, we generate a mask M of the same size as the image slice and set the value of each pixel of the mask to 1.
Next, assuming that the i-th target bounding box in the original image crosses the cropping boundary, let a i be the area of the i-th target bounding box in the original image, and b i is the area of the i-th target bounding box automatically adjusted by the random cropping algorithm. If there is: then, all the mask pixel values corresponding to the inside of the i-th target bounding box in the image slice will be set to 0. T c is used to control the tolerance of the model to the target bounding box error introduced by random cropping. Figure 6 shows an image containing four identical targets and an example of its cropping result. It can be found in Figure 6 that smaller b i a i means larger bounding box error for the same target. Figure 7 shows the corresponding image masks of the cropping result in Figure 6 under different T c . In these masks, gray represents pixels with value 1, and black represents pixels with value 0. It can be seen from Figure 7 that a larger T c means less error introduced by random cropping in the model loss.
Finally, we downsample the mask M to match the size of the feature map output by the model.  Figure 6 under different c T . In these masks, gray represents pixels with value 1, and black represents pixels with value 0. It can be seen from Figure 7 that a larger c T means less error introduced by random cropping in the model loss.
Finally, we downsample the mask M to match the size of the feature map output by the model.

Loss Calculation with Feature Map Mask
After downsampling the mask M to the size of the model output feature map, the mask M will be added to the calculation of the model loss. Taking the simplified Cen-terNet model used in this paper as an example, the new localization loss loc L of the localization branch is: where 1 N is equal to the total number of targets in the image slice minus the number of targets with < .
The new regression loss is calculated as: where 2 N is the number of pixel where It can be seen from the new loss functions that if the i -th target in the image slice satisfies < , then the loss contribution of the pixel inside the i -th target bounding box in the image slice will be equal to 0, which avoids introducing errors into the total loss.
It should be noted that if the orientation angle of a ship target is vertical or horizontal and its bounding box crosses the cropping boundary, the bounding box of this target still has high accuracy after adjusted by the random clipping algorithm. Although its contribution to the training loss will still be suppressed by the mask when this target box satisfies < , its impact on the model is not obvious, so we do not do special treatment for this situation.

Loss Calculation with Feature Map Mask
After downsampling the mask M to the size of the model output feature map, the mask M will be added to the calculation of the model loss. Taking the simplified CenterNet model used in this paper as an example, the new localization loss L loc of the localization branch is: where N 1 is equal to the total number of targets in the image slice minus the number of targets with b i a i < T c . The new regression loss is calculated as: where N 2 is the number of pixel where H ij > 0 and M ij = 1. It can be seen from the new loss functions that if the i-th target in the image slice satisfies b i a i < T c , then the loss contribution of the pixel inside the i-th target bounding box in the image slice will be equal to 0, which avoids introducing errors into the total loss.

Data Preprocessing and Post-Processing
During the training phase, each training image will be randomly cropped into × slices and randomly flipped horizontally. During the test phase, According to [15], a maxpooling layer with a kernel size of 3 × 3 is used to extract the peak points in the output feature map of the localization branch, and all peak points with a peak value greater than conf T will be considered as positive targets. conf T is set to 0.05 in our experiments. The output results of the regression branch corresponding to the peak points will be used to decode the bounding box position of the targets, according to Equation (2).

Optimizer Setting
All of the models are trained with Stochastic Gradient Descent (SGD) algorithm over an Intel i9-9700k processor and an NVidia GTX1080Ti GPU. The mini-batch size is 4 in each iteration. A small batch size is selected here to increase the total number of iterations as much as possible without increasing too much training time too much, which is beneficial to increase the number of randomly crop times of each image and shown a better compromise between training time and model accuracy in our experiments. All models are trained for 150 epochs. The cosine annealing learning rate scheduling strategy is adopted with an initial learning rate of 0.001. It should be noted that if the orientation angle of a ship target is vertical or horizontal and its bounding box crosses the cropping boundary, the bounding box of this target still has high accuracy after adjusted by the random clipping algorithm. Although its contribution to the training loss will still be suppressed by the mask when this target box satisfies b i a i < T c , its impact on the model is not obvious, so we do not do special treatment for this situation.

Data Preprocessing and Post-Processing
During the training phase, each training image will be randomly cropped into C × C slices and randomly flipped horizontally. During the test phase, According to [15], a maxpooling layer with a kernel size of 3 × 3 is used to extract the peak points in the output feature map of the localization branch, and all peak points with a peak value greater than T con f will be considered as positive targets. T con f is set to 0.05 in our experiments. The output results of the regression branch corresponding to the peak points will be used to decode the bounding box position of the targets, according to Equation (2).

Optimizer Setting
All of the models are trained with Stochastic Gradient Descent (SGD) algorithm over an Intel i9-9700k processor and an NVidia GTX1080Ti GPU. The mini-batch size is 4 in each iteration. A small batch size is selected here to increase the total number of iterations as much as possible without increasing too much training time too much, which is beneficial to increase the number of randomly crop times of each image and shown a better compromise between training time and model accuracy in our experiments. All models are trained for 150 epochs. The cosine annealing learning rate scheduling strategy is adopted with an initial learning rate of 0.001.

Experimental Data
In this paper, HRSID [17] dataset is used to test the proposed method. HRSID is a large SAR ship detection dataset published recently which contains multi-scale ships labeled with bounding box in various environments, including different scenes, sensor types and polarization modes. It has more training samples and test samples than classic SAR ship detection dataset SSDD [18], which can help researchers evaluate their methods more accurately. Some important parameters of HRSID are shown in Table 1. Figure 8 shows the distribution of the length and width of the target bounding boxes in the HRSID and SSDD. It can be seen from the distribution that HRSID has a larger target scale variation range, which brings a greater challenge to the robustness of the detector. More detailed information about HRSID can be found in [17].

Experimental Data
In this paper, HRSID [17] dataset is used to test the proposed method. HRSID is a large SAR ship detection dataset published recently which contains multi-scale ships labeled with bounding box in various environments, including different scenes, sensor types and polarization modes. It has more training samples and test samples than classic SAR ship detection dataset SSDD [18], which can help researchers evaluate their methods more accurately. Some important parameters of HRSID are shown in Table 1. Figure 8 shows the distribution of the length and width of the target bounding boxes in the HRSID and SSDD. It can be seen from the distribution that HRSID has a larger target scale variation range, which brings a greater challenge to the robustness of the detector. More detailed information about HRSID can be found in [17].

Evaluation Criteria
In order to quantitatively evaluate the effectiveness of the proposed method, standard PASCAL VOC evaluation indicators [19] are used to compare the performance of different configurations.

Evaluation Criteria
In order to quantitatively evaluate the effectiveness of the proposed method, standard PASCAL VOC evaluation indicators [19] are used to compare the performance of different configurations.
For typical CNN-based detection models, a specific Intersection-over-Union (IoU) threshold will be used to filter out detection results with low confidence. The precision rate P r of the model will increase as the threshold increases, but the recall rate R r of the model will decrease. Recall rate R r is the ratio of true positive targets (TP) in all ground truths, which is defined as: where FN means false negative targets. Precision rate P r is the ratio of TPs in all detected targets. The definition is as follows: where FP means false positive targets. AP is the standard metric for target detection algorithms, which comprehensively considers the P r and R r of the model at different confidence levels and can be expressed as: The AP of an ideal detector will be equal to 1. Three kinds of AP indicators including mAP, AP50, and AP75 are used in this paper. The meanings of mAP, AP50, and AP75 are shown in Table 2.

Evaluation Results of the Proposed Method
In the random cropping process, the size C of the image slice is often fixed to an integer multiple of the maximum downsampling multiple of the model in order to ensure that the feature map is always divisible when performing downsampling operation. Models with T c = 0.7 and C from 704 to 416 were trained and tested in order to analyze the robustness of the proposed method under different random crop sizes. The test results of different configurations are summarized in Table 3. Figure 9, Figure 10 and Figure 11 show the comparison of model performance on test set under different C. C = 800 means that the model is trained without random cropping. In order to analyze the impact of different c T on model performance, models with different c T under 512 C = were trained and compared. Table 4 gives the summary of the test results under different c T . The visualization results under different metrics are given in Figures 12-14. In addition to the detection performance of the models under different c T , we also counted the number of target bounding boxes participating in the loss calculation and the number of target bounding boxes not participating in the loss calculation (suppressed by the mask) in an epoch under different c T . The statistical results are shown in Figure 15.          In order to analyze the impact of different T c on model performance, models with different T c under C = 512 were trained and compared. Table 4 gives the summary of the test results under different T c . The visualization results under different metrics are given in Figures 12-14. In addition to the detection performance of the models under different T c , we also counted the number of target bounding boxes participating in the loss calculation and the number of target bounding boxes not participating in the loss calculation (suppressed by the mask) in an epoch under different T c . The statistical results are shown in Figure 15.  Figure 11. AP75 of different models under different random crop sizes.                 In addition to the anchor-free model proposed in the paper, we have also verified the proposed method on typical anchor-based model RetinaNet. Our modified RetinaNet uses feature maps from P2, P3, P4, P5, and P6 for prediction instead of the original P3-P7. The introduction of P2 feature map greatly improved the detection accuracy of the model on HRSID because the feature map from P2 has a higher resolution, which is beneficial to the detection of a large number of small ships in HRSID. Except for the model itself, other implementation details remain the same as in Section 2.4.2. Like ShipDet, the predicted value of each pixel in the output feature map of different scales of RetinaNet will be multiplied by the feature map mask of the corresponding scale after calculating the loss with ground truth, which avoids the pixels inside the target bounding box with large errors to participate in the final loss summation. The detailed loss calculation process of RetinaNet with the proposed method is shown in Figure 19. The experimental results are shown in Table 5. In addition to the anchor-free model proposed in the paper, we have also verified the proposed method on typical anchor-based model RetinaNet. Our modified RetinaNet uses feature maps from P2, P3, P4, P5, and P6 for prediction instead of the original P3-P7. The introduction of P2 feature map greatly improved the detection accuracy of the model on HRSID because the feature map from P2 has a higher resolution, which is beneficial to the detection of a large number of small ships in HRSID. Except for the model itself, other implementation details remain the same as in Section 2.4.2. Like ShipDet, the predicted value of each pixel in the output feature map of different scales of RetinaNet will be multiplied by the feature map mask of the corresponding scale after calculating the loss with ground truth, which avoids the pixels inside the target bounding box with large errors to participate in the final loss summation. The detailed loss calculation process of RetinaNet with the proposed method is shown in Figure 19. The experimental results are shown in Table 5.

Analysis of the Proposed Method under Different Metrics
At least three conclusions can be drawn from the comparison results of Figures 9-11. First, the use of random cropping can effectively improve the detection performance of the model under different indicators. Second, the proposed method can significantly improve the detection performance of the model under different crop sizes, which not only proves the effectiveness of the proposed method, but also shows that it is inappropriate to ignore the gradient noise introduced by traditional random cropping algorithm. Third, compared with the results of AP50, eliminating the gradient error introduced by the random cropping algorithm can achieve a more obvious and stable performance improvement on AP75, which illustrates that the high-precision bounding box regression task is

Analysis of the Proposed Method under Different Metrics
At least three conclusions can be drawn from the comparison results of Figures 9-11. First, the use of random cropping can effectively improve the detection performance of the model under different indicators. Second, the proposed method can significantly improve the detection performance of the model under different crop sizes, which not only proves the effectiveness of the proposed method, but also shows that it is inappropriate to ignore the gradient noise introduced by traditional random cropping algorithm. Third, compared with the results of AP50, eliminating the gradient error introduced by the random cropping algorithm can achieve a more obvious and stable performance improvement on AP75, which illustrates that the high-precision bounding box regression task is more sensitive to the gradient error introduced by the random cropping algorithm. In addition, Table 5 also proves that the proposed method is not only applicable to the anchor-free CNN model, but can also bring significant performance improvements in the typical anchor-based CNN model.

The Influence of Different T c on the Performance of the Proposed Method
From the change trend of different metrics in Figures 12-14, it can be seen that there is a peak of model performance around T c = 0.7. In theory, a larger T c means that the random cropping algorithm introduces less gradient noise. When T c is large, even if the targets at the edge of the image have a small bounding box error, they will still not be able to participate in the training of the model, which is explained in Figures 6 and 7 and can be proved by Figure 15. However, the experimental results show that the performance of the model declines rapidly when T c is close to 1, which shows that the target bounding boxes with low error at the edge of the image are still benefit to the model's learning and it is not appropriate to simply remove the loss contribution of all targets located at the boundary of the image. Figures 16 and 17 show examples of detection results for targets of different scales using the model trained with traditional random cropping method and the proposed method. It can be seen from the detection results that the bounding box regression accuracy of the model trained by the proposed method is better than that of the model trained by the traditional random cropping method on many targets at the edge of the image. The detection performance of these two methods in the middle of the image has little difference. This may be because the gradient noise introduced by the traditional random cropping method is mainly from the edges of the training image. In addition, it can be found that the model trained with the traditional random cropping method is also prone to produce many strange false alarm targets at the edge of the image. This shows that the gradient noise introduced by the traditional random cropping method not only affects the model's box regression ability, but also hurts the ability to determine whether the target exists.

The Significance of theProposed Method for SAR Image Ship Detection Task
Effective data augmentation methods are essential for the CNN-based model when training data is very limited. Considering that the training data of the SAR image ship detection task is often more limited compared with the optical image detection task, our method has sufficient value for the SAR image ship detection task.

Applicability of the Proposed Method in Other Fields
The method proposed in this paper is dedicated to suppress the gradient noise when training the CNN-based SAR image ship detector using the traditional random cropping method. However, it is foreseeable that this method is also applicable to other detection scenarios, in which most targets have extremely large aspect ratios and different orientation angles. One of the most intuitive examples is ship detection based on optical remote sensing images.

Conclusions
In this paper, the problem of gradient noise introduced by traditional random cropping algorithm when training CNN-based SAR image ship detection model is pointed out for the first time, which has been proven to cause the deterioration of detection performance, especially for high-precision bounding box regression tasks. Then, a simple and effective method is proposed for the suppression of the gradient noise. The experimental results show that the proposed method can effectively eliminate the gradient noise introduced by random cropping, thereby improving the model's detection performance without affecting the detection efficiency of the model.