A New Ship Detection Algorithm in Optical Remote Sensing Images Based on Improved R 3 Det

: The task of ship target detection based on remote sensing images has attracted more and more attention because of its important value in civil and military ﬁelds. To solve the problem of low accuracy in ship target detection in optical remote sensing ship images due to complex scenes and large-target-scale differences, an improved R 3 Det algorithm is proposed in this paper. On the basis of R 3 Det, a feature pyramid network (FPN) structure is replaced by a search architecture-based feature pyramid network (NAS FPN) so that the network can adaptively learn and select the feature combination update and enrich the multiscale feature information. After the feature extraction network, a shallow feature is added to the context information enhancement (COT) module to supplement the small target semantic information. An efﬁcient channel attention (ECA) module is added to make the network gather in the target area. The improved algorithm is applied to the ship data in the remote sensing image data set FAIR1M. The effectiveness of the improved model in a complex environment and for small target detection is veriﬁed through comparison experiments with R 3 Det and other models. Conceptualization, J.L. and Z.L.; methodology, J.L.; software, Z.L. and M.C.; validation, J.L., Z.L. and M.C.; formal analysis, Z.L. and M.C.; investigation, J.L. and Y.W.; resources, J.L. and Q.L.; data curation, Z.L. and M.C.; writing—original draft preparation, Z.L.; writing—review and editing, J.L. and Y.W.; Y.W.; J.L.; project administration,


Introduction
With the continuous development of satellite technology, the resolution and imaging quality of remote sensing optical images have also been greatly improved [1]. Compared with traditional synthetic aperture radar (SAR) imaging, it contains a great deal of color information, ship shape features, and texture structure features, which enables us to obtain more abundant sea surface information [2]. The recognition and monitoring of ship targets on the sea based on optical remote sensing images have important application prospects in the management of maritime traffic, fishing, maritime search and rescue, border surveillance, and other civil and military aspects [3].
At present, traditional methods based on segmentation and feature and depth learning methods based on convolutional neural networks are often used in ship target detection [4].
Many traditional detection methods detect ship targets through handcrafted feature extraction, such as methods based on gray-level features [5,6], template-based matching methods [7,8], methods based on shape features [9,10], and so on. Most of the traditional algorithms have achieved success in fixed-scene applications. However, when ships are in a complex environment, they may encounter bottlenecks. Moreover, the establishment of handmade features relies too much on expert experience, which makes its generalization ability weak.
Compared with traditional feature extraction methods, a neural network has more deep and complex feature expression ability. After nonlinear transformation, the extracted feature semantic information is more abundant, and the robustness is stronger in the face 1.
Atmospheric correction of remote sensing images is performed by the dark channel prior method, and ship targets are detected and recognized based on the R 3 Det model.

2.
Aiming at the problem of large-scale differences in ship targets in remote sensing images and easy interference of complex backgrounds, an improved method based on NAS FPN and channel attention ECA is proposed. 3.
Deformation convolution and dilated convolution to enrich the context information of small ship targets are introduced.

FAIR1M Dataset
FAIR1M is the world's largest satellite optical remote sensing image target recognition data set released by the China Aerospace Research Institute [15]. Its content includes image target annotation of various surfaces. Based on the needs of the subject, this paper selects the ship data. As shown in Figure 1 below, 9 types of ship targets are marked in the data set.
fied that R-DFPN has excellent performance in multiscale and high-density objects. Ho ever, there are still many false alarms and errors in model detection [12]. Chen et al. p posed an improved YOLOv3 (ImYOLOv3) based on an attention mechanism. They d signed a new light attention module (DAM) to extract the identification features of sh targets. This method can accurately detect ships of different scales in different ba grounds in real time. However, this method has difficulty accurately expressing a sh based on a horizontal detection frame [13]. Zhang et al. first used a support vector m chine (SVM) to divide an image into small regions of interest (ROIs) that may conta ships and then used an improved target detection framework, a fast region-based conv lutional neural network (Faster RCNN), to detect ROIs. The model was able to detect bo large ships and small ships, but it also had the problem of inaccurate positioning [14].
The above ship detection algorithms have achieved good detection results, but the is still much room for improvement in dealing with complex scene interference, ship-sc differences, and other issues. Therefore, in order to improve the detection effect of sh targets in remote sensing images, this paper uses R 3 Det as the benchmark model and i proves the above problems.
This paper makes the following contributions: 1. Atmospheric correction of remote sensing images is performed by the dark chann prior method, and ship targets are detected and recognized based on the R 3 Det mod 2. Aiming at the problem of large-scale differences in ship targets in remote sensi images and easy interference of complex backgrounds, an improved method bas on NAS FPN and channel attention ECA is proposed. 3. Deformation convolution and dilated convolution to enrich the context informati of small ship targets are introduced.

FAIR1M Dataset
FAIR1M is the world's largest satellite optical remote sensing image target recog tion data set released by the China Aerospace Research Institute [15]. Its content includ image target annotation of various surfaces. Based on the needs of the subject, this pap selects the ship data. As shown in Figure 1 below, 9 types of ship targets are marked the data set.

Atmospheric Correction
Satellite-based remote sensing images need long-distance atmospheric transmission and the radiance received by a satellite is attenuated by atmospheric absorption. In add tion, particles in the atmosphere are also reflected into the imaging light path, thus reduc ing the contrast of the remote sensing image, resulting in a layer of water mist in the imag visually [16]. In this paper, dark channel prior theory [17] is used for atmospheric correc tion of remote sensing images. The algorithm flow of atmospheric correction is shown i Figure 2.

Atmospheric Correction
Satellite-based remote sensing images need long-distance atmospheric transmission and the radiance received by a satellite is attenuated by atmospheric absorption. In add tion, particles in the atmosphere are also reflected into the imaging light path, thus reduc ing the contrast of the remote sensing image, resulting in a layer of water mist in the imag visually [16]. In this paper, dark channel prior theory [17] is used for atmospheric correc tion of remote sensing images. The algorithm flow of atmospheric correction is shown i Figure 2.

Atmospheric Correction
Satellite-based remote sensing images need long-distance atmospheric transmission, and the radiance received by a satellite is attenuated by atmospheric absorption. In addition, particles in the atmosphere are also reflected into the imaging light path, thus reducing the contrast of the remote sensing image, resulting in a layer of water mist in the image visually [16]. In this paper, dark channel prior theory [17] is used for atmospheric correction of remote sensing images. The algorithm flow of atmospheric correction is shown in Figure 2.
The images before and after atmospheric correction are shown in Figure 3. It can be seen from the figure that the ship target and background in the corrected image are clearer, which is conducive to improving the detection accuracy of ships.

Training Set and Test Set
The size of the remote sensing images in the data set is different and generally large, but due to the limitation of computing hardware, current object detection networks generally allow input images of small size [18]. Van Etten [19] demonstrated that directly scaling remote sensing images to sizes allowed by a network would lose many image details. Therefore, this section cuts the training image to 800 × 800, overlaps 100 pixels, and generates 6622 training images and 1126 testing images in total. Table 1   The images before and after atmospheric correction are shown in Figure 3. It seen from the figure that the ship target and background in the corrected ima clearer, which is conducive to improving the detection accuracy of ships. The images before and after atmospheric correction are shown in Figure 3. It can be seen from the figure that the ship target and background in the corrected image are clearer, which is conducive to improving the detection accuracy of ships.

Methods
In this paper, a single-stage rotating target detector R 3 Det [20] is selected to detect ship targets in remote sensing images. R 3 Det is an improvement based on the Retinanet algorithm [21], adding an FRM (feature refinement module) and designing a loss function of the approximate skew bound intersection (SkewIOU) to enhance the detection effect of rotating targets. The structure of R 3 Det is shown in Figure 4.

Methods
In this paper, a single-stage rotating target detector R 3 Det [20] is selected to detect ship targets in remote sensing images. R 3 Det is an improvement based on the Retinanet algorithm [21], adding an FRM (feature refinement module) and designing a loss function of the approximate skew bound intersection (SkewIOU) to enhance the detection effect of rotating targets. The structure of R 3 Det is shown in Figure 4.  The R 3 Det algorithm takes ResNet50 as the feature extraction network and performs feature multiscale fusion expression through FPN (feature pyramid network) to enhance the detection ability of multiscale targets. The predictor is divided into two convolution networks with shared weights to realize category regression and prediction frame parameter regression, respectively, and the FRM is designed. Its structure is shown in Figure 5, which solves the problem of feature misalignment in rotating box regression. feature multiscale fusion expression through FPN (feature pyramid network) to enhan the detection ability of multiscale targets. The predictor is divided into two convoluti networks with shared weights to realize category regression and prediction frame param eter regression, respectively, and the FRM is designed. Its structure is shown in Figure  which solves the problem of feature misalignment in rotating box regression. R 3 Det combines the advantages of high recall of horizontal anchors and dense adap ability of rotating anchors. In the first stage, horizontal anchors are generated to impro detection accuracy. In the refinement stage, the bounding box is filtered, and only t bounding box with the highest score of each feature point is retained to improve detecti speed. In the FRM, the five coordinates (center and four vertices) of the feature poi bounding box are bilinearly interpolated to obtain the corresponding position info mation, and the entire feature map is reconstructed pixel by pixel to achieve alignme between the rotation box and the target feature.
Due to the large-scale differences in each ship target, the artificially designed to down feature fusion path in FPN has difficulty accurately expressing multiscale feature In this paper, the feature pyramid structure of FPN in R 3 Det is replaced by the featu search fusion network structure, NAS FPN. The network updates and combines mu tiscale features through reinforcement learning to enrich multiscale feature informatio Considering the background interference caused by the wide imaging range of remo sensing images, adding an attention mechanism can improve the significance of targ features and improve the positioning accuracy of ship targets. This paper adds a ligh weight channel attention module, ECA, to make the model focus on the target region. order to solve the problem of detection difficulty caused by the less available features small ship targets, a context information enhancement module COT, based on defo mation convolution and dilated convolution, is designed to enrich the features of sm ship targets by using the context information around small targets. The size of anchors the original R 3 Det algorithm is not suitable for ship targets with large aspect ratios. Th paper modifies the size of anchors based on the model and uses k-means clustering an ysis to design the prior frame aspect ratio to improve the detection and positioning effe The problems and improved methods of ship detection based on R 3 Det are shown in Fi ure 6. R 3 Det combines the advantages of high recall of horizontal anchors and dense adaptability of rotating anchors. In the first stage, horizontal anchors are generated to improve detection accuracy. In the refinement stage, the bounding box is filtered, and only the bounding box with the highest score of each feature point is retained to improve detection speed. In the FRM, the five coordinates (center and four vertices) of the feature point bounding box are bilinearly interpolated to obtain the corresponding position information, and the entire feature map is reconstructed pixel by pixel to achieve alignment between the rotation box and the target feature.
Due to the large-scale differences in each ship target, the artificially designed topdown feature fusion path in FPN has difficulty accurately expressing multiscale features. In this paper, the feature pyramid structure of FPN in R 3 Det is replaced by the feature search fusion network structure, NAS FPN. The network updates and combines multiscale features through reinforcement learning to enrich multiscale feature information. Considering the background interference caused by the wide imaging range of remote sensing images, adding an attention mechanism can improve the significance of target features and improve the positioning accuracy of ship targets. This paper adds a lightweight channel attention module, ECA, to make the model focus on the target region. In order to solve the problem of detection difficulty caused by the less available features of small ship targets, a context information enhancement module COT, based on deformation convolution and dilated convolution, is designed to enrich the features of small ship targets by using the context information around small targets. The size of anchors in the original R 3 Det algorithm is not suitable for ship targets with large aspect ratios. This paper modifies the size of anchors based on the model and uses k-means clustering analysis to design the prior frame aspect ratio to improve the detection and positioning effect. The problems and improved methods of ship detection based on R 3 Det are shown in Figure 6. The improved model structure is shown in Figure 7. The ECA modules are added to the input feature layers of FPN, and COT is added to the shallow feature layer. The improved model structure is shown in Figure 7. The ECA modules are added to the input feature layers of FPN, and COT is added to the shallow feature layer.

Context information enhancement
Feature pyramid based on search architecture Kmeans cluster anchor The improved model structure is shown in Figure 7. The ECA modules are added to the input feature layers of FPN, and COT is added to the shallow feature layer.

Search Architecture-Based Feature Pyramid Network (NAS FPN)
The network architecture search algorithm (NAS) is a popular algorithm in the field of deep learning that can adaptively learn and modify the neural network structure based on the characteristics of the data [22]. Wang et al. improved the FPN structure by using the NAS algorithm and designed a frame-adaptive search-based FPN model [23]. In FPN, only the path from the top feature to the bottom feature is simply adopted, while NAS FPN selects the appropriate feature fusion path through reinforcement learning. The fusion process is shown in Figure 8. NAS FPN is composed of an RNN controller and fusion module. Firstly, the feature map extracted from the backbone network is put into the candidate feature layer. The RNN controller controls the fusion module to select two feature maps as inputs in the candidate feature pool. Select the output size and fusion operation output features as new features, put them into the candidate feature layer, or output them until each layer of the feature pyramid is output, replacing the manually designed fusion path to realize multiscale feature cross fusion. The RNN controller adopts the reinforcement learning method, and its parameter learning takes the AP value detected by the model as the update excitation.  The network architecture search algorithm (NAS) is a popular algorithm in the field of deep learning that can adaptively learn and modify the neural network structure based on the characteristics of the data [22]. Wang et al. improved the FPN structure by using the NAS algorithm and designed a frame-adaptive search-based FPN model [23]. In FPN, only the path from the top feature to the bottom feature is simply adopted, while NAS FPN selects the appropriate feature fusion path through reinforcement learning. The fusion process is shown in Figure 8. NAS FPN is composed of an RNN controller and fusion module. Firstly, the feature map extracted from the backbone network is put into the candidate feature layer. The RNN controller controls the fusion module to select two feature maps as inputs in the candidate feature pool. Select the output size and fusion operation output features as new features, put them into the candidate feature layer, or output them until each layer of the feature pyramid is output, replacing the manually designed fusion path to realize multiscale feature cross fusion. The RNN controller adopts the reinforcement learning method, and its parameter learning takes the AP value detected by the model as the update excitation.

Channel Attention Module (ECA)
In recent years, attention mechanism models inspired by human visual attention have been developing continuously. Implementing attention techniques in neural networks helps the network focus on the essential parts of a problem, maximizing accuracy and efficiency [24]. The large-scale background interference in remote sensing images brings great challenges to target detection [25]. Adding an attention model can improve the robustness of the model against interference and the positioning accuracy of ship targets.
In this paper, a lightweight channel attention mechanism module ECA [26] without dimensionality reduction is added to the model. The structure of the ECA module is shown in Figure 9. The input feature map through global average pooling and the size k of the convolution kernel is determined by the adaptive method to carry out the one-dimensional convolution operation so as to realize the cross-channel interaction of information. After the sigmoid activation function, the weight of each channel is obtained. Finally, the weight is multiplied by the original feature map to improve the saliency of the target feature.

Channel Attention Module (ECA)
In recent years, attention mechanism models inspired by human visual attention have been developing continuously. Implementing attention techniques in neural networks helps the network focus on the essential parts of a problem, maximizing accuracy and efficiency [24]. The large-scale background interference in remote sensing images brings great challenges to target detection [25]. Adding an attention model can improve the robustness of the model against interference and the positioning accuracy of ship targets.
In this paper, a lightweight channel attention mechanism module ECA [26] without dimensionality reduction is added to the model. The structure of the ECA module is shown in Figure 9. The input feature map through global average pooling and the size k of the convolution kernel is determined by the adaptive method to carry out the one-dimensional convolution operation so as to realize the cross-channel interaction of information. After the sigmoid activation function, the weight of each channel is obtained. Finally, the weight is multiplied by the original feature map to improve the saliency of the target feature. dimensionality reduction is added to the model. The structure of the ECA module is shown in Figure 9. The input feature map through global average pooling and the size k of the convolution kernel is determined by the adaptive method to carry out the one-dimensional convolution operation so as to realize the cross-channel interaction of information. After the sigmoid activation function, the weight of each channel is obtained. Finally, the weight is multiplied by the original feature map to improve the saliency of the target feature. There is a mapping relationship between the size k of the one-dimensional convolution kernel and the input channel C: Therefore, given the dimension C of the input channel, the size of the convolution kernel can be obtained: There is a mapping relationship between the size k of the one-dimensional convolution kernel and the input channel C: Therefore, given the dimension C of the input channel, the size of the convolution kernel can be obtained:

Context Information Enhancement Module (COT)
The difficult problem of detecting small targets has always been an important task to be solved in the task of target recognition. Usually, the pixels of small targets in an image are low, and the features available for mining are limited, which makes the model insensitive to small targets and difficult to locate accurately [27]. To solve this problem, expanding the feature receptive field and taking the background information of small targets as a supplement can effectively improve the positioning accuracy of the model for small targets [28].
Inspired by the ac-fpn [29] structure proposed by Cao et al., this paper introduces deformable convolution and dilated convolution and designs the COT context information enhancement module. Its structure is shown in Figure 10. At the same time, considering the feature volatility and model complexity of small targets in the deep network [30], just add the COT module to the shallow feature map extracted by the ResNet50 network to realize the supplement and enhancement of the context information of small targets.

Context Information Enhancement Module (COT)
The difficult problem of detecting small targets has always been an important task to be solved in the task of target recognition. Usually, the pixels of small targets in an image are low, and the features available for mining are limited, which makes the model insensitive to small targets and difficult to locate accurately [27]. To solve this problem, expanding the feature receptive field and taking the background information of small targets as a supplement can effectively improve the positioning accuracy of the model for small targets [28].
Inspired by the ac-fpn [29] structure proposed by Cao et al., this paper introduces deformable convolution and dilated convolution and designs the COT context information enhancement module. Its structure is shown in Figure 10. At the same time, considering the feature volatility and model complexity of small targets in the deep network [30], just add the COT module to the shallow feature map extracted by the ResNet50 network to realize the supplement and enhancement of the context information of small targets. Compared with ordinary convolution, deformable convolution adds an adaptive learning offset [31] to the receptive field of convolution so that the receptive field is no longer a simple rectangle but changes with the shape of the target object to adapt to the geometric deformation of various objects, as shown in Figure 11b. Considering that the ship target has obvious length and width differences and is distributed at any angle in a remote sensing image [32], the rectangular receptive field of ordinary convolution will introduce too much background interference, and the deformation convolution will make the receptive field concentrate around the ship target to improve the feature significance. Compared with ordinary convolution, deformable convolution adds an adaptive learning offset [31] to the receptive field of convolution so that the receptive field is no longer a simple rectangle but changes with the shape of the target object to adapt to the geometric deformation of various objects, as shown in Figure 11b. Considering that the ship target has obvious length and width differences and is distributed at any angle in a remote sensing image [32], the rectangular receptive field of ordinary convolution will introduce too much background interference, and the deformation convolution will make the receptive field concentrate around the ship target to improve the feature significance.
ship target has obvious length and width differences and is distributed at any angle in a remote sensing image [32], the rectangular receptive field of ordinary convolution will introduce too much background interference, and the deformation convolution will make the receptive field concentrate around the ship target to improve the feature significance.  Dilated convolution [33] is based on ordinary convolution; the void rate is set, and the characteristic sampling points of the original convolution are expanded externally, which expands the target perception field, but the convolution kernel size is not changed [34]. Figure 11c shows the 3 × 3 hole convolution receptive field with a hole rate of 1. The dilated convolution is used to supplement the context information of the ship target to enrich the semantic features [35], such as near shore information, sea surface information, the position information of the ship berthing side by side, and other hidden information associated with the ship target. The feature information of a small ship target is used to improve detection accuracy.
In the COT module shown in Figure 11, the features are first convoluted through deformation to significantly enhance the features of the ship target, and then the compact context information around the target is extracted through dilated convolution to avoid more background interference caused by the holes in the receptive field.

Anchor Improvement
In view of the inaccurate positioning of the target detection above, this section modifies the scale of the anchors. The R 3 Det model performs regression prediction on the rotating target based on the horizontal anchors. The sizes of the anchors are manually set, and they need to be adapted for the actual detection task. The base anchors for the multiscale feature outputs by the FPN in the original model are (32,64,128,256,512). However, the size of 512 × 512 is not suitable for ship targets in remote sensing images. Considering that the size of small ship targets is generally below 20 × 20, this paper modifies the sizes of the basic anchors to (16,32,64,128,256). At the same time, the aspect ratio of the original anchors (1:1, 2:1, 1:2) does not have a good pertinence for the ship target. In this paper, the k-means clustering method is used to cluster the length-width ratio of the data. Class analysis is performed to obtain the optimized aspect ratio.
The process of adopting the k-means clustering method is shown in Figure 12. First, K clustering centers are set, their values are randomly selected in the data, and the distances between each sample and the K clustering centers are respectively calculated. In this paper, the intersection union ratio IOU of the bounding box is used to calculate the distance between the sample and the K clustering centers, and all the boxes are divided into K regions according to the distance. The average value of each region is calculated to replace the original clustering center, and the iterative cycle is carried out until the value of the cluster center does not change.
The results after clustering are shown in Table 2. Three anchors with different aspect ratios are used in the original model, so K is set to 3. Since the results of 3 cluster centers are similar, this paper uses the analysis results of 5 cluster centers to set the aspect ratio as (0.63, 1, 2.49).
K clustering centers are set, their values are randomly selected in the data, and the distances between each sample and the K clustering centers are respectively calculated. In this paper, the intersection union ratio IOU of the bounding box is used to calculate the distance between the sample and the K clustering centers, and all the boxes are divided into K regions according to the distance. The average value of each region is calculated to replace the original clustering center, and the iterative cycle is carried out until the value of the cluster center does not change.  The results after clustering are shown in Table 2. Three anchors with different aspect ratios are used in the original model, so K is set to 3. Since the results of 3 cluster centers

Training Process
In this paper, the R 3 Det model is trained in the windows system. The software environment is pytorch1.3 and python3.7, and the hardware environment is an Intel (R) core (TM) i7-8750h, NVIDIA GTX 1070 (8 GB), and 16 GB memory. The initial learning rate is set to 0.004, the IOU threshold is set to 0.4, the number of training data is 6621, and the number of testing data is 1126. When inputting the model, random flipping is performed.

Evaluation Index
In this paper, average precision AP (average precision) is used as the performance evaluation index of the ship detection model, and the calculation formula is: TP is the number of targets correctly classified, FP is the number of backgrounds recognized as targets, and FN is the number of objects recognized as the background. The accuracy rate p (precision) represents the ratio of the correct target detected in all detection results. The recall rate r (recall) indicates the ratio of the detected correct target to the true value of all targets. The area enclosed by the curve with p as the vertical axis and r as the horizontal axis and the coordinate axis is the AP value. AP is used to measure the detection accuracy of single-class targets. The closer the AP value is to 1, the higher the detection accuracy. Map is the AP mean of object detection in multiclassification tasks.

Improvement Effect
The R 3 Det model can detect ship targets in remote sensing images accurately. However, the detection effect of the model for small ship targets is poor, there are many missed detections when the small ships are densely arranged, and the positioning of the ships is inaccurate. Based on the above improvements, the two types of problems have been effectively improved, as shown in Figures 13 and 14.
Ablation experiments were carried out for each improvement. The recognition effect of some ship targets that are difficult to detect is shown in Table 3. Through a comparison of experiments one and two, the use of NAS FPN as a multiscale fusion network has better advantages than FPN in the expression of multiscale target features because NAS FPN searches for an optimized feature fusion path through reinforcement learning, and the feature expression is rich, which makes the model perform better. Through experiments two and three, it can be observed that the improvement in the anchor and the addition of the attention mechanism can significantly improve the detection performance of the model, and the attention mechanism can filter the background interference to obtain more significant characteristics of the ship. The recognition accuracy of the model for small ship targets has been greatly improved, and the sensitivity of the model to small ship targets has been improved. Comparative experiments four and five show that the introduced context information enhancement structure enriches the feature information of small target ships and enhances the sensitivity of the model to target ships.  As shown in Table 3, the addition of each module can effectively improve the recognition rate. In order to more accurately illustrate the effect of the model improvement, experiment four and experiment five are taken as examples (the experimental results are close), and the model is trained three times, respectively. The average value is taken as the final result, and the mean and standard deviation of the detection accuracy is calculated. The result is shown in Tables 4 and 5. Figure 13. Display of test results before and after model improvement-small target missed detec tion improvement.

Figure 14.
Display of test results before and after model improvement-inaccurate positioning improvement.   The results of the above experiments show that the addition of each module can effectively improve the detection accuracy of the model.
Due to the large difference between the scales of different ships, the multiscale feature information of the model before the improvement is limited. Figure 15 shows the detection and comparison results of the model before and after replacement with NAS FPN. It can be observed that the improved model improves sensitivity to the targets of ships of various scales and performs a deeper expression of the feature information.
The large-scale background in the remote sensing image has strong interference with the detection task. Figure 16 shows the detection effect before and after adding ECA. Adding the channel attention mechanism ECA to the model can preserve the effective feature information of the region, effectively overcoming the interference caused by the complex land background in the port environment and, thus, improving feature saliency. The results of the above experiments show that the addition of each module can effectively improve the detection accuracy of the model.
Due to the large difference between the scales of different ships, the multiscale feature information of the model before the improvement is limited. Figure 15 shows the detection and comparison results of the model before and after replacement with NAS FPN. It can be observed that the improved model improves sensitivity to the targets of ships of various scales and performs a deeper expression of the feature information. The large-scale background in the remote sensing image has strong interference with the detection task. Figure 16 shows the detection effect before and after adding ECA. Adding the channel attention mechanism ECA to the model can preserve the effective feature information of the region, effectively overcoming the interference caused by the complex By comparing the detection effects before and after adding COT in Figure 17, it is shown that this structure can improve detection performance by mining the hidden context relevance of the dense arrangement of ship targets. The deformation convolution can effectively deal with the direction rotation, and the dilated convolution is rich in feature information, thus enhancing the significance of shallow features. By comparing the detection effects before and after adding COT in Figure 17, it is shown that this structure can improve detection performance by mining the hidden context relevance of the dense arrangement of ship targets. The deformation convolution can effectively deal with the direction rotation, and the dilated convolution is rich in feature information, thus enhancing the significance of shallow features.

Influence of Deformation Convolution Size in COT
In order to verify the effect of deformation convolution size on detection accuracy, the deformation convolution size of 3 × 3 and 5 × 5 in COT are used in the experiments. The experimental results are shown in Table 6. The results show that after the deformation convolution size is changed from 3 × 3 to 5 × 5, the model parameters are increased, but the detection accuracy is not improved. Therefore, the deformation convolution size of 3 × 3 is used in the improved model. Table 7 shows the AP values of various targets detected by the improved model. It can be observed that the improved model can achieve more accurate detection of various ship targets.

Influence of Deformation Convolution Size in COT
In order to verify the effect of deformation convolution size on detection accuracy, the deformation convolution size of 3×3 and 5×5 in COT are used in the experiments. The experimental results are shown in Table 6.   The results show that after the deformation convolution size is changed from 3×3 to 5×5, the model parameters are increased, but the detection accuracy is not improved. Therefore, the deformation convolution size of 3×3 is used in the improved model. Table 7 shows the AP values of various targets detected by the improved model. It can be observed that the improved model can achieve more accurate detection of various ship targets.  Figure 18 shows the detection results of the improved model in various scenarios. In the complex nearshore scene, the model can still overcome the large-scale background interference and achieve good detection results.

Comparison of Other Target Detection Methods
In order to further verify the effect of the improved model, the commonly used detection models in remote sensing target detection are selected and tested on the training set and test set constructed in this paper. The AP and mAP values of different models are shown in Table 8. It can be seen from Table 8 that compared with other models, the improved R 3 Det model improves the accuracy of ship detection.

Discussion
Firstly, according to the differences between different ship scales, the reinforcement learning method was used to optimize the fusion effect of multistage features to better detect targets of different scales. Secondly, starting from the channel dimension, the channel weighting mechanism was introduced to self-learn the importance of the semantic representation of each channel, improve the significance of effective features, and improve the model's ability to distinguish between ships and the background. In addition, to solve the problem of the difficult positioning of small ships, context correlation between the target and surrounding objects was explored from the spatial dimension: the deep semantic significance and spatial information representation were enhanced, and the detection ability of the model for small ships was optimized. The ablation experiments show the effectiveness of the above methods in ship detection.
There is still a certain gap between the ship target detection model constructed in this paper and the actual application scenario. The actual remote sensing images acquired by satellites usually contain a large number of sea surface and land backgrounds. It is inefficient to directly detect ship targets in the images, and it has high requirements for data storage space and data transmission [40]. Therefore, in order to meet the needs of the on orbit engineering practice of ship target detection, a further optimization direction of the detection method is to conduct relevant sea-land separation processing on large-scale remote sensing images.

Conclusions
In this paper, an improved R 3 Det model based on attention and context information enhancement is proposed, which can handle different complex scenes and detect multiscale ship targets. For example, the attention model ECA was used to enhance the characteristics of the target and reduce interference, and NAS FPN was used to enhance the ability to detect multiscale targets. Finally, in order to improve the detection accuracy of small target ships, COT was designed. Under the effect of deformation convolution and dilated convolution, the context information around small targets was enhanced. The effectiveness of the improved model in a complex environment and for small target detection was verified through comparison experiments with R 3 Det and other models. The future work is to further improve detection speed and accuracy by performing relevant sea-land separation processing on large-scale remote sensing images.