Domain Adaptive Ship Detection in Optical Remote Sensing Images

: With the successful application of the convolutional neural network (CNN), signiﬁcant progress has been made by CNN-based ship detection methods. However, they often face consider-able difﬁculties when applied to a new domain where the imaging condition changes signiﬁcantly. Although training with the two domains together can solve this problem to some extent, the large domain shift will lead to sub-optimal feature representations, and thus weaken the generalization ability on both domains. In this paper, a domain adaptive ship detection method is proposed to better detect ships between different domains. Speciﬁcally, the proposed method minimizes the domain discrepancies via both image-level adaption and instance-level adaption. In image-level adaption, we use multiple receptive ﬁeld integration and channel domain attention to enhance the feature’s resistance to scale and environmental changes, respectively. Moreover, a novel boundary regression module is proposed in instance-level adaption to correct the localization deviation of the ship proposals caused by the domain shift. Compared with conventional regression approaches, the proposed boundary regression module is able to make more accurate predictions via the effective extreme point features. The two adaption components are implemented by learning the corresponding domain classiﬁers respectively in an adversarial training way, thereby obtaining a robust model suitable for both of the two domains. Experiments on both supervised and unsupervised domain adaption scenarios are conducted to verify the effectiveness of the proposed method.

Following the pipeline of general object detection algorithms, these methods can be divided into two categories of region-based methods and regression-based methods. The region-based methods first generate a set of region proposals based on predefined anchor boxes via the region proposal network (RPN) [5]. Then, those candidate regions are refined to obtain the final detection results [11,12,[14][15][16]. In contrast, regression-based methods directly regress the bounding box of each object from different locations [13,17,18]. In general, the region-based methods have higher detection accuracy, while regressionbased methods run faster. With the help of the powerful feature extraction ability of CNN, these methods achieves promising improvements compared with their traditional counterparts which relies on manually designed features for the detection.
Despite the effectiveness of these CNN-based ship detection methods, they are only trained and tested on a single domain (dataset). In practical applications, the changing environments and weather conditions often lead to large domain shift between different domains. Such domain shift commonly exists in object detection tasks and has been observed to cause significant performance drop when apply the trained model to a visually different new domain [19]. Therefore, to improve the detection performance on both of the two domains, a natural idea is to train on the two domains together to cover richer situations. However, without dedicated treatments, the model will tend to fit the two domains with sub-optimal feature representations separately, and thus leads to inferior detection results.
To solve this problem, we propose a novel domain adaptive ship detection method based on the region-based detection pipeline. The basic idea is to minimize the distance of the feature distribution between the two domains on both image-level and instancelevel. In general, this distance is usually optimized by maximizing the error rate of the domain classifier that predicts the domain label of the data [20][21][22]. Moreover, conventional feature extraction networks are employed for typical single-domain object detection. These networks are less powerful to deal with the large domain shift between different domains in remote sensing images. Therefore, some innovative treatments are also proposed for better feature representation between different domains.
In image-level adaption, several improvements are proposed to overcome the shortcomings of conventional feature extraction networks for the overall image feature extraction. Specifically, we reasonably integrate the feature maps with different receptive fields via the attention-based feature fusion structure. In this way, the integrated feature map is able to better perceive ships of different sizes on each pixel location. Then, the feature map is optimized in the channel direction for more suitable cross-domain feature representations based on the squeeze-and-excitation (SE) [23] mechanism, which emphasizes important features by modelling the interdependencies between different channels of the feature map. Finally, the optimized feature maps are fed into the domain classifier to obtain aligned feature representations between the two domains.
In addition to the overall image-level features, the domain shift also affects instancelevel feature expression, which leads to deviations in object localization. Therefore, to obtain more accurate detection results, we propose the boundary regression module to refine the region proposals generated by RPN. The proposed module utilizes extreme point features closest to the boundary of the proposal to predict the offset of each corresponding boundary. Compared with the traditional bounding box regression approach, the boundary regression module discards redundant features useless for object localization, making the regression process more robust and accurate. Similar to image-level adaption, the domain classifier is also used to align the refined region of interest (ROI) features. Moreover, the proposed module is also suitable for the unsupervised domain adaption scenario, where the bounding box annotations are only available in one of the two domains.
The rest of this paper is organized as follows. In Section 2, we introduce the related work. Section 3 describes the details of the proposed method. In Section 4, detailed experimental analysis and comparisons are given to verify the superiority of our method. Section 5 concludes this paper.

General Object Detection
In recent years, convolutional neural networks (CNN) have shown powerful abilities in various visual tasks such as image classification [24][25][26], semantic segmentation [27][28][29] and object detection. In general, current CNN-based object detection methods can be divided into two categories. The first category generates a series of region proposals via the region proposal network (RPN) [5] or selective research [1] at first. Then, on each proposal, region classification and bounding box regression are performed to obtain the final detection results [1][2][3][4][5][6][7]9,10]. Among all these methods, Faster R-CNN [5] is the most representative one, which integrates all these steps into a unified network for the first time.
The second category directly obtains the object bounding boxes from predefined default boxes via regression, ignoring the process of proposal generation. For example, the single-shot multibox detector (SSD) [3] predicts bounding box offsets and probability scores for the densely distributed default boxes based on the feature maps of different scales. In pursuit of faster running speed, you only look once (YOLO) [2] divides the image into several grids. Then, two default boxes are defined at the center location of each grid to perform region classification and regression. Although these one-step methods usually have inferior detection performance than those proposal-based ones, YOLOv3 [9] (the upgradation version of [6]) still achieves impressive performance with the help of multiple improvements.

Ship Detection in Optical Remote Sensing Images
Ship detection is a hot research topic in the fields of remote sensing. The manuallydesigned features (such as geometric elements) are widely used in early ship detection methods to identify ships from the background. For instance, Lei et al. [30] and Lin et al. [31] perform onshore ship detection via the contour feature and line segment. A hierarchical complete and operational ship detection method was proposed in [32] based on shape and texture features. Since these basic geometric features are not robust enough in complex background interference, more prominent features of the ship head are used in some other methods for preliminary localization. In [33], the regions of potential ship heads are first predicted by transforming local pixels into the polar coordinate system, based on which the saliency of directional gradient information is then employed to identify ship body. The ship heads are detected in [34] by corner features, and then the methods of shape analysis and region growth are used to determine the complete ship region. The method [35] first determines the potential ship regions by saliency segmentation, and then the structure-LBP features are used to identify the real ships. However, these manually-designed features can only utilize low-level information with poor generalization ability, these methods often suffer from the influence of complex background, resulting in either false or miss detection.
In recent years, based on the successful CNN-based object detection methods, many similar ship detection algorithms in optical remote sensing images are proposed and achieve good results [11][12][13][14][15][16][17][18]. Li et al. [11] detect ships of various scales via the multiscale feature mapping. Nie et al. [12] improve the post-processing process of the Mask R-CNN [10] algorithm and use it for ship detection. Zhang et al. [15] rotate the ship candidate regions to achieve arbitrarily oriented ship detection. Li et al. [16] propose a novel dual-branch regression network to more effectively predict the ship orientation and other variables independently. For the regression-based methods, Liu et al. [13] pass through shallow layer feature maps to deeper ones to utilize fine-grained features for small ship detection. Based on the advanced YOLOv3, Chen et al. [18] propose a lightweight dilated attention module to achieve a trade-off between detection accuracy and speed. However, these methods are all trained and tested on a single dataset with simple scenarios, which have limited generalization abilities in practical applications.

Domain Adaption
The purpose of domain adaption is to find a domain-invariant feature representation to achieve information transfer. Various methods have been proposed to solve this problem in computer vision, such as image classification and semantic segmentation [36][37][38][39]. Recently, inspired by the powerful generative adversarial network (GAN) [40], adversarial learning is widely used to align features between different domains. For example, Ganin et al. [37] effectively improve the performance of the domain adaptive classification task via the domian classifier and a gradient reversal layer. Tzeng et al. [38] propose an adversarial discriminative domain adaption model, which combines the discriminative model, untie weight sharing, and GAN loss for classification. With the help of adversarial learning, Tsai et al. [39] achieves superior performance in domain adaptive semantic segmentation.
Recently, some domain adaptive object detection methods have been proposed to solve the problem of cross-domain object detection. Chen et al. [20] optimize the detection model together with the domain classifier by adversarial training to minimize the distance between the features of the two domains. Following the same principle, Saito et al. [21] focus on aligning low-level image features during training, but weakens the alignment of high-level features. To achieve multi-scale feature alignment, Xie et al. [22] set multiple domain classifiers to the middle layers of the feature extraction network for training.

The Proposed Method
Following the region-based detection pipeline, the overall framework of our method is shown in Figure 1. To achieve the domain adaption based on both image-level and instancelevel, some dedicated and innovative structures are used to improve the generalization ability of the detection model. In the following, we will describe them as well as other relevant information of the proposed method in detail.

ROI feature
Boundary offsets

Image-Level Adaption
The image-level adaption focuses on coping with the overall image differences through feature alignment. Existing ship detection methods adopt conventional backbone networks for feature extraction. However, the structures of these networks are all designed for single domain and are less powerful to deal with the large domain shift. Therefore, we adopt two successive feature optimization modules to adapt the domain shift in both the width (the receptive field) and depth (the feature channels) direction of the feature map generated by the backbone network. Meanwhile, two independent classifiers are set to align the foreground and background features separately for more effective training.

Multiple Receptive Field Feature Integration
For region-based object detection methods, RPN slides on the feature map via 3 × 3 convolutional layers, and predicts classification scores together with bounding box offsets based on the feature vectors which have a fixed receptive field. However, the size of the ship can be arbitrary. It is obviously sub-optimal to predict ships of different sizes with a fixed receptive field. Especially when the domain offset exists, the disharmony between the scale and the receptive field further increases the difficulty of the cross-domain feature representation. To solve this problem, the first step is to enlarge the receptive field of the feature map. A larger receptive field can cover larger objects and contain more context information [41], which is conducive to extracting important features between different domains.
Conventional backbone networks mainly expand the receptive field through the pooling operation [42]. However, pooling is actually a down-sampling process. The loss of the spatial information blurs the boundary of objects, which compromises the object location ability. Therefore, we adopt the dilated convolution to expand the receptive field without down-sampling [43]. Specifically, as shown in Figures 2 and 3, we add the dilated convolution layer to two different types of residual modules, and combine them repeatedly to obtain new feature maps with a larger receptive field [44]. Then, these feature maps should be integrated together to contain multiple pieces of receptive field information. The most common way to merge different feature maps is directly splicing them along the channel. However, due to the semantic gap between different features, this rough combination approach is not conducive to feature learning. Therefore, we gradually integrate the two adjacent feature maps with different receptive fields via a novel attention-based feature fusion module to alleviate the negative impact of the semantic gap (see Figure 2).

A B A B
Att Att B B Figure 2. Illustration of the multiple receptive field feature integration. A and B represent two different types of residual modules, respectively, while Att represents the attention-based feature fusion. Inspired by [45], the attention-based feature fusion module combines global and local information to predict the weight of the integration. The structure of the attentionbased feature fusion is shown in Figure 4, where ⊕ and ⊗ denote element-wise add and multiplication, respectively. Denoting the output of the connected ⊗ as P, then the dotted arrow indicates the operation (1 − P). The two inputs are first merged together by a convolutional layer after splicing. Then, two parallel convolution branches are used to extract global and local information, respectively. The upper branch is directly calculated on the input feature map to obtain local information, while the lower branch first obtains global information by the global pooling operation. Finally, the outputs of the two branches are element-wise added and passed through the sigmoid activation function to obtain the final fusion weight. Using and Ψ to represent the convolution operation and global-local weight prediction (the dashed box in Figure 4), respectively, the attention-based feature fusion module can be represented by the following formula: where X, Y and Z represents the two inputs and output, respectively. With the help of the attention-based feature fusion, features with different receptive fields are reasonably integrated to adapt to ship targets of different sizes.

Channel Domain Attention
The multiple receptive field feature integration introduced in Section 3.1.1 enables the integrated features to better adapt to the scale changes of ships. However, such locationrelated optimization is unable to cope with the domain shift caused by environmental and weather changes since these differences are encoded into different feature channels via convolutional layers. One possible solution is to optimize the feature channels to better emphasize the important features of the two domains, while suppressing those that are useless for the feature alignment. Therefore, inspired by the wildly used SE mechanism, we attach the channel domain attention module to multiple receptive field feature integration. The channel domain attention module extends the SE mechanism to different domains, aiming at achieving more effective feature extraction by focusing on individual domains.
The structure of the channel domain attention module is shown in Figure 5, which consists of multiple SE adapters and an attention assignment branch. For a channel domain attention module, the input feature map is first pooled by a global pooling layer to aggregate the spatial information. Then, following the SE mechanism, N SE adapters produce independent channel weights for the C channels of input feature map, selectively emphasising informative features and suppress less useful ones [46]. Next, the attention assignment branch generates domain-specific activations for each SE adapter to obtain the final channel weights. Finally, the input feature map is scaled by the final channel weights through channel-wise multiplication. The benefit of this module is two-fold: first, the channel-related information is encoded into the output of multiple SE adapters, making the model more sensitive to input changes and easier to capture useful information; second, the attention assignment branch further dynamically integrate these channel-related information for different domains, facilitating the efficacy to obtain robust features suitable for both of the two domains.

Dual Supervision Adaption Approach
Besides the multiple receptive field feature integration and channel domain attention for feature optimization, the domain classifier is also required to achieve cross-domain feature alignment. During training, the domain classifier adjusts its weights to discriminate the domain label of the input features produced by the feature extraction network, while the feature extraction network tries to generate domain-invariant features that can deceive the domain classifier. Therefore, if the classification error is high even for the well-trained domain classifier, it means that the features of the two domains are close to each other. So they are already aligned.
In theory, the image-level adaption should take the output feature map of the feature extraction network as a whole for global feature alignment. However, during training, each activation on the feature map (which represents a fixed-size image patch) is regarded as an independent sample as the input of the domain classifier. The benefits of using image patches instead of the entire image for domain classification are two-fold: (1) Limited by the computing power, the batch size is usually set to a small value during training. Classifying image patches instead of the whole image can generate more training samples (e.g., 128 per image in our implementation).
(2) Since the classifier requires fixed-size input while the size of the image is changeable, this patch-based classification strategy can avoid the scaling or sampling operation, which will inevitably cause the information distortion.
However, despite of the effectiveness of this patch-based approach, the training samples are dominated by the background since the ship region in remote sensing images only occupies a small part. Therefore, the ship samples will not be fully trained, which is not conducive to obtaining a proper feature representation for the ships between the two domains.
To solve this problem and obtain more effective aligned features, we set two independent domain classifiers of the foreground domain classifier and the background domain classifier to identify the ship samples and background samples from the two domains, respectively (see Figure 1). During training, the classification scores predicted by RPN determine whether current location on the feature map is a ship sample. Specifically, locations with a classification score larger than 0.5 are ship regions, and those with classification scores less than 0.2 are background regions. It is worth noting that to cover different scales and aspect ratios, multiple anchors with different sizes are centred on each location of the feature map. Since each anchor has a corresponding classification score, we take the maximum value as the score of current location. Specifically, the image-level adaption loss is the sum of the foreground domain classifier L f g and background domain classifier L bg : We train the domain classifier on each activation located at (u, v) of the feature map. Denoting the output probability of the domain classifier as p, L f g and L bg is defined as follows: where D i denotes the domain label of the i-th training image (D i = 0 for the first domain and D i = 1 for the second domain). α = 1 for ship samples, or α = 0. β = 1 for background samples, or β = 0.

Instance-Level Adaption
Section 3.1 minimizes the impact of domain shift in terms of the image-level global feature representations. However, images in different domains may also show significant differences in local regions that may contain ships. The feature of these regions also affect the localization accuracy of the detection. Therefore, to achieve local region feature alignment, we design a novel boundary regression module to correct the size and location of the proposals, as well as align the corrected ROI features between different domains via the domain classifier.
Due to the existence of the domain shift, more robust features are required to achieve accurate regression. Although the features in the inside part of the region proposal are helpful for the classification task, they are redundant for the boundary regression and will make the regressor easily affected by the background inference. Therefore, to reduce the interference as much as possible, we only utilize the features closest to the boundary of the proposal for the regression. Such features are the extreme point features with a maximum response value on the boundary of the ROI features.
Specifically, as shown in Figure 6, the boundary regression module takes a 7 × 7 × 512 size feature map produced by the ROI Align [10] layer for each proposal as input. Then, since the convolution operation applies shared transformations which are more robust to regress the ship boundary, the number of channels is increased through two convolution branches to obtain a 7 × 7 × 1024 size feature map. Finally, four 1024-d vectors are obtained via the extreme point pooling to regress the offset of the proposal, respectively. The structure of extreme point pooling is shown in Figure 7. Features of the outermost circle are first divided into four parts, which are upper, lower, left and right (identified by different colors). Then, max pooling is performed to convert these four parts of features into 1024-d vectors, each of which is responsible for regressing the corresponding offset. In object detection algorithms, a proposal is usually described by x, y, w, h ((x, y) represents the center point, while w and h represent the width and height). However, in the boundary regression module, we adopt another approach to predict the offset of each side (represented by l, r, u, d for the left, right, upper and down side, respectively) of the proposal via the corresponding feature vectors. In this way, the feature and the predicted value are well linked to avoid the interference from irrelevant information, and thus improves the adaptability to the domain shift. Given a proposal represented by x, y, w, h, the normalized offsets t t t = (t l , t r , t u , t d ) is expressed as follows: in which x, x a are for the original proposal and the corrected proposal, respectively (likewise for y, w, h). Given the ground-truth offset t * = (t * l , t * r , t * u , t * d ), we employ the smooth-L 1 loss [1] for the regression: where in which x * , y * , w * , and h * denote the corresponding ground-truth values. Given the predicted offsets, the corrected proposal can be calculated from Equation (5), and then the second ROI pooling process is performed to obtain the corrected ROI feature for further identification. To eliminate the domain shift on instance level, a domain classifier is attached to the second ROI Align layer to discriminate the corrected ROI features. Similar to Equations (3) and (4), the loss of the region-level domain classifier is as follows: where j represents the j-th corrected region proposal. In summary, the instance-level adaption loss L ins is composed of the loss of the domain classifier and the loss of offsets regression: Moreover, the proposed method is also applicable to the unsupervised domain adaption scenario, where only one of the two domains has bounding box annotations during training (see Section 3.4 for detail). In this situation, L reg is only for the proposals of the annotated domain.

Training
Since the proposed method is constructed on the typical object detection model, the final training loss L includes the original detection loss in addition to the domain adaption loss introduced above, which can be written as: where L det is the detection loss. It is worth noting that the optimization goals of the proposed model are contradictory. During training, the detection network tries to generate similar features to deceive the domain classifiers, thus maximizing L img and L ins . In contrast, to obtain a powerful domain discriminator, the optimization goal of the domain classifier is to minimize L img and L ins . Therefore, it is natural to perform the two-stage adversarial training. In the first stage, all loss items are minimized to optimize the detection network and the discriminative ability of the domain classifiers. The second stage keeps the parameters of the domain classifiers fixed, continues to minimize L det while maximizing L img and L ins . At this time, the detection network tends to generate domain-invariant features to deceive the domain classifier. The above two stages are alternately performed until the network parameters are optimized.

Unsupervised Domain Adaption
Since annotating the training samples is quite labor-expensive and energy-consuming, it is also meaningful to consider the case of unsupervised domain adaption. In unsupervised domain adaption scenario, no bounding box annotations are available for the target domain during training. Therefore, the detection loss and offset regression loss of the target training data must be ignored at this time. However, due to the existence of domain shift and lack of bounding box annotations, proposals of the target domain usually have poor localization accuracy even if the features are well aligned. Such inaccurate proposals will directly lead to poor detection results.
This problem can be alleviated by the proposed boundary regression module and instance-level domain classifier. Since the accuracy of the proposal is faithfully reflected on their ROI features via the ROI pooling process, the domain classifier can also be used to supervise the update of offset t by implementing the feature alignment. That is, when the corrected ROI features are close enough to each other, the localization accuracy of the proposals corrected by t is also almost the same. Therefore, even if no bounding box annotations are available for the target domain, the effective offset can still be predicted via the feature alignment to reduce the localization deviation between the proposals of the two domains.
The boundary regression together with other structures improves the generalization ability of the model to different domains, making the proposed method achieve leading results in the unsupervised domain adaption task. The comparative experiments with other methods are shown in Section 4.3.3.

Datasets and Implementation Details
HRSC2016 and Airbus ship detection dataset. The proposed method is trained and tested on a collection of the HRSC2016 [47] dataset and the dataset for the Kaggle Airbus ship detection challenge (https://www.kaggle.com/c/airbus-ship-detection, date accessed: 9 August 2021). The HRSC2016 dataset contains 1055 ship images, while the Airbus dataset has a total of 3000 images. As shown in Figure 8, the two datasets have obvious differences. Images in HRSC2016 are mainly taken from the port environment and contain a large number of military ships. In contrast, most of the ships in the Airbus dataset are civil ships in the sea environment. Synthetic remote sensing ship datasets. To verify the superiority of the proposed method in more demanding domain adaption scenarios, we also adopt the synthetic remote sensing image data from the maritime ship detection competition (https://www. datafountain.cn/competitions/275/, date accessed: 9 August 2021) held by the China Com-puter Federation in the comparison experiments. As shown in the Figure 9, the data are divided into two independent datasets (each of which contains 1500 images) according to different weather conditions, including normal weather (Normal) and cloudy weather (Cloudy). The network is trained with an Adam optimizer on GTX1080ti GPU with 2 images from different datasets per mini-batch. The input image is resized such that its shorter side has 600 pixels. The backbone network is ResNet-50 [25]. During training, we use a momentum of 0.9 and a weight decay of 0.0005 for optimization. The collection of the two datasets is randomly divided into three parts (training set, validation set, and test set) with a ratio of 6:1:3. We flip each image in horizontal reflections to double the images for data augmentation. The other parameter settings are consistent with those in paper [5]. We use the average precision (AP) calculated in accordance with the PASCAL visual object classes challenge 2007 (VOC2007) [48] as the evaluation metric.

Evaluation of the Proposed Modules
To verify the superiority of the proposed structures, we conduct several comparisons in Table 1 with a baseline model in which any proposed technology is abandoned. For a fair comparison, any other parameter settings of the baseline model are consistent with the proposed method. In Table 1, MRFI and CDA indicate the multiple receptive field feature integration and the channel domain attention module in image-level adaption, respectively. BR indicates the boundary regression in instance-level adaption.

89.0%
From Table 1, we can see that the multiple receptive field feature integration can effectively improve the performance of the baseline model on the test set. By combining with the domain attention module, the AP finally achieves a 1.8-point improvement. In addition, the extreme point regression also brings an AP increase of 2.0 points independently. Exper-imental results show that the instance-level adaption has achieved a higher improvement than the image-level adaption. The reasons may be as follows: (1) for the two datasets, the difference between ship objects is more significant than the background region. (2) The instance-based adaption includes the process of proposal correction. More precise object localization can directly improve the performance of the detection. Finally, by further assembling all these structures together, the proposed model improves the Baseline model by +3.6%, which achieves 89.0% in terms of AP.

Evaluation of the Multiple Receptive Field Feature Integration
The key of multiple receptive field feature integration is the fusion strategy of the feature maps with different receptive fields. Since there are semantic gaps between different feature maps, a simple fusion strategy may not achieve good results. Therefore, we gradually fuse adjacent feature maps, and obtain appropriate fusion weights based on the attention mechanism. As shown in Table 2, we evaluate the performance of the proposed approach with different fusion strategies. In Table 2, direct means directly splicing the feature maps with different receptive fields along the channel, while gradual means fusing the adjacent feature maps in sequence. The attention-based feature fusion proposed in this paper is indicated by *. Experimental results show that gradual fusion is better than direct fusion, and the weight obtained by the attention-based feature fusion further improves the performance of multiple receptive field feature integration.

Evaluation of the Domain Attention Module
In this section, we will study the influence of the number of SE adapters in the channel domain attention module. As shown in Figure 10, we set a different number of SE adapters (denoted by n) for the channel domain attention module to evaluate their performance. It is worth noting that when n = 1, the channel domain attention module degenerates to the standard SE module. It can be seen from the experimental results that a single SE adapter has the worst performance, since only the channel attention works at this time. When sufficient SE adapters are used (n = 3 in our experiment), involving more SE adapters will not bring further performance improvement. This can probably be explained as the following. Although a larger n provides a larger parameter space which helps to sensitively distribute the activation between both of the two domains, more parameters also increase the risk of over-fitting, thus resulting in a performance decrease.

Evaluation of the Boundary Regression Module
The boundary regression module consists of two major components: a convolution head for feature preprocessing, and the extreme point pooling to obtain the features of extreme points for the regression. In this section, we will study the performance of these two components separately.
The experimental results are shown in Table 3, where normal means that the feature obtained from the previous ROI pooling process is used as the pooling input, and conv indicates that the structure shown in Figure 6 is used to process the feature map. For the pooling method, traditional represents the conventional way to divide the proposal into 7 × 7 bins for pooling (a feature map of 7 × 7 × 512 size is obtained after pooling), while the point-based score indicates that the extreme point pooling shown in Figure 7 is performed to obtain four 1024-d vectors for regression. To better evaluate the localization performance, the index of mean intersection-over-union (IoU) is also calculated. Experimental results show that the conv head boosts both AP and IoU, which means that convolution operation is more suitable for bounding box regression. In addition, although the AP indicator of extreme point pooling is almost the same as the traditional pooling method, we can still see a significant improvement in terms of IoU. This shows that extreme point pooling effectively improves the localization accuracy of the detection.

Domain Adaption on Real Remote Sensing Data
The proposed method is compared with four other representative methods which are Faster R-CNN [5], RetinaNet [8], YOLOv3 [9], and Mask R-CNN [10]. Some examples of the detection results from HRSC2016 and Airbus by different methods are shown in Figure 11 and Figure 12, respectively. Examples of some failed cases are shown in Figure 13.
The three images shown in Figure 11 are all with complex port background interference. It can be seen that the other four methods are all have miss detections to some extent, while the proposed method successfully detects all the ships with various scales. In contrast, images in Figure 12 are all with sea backgrounds. Despite the relatively simple environment compared with the ports, the sea clutter and complicated weather conditions also make it difficult to detect small ships. For example, all the other algorithms fail to detect both of the tiny moving ships in the first image at the same time. Suffering from the bad weather, ships in the other two images are less clear. It can be seen from the results that such blurry ships are easily cause miss detections, while the proposed method successfully detects all these ships. In summary, compared with general detection methods, the proposed method can effectively deal with the domain shift between different datasets, achieving accurate ship detection in both port and sea environments.
Quantitative comparisons on the test set are provided in Table 4. From Table 4, it can be seen that our method obtains the highest scores on all the indexes of precision, recall, and AP.

Domain Adaption on Synthetic Remote Sensing Data
In addition to adapting between real remote sensing data, synthetic data are also used for the comparison experiments. Specifically, the experiments with synthetic data include HRSC2016 & Normal, HRSC2016 & Cloudy, and Normal & Cloudy. The evaluation results are shown in Table 5. The best results are marked in bold.
The experimental results show that, compared with other methods, our proposed method still maintains obvious advantages in all the three adaption scenarios with synthetic data. It should be noted that both the data types and weather conditions between HRSC2016 and Cloudy are different. Therefore, there is a large domain shift between the two datasets, which limits the performance of the detection methods. In contrast, the Normal and Cloudy datasets are both synthetic data. The difference is only in weather conditions, so there is a relatively small domain shift between them. In this case, all methods achieve the best performance compared with the other two sets of experiments.

Unsupervised Domain Adaption
Since it is difficult to acquire and annotate the remote sensing image, we also consider the unsupervised domain adaption scenario and the experimental results are shown in Table 6. We perform the adaption experiments between real remote sensing data and synthetic remote sensing data, respectively. Specifically, we use the HRSC2016 dataset as the source domain for which images and their bounding box annotations are provided, and the Airbus dataset as the target domain for which only unlabeled images are available. All images from the source domain are used for training, while 30% of the images from the target domain are reserved to evaluate the performance of the trained model. In the experiment on synthetic data, the Normal dataset is used as the source domain, while the Cloudy dataset is used as the target domain. The best results are marked in bold.
The proposed method is compared with three other representative domain adaption methods which are method [20], method [21] and method [22]. We also present the evaluation results of the baseline model proposed in Section 4.2.1 to verify the effectiveness of the domain adaptation technology. As shown in Table 6, the proposed method outperforms all the other relevant methods on both of the two unsupervised domain scenarios.

Conclusions
In this paper, we propose a novel CNN-based domain adaptive ship detection method for cross-domain ship detection in optical remote sensing images. The proposed method alleviates the performance drop caused by domain shift via both image-level adaption and instance-level adaption. In image-level adaption, we utilize multiple receptive field feature integration and channel domain adaption to improve the feature representation ability of the network between the two domains. In instance-level adaption, a novel boundary regression module is proposed to correct the region proposals with the corresponding effective extreme point features. During training, the network learns suitable feature representations for both of the two domains with the help of the domain classifiers, thereby improving the generalization ability of the trained model. In addition, the proposed method is also suitable for the unsupervised domain adaption scenario. Detailed ablation studies and the comparison results with other algorithms verify the superiority of our method.

Conflicts of Interest:
The authors declare no conflict of interest.