1. Introduction
Tree crowns, the main site of photosynthesis, are an indispensable part of trees. Offering an accurate assessment of tree plantations on a large scale can be very useful in both scientific research and production, such as for growing-status observation, pest control, and biomass prediction.
While many methods based on machine learning and deep learning have been applied to detect tree crowns in remote sensing images and have obtained good results [
1,
2,
3,
4], the premise for their effectiveness is that training and testing data follow the same distribution. However, images in large-scale tree-crown detection and counting are usually collected in different regions with different sensors in practice, resulting in a distribution diversity in the feature space, i.e., a domain shift. In this case, the performance of traditional deep-learning-based methods will degrade dramatically if they are applied to images taken in different conditions directly. The most straightforward approach is to manually annotate the data from the target (new) domain, but it is expensive, labor-intensive, and there may not always be enough training data.
Domain Adaptation (DA) is a transfer learning paradigm that aligns the data distribution of the source domain and the target domain by learning new feature representations, so the model trained on the labeled source domain can be transferred to the target domain that is completely unlabeled or contains a few labeled data, without significant performance loss [
5]. According to the visibility and quality of data labels on the source and target domains, DA can be classified into supervised DA (SDA), semi-supervised DA (SSDA), and unsupervised DA (UDA) [
6]. Our proposed method concentrates on the scenario in which there is a massive quantity of clean data in the source domain and the labels of the target data are totally unobtainable, i.e., UDA.
However, most of the existing DA models are proposed and optimized on detection benchmarks, which contain various categories of objects. For example, there are 20 classes of labeled objects in PASCAL VOC [
7] and 80 in MS COCO [
8]. In tree-crown detection, obviously, it is actually quite the opposite. Compared with benchmarks, the tree crowns in remote sensing images are characterized by a similar appearance, a uniform and dense arrangement, and a poor abundance of classes, which are unfavorable for CNN-based classifiers to learn discriminative features.
In addition, most of the existing object detectors [
9,
10,
11,
12] generate a large number of anchor boxes with sliding windows. In order to speed up the training procedure, in Ren et al. [
9], a region proposal network (RPN) was proposed to generate region proposals with a wide range of scales and aspect ratios in the first stage of the detector and predict an objectness score and the region coordinates.However, the number of negative samples (background) was much greater than the number of positive samples, resulting in an imbalance between positive and negative samples and the performance decline of the classifier. Though minibatch biased sampling is widely used in two-stage approaches it randomly selects examples by a predefined foreground-to-background ratio and the required number of examples, it makes the number of simple samples much larger than that of difficult samples, leading to a decrease in model performance when similar semantic interference occurs, such as the vegetation or impervious background similar to the tree crown that often occurs in forestry land. Instead, hard samples are more conducive to the effective training of the detector. The classical hard samples mining methods, e.g., OHEM [
13], can assist the model to focus on hard samples under the guidance of the classification confidence and discard easy samples directly. However, it requires iterative training, which makes it difficult to be integrated with end-to-end detector. Moreover, the high computational cost also greatly limits its application. From another perspective, Lin et al. [
14] proposed a focal loss to alleviate the extreme imbalance between foreground samples and background samples in one-stage detectors. Instead of discarding easy samples directly, it dynamically reduced the weight of easy samples by modifying the original cross-entropy loss. However, this method had a very limited effect on two-stage detectors since most easy negatives were filtered by the two-stage process. Notably, both of the above two methods introduced extra hyperparameters that needed additional configuration, increasing the optimization difficulty [
15]. For two-stage detectors, improving the region-proposal quality is crucial for the detection performance. Many approaches [
16,
17,
18,
19] have been proposed to improve the performance of RPN, and most of them achieve accurate positioning by fine-tuning and aligning the feature to the anchors in multiple stages. In order to alleviate the sample-level imbalance and train an RPN on hard candidate regions, Cho et al. [
20] proposed a negative region proposal network (nRPN) that was trained with the false positives classified by the RPN. Simultaneously, the RPN was trained with the hard negatives proposed by the nRPN. In this way, they provided more difficult positive or negative proposals to each other. However, before they were trained simultaneously, it was necessary to first train the RPN alone for a few epochs to generate the false positives (FP) used in the nRPNs.
Observing the above situation, we propose a DA cascade tree-crown detection framework with multiple region proposal networks (RPNs), referred to as CAS-DA, to improve the performance of cross-regional tree-crown detection and counting using remote sensing images. During training, the RPNs of CAS-DA discard easy samples stage by stage, and only hard samples can participate in the training of classifiers at the deep stage. In addition, To further improve the precision of detection, we propose a filtering strategy based on the empirical planting rules of tree crowns to remove false positives in the detection results of CAS-DA. Accordingly, our contributions are as follows.
(1) A cascade of region-proposal networks for tree-crown detection is proposed. It takes features from different convolutional layers stage by stage, filters easy samples, and mines hard samples to alleviate the data imbalance and enhance the classification capacity of RPNs.
(2) We integrate the proposed cascade network and Strong Weak Faster R-CNN into CAS-DA and construct the loss function of multiple stages so that the CAS-DA can be trained in an end-to-end manner.
(3) A practical filtering strategy based on planting rules is designed to further eliminate the wrongly detected trees effectively.
Extensive cross-regional experiments are conducted on three datasets collected by satellites and UAVs, including adaptation between satellite images and drone images and adaptation between different satellite images. The experimental results show that our method achieves an average F1-score of 68.95% and 88.83% in the two series of experiments, outperforming the other existing DA approaches by an obvious margin of 11.88%~40% and 0.50%~5.02%, respectively. This proves the effectiveness of our proposed method in cross-domain tree-crown detection and counting from multisource remote sensing images.
3. Method
Figure 1 shows the framework of the proposed CAS-DA (DA detector with cascade RPNs) for tree-crown detection and counting, including a Strong Weak Faster R-CNN and the cascade RPNs. We summarize these modules in our framework as follows.
(a) Strong Weak Faster R-CNN: The method is inherited from Faster R-CNN; it narrows the domain shift at local and global levels by local discriminator
and global discriminator
. The gradient reversal layer (GRL) connects the domain discriminator with the feature extraction network to achieve adversarial training. We focused on the process of feature extraction, represented by a series of blue cuboids in
Figure 1. For clarity, we briefly review the structure of the feature extraction network (taking ResNet-101 [
56] as an example) here. As shown in
Table 1, the blocks consist of a series of convolutional layers successively named Conv1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x. Conv1~Conv4_x are responsible for extracting convolutional features of images from the source and target domains. Conv5_x is a region-of-interest (ROI)-based classifier. The loss function of the Strong Weak Faster R-CNN is composed of local-level adaptation loss
, global-level adaptation loss
, and the detection loss of Faster R-CNN
.
(b) Cascade RPNs: As shown in
Figure 1, this module consists of three RPNs with the same structure; they are sequentially attached to Conv2_x, Con3_x, and Conv4_x of the Strong Weak Faster R-CNN to obtain multilevel features. Since the feature map from Conv4_x has a different size from that of from Conv2_x and Conv3_x, convolutional blocks composed of a convolutional layer and an average pooling layer are deployed between the RPN and Strong Weak Faster R-CNN pipeline for integration convenience. These multilevel features are fed into the corresponding RPN to generate proposals. Afterwards, the RPN classifier outputs binary classification probability for each proposal. The higher the probability value, the more likely there is a tree crown in the proposal. Then, we set a threshold value at each stage, the easy samples whose classification score is greater than or equal to the threshold are rejected at this stage and do not participate in the training of latter stages. The loss of cascade RPNs consists of three classification loss of all three RPNs and the regression loss of RPN3.
In this paper, we concentrated on the unsupervised DA tree-crown detection across two different regions. Specifically, we defined images from labeled source domain as , and images from unlabeled target domain as , where is the labels of , and and represent the number of images from the source domain and the target domain, respectively.
In the following sections, we explain the implementation of cascade RPNs, the integration details with Strong Weak Faster R-CNN and the filtering strategy successively.
3.1. Cascade Region Proposal Networks
As shown in
Figure 1, images from the source domain and target domain are sequentially input into the feature extraction network, and these features are fed into the branch of cascade RPNs and generate a set of candidate regional proposals; the last layer of each branch outputs a classification probability between 0 and 1 with a softmax function for source proposals when training. Then, some easy samples from the source domain are filtered out if the probability is greater than a threshold, and the remaining samples go on to the training of the next cascade stage.
According to the green line of dashes in
Figure 1, for example, the source image features extracted by Conv1 and Conv2_x are resized and fed into PRN1 and generate region proposals with corresponding binary classification scores. For training efficiency, 1024 of these region proposals are selected and used to train RPN1. If the proposal feature is significantly different from that of the real tree crown, it obtains a very high classification score in the background category in RPN1 and is then marked as the easiest negative sample and rejected. Similarly, the easiest positive samples is also filtered at this stage. Then, the same number of region proposals is generated in the deeper feature map further extracted by Conv3_x, but only 512 of the remaining harder samples are used to train RPN2, and the comparatively easier samples are rejected at that stage. Similar processes are applied in RPN3 with an extra bounding-box regression. Finally, the remaining proposals are allowed to join the subsequent processing.
Such a cascade structure brings two advantages:
(1) A large quantity of easy samples are detected and rejected at an early stage, which reduces the number of samples to be trained in the subsequent network and improves the computing efficiency of the network.
(2) In cascade RPNs, the classifiers perform a stage-by-stage hard sampling. Each RPN can be trained to detect the tree crown at different levels of difficulty. Consequently, the classifier of RPNs is sequentially adept at distinguishing more difficult distractors, and the distribution of training samples are sequentially more balanced.
As in Faster R-CNN, in Strong Weak Faster R-CNN, the RPN loss function
is composed of binary classification loss function
and regression loss function
as Equation (
1):
represents the probability that the
ith anchor contains an object (
when the
ith anchor box is positive, otherwise
= 0).
and
are the trained coordinates of the anchor box and the coordinates of the ground-truth bounding box, respectively.
is the proposal minibatch size for participating in the RPN training, and
is the number of anchor locations. Specifically,
where
R is a smooth L1 loss function.
Similar to the Strong Weak Faster R-CNN, in CAS-DA, the cascade RPNs adopt a multitask loss, including the binary classification loss
in the
(
) stage and the regression loss
of the final stage. For each stage, the binary classification loss is computed as Equation (
2). The regression loss is computed as Equation (
3). The loss function of cascade RPNs is expressed as follows:
where
= 1,
R = 3, and r∈{1,2,3} in this paper. Similar to the Strong Weak Faster R-CNN,
is a vector representing the classification score at stage
r for the background and objects.
is the minibatch size of the
RPN,
is the number of anchor locations of the final stage. Additionally,
is introduced in CAS-DA to evaluates whether the sample is simple or not, it is a one-dimensional binary tensor (0 represents an easy sample, 1 represents a nonsimple sample) whose length is consistent with the number of anchor locations. Specifically, we set a threshold value (e.g., 0.99) at each stage and the
’s are all initialized to one,
is 0 if the classification score is greater than the threshold, otherwise it is 1. The RPN randomly selects minibatch samples from the unrejected set for training.
in a form of a successive multiplication means that, as long as a sample is rejected by any of the cascade stage, it will not have the opportunity to participate in the training of a later stage. Intuitively, the classification score of deep features counts more than that of shallow features, so
is introduced to control the weight of the loss at different cascade stages and make sure losses from a deeper stage are attributed more weight. Specifically, the weight of the loss in the previous stage is one tenth that in the later stage. It can be seen that if
R =
=
= 1, the
item is a classical cross-entropy loss. Moreover, it is worth noting that the above procedure filters both simple positive and simple negative samples simultaneously, thus few simple positive samples are filtered at the early stage as well.
3.2. Integration with Strong Weak Faster R-CNN
We took the state-of-the-art Strong Weak Faster R-CNN as our baseline detector. Since CAS-DA focuses on strengthening the learning ability of RPN classifier, we only needed to add extra RPNs without making any changes to the other processes. The implementation details of CAS-DA are as follows: First, the structure and the data flow of RPN1 and RPN2 were designed similarly to those of the RPN in Strong Weak Faster R-CNN, i.e., RPN3 in
Figure 1. To make sure that the feature maps with the same size were available for the three RPNs, we added convolutional blocks composed of a 1 × 1 convolutional layer and an average pooling layer between the first two cascade RPNs and the feature extractor. Specifically, the convolutional layer size was 1 × 1 × 1024, the pooling layer output feature map was 38 × 38, and finally, a feature map of 38 × 38 × 1024 was fed to RPNs. For the convenience of joint training, the batch size of each stage, i.e.,
(r∈{1,2,3}), was set to 1024, 512, and 256, respectively, in which the batch size of RPN3 was the same as that of the RPN in Strong Weak Faster R-CNN. Furthermore, a one-dimensional tensor, i.e.,
, was introduced to evaluate whether the sample was rejected in previous stages.
Finally, we can train our CAS-DA in an end-to-end manner by backward propagation. The training loss
was designed to compose the loss of detection and domain adaptation, they were balanced by the trade-off parameter
. Specifically:
where
and
are both composed of classification loss and regression loss.
3.3. A Filtering Strategy for Wrong Trees Based on Planting Rules
To further improve the precision of detection, we propose a filtering strategy based on the empirical planting rules of the tree crown, which can be applied to the postprocessing in validation and effectively filter the wrongly detected trees (false positives) by CAS-DA.
In large-scale tree-crown detection, we observed that most of the trees were planted together and distributed intensively; trees planted individually or in clusters of two or three are not abundantly found. In other words, there should be several (at least 2) other tree crowns around a tree crown within a certain range. Depending on this prerequisite, we propose a tree-crown filtration strategy.
First, we calculated the center of each bounding box classified as a tree crown by CAS-DA. If there were C bounding boxes, a matrix with the shape of C × C was generated, and the corresponding values represented the distance between two coordinates. It was stipulated that the detected tree crowns should be marked as false positives and dropped out if there were fewer than or exactly 2 detected trees within
m pixels of them. As described above, the tree crowns are arranged intensively in remote sensing images, so
m was set to the average size of the tree crown to obtain the best filtering effect. Specifically,
m = 64 in dataset A and B (given that the size of a tree crown was about 64 × 64), and
m = 75 in dataset C (given that the size of a tree crown was about 75 × 75). In
Section 5.1.1, we explore the significance of this filtering strategy.
4. Experiments
4.1. Study Area and Dataset
In this research, the proposed method was applied to detect oil palm in remote sensing images from three different regions. As one of the major tropical cash crops in the world, the detection and counting of oil palm is of great significance to both economy and ecology.
As shown in
Figure 2, we obtained two high-resolution satellite images (i.e., images A and B) in Peninsular Malaysia [
44]. These two images were acquired at different times and places with different equipment, and therefore, they differed significantly in environmental conditions and resolution. Another remote sensing image (image C) was taken by a UAV in South Kalimantan, Indonesia [
57].
Table 2 shows the elaborate information of the three images.
In images A and B, there were four kinds of samples: background, oil palm, other vegetation, and the impervious background. The training datasets were collected from four regions of these two images, respectively. To evaluate the performance of our proposed method, we chose another representative region in images A and B as the validation datasets and compared the detected results with the ground truth collected by manual annotation. For images from training and validation regions, a bilinear interpolation was first applied to resize them to 2400 × 2400 pixels, and then the enlarged images were cropped randomly to 500 × 500 pixels. Finally, we obtained 4718 samples in image A and 3782 samples in image B.
In image C, in addition to the tree crown, there were also rivers, buildings, and other vegetation. The training datasets were from one region, and the validation datasets were from the other one. Since the images were originally collected for growing status observation by Jz et al. [
57], there were more varieties in crown size and appearance compared with images A and B, such as small palm and yellowed palm. We firstly unified their labels into oil palm and then split the training region into 3148 images and the validation region into 851 images with 1024 × 1024 pixels.
Figure 2 shows the location of our study area and the examples of the training region from three locations. We can easily observe that the tree crowns distributed intensively and had a low category diversity.
Table 3 shows information on the three datasets.
4.2. Experimental Setup and Evaluation Metrics
We applied the proposed method (CAS-DA + filtering) in two different domain-shift scenarios: (1) adaptations between images collected by satellites and UAVs, including dataset B → dataset C (B → C), dataset C → dataset B (C → B), dataset A → dataset C (A → C), and dataset C → dataset A (C → A); (2) adaptations between images collected by different satellites, including dataset A → dataset B (A → B) and dataset B → dataset A (B → A). Note that s → t above means adaptation from source domain to target domain.
We implemented our method based on PyTorch [
58] on Ubuntu 18.04 using a GeForce RTX 1080 Ti. Our backbone network was ResNet-101 pretrained with Image-Net [
59]. The network was trained with the backpropagation algorithm. The initial learning rate was 0.001, and it was divided by 10 every 50k iterations. Each batch included one image from the source domain and one image from the target domain. Conventionally, all images were resized to 600 × 600 after preprocessing. With this design, the feature maps from the three stages were resized to 38 × 38 × 1024 by the convolutional layer and pooling layer introduced in
Section 3.2. For cascade RPNs, the threshold for an easy sample was set to 0.99 and
=
= 1. According to the size of tree crowns in our image, we defined the anchors to have a width of {64, 80} pixels and only assign a {1:1} aspect ratio considering that the shape of the tree crown was close to a square, which caused the number of anchor locations in an image (i.e., the length of
) to be fixed to 2888. For the loss function, we set the trade-off parameter
= 0.1. Readers can refer to [
40] for further details of the implementation of Strong Weak Faster R-CNN.
In validation, only the images from the target domain were used. We set the maximum number of tree crowns in an image to 100 and used NMS as the postprocessing method before the proposed filtering strategy.
We used true positive (TP), false positive (FP), false negative (FN), precision, recall, and F1-score as evaluation indicators. TP represents true positives, which is the number of crowns correctly detected. FP represents false negatives, referring to other objects mistaken for the crown. FN represents false negatives and denotes the number of crowns undetected. In our validation, the detecting results with a probability score over 0.5 and an IoU with the ground truth that was higher than 0.5 were considered to be TP. Precision indicates the proportion of correctly detected tree crowns in the detected tree crowns. Recall describes the proportion of correctly detected tree crowns in all ground-truth data. F1-score was introduced to balance these two indicators by computing their harmonic mean value.
4.3. Experiment Results
4.3.1. Experiments between Images Collected by Satellites and UAVs
Table 4 display the results of our proposed method (CAS-DA + filtering) in experiments between satellite images and UAV images. We can observe that our proposed method achieved 69.48%, 51.24%, 68.01%, and 87.06% in terms of F1-score on four transfer tasks; the average F1-score was 68.95%.
We also compared our proposed method with other methods, including F-RCNN (source) [
9], DA Faster R-CNN [
39], Strong Weak Faster R-CNN [
40], CROPTD [
45], and SW-ICR-CCR [
41]. The results are shown in
Table 5 and
Table 6. F-RCNN (source) denotes Faster R-CNN trained only with source images and tested on the target images. Intuitively, a better performance of F-RCNN (source) implies a smaller gap in the feature distribution from the source domain to the target domain, i.e., a smaller domain shift. The other four are representative DA methods. For comparison fairness, the experiment setups of these methods were all as described in
Section 4.2.
In this paper, the bold values in the table indicate the maximum value in the corresponding evaluation index unless noted otherwise.
Comparing the performance of these methods, a noteworthy phenomenon was the dramatic drop in the performance, especially the recall, of Strong Weak Faster R-CNN and CROPTD in B → C, C → B, and C → A, with average F1-scores of only 19.13% and 35.67%, respectively. There are three possible key reasons for this: (1) As mentioned in
Section 4.1, dataset C was the only dataset collected by UAV, which exhibited a clear visual difference with the other two datasets in terms of image quality/resolution, texture, etc. The knowledge learned by CAS-DA on image C could not be well generalized to image A and image B, which was proved by the poor results of Faster RCNN (source) on C → B and C → A. On the contrary, Faster RCNN (source) achieved good performance (even better than DA F-RCNN) on A → C, indicating that the knowledge learned in dataset A could be directly transferred to dataset C to a certain extent. (2) In image C, there were palms in unhealthy growth status such as dead and yellowed ones that did not exist in image A and image B. Consequently, when C was taken as the target domain, these unhealthy palms were missed by the detector due to their difference from healthy ones in the source domain in terms of texture, size, etc., resulting in a decrease in recall. When we consider C as the source domain, these outlier features were prone to negative transfer, hurting the transferability of the model. (3) CROPTD is inherited from Strong Weak Faster R-CNN. Compared to other DA methods, both eliminated instance-level alignment, which also caused a performance decline, especially when there were a quantity of background samples similar to the object.
Our proposed method ranked at or near the top of the listed DA methods in terms of both precision and recall, thus achieving the highest F1-score in all four transfer experiments. Compared with Strong Weak Faster R-CNN (baseline), our method showed a greater robustness and improved the F1-score on the four experiments by 5.61%~58.75%.
To demonstrate the experimental results more clearly,
Table 7 shows the F1-scores of all methods in the four experiments. Our proposed method achieved the best performance with 68.95% in terms of average F1-score, outperforming DA F-RCNN, Strong Weak Faster R-CNN, CROPTD, and SW ICR-CCR by an obvious margin of 11.88%, 40.00%, 12.36%, and 26.01%, respectively.
We divided the experimental results from
Table 7 into two groups: adaptation between dataset B and dataset C (i.e., B ↔ C, including B → C and C → B), and adaptation between dataset A and dataset C (i.e., A ↔ C, including A → C and C → A). We noticed that the performance of Faster RCNN (source) in B ↔ C was clearly worse than that in A ↔ C. Combined with the previously mentioned results, we can draw the conclusion that the domain shift between B and C was more significant than that between A and C, which was also confirmed by the experimental results of the DA methods: taking C as the source domain, in C → B and C → A, all of the DA methods obtained the worst performance in C → B, and our method also achieved an F1-score of only 51.24%, while in C → A, they performed much better, and our method achieved an F1-score of 87.06%.
Analyzing these two groups of experiments from another perspective, it can be seen that our method led to a more remarkable improvement of the adaptation with a larger domain shift. Specifically, compared with the other four DA methods, our method improved the average F1-score of B ↔ C by 11.32%~45.82%, exceeding 10.97~32.18% on A↔C. It proved that our method, which focuses on improving classification capability on source domain, was more effective than the other DA methods that focus on diminishing the domain shift across common benchmarks, especially in the presence of a large domain shift.
4.3.2. Experiments between Images Collected by Different Satellites
To further verify the effectiveness of the proposed method, we performed adaptation between two high-resolution satellite images. The experimental results are shown in
Table 8,
Table 9 and
Table 10.
TP, FP, FN, precision, recall, and F1-score for A → B and B → A are listed in
Table 2. Our proposed method achieved an F1-score of 82.95% in A → B and 94.70% in B → A.
As shown in
Table 9 and
Table 10, it is noticed that not all DA methods outperformed F-RCNN (source). For example, the performance of DA F-RCNN on A → B was worse than that of F-RCNN (source). Our proposed method achieved the best performance on two transfer tasks, with F1-scores of 3.29 and 2.16 percentage points higher than Strong Weak Faster R-CNN (baseline) in two experiments. The average F1-score was 88.83%, outperforming other DA methods by 5.02%, 2.73%, 0.80%, and 0.50%. Furthermore, the good performance of F-RCNN (source) implied a smaller domain shift between A and B than that in the previous adaptation scenario.
To summarize
Section 4.3.1 and
Section 4.3.2: (1) The performance of all DA methods between different satellite images (as shown in
Table 10) was superior to that between satellite images and UAV images (as shown in
Table 7), which might be because the significant domain shift in the latter brought a greater difficulty for the DA detector to learn the domain-invariant features. (2) Compared to the other DA methods that focus on diminishing the domain-shift, our method obtained the highest F1-score in two different adaptation scenarios. which demonstrates the importance and effectiveness of improving the detection performance on the source domain to enhance the DA detector. Moreover, it is worth noting that our method brought a greater improvement for the adaptation with a large domain shift.
6. Conclusions
In this paper, we proposed an end-to-end DA detector with cascade RPNs (i.e., CAS-DA) to realize the cross-regional tree-crown detection and counting. To deal with the problems of a poor abundance of objects and the data imbalance in tree-crown detection, the cascade RPNs were designed to adopt multiple region proposal networks to filter out easy samples so that the learning ability of deeper classifiers was gradually enhanced. Then, the adaptation components and detector were integrated to form an end-to-end framework for the cross-regional detection. In addition, a practical filtering method was proposed according to the observed tree-crown distribution rules to effectively eliminate the wrongly detected trees. Experiments in two different adaptation scenarios showed that our method achieved 68.95% and 88.83% average F1-scores, respectively, significantly outperforming the other DA approaches focusing on diminishing the domain shift across common benchmarks, showing its effectiveness in cross-domain tree-crown detection using remote sensing images. Particularly, our method obtained a greater performance boost for the adaption with a larger domain shift. From the experimental results, we could draw the conclusion that, in tree-crown detection, it is more effective to improve the detection performance on the source domain than to diminish the domain shift, especially when confronted with a significant domain shift.
Moreover, our method has the potential to realize other DA object detection with similar characteristics to overcome the scarcity of labeled data and the difficulty of traditional deep learning methods in transfer learning across different domains. For example, in the remote sensing field, it can be directly applied to the cross-regional growing-status observation of the same tree species after labeling the source domain with fine-grained labels. Furthermore, fault detection in different environments, such as the fault detection of mechanical equipment, high-voltage line, etc., is also a possible scenario, where the forms of the fault are few and the image background is very complex, limiting the performance of traditional RPN-based detector. Introducing DA methods with cascade RPNs could be beneficial by saving labeling costs and locating the fault more precisely.
Nevertheless, there is still much room for improving the performance of cross-domain detection between satellite images and UAV images due to the great difference in styles, textures, etc. Therefore, we hope to build a more effective cross-domain tree-crown detector for this adaptation in the future.