Learning Lightweight and Superior Detectors with Feature Distillation for Onboard Remote Sensing Object Detection

: CubeSats provide a low-cost, convenient, and effective way of acquiring remote sensing data, and have great potential for remote sensing object detection. Although deep learning-based models have achieved excellent performance in object detection, they suffer from the problem of numerous parameters, making them difﬁcult to deploy on CubeSats with limited memory and computational power. Existing approaches attempt to prune redundant parameters, but this inevitably causes a degradation in detection accuracy. In this paper, the novel Context-aware Dense Feature Distillation (CDFD) is proposed, guiding a small student network to integrate features extracted from multi-teacher networks to train a lightweight and superior detector for onboard remote sensing object detection. Speciﬁcally, a Contextual Feature Generation Module (CFGM) is designed to rebuild the non-local relationships between different pixels and transfer them from teacher to student, thus guiding students to extract rich contextual features to assist in remote sensing object detection. In addition, an Adaptive Dense Multi-teacher Distillation (ADMD) strategy is proposed, which performs adaptive weighted loss fusion of students with multiple well-trained teachers, guiding students to integrate the learning of helpful knowledge from multiple teachers. Extensive experiments were conducted on two large-scale remote sensing object detection datasets with various network structures; the results demonstrate that the trained lightweight network achieves auspicious performance. Our approach also shows good generality for existing state-of-the-art remote sensing object detectors. Furthermore, by experimenting on large general object datasets, we demonstrate that our approach is equally practical for general object detection distillation.


Introduction
With the launch of numerous CubeSats, nanosatellites, and microsatellites, the cost of obtaining and processing remote sensing data has been decreased dramatically [1], which has greatly aided the advancement of Earth observation technology.Remote sensing object detection is an important task of Earth observation, which refers to identifying and locating specific objects from satellite images.It is essential in various missions such as ocean monitoring, disaster prevention, and environmental monitoring [2].
In recent years, deep learning-based object detection approaches have shown impressive performance in general scenes [3][4][5].However, compared to ordinary scenes on the ground, remote sensing scenes include complicated backgrounds, dramatic changes in shooting angles and illumination, sharp scale changes, and small object sizes, making it difficult to obtain sufficient information about targets to perform accurate detection.
In order to obtain better performance, more complex network architectures are applied to perform remote sensing object detection tasks; however, these suffer from a massive number of parameters and excessive consumption of computational resources.Due to the limited memory and computational power of CubeSat onboard platforms, it is difficult to deploy large-scale deep networks on them [6].
To deploy neural networks on small satellites, many studies have been conducted on the software and hardware dimensions.For the hardware dimension, Manning et al. [7] deployed Convolutional Neural Networks (CNNs) on FPGAs to classify images captured by the ISS SHREC platform.Arechiga et al. [8] implemented CNNs on the Nvidia Jetson TX1 for onboard image processing on small satellites.Bappyet al. [9] proposed onboard deep neural computation and machine learning models to analyse and process multiple spectral images for the 3U CubeSat.Giuffrida et al. [10] deployed a CloudScout CNN on the Myriad 2 vision processing unit for cloud detection on hyperspectral images.These studies explore the possibility of deploying CNNs on small satellites.
For the software dimension, previous work has focused on developing efficient and lightweight deep models.For example, MobileNets [11] designs depthwise separable convolution with few channels and small convolution kernels to reduce the number of network parameters.Parameter pruning [12] reduces the model size by removing unnecessary parameters from the deep neural network.Although the above approaches are able to compress the model and improve the speed of detection, they more or less lead to a reduction in model accuracy.
In recent years, knowledge distillation has received increasing attention.This refers to inheriting information from an extensive teacher network into a lightweight student network, thereby enhancing the performance of the lightweight network.Depending on the location of the distillation, it can be divided into two categories.The first is logitsbased distillation [13,14], where teachers are distilled from the output level, while the second is feature-based distillation [15,16], where teachers are distilled from the middle feature layers.Compared to the logits-based approaches, feature-based distillation has demonstrated advantages for various tasks.
In general object detection scenes, most feature-based distillation forces students to mimic the teacher's output as closely as possible, as the teacher's features are representative.However, previous work [2,17] has demonstrated that contextual features can compensate for the little information available on remote sensing objects and effectively assist in detecting small objects, thus playing an important role in remote sensing detection tasks.Thus, this paper explores distilling the pixel distribution in the teacher's feature map and the contextual relationships in the teacher's feature map.
To explore the differences in student instruction with different teachers, we visualised the output of the middle feature layer of three teacher networks.As shown in Figure 1, it is clear that different teachers have different regions of interest.Therefore, learning the features from different teachers is vital for improving detection accuracy.
Based on the above observations, we propose a novel feature distillation approach, which considers contextual features and integrated knowledge from multiple teachers, to train lightweight and advanced detectors for onboard remote sensing object detection.We name this approach Context-aware Dense Feature Distillation (CDFD).
Specifically, multiple two-stage teachers are first trained offline and their weights are frozen at the end of training, followed by the teacher's neck and head parameters being inherited to the student.Then, the student's training starts.To address the problem of small object information in remote sensing images, a Contextual Feature Generation Module (CFGM) is proposed to learn context from the teacher's intermediate layer output as complementary information to enhance the student's capability to detect small objects.In addition, an Adaptive Dense Multi-teacher Distillation (ADMD) strategy is proposed, which performs adaptive weighted loss fusion of students with multiple well-trained teachers to combine knowledge from multiple teachers and improve students' feature representation.Extensive experiments were conducted on remote sensing datasets; the results show that our CDFD is able to reproduce or even exceed the performance of the large models with a small model without any additional computation.In summary, the contributions of this paper are:

Relation Works 2.1. CubeSats
CubeSats are small, low-mass satellites that can provide data and experimental platforms for scientific research at low cost.CubeSats were first proposed by Stanford University in 1999 and, over the past 20 years, have been developed considerably, with a wide range of applications in Earth remote sensing [18].For example, Spire has deployed a constellation of LEO CubeSats called Lemur-2 for weather prediction and ship tracking missions [18].Planet Labs has employed about 300 CubeSats to collect images at 3-5 m resolution for studies such as water tracking [19], vegetation monitoring [20], glacier investigation [21], permafrost monitoring [22], etc.The Dove Cluster [20,23] carries payloads including optical telescopes and high-resolution cameras to conduct Earth surface imaging, and the data obtained can be applied in the field of machine learning.Typical CubeSats have between 32KB and 8MB of on-board memory [6]; some CubeSats can carry up to 8GB of additional flash memory [24], but still cannot store excessive amounts of data.In addition, CubeSats have limited downlink capability, with most of them having a data transfer rate of 9600 BPS [6].Deploying object-lightweight models on CubeSats for onboard real-time object recognition can solve these memory and communication problems.

Remote Sensing Object Detection
Most existing deep learning-based remote sensing object detection approaches are transferred from approaches designed for natural scene images.However, remote sensing images are very different from natural scene images, especially in terms of small objects, scale variations, and complex backgrounds.In order to obtain better detection performance, most studies use region proposal-based approaches [5,25,26] (also known as two-stage approaches) to detect remote sensing objects [27].For example, the cross-scale fusion strategy [28,29] enhances the feature representation of objects by fusing features across layers and improves the detection of small objects.The attention module [17,30,31] models long-range dependence and is used to improve the model's representation of spatially non-local features.The frequency-domain convolution module [2] is used to extract global features as additional information to assist in the detection of remote sensing objects.Although the above approaches achieve excellent detection accuracy, they are all improved on the basis of two-stage networks, which have the disadvantages of high computational complexity and high computational resource consumption.

Knowledge Distillation
The core idea of knowledge distillation is to learn small student models from large teacher models and thus achieve competitive performance [32].In general, a knowledge distillation system consists of three key components [33]: knowledge types, distillation strategies, and teacher-student architectures.Depending on the type of knowledge, knowledge distillation approaches can be divided into logits-based knowledge distillation [13,14] and feature-based knowledge distillation [15,16].Logits-based knowledge distillation enables students to directly imitate the final predictions of the teacher model, and the student model trained by this approach generally relies on the output of the last layer.Feature-based knowledge distillation uses the intermediate network layer features of the teacher model as knowledge to supervise the training of the student model; this approach solves the supervision problem of the intermediate layer of the teacher model.Obviously, the feature-based distillation strategy enables students to learn a multi-layered feature representation approach.Therefore, this strategy is widely used in computer vision tasks, such as image classification [34,35], image segmentation [36,37], action recognition [38,39], and object detection [40,41].Different teacher networks are able to provide their own useful knowledge to student networks, and the distillation strategy of multi-teacher networks is effective for training student models [39,42].

Context-Aware Dense Feature Distillation
In order to fully exploit multi-scale information in remote sensing images, detectors always utilise two-stage networks and feature pyramid networks [43], since the object detection in remote sensing tasks demands excellent detection accuracy [2,29].Based on this premise, we propose a new novel approach named Context-aware Dense Feature Distillation (CDFD) for such detectors, as illustrated in Figure 2. First, the input images pass through the teacher and student backbone networks for feature extraction, followed by multi-scale features P1-P4 obtained through the neck.Then, the student network is divided into two branches.In one branch, the student features pass through the head network and then compute the detection loss of ground truth.
In the other branch, the alignment layer is applied to align the student feature maps with the teacher feature maps; then, the Contextual Feature Generation Module (CFGM) is employed to rebuild the non-local relationships between different pixels and transfer it from teacher to student.Finally, Adaptive Dense Multi-teacher Distillation (ADMD) uses an adaptive weight to balance the loss terms of each student-teacher pair.
The two main innovations of our approach are the Contextual Feature Generation Module (CFGM) and the Adaptive Dense Multi-teacher Distillation (ADMD) strategy, which are described in detail in Sections 3.2 and 3.3, respectively.In Section 3.4, the loss function of CDFD is described in detail.

Contextual Feature Generation Module
In remote sensing object detection tasks, small networks usually struggle to extract sufficient features.Non-local operations [30] have been shown to be significant for remote sensing object detection tasks by modelling the long-range dependence of different spatial locations [27].As a result, existing detectors [2,29] introduce surrounding context as additional information about the remote sensing object to improve detection performance.Inspired by this, we designed a Contextual Feature Generation Module (CFGM) to help the student network extract global features, the framework of which is shown in Figure 3. Specifically, for the original student feature map S i ∈ R H×W×C in i-th stages, the query and key are defined as Q = S i , K = S i , and S i is transformed into the value V by the embedding matrix k × k group convolution W v is employed on all the neighbouring keys within the k × k grid to achieve a contextual representation of each key; thus, the contextual key K incorporates contextual information of each k × k grid and is called the local contextual representation: where W Bδ represents the convolution-batch normalisation-activation layer.In this paper, the convolution kernel is 3 × 3 and the activation function is ReLU.Then, K 1 and Q are concatenated in the channel dimension and downscaled by a 1 × 1 convolutionnormalisation-activation layer: where M ∈= R H×W×D , D = 2C/ f actor; factor equals 4 in this paper.An attention matrix is then obtained by an activation-free convolution operation: where W θ represents the convolution without the activation function, A ∈ R H×W×(k×k×C h ) refers to the enhanced spatial-aware local relationship matrix, and C h is the head number.Thus, each attention matrix A j represents the j-th two-dimensional relative position embedding within each k × k grid and is shared among all C h heads.For each head, the local attention matrix for each spatial location of A is learned based on query features and key features of the context, rather than isolated query-key pairs.Next, the attention matrix B is obtained by normalizing A and performing a softmax operation on each head along the channel dimension.
Then, we obtain the global context G by aggregating all the values V based on the attention matrix A of the context: where denotes the local matrix multiplication operation that measures the pair-wise relations between each query and the corresponding keys within the local k × k grid in space; thus, global context information is obtained.Then, the local context K 1 and the global context G are fused to obtain the student feature map.Next, the student feature map attempts to generate the teacher's feature map by two convolution operations, which can be formulated as: where S i is the learned student feature map in i-th stages and f align denotes the align layer, which is a 1 × 1 convolutional layer with the same number of input channels as the student feature map and the same number of output channels as the teacher feature map.σ is the ReLU activation function and W i1 and W i2 are 3 × 3 convolution layers.T i is the teacher feature map in i-th stages.

Adaptive Dense Multi-Teacher Distillation Strategy
As shown in Figure 1, different teacher networks have different regions of interest so, compared to mono-teacher models, multi-teacher models can contribute knowledge from multiple teachers to students and provide more useful knowledge.Figure 4a,b display two generic frameworks for sparse mono-teacher distillation and sparse multi-teacher distillation.Figure 4c shows dense mono-teacher distillation, while Figure 4d illustrates our proposed dense multi-teacher distillation scheme.It expands upon the original multiteacher scheme by introducing the idea of dense distillation, i.e., knowledge transfer at different feature levels.
In the field of object detection, there is wide agreement that different-level features have distinct advantages [43,44].Deep features are more important for a network to recognise large objects, since they contain the wider receptive fields and more semantic information than shallow features.On the other hand, shallow features are better at localizing small objects because they retain more spatial information.Guided by this view, our proposed approach can transfer knowledge from different teachers and different stages to the student model through a dense distillation scheme, which ultimately improves the detection accuracy of the student detector.
The straightforward approach to distilling knowledge from multiple teachers is to utilise the supervision signal as the average response from all teachers [32].However, this approach is obviously oversimplified and cannot effectively utilise the unique knowledge of different teachers.To further efficiently perform knowledge transfer of multiple teachers, an adaptive dense multi-teacher distillation strategy based on adaptive weighted loss is proposed.Specifically, the feature representation learning of student networks is achieved through a well-designed distillation function.
For feature-based knowledge transfer, the distillation loss L Dis can be formulated as where f t (x) and f s (x) are the feature maps of the intermediate layers of the teacher and student models, respectively.The feature maps of the teacher and student models are matched using a similarity function, as indicated by L F (.).The similarity function L F (.) adopts the weighted Mean Squared Error (MSE) loss in this paper.Figure 5 depicts the procedures for our adaptive dense multi-teacher distillation strategy.It is important to note that the weight values can be dynamically adjusted by network learning rather than being fixed.However, since the weight is unbounded, it will increase the possibility of training instability.Therefore, to constrain the value range of each weight, we resort to weight normalisation.Specifically, a softmax-based weighted loss function is designed for L F (.), which can be formalised as: where i is the i-th stage feature and M is the number of stages.j is the j-th teacher model and N is the number of teachers.

Overall Loss
With the proposed distillation loss L Dis for CDFD, all the models are trained with the total loss as follows: where L Det denotes the detection loss for the original student models and α is a hyperparameter to balance the two losses.In this paper, the hyper-parameter α is equal to 5 × 10 −7 and the detection loss includes the classification loss and the regression loss [5]: where i refers to the index of an anchor in a mini-batch, p i denotes the predicted probability of anchor i being an object, and p * i is the ground truth label.t i denotes a vector representing the four parameterised coordinates of the predicted bounding box, while t * i is the ground truth box associated with a positive anchor.L cls is the classification loss and L reg is the regression loss.For classification losses, the cross-entropy loss function is used: For the regression loss, the smooth L1 function is used: In this paper, the calculation of the detection loss is unchanged and the distillation loss is only calculated on the feature map, which can be obtained from the neck of the detector.Therefore, our CDFD is applicable to different detectors.

Dataset
Since remote sensing datasets from Cubesats have not yet been released, our experiments were conducted on two public remote sensing datasets from regular satellites-DIOR [27] and DOTA [45].In addition, to verify the effectiveness of CDFD on general object detection tasks, we conducted distillation experiments on the classical general object dataset COCO [46].DIOR [27] is a large publicly available dataset for remote sensing image object detection, derived from Google Earth, with an image size of 800 × 800 pixels and a spatial resolution range of 0.5 to 30 m.Each instance is labelled using a horizontal bounding box.The DIOR dataset contains 23,463 images and 192,472 annotated instances, with 20 object categories: airplane, airport, baseball field, basketball court, bridge, chimney, dam, expressway service area, expressway toll station, golf field, ground track field, harbor, overpass, ship, stadium, storage tank, tennis court, train station, vehicle, and windmill.The official guideline for splitting the DIOR dataset was followed for this paper, i.e., 11,725 remote sensing images as the training set and the remaining 11,738 images as the test set.
DOTA [45] is another large multi-scale optical remote sensing dataset.The data sources are mainly Google Earth and the JL-1 and GF-2 satellites of the China Resources Satellite Data and Applications Centre, with image sizes ranging from 800 × 800 to 4000 × 4000 pixels.For this paper, the DOTA-v1.5 dataset with horizontal bounding box annotation was used.Its training set consists of 1411 remote sensing images and the validation set consists of 458 remote sensing images.There are more than 400,000 object instances with 16 object classes: container crane, baseball field, basketball court, bridge, surface runway, harbour, helicopter, large vehicle, aircraft, roundabout, ship, small vehicle, football field, storage tank, swimming pool, and tennis court.Before training and testing, all images were cropped by sliding windows with size of 800 × 800 and overlap of 200.After the crop operation, our training set included 15,749 images, and the test set included 5297 images.
COCO [46] is one of the most widely used object detection datasets.It is achieved by collecting images of complex, everyday scenes containing common objects in the natural environment.Objects are labelled using a segmentation of each instance to aid accurate object localisation.COCO object detection dataset contains 80 object categories, with a total of 200,000 images.Following the official guideline, we used the default 120,000 images for training and 80,000 images for testing.

Evaluation Metrics
In order to quantitatively analyse the object detection results of the proposed approach, the final models were evaluated in terms of two dimensions-detection accuracy and computational complexity.Among them, detection accuracy uses the mean average precision of all categories (mAP) as a metric, which is calculated from the precision and recall, where TP, FP, and FN refer to true positive, false positive, and false negative, respectively.The precision is used to describe the ability of the model to detect the correct object, while the recall describes the ability of the model to find all objects.The precision-recall curve (PRC) can be drawn from the recall and precision, and the average precision is the area under the PRC.The formulae for calculating average precision and mean average precision are as follows: where AP, mAP, P, and R denote average precision, mean average precision, precision, and recall, respectively, while N is the number of object categories.Model size and giga-floating point operations (GFLOPs) were also used as metrics to evaluate the computational complexity of the model.

Implementation Details
All experiments were conducted on four RTX TITAN 24G with CUDA 10.2 and CuDNN 7.6.5 acceleration.The SGD optimiser with an initial learning rate of 0.02, momentum of 0.9, and weight decay of 0.0001 was used.
For the remote sensing datasets DIOR and DOTA, training lasted for 12 epochs; in the first 500 iterations, we used a warm-up approach to adjust the learning rate from 0.001 to 0.02.In the eighth and eleventh epochs, the learning rate was set to 0.1 times the previous one.For the general object dataset COCO, which was trained on for a total of 24 epochs, the learning rate was set to 0.1 times the previous one in the 16th and 22nd epochs, and the other settings were the same as for the remote sensing dataset.The training of three teachers used pre-trained weights from ImageNet and all new layers were initialised using Kaiming normal.The three teacher networks were trained first, and their weights were frozen at the end of training.When training the student network, an inheritance strategy [47,48] was used to initialise the students with the teacher's neck and head parameters to train the students with the same head structure.All experiments were based on the MMDetection toolbox [49] without any modifications.

Comparison of Object Detection Results
Table 1 compares the performance of our student network with other detectors on the DIOR and DOTA datasets.The classical detectors, i.e., Faster RCNN [5], YOLOv3 [50], Reti-naNet [51], and the three teacher models are compared in this experiment.Both the student and the three teacher networks used Faster RCNN networks with feature pyramid networks [43].The backbones of the two students were ResNet18 and ResNet50 [52]; the backbones of the three teacher networks were ResNet101 [52], ResNext101 [53] with 32 groups (ResNext32), and ResNext101 with 64 groups (ResNext64).The backbone of YOLOv3 was Darknet53, and the backbone of RetinaNet was ResNet101.The experimental results show that the performance of the student networks was significantly improved with the use of CDFD in various datasets and backbones.For example, Faster RCNN-ResNet50 obtained a state-of-the-art performance of 73.1% mAP on the DIOR dataset, which not only improved the mAP by 2.5% compared to the original network, but also outperformed all the teacher networks.Furthermore, in the Faster RCNN-ResNet50 setting, the student detector performance (58.0%) was comparable to the best teacher detector (58.1%) by training in CDFD, while greatly reducing model parameters and improving computing time.Faster RCNN-ResNet50 was the most lightweight model of all the detectors, and with CDFD training, the student detector obtained mAPs of 70.9% and 56.9% on the DIOR and DOTA datasets, respectively, improving the mAP by 2.5% and 8.5% over the original network.
Notably, with the CDFD training, the student detector even outperformed most of the teacher detectors, demonstrating that the student detectors obtained better features by learning the teacher's context and comprehensive knowledge.

Ablation Studies
To further demonstrate the effectiveness of the proposed CFGM and ADMD, ablation experiments were conducted on the DIOR and DOTA datasets.Both the teacher and student networks were Faster RCNNs with feature pyramid networks, while the three teacher backbones were ResNet101, ResNext32, and ResNext64, respectively.The student backbones were ResNet50 and ResNet18.The results are shown in Table 2.When the CFGM was introduced on the DIOR dataset, the mAP of the student networks Faster RCNN-Res50 and Faster RCNN-Res18 improved from the original scores of 70.6% and 68.4% to 72.3% and 70.6%, respectively.As for DOTA dataset, the mAP of Faster RCNN-Res50 and Faster RCNN-Res18 were improved from the original 56.6% and 48.4% to 57.0% and 56.7%, respectively.The experiment results show that the student detectors obtained better features by learning the context of the teacher detectors, thus improving detection performance for the remote sensing objects.
Compared to the baseline network, the introduction of the ADMD strategy resulted in 1.9% and 2.3% higher mAP performance compared to the baseline on the DIOR dataset for Faster RCNN-Res50 and Faster RCNN-Res18, respectively.On the DOTA dataset for Faster RCNN-Res50 and Faster RCNN-Res18, the mAP was 0.6% and 7.9% higher than baseline.This demonstrates that our ADMD strategy guides the student to integrate the detection superiority of multiple teachers in order to learn better features, thereby achieving excellent performance in object detection.
Each component (CFGM and ADMD) provided additional significant gains for the different student detectors on various datasets and backbones.The joint consideration of CDFD for CFGM and ADMD provided the best performance for both Faster RCNN-Res50 and Faster RCNN-Res18 on the DIOR (73.1%, 58.0%) and DOTA (70.9%, 56.9%) datasets.From Table 2, it can be seen that the two modules proposed in this paper-CFGM and ADMD-are both effective in enhancing the performance of the student detector, and the combination of both modules further improves detection performance.
By comparing the performance of CFGM and CDFD on students, we noticed that CDFD further improved student performance with Faster RCNN-ResNet50, while the performance difference was negligible with Faster RCNN-ResNet18.This is due to the fact that compared to ResNet18, the student backbone ResNet50 is more similar to the teacher backbones ResNext32 and ResNext64, and it is easier to learn useful knowledge from similar teachers.Four sets of comparison experiments were conducted with ResNet50 and ResNet18 on the DIOR and DOTA datasets to compare our CDFD with several state-of-the-art distillation approaches.The framework for both students and teachers was Faster RCNN, and the backbone of the students is ResNet50.For distillation approaches FGD [48], MGD [54], and our approach without the multi-teacher strategy, the backbone of teachers was ResNet101, while for our approach with multi-teacher strategy, three teachers were employed with backbones ResNet101, ResNext32, and ResNext64.
It can be seen that our approach has state-of-the-art performance on all datasets and backbone networks, with all students achieving significant accuracy improvements with our CDFD, as shown in Table 3.With Faster RCNN-ResNet50 without the dense multiteacher training strategy, our approach performed comparably to the high-performance distillation approach MGD; when the dense multi-teacher training strategy was introduced, it outperformed MGD on both the DIOR and DOTA datasets.With Faster RCNN-ResNet50 without the dense multi-teacher training strategy, our approach was slightly weaker than MGD, while with the dense multi-teacher training strategy, the performance was comparable to MGD on both datasets.In order to verify the generality of CDFD, we conducted experiments on more detectors with stronger students and teachers.We tested two state-of-the-art remote sensing detectors-AFPN [29] and FFPF [2]-which are two representative object detection networks.AFPN uses an aware feature pyramid network, while FFPF uses a frequency domain-aware backbone network and a bilateral frequency domain-aware feature pyramid network.
Table 4 shows the experimental results.Without any additional computational effort, the mAP of FFPF increased from 72.8% to 73.6% and the mAP of AFPN increased from 71.8% to 72.6% with CDFD.The experiment results show that student detectors obtain better feature from stronger teacher detectors, demonstrating that CDFD has good generalisation and is applicable to various SOTA remote sensing detectors.To verify that CDFD is also effective for general object detection tasks, we compared CDFD to several of the state-of-the-art detection distillation approaches on the COCO dataset.Faster RCNN frameworks were used for both students and teachers, with ResNet50 for the student backbone and ResNet101 as the teacher backbone of FGFI [55], GID [56], and FGD [48], while ResNet101, ResNext32, ResNext64 were used for the teacher backbone of CDFD.
Table 5 shows that our approach significantly outperformed previous SOTA approaches, and students obtained significant performance gains from the three teachers using the proposed CDFD.The student model improved from 38.4% to 40.9% on mAP, completely eliminating the performance decrease due to the lightweight backbone.Thus, the experimental results confirm that the proposed CDFD approach is equally applicable to common object detection tasks.

Visualisation Results
To more intuitively understand the role of our proposed CDFD, we visualised detection heat maps of the original model and of the distilled student model on the DIOR dataset; the regions with warm colours are the regions of greater interest to the model.The framework for both the original and student models was the Faster RCNN-ResNet50 detector.As shown in Figure 6, the student models trained with CDFD were able to focus more precisely on the regions where the objects are located, avoiding interference from the background and therefore achieving better detection performance.

Conclusions
To create lightweight and superior detectors for onboard remote sensing object detection, we propose the novel Context-aware Dense Feature Distillation (CDFD), guiding a lightweight student detector to fully learn the superiority of multiple large teacher detectors.Our approach includes novel components: a Contextual Feature Generation Module (CFGM) for learning teacher context, and an Adaptive Dense Multi-teacher Distillation Strategy (ADMD) for learning multi-teacher feature maps, allowing lightweight detectors to obtain better features without additional computation, resulting in considerable performance gains.Extensive experiments on different datasets and network structures demonstrate that the proposed CDFD effectively improves the performance of lightweight detectors with good generalisation.Furthermore, experiments on an extensive general object dataset demonstrate that our CDFD is equally effective for general object detection distillation.However, there are limitations to the application of CDFD, i.e., it is not suitable for detectors without a feature pyramid network.Therefore, in further research, it is necessary to investigate distillation methods that are more general and suitable for all detectors.

( 1 )Figure 1 .
Figure 1.Visualisation of middle layer feature maps for different teacher networks.(a) Input image, (b) feature map of ResNet101, (c) feature map of ResNext32, (d) feature map of ResNext64.

Figure 2 .
Figure 2. The overall framework of context-aware dense feature distillation.

Figure 3 .
Figure 3. Description of Contextual Feature Generation Module.

Figure 4 .Figure 5 .
Figure 4. Four different distillation strategies.Due to the varied knowledge from different teachers and different stages, dense multi-teacher knowledge distillation can typically offer rich knowledge and fine-tune a better student model.(a) Sparse mono-teacher distillation, (b) sparse multi-teacher distillation, (c) dense mono-teacher distillation, (d) dense multi-teacher distillation.Homogeneous Feature Representation Learning

4. 4 . 3 .
Comparison with State-of-the-Art Distillation Approaches on RS Datasets

Figure 6 .
Figure 6.Examples of heatmap visualisations on the DIOR dataset.(a) Input image, (b) heat map of original model, (c) heat map of student model.

Table 1 .
Comparison of performance on the DIOR dataset and DOTA dataset.The best results are marked by bold text.T indicates the teacher network.

Table 2 .
Comparisons of the impact of the CFGM and the ADMD.The best results are marked by bold text.

Table 3 .
Comparison with state-of-the-art distillation approaches.The best results are marked by bold text.† means without multi-teacher strategy.

Table 4 .
Results of more detectors with stronger students and teacher detectors on the DIOR dataset.The best results are marked by bold text.

Table 5 .
Comparison with state-of-the-art distillation approaches on COCO dataset.