Multi-Scale Object Detection in Remote Sensing Images Based on Feature Interaction and Gaussian Distribution

: Remote sensing images are usually obtained from high-altitude observation. The spatial resolution of the images varies greatly and there are scale differences both between and within object classes, resulting in a diversified distribution of object scales. In order to solve these problems, we propose a novel object detection algorithm that maintains adaptability to multi-scale object detection based on feature interaction and Gaussian distribution in remote sensing images. The proposed multi-scale feature interaction model constructs feature interaction modules in the feature layer and spatial domain and combines them to fully utilize the spatial and semantic information of multi-level features. The proposed regression loss algorithm based on Gaussian distribution takes the normalized generalized Jensen–Shannon divergence with Gaussian angle loss as the regression loss function to ensure the scale invariance of the model. The experimental results demonstrate that our method achieves 77.29% mAP on the DOTA-v1.0 dataset and 97.95% mAP on the HRSC2016 dataset, which are, respectively, 1.12% and 1.41% higher than that of the baseline. These experimental results indicate the effectiveness of our method for object detection in remote sensing images.


Introduction
With the rapid development of remote sensing technology, the quantity and quality of remote sensing images have substantially improved.The objects of interest in remote sensing image object detection typically include harbors, bridges, airplanes, ships, vehicles, and other multi-type objects.The main tasks of object detection in remote sensing images are to localize and classify the objects of interest [1].this is important in intelligence reconnaissance, object surveillance, disaster rescue, industrial applications, and daily life.It also serves as the basis for the subsequent work of object tracking, scene classification, and image understanding.
Most of the object detection algorithms in remote sensing images based on deep learning are derived from general object detection algorithms [2].Due to the outstanding performance of deep learning algorithms in the field of natural image processing, object detection technology in remote sensing images has also developed from the traditional manual feature extraction to the current deep learning-based approach, and its performance has been greatly improved.Compared with natural images, remote sensing images are usually obtained from high-altitude observations.There are situations such as drastic object scale changes, diverse object orientation, small and densely distributed objects, and complex backgrounds.Therefore, there are still a lot of challenges for object detection in remote sensing image scenes.
Compared to natural images, remote sensing images feature complex object characteristics and a significant amount of background interference, the object visual information is not sufficient, and there is a lack of discriminative object features [3].It is difficult to achieve good detection results by directly adopting the detection scheme of the natural scene.Most current detection methods are researched based on the intrinsic characteristics of remote sensing images.As shown in Figure 1, in remote sensing scenes, it is common to see objects of different classes and scales or objects of the same class but different sizes in the same field of view [4].The scale difference among the objects is significant, which makes it difficult for the features of different objects to be synchronously transmitted to deep networks.As a result, the deep convolutional network, which relies on feature map representations, is unable to effectively capture the features of multi-scale objects.The detection model lacks good scale invariance, which in turn affects the detection accuracy of multi-scale objects [5].Therefore, object detection in remote sensing images with large-scale variations is still a challenging problem.To address the aforementioned challenges, we propose a multi-scale feature interaction (MSFI) network.At the feature layer, a cross-level feature interaction (CFI) module is constructed to facilitate interactions between feature layers at different levels of the feature pyramid.This module is based on the focal modulation network [6], which can efficiently process multi-scale features while avoiding redundant and dense connections and mitigating the loss of small object features caused by multiple down-samplings.In the spatial domain, a spatial feature interaction (SFI) network, based on context aggregation [3] with a residual connection, is adopted to enhance features and suppress background noise.The proposed multi-scale feature interaction algorithm enhances the information recognition and expression capabilities of the detector, enabling multi-scale detection.Meanwhile, commonly used loss functions in object detection, such as L-norm loss, lack scale invariance [7].To adapt the model for the multi-scale detection task, we propose a regression loss based on Gaussian distribution (RLGD).We transform the regression problem in object detection into a problem of measuring the similarity between Gaussian distributions.Then, the generalized Jensen-Shannon divergence [8] that possesses the scale invariance is used to measure the similarity between Gaussian distributions and it is normalized as the regression loss function.On this basis, the Gaussian angular loss [9] is introduced to overcome the angle confusion problem of square bounding boxes in the rotation box representation method based on Gaussian distribution.
In this paper, we follow the Oriented R-CNN [10] object detector framework and apply the proposed algorithm to the object detection task in optical remote sensing images.The results are evaluated on the datasets of DOTA-v1.0 [4] and HRSC2016 [11].Compared with other detectors, the proposed method achieved better detection results.
The main contributions in this paper are as follows: (1) We propose a new multi-scale feature interaction (MSFI) network that constructs a cross-level feature interaction (CFI) module at the feature layer and introduces a spatial feature interaction (SFI) module in the spatial domain.By combining the feature interaction in the feature layer and spatial domain, the network makes full use of the spatial and semantic information of multi-level features to effectively localize and identify objects with drastic scale changes.
(2) We propose a new regression loss based on Gaussian distribution (RLGD), which uses the normalized generalized Jensen-Shannon divergence between Gaussian distributions as the regression loss function to make the model scale-invariant.The Gaussian angle loss is introduced to overcome the angle confusion problem in Gaussian modeling.
(3) The experimental results show that the proposed method has superior detection capabilities compared to the baseline and other state-of-the-art detectors.Our method obtains mAP of 77.29% on the DOTA V.1.0test set and mAP of 97.95% on the HRSC2016 test set.

Object Detection in Remote Sensing Images
Advanced remote sensing image object detection algorithms usually rely on the twostage RCNN [12] framework, which consists of a region proposal network and a region CNN detection head.In recent years, a large amount of research has focused on oriented bounding box representations for remote sensing image object detection, and several variations of the RCNN framework have been proposed.The RoI Transformer [13] applies multi-layer fully connected layers from the horizontal anchor box to learn the rotated anchor box and extracts features in the boxes for further regression and classification.ReDet [14] provides an explicit encoding for equivariance and rotation invariance and proposes a rotation-invariant RoI alignment.Oriented RCNN [10] introduces a new bounding box encoding system and proposes an oriented region proposal network (oriented RPN) that directly generates high-quality oriented proposals.It also introduces the Oriented R-CNN head for refining and recognizing oriented regions of interest (oriented RoIs).Gliding Vertex [15] adds four sliding offset variables to the classical horizontal bounding box representation to address the instability in training loss caused by the periodicity of the rotation angle.
Different from the two-stage detection framework, the one-stage detection framework directly employs a network to transform the object detection problem into a regression problem.For the remote sensing detection tasks, S2ANet [16] extracts robust object features by means of a feature alignment module and orientation detection module.Some approaches [17][18][19][20] consider remote sensing detection as a point detection task.In addition, DRN [21] uses the attention mechanism to dynamically refine the features extracted by the backbone for more accurate prediction.RSDet [22] effectively mitigates the discontinuity of regression loss by weakening the loss value of the training samples in the bounded case from the perspective of improving the loss function.R3Det [23] detects objects quickly and accurately by using a stepwise regression approach from coarse-grained to fine-grained.AOPG [24] uses a progressive regression method to refine the bounding box from coarsegrained to fine-grained.LSKNet [25] can dynamically adjust its large spatial receptive field to better model the ranging context of various objects in remote sensing scenarios.ARC [26] proposes a convolution kernel that rotates adaptively to extract object features with varying orientations in different images.In addition to CNN-based frameworks, the O2DETR [27] and AO2-DETR [28] also introduced the Transformer-based detection framework DETR [29] to the remote sensing detection tasks, which provides another way to solve the object detection problem in remote sensing images.
In recent years, the Transformer architecture has also promoted the advancement of remote sensing image object detection.Such algorithms are not directly equivalent to single-stage or two-stage algorithms.However, when they are applied to object detection, single-stage or two-stage detection can be achieved in different ways.The core idea of the Transformer is to focus on different parts of the image through attention mechanisms, but it does not directly define whether the detection process is single-stage or two-stage.Attention mechanism-based Transformers [30], initially designed for natural language processing, have made their impactful presence felt in multiple vision tasks, including object detection.The core principle behind Transformers is attention-determining a specific part of the image where the model should focus at a given time [31].This attention allocation can be immensely beneficial for object detection, where spatial information and an understanding of the object hierarchy in the image are often crucial.Furthermore, several studies have highlighted the feasibility and efficiency of hybrid models that combine the spatial dominance of CNNs with the dynamic attention capabilities of Transformers, referred to herein as CNN-Transformer hybrids.Diverse experiments, ranging from DETR [29], a pioneering model applying Transformers to object detection, to Swin Transformer [32], a CNN-Transformer hybrid, have illustrated the potency of this amalgamation.

Small Infrared Object Detection
This section delves into the recent progress made in small infrared object detection, with a particular focus on the contributions of several groundbreaking studies.Firstly, the ISNet model [33] emphasizes the significance of 'shape' in detecting infrared small targets and integrating a shape-aware feature extraction module within a convolutional neural network framework.This innovative approach has proven effective in enhancing the recognition of small targets, which often exhibit irregular and amorphous shapes against complex backgrounds.
Following this, Rkformer [34] introduces a novel application of the Runge-Kutta Transformer coupled with random-connection attention mechanisms.This model adeptly captures the dynamic and transient characteristics of small targets, facilitating robust detection even within highly intricate scenes.Further exploration into feature analysis is presented in [35], where a method for feature compensation and cross-level correlation is proposed.This technique aims to offset the loss of critical target details and leverages the intrinsic correlation between different feature levels, showing marked effectiveness in distinguishing small targets from background noise.
Similarly, Chfnet [36] proposes a curvature half-level fusion network that merges curvature-driven contour features with conventional intensity features, significantly boosting the discriminative capability for single-frame infrared small target detection.Drawing inspiration from thermodynamic principles, [37] introduces a multi-feature network designed to effectively segregate small targets from their surroundings.This thermodynamicsinspired approach has shown exceptional proficiency, especially in scenarios where traditional intensity-based methods fall short.DMEF-Net [38] offers a lightweight solution tailored to limited sample scenarios, further underscoring the diverse approaches being developed to tackle the challenges of infrared small target detection.Addressing the computational efficiency of infrared small target detection, IRPruneDet [39] incorporates wavelet structure-regularized soft channel pruning.This method significantly reduces computational overhead without compromising detection accuracy, presenting an ideal solution for real-time applications.

Multi-Scale Object Detection 2.3.1. Multi-Scale Object Detection in Remote Sensing Images
Remote sensing images are usually obtained from high-altitude observation, the spatial resolution of the images varies widely, and there are scale differences between and within object classes.The object scale in remote sensing images changes drastically, so the designed model must have good scale adaptability.This means the model should retain high recognition ability across various scales, even with drastic changes in multiple classes of objects.Multi-scale feature fusion can effectively alleviate the challenging problem of large object scale changes.
The most typical module of multi-scale feature fusion is the feature pyramid [40], which integrates deep semantic information with shallow contour and position information and uses different feature maps to uniformly locate the object to be detected at multiple scales, to achieve accurate object detection [41].CATNet [3] proposes a dense feature pyramid network with spatial context pyramids to improve the multi-scale object feature extraction process, thus improving object detection accuracy.Liu et al. [42] propose a hybrid network named TransConvNet.The model pays more attention to the aggregation of global and local information to improve the information representation ability of feature maps with different resolutions, while an adaptive feature fusion network is designed to mitigate the problem of drastic changes in object scales.Li et al. [43] aggregate the features of global spatial locations at multiple scales on the Transformer and model the interaction between pairwise instances to improve the detection performance of objects at different scales.
In addition to multi-scale feature fusion, Yang et al. [44] propose a measurement method based on Kullback-Leibler divergence (KLD), which is scale-invariant and can localize the object to be detected well.Zhu et al. [45] propose Transformer prediction head YOLOv5 (TPH-YOLOv5), which replaces some convolutional and cross-stage partial (CSP) structures in the original YOLOv5 with Transformer encoder and Transformer prediction head to mitigate the problem of drastic scale changes of the object.Xu et al. [46] propose the adaptive zoom network (AdaZoom), which freely scales the object-focusing region with flexible aspect ratios and scales for multi-scale object detection.And they use cotraining to promote the coordination between AdaZoom and the object detector to further improve the detection performance.Chalavadi et al. [47] propose a multi-scale object detection network with hierarchical extended convolution.Their model employs parallel extended convolution to learn contextual information about different classes of objects across multiple scales and fields of view, effectively enhancing its ability to detect objects at various scales.

Multi-Scale Features
Multi-scale features have been widely used in the field of object detection.Higher-level feature maps emphasize advanced semantic information, while lower-level feature maps focus on texture and localization information.The fusion of multi-scale information can effectively combine high-resolution and high-semantic features, enhancing the recognition and expression of object information.Effectively fusing them is the key to improving the detection model [42].Some current studies have achieved certain results [41].However, in the process of multi-scale feature fusion, in order to obtain higher-level semantic information, the feature map is downsampled several times, which leads to the disappearance of small object features and is very unfriendly to small objects.Additionally, various noises are also amplified during the process of feature fusion.How to deal with such a situation is a critical problem that needs to be solved at present [48].
The multiscale feature propagation strategy [49] propagates features along a fixed path, while flexible information flow reduces information confusion and better aggregates multiscale features to avoid the problem of small object features disappearing due to multiple down-sampling.Cross-scale feature fusion pre-fuses features at a uniform scale and subsequently generates feature maps of different specifications for subsequent detection.This approach samples a larger span on feature layers that deviate from the center layer, whereas feature fusion at the corresponding scales separately avoids loss of information [50].Furthermore, the attention mechanism introduced by Transformer [30] has demonstrated effectiveness in multi-scale object detection.The attention mechanism focuses on regions of interest within a large amount of input information, reducing focus on other details and filtering out irrelevant data.This can enhance both the efficiency and accuracy of object detection tasks.TPH-YOLOv5 [45] leverages the Transformer architecture to enhance YOLOv5 and incorporates a self-attention mechanism to effectively detect multi-scale objects.

Scale Invariance of Regression Loss
In the field of object detection, commonly used loss functions such as L1-norm do not have scale invariance, while regression loss based on intersection over union (IoU) [51,52] can be applied to various object detection tasks with different scales and shapes.This is achieved by optimizing the overlap of the bounding boxes to train the network parameters, which is suitable for object detection tasks with different scales and shapes.However, the method of IOU loss is non-differentiable in the oriented bounding box and cannot be directly applied to rotating object detection in remote sensing images.Additionally, the existing deep learning framework operators do not easily accommodate the skew IoU loss, so the IoU loss is not widely used in the field of object detection based on oriented bounding boxes.
The Gaussian Wasserstein distance model (GWD) [7] transforms a rectangular box representation with arbitrary orientation into the form of a two-dimensional Gaussian distribution.This transformation reframes the bounding box parameter regression problem as a similarity measurement between two Gaussian distributions.It replaces the traditional method of calculating differences between bounding box parameters at various scales and units, used in L1 loss calculations.The KLD model [44] computes the Kullback-Leibler divergence (KLD) between Gaussian distributions, which can approximate the degree of overlap between the predicted bounding boxes and the true bounding boxes and is suitable for object detection tasks with various scales and shapes.

Overall Architecture
In order to improve the multi-scale detection capability in remote sensing images, we propose a multi-scale feature interaction (MSFI) network to take full advantage of the spatial and semantic information of multi-level features, and the proposed regression loss based on the Gaussian distribution (RLGD) algorithm adopts regression loss with scale invariance.In this paper, we follow the Oriented R-CNN [10] object detector framework and apply the proposed method to the object detection task in remote sensing images.Oriented R-CNN is a two-stage object detection framework based on a five-parameter rotated bounding box with good accuracy and efficiency, which mainly consists of an oriented region proposal network (oriented RPN) and an Oriented R-CNN head.Specifically, oriented RPN is employed in the first stage to directly generate high-quality-oriented proposals in a nearly cost-free manner.The Oriented R-CNN head, equipped with rotated RoIAlign and a fully connected (FC) layer, is used to classify and regress from the oriented RPN.The overall architecture of the proposed multi-scale remote sensing image object detection model presented in this paper is shown in Figure 2. The structural details of each part of our method are described in the following.

Multi-Scale Feature Interaction Network
This section proposes a multi-scale feature interaction (MSFI) network, including a cross-level feature interaction (CLFI) module composed of a top-down grounding feature interaction (GFI) module and a bottom-up rendering feature interaction (RFI) module, as well as a spatial feature interaction (SFI) module.The overall architecture is shown in Figure 3a.The MSFI network proposed in this section carries out feature interactions between different layers of features and within the same scale.This multi-scale feature interaction strategy can help the model integrate deep and shallow features more efficiently, thus capturing more comprehensive information about the object.

Cross-Level Feature Interaction Network
In this section, we propose a cross-level feature interaction network (CLFI) that interacts with the features of adjacent layers in the feature pyramid.The feature pyramid obtained after feature interaction is constant in size and has richer contextual information.In the CLFI network, adjacent feature maps are scaled to perform grounding feature interaction (GFI) and rendering feature interaction (RFI), respectively.Among these, the GFI module facilitates the interaction of lower-level feature maps with higher-level feature maps, whereas the RFI module enables the interaction of higher-level feature maps with lower-level ones.These feature interactions occur between adjacent feature layers and are based on the focal modulation network.
As shown in Figure 3b, the GFI module is a kind of top-down feature interaction [53], which grounds the "concept" in the higher-level feature map to the "pixels" in the lowerlevel feature map.And the size of the output is the same as that of the lower-level feature map.Specifically, in the GFI module, the lower-level feature map is , and the feature map obtained after feature interaction is P Gi ∈ R H×W×C .
In the GFI module, we set The bilinear interpolation algorithm is used to upsample C ′ i+1 to obtain X, keeping its resolution consistent with Q i .We adopt the focal modulation network [6] instead of the attention mechanism [54], feature interaction is performed on Q and X to obtain Y: where f (q i ) is a query projection function, ⊙ is multiplied by elements, and m(•) is a context aggregation function whose output is a modulator.The focal modulation network [6] is shown in Figure 4.The feature map P Gi = Y obtained after feature interaction is the output of the GFI module.As shown in Figure 3c, the RFI module adopts a bottom-up approach [53] and uses the visual attributes of "pixels" in lower-level feature maps to render "concepts" in higherlevel feature maps.The size of the output is the same as that of a higher-level feature map.Specifically, in the RFI module, given that higher-level feature map is , and the feature map obtained after feature interaction is P Ri ∈ R H×W×C .
In the RFI module, similar to the GFI module, let The maximum pool algorithm is applied to reduce the resolution of the feature map C ′ i−1 to obtain X, whose resolution is consistent with Q ∈ R H×W×C .Like the GFI module, feature map Y is obtained by combining the features of Q and X based on the focal modulation network.The feature map P Ri = Y obtained after feature interaction is the output of the RFI module.

Spatial Feature Interaction Network
The cross-level feature interaction network performs feature interactions for different levels of feature maps.But the feature pyramid still contains spatial information.Therefore, this section introduces the spatial feature interaction (SFI) network to further enhance features by learning the global spatial context within each level.The structure of the SFI network is shown in Figure 3d, which is composed of residual connection and context aggregation block (CAB) [3].Given the input feature map C ′ i ∈ R H×W×C , the output feature map P Si ∈ R H×W×C is obtained after SFI, which has the same size as the input feature map C ′ i .In the SFI module, let X = C ′ i perform the pixel-by-pixel spatial feature interaction: where Y is the output feature map after the spatial feature interaction.CAB(•) [3] denotes the context aggregation block, as shown in Figure 5. Specifically, the context aggregation block is denoted as follows: where X is the feature map containing N pixels, j, m, k ∈ {1, . . . ,N i } denotes the index of each pixel, ⊙ is multiplied element-wise, and W v1 , W v2 , W v3 , W v4 is the linear transformation matrix used to project the feature map, which is mapped using a 1 × 1 convolution.Let the feature map, P Si = Y, obtained after spatial feature interaction be the output of the SFI module.

Multi-Scale Feature Interaction Network
The overall architecture of the multi-scale feature interaction network is shown in Figure 3. Firstly, the input image is a feature extracted by the backbone network and denoted as C = {C 2 , . . . ,C 5 }, where C i represents the feature map at level i.Then, a pyramid- shaped feature map C ′ = C ′ 2 , . . ., C ′ 5 with different layers is obtained by enhancing the features through the lateral connection and top-down pathways [40], as shown in Figure 6a.Specifically, a 1×1 convolution is used on each feature layer for lateral connection to reduce and unify its channel dimensions to 256.A bilinear interpolation algorithm is used to upsample the spatial resolution of higher-level feature maps by a factor of 2, which is then summed element-by-element with the lower-level feature maps.And C  The resolution of the feature map C ′ i and P i is 1/i 2 of that of the input image.The multi-scale feature interaction from C ′ i to P i at the i-th layer is shown in Figure 6b.For the multi-scale feature interaction network, P Gi and P Ri are obtained from crosslayer feature interactions, and P Si is obtained from spatial feature interactions.Let P i = P Gi + P Ri + P Si , where P i ∈ {P 2 , . . . ,P 6 } is the feature map obtained by combining feature interactions in the feature layer and the spatial domain.Then P i = P  In this section, we propose a regression loss algorithm based on Gaussian distribution.Firstly, the rectangular box representation with arbitrary orientation is transformed into the form of a two-dimensional Gaussian distribution.Replacing the oriented box representation with Gaussian distribution can solve the bounding problem, improve the square-like problem, and avoid the inconsistency between the metrics and the loss in the detection of rotating objects [7].Specifically, the representation of a five-parameter-oriented bounding box B(x, y, w, h, θ) is transformed into the form of two-dimensional Gaussian distribution N(µ, Σ) [7]: where R denotes the rotation matrix and Λ denotes the diagonal matrix of eigenvalues.According to the above equation, the prediction bounding box and the truth bounding box are transformed into two-dimensional Gaussian distributions X p ∼ N p µ p , Σ p and X t ∼ N t (µ t , Σ t ), respectively.This transformation allows for the regression problem in object detection to be converted into a problem of measuring the similarity between two Gaussian distributions.

Regression Loss of Normalized GJSD
Due to the significant size differences among various categories of objects, or even within the same category at different resolution acquisitions in remote sensing image analysis, it is crucial for the detection algorithm to possess excellent scale adaptability.To address these challenges, we propose a regression loss model based on the generalized Jensen-Shannon divergence (GJSD) [8].Specifically, GJSD is utilized to measure the similarity between two Gaussian distributions, X p ∼ N p µ p , Σ p and X t ∼ N t (µ t , Σ t ).The calculations are detailed below: where D KL represents the Kullback-Leibler divergence (KLD) [44] and D KL (N α ∥ N p ) is denoted as follows: The definition of D KL (N α ∥ N t ) is the same as above, which gives D GJS (N p ∥ N t ), denoted by the following: N α (µ α , Σ α ) is given by the following formula: The parameter α controls the weights of the two distributions, and its value is set to 0.5.Then, normalizing the distance function D GJS (N p ∥ N t ) yields a regression loss of the following: The symbol f (•) represents a nonlinear function, typically including functions such as sqrt(•) and log(•).The symbol τ is a hyperparameter that modulates the entire loss function.In this case, f (•) is set to sqrt(•) and τ is set to 2 [55].
From Equations ( 7) and ( 8), it can be seen that each term in D GJS (N p ∥ N t ) consists of a partial parameter coupling, which makes all parameters form a chain coupling relationship.During the optimization process of GJSD-based regression loss, the parameters interact and co-optimize with each other, resulting in a self-modulating optimization mechanism.GJSD is capable of measuring the similarity between the Gaussian distributions of the truth bounding box and the prediction bounding box.It inherits the scale invariance of KLD while overcoming its asymmetric drawbacks.The scale invariance of GJSD is analyzed as follows: If there is a full-rank matrix M, then (5) to (8), the following can be obtained: When M = kI (I is a unit matrix and I ̸ = 0), the scale invariance of GJSD is proved (see the detailed proof in Appendix A).Therefore, using the GJSD-based regression loss model, which employs the oriented bounding box representation of the Gaussian distribution, can mitigate the issue of large object scale variation in remote sensing images and enhance the effectiveness of object detection.

Gaussian Angle Loss for the Problem of Angle Confusion
Although the regression loss based on GJSD has scale invariance, when the bounding box B(x, y, w, h, θ) is square, its two-dimensional Gaussian distribution N(µ, Σ) is a circle that cannot accurately represent the direction.Specifically, when w = h, the following can be obtained from Equations (4): It can be seen that the angular information in it is lost, which can lead to angular confusion for square objects, thus affecting the performance of detection.
To address this issue, the Gaussian angle loss function is introduced as a solution to the angle confusion problem of square bounding boxes in the rotation box representation method based on Gaussian distribution [9].The Gaussian angle loss function is as follows: where ∆θ = θ p − θ t .λ and β are hyperparameters.To reduce the effect on non-square objects, the values of λ and β are set to 3 and 0.25, respectively.For further analysis, we have the following: When w t = h t , ∂L GA /∂θ p = 2 sin(4∆θ).As the aspect ratio approaches 1, L GA exhibits a periodic variation of 90 • as the angle deviation ∆θ increases, which is consistent with the angular period of a square.As the aspect ratio increases, ∂L GA /∂θ p approaches 0 and L GA becomes invalid.
The relationship between the regression loss based on GJSD (L GJSD ) and Gaussian angle loss (L GA ) with the bounding box aspect ratio and angle deviation, respectively, is shown in Figure 7, which visualizes the above properties.

Regression Loss Based on Gaussian Distribution
Based on the discussion of the above issues, we propose the regression loss based on Gaussian distribution (RLGD), which can be expressed as follows: Among them, L GJSD is the regression loss based on GJSD, and L GA is the Gaussian angle loss, which is used to address the issue of angle confusion in square bounding boxes caused by the rotating box representation method based on Gaussian distribution.Combining L GJSD and L GA can enhance the accuracy and performance of oriented object detection.When the aspect ratio of the bounding box approaches 1, L GA can be used as compensation for L GJSD to improve the detection accuracy of objects with a square bounding box.Moreover, when the aspect ratio of the bounding box is not 1, L GA approaches 0, which does not affect the value of the original loss function.

Datasets
In recent years, with the widespread application of remote sensing image object detection techniques across various fields, several institutions have collected and organized a wide variety of remote sensing datasets.In this paper, we conduct extensive experimental validation of the proposed method on two publicly available remote sensing image datasets, namely, the DOTA-v1.0 dataset [4] and HRSC2016 dataset [11].
4.1.1.DOTA-v1.0DOTA-v1.0 [4] is a dataset specifically designed for object detection in remote sensing images.The images in this dataset originate from a diverse range of sources, including aerial images collected by various sensors and platforms, which contribute to the rich scene variations and practical application value of the dataset.Furthermore, the objects within the images exhibit extensive variations in scale, orientation, and shape.It comprises 2806 images with a total of 188,282 instance objects labeled using quadrilateral bounding boxes.The dataset contains 15 categories: plane (PL), baseball diamond (BD), bridge (BR), ground track field (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccer ball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP) and helicopter (HC).The images in the DOTA-v1.0 dataset have varying pixel sizes, ranging from 800 × 800 to 4000 × 4000.The dataset is divided into three sets: training, validation, and test sets (1411, 458, and 937 images, respectively).The training and validation sets provide labels of the object annotation boxes in the images, while the test set does not provide true labels.After testing, the predicted bounding box labels are submitted to the official DOTA-v1.0 dataset test website to obtain the final detection results.

HRSC2016
HRSC2016 [11] is a dataset containing high-resolution remotely sensed ship images, known as the High-Resolution Ship Collection 2016.Due to the complex backgrounds of remote sensing images in this dataset, the significant similarity between object ships and coastal textures, as well as the diverse scale variations of ships of various sizes present in the same image, we chose this dataset to test our model.It consists of 1070 images containing 2976 ship objects in the rectangular box labeled format.The ships vary in size, appearance, and orientation, with over 20 different types represented.The HRSC2016 dataset comprises images with pixel sizes ranging from 300 × 300 to 1500 × 900 pixels, with the majority exceeding 1000 × 600 pixels.These images feature a spatial resolution between 0.3 and 3 m.The dataset includes 626 images in the training set and 444 images in the test set, all of which are accompanied by accurate real-world labels.

Implementation Details
Our experiment was conducted on the Ubuntu 18.04 operating system with a single NVIDIA TESLA T4.We utilized PyTorch 1.9.0 as a deep learning framework and Python 3.8 was used as the development language.Oriented R-CNN [10] was used as the object detection framework.ResNet50 and ResNet101 [56] were used as our backbones.They were pre-trained on ImageNet [57].Horizontal and vertical flips were used as data augmentation techniques during training.The batch size for the experiment was set to 2. We optimized the overall network using the stochastic gradient descent (SGD) algorithm, with a momentum of 0.9 and a weight decay of 0.0001.The effectiveness of the remote sensing image object detection algorithms was evaluated using the mean average precision (mAP), the number of parameters (Params), and the floating-point operations (Flops).
On the DOTA-v1.0 dataset, the original images were cropped into 1024 × 1024 patches with a cropping stride set to 824, resulting in a 200-pixel overlap between two adjacent patches.The number of epochs was set to 12 for the DOTA dataset.The initial learning rate was set at 0.005 and reduced by a factor of 10 at epochs 8 and 11.
On the HRSC2016 dataset, the short side of the original images was resized to 800 pixels, while the long side was kept to a maximum of 1333 pixels, preserving the original aspect ratio of the images.The number of epochs was set to 36 for the HRSC2016 dataset.The initial learning rate was set to 0.005 and divided by 10 at epochs 24 and 33.

Results on the DOTA-v1.0 Dataset
We compare the proposed model with other advanced remote sensing image object detection algorithms on the DOTA-v1.0 dataset.Table 1 displays the experimental results, including the average precision (AP) of the 15 classes, respectively, on the DOTA-v1.0 dataset and the mean average precision (mAP) of all the classes.The backbone networks, R-50 and R-101, are denoted as ResNet-50 and ResNet-101, respectively.Compared to other algorithms, our method achieves the highest detection accuracy of 77.29% mAP.Compared to the baseline, our method improves the detection accuracy (mAP )by 1.15% and 1.12% when using ResNet-50 and ResNet-101 as the backbones, respectively.The experimental results presented above demonstrate the effectiveness of our method in addressing the problem of large-scale variation in remote sensing image object detection and improving detection accuracy.The comparison results of various algorithms for object detection on the DOTA-v1.0 dataset are visualized in Figure 8.It can be seen that our method outperforms other models in terms of AP for the eight classes (PL, BR, SV, LV, SH, TC, BC, and ST).And the AP values of the proposed model are generally higher than those of other models.This suggests that our approach is effective at enhancing object detection in remote sensing images.

Visualization Results
The remote sensing image object detection results of the model proposed in this paper are visualized as shown in Figure 10.As can be seen in Figure 10, there are significant size differences between different categories or within the same category of objects in remote sensing images.The detection algorithm must be highly adaptable to handle objects with significant scale changes.And the model should be capable of detecting both tiny and oversized objects.Figure 10 illustrates the detection results of our method in various multiscale scenarios.The proposed approach can effectively detect various classes of objects with significant differences in scale, including objects of the same classes that vary in size, such as planes and ships, as well as objects of different classes that vary significantly in size, such as vehicles and baseball fields, ships, and harbors, and so on.The experimental results demonstrate that the model proposed in this paper has a strong ability to detect multi-scale objects.3.
The proposed multi-scale feature interaction (MSFI) method in this paper achieves a detection accuracy of 76.81% mAP on the DOTA-v1.0 dataset, which is an improvement of 0.98% compared to the baseline.In addition, on the HRSC2016 dataset, the detection accuracy metrics mAP(07) and mAP (12) reached 90.50% and 97.52%, respectively, which are superior to the baseline.This indicates that our method enables the network to effectively locate and identify objects with drastic scale changes by utilizing the spatial and semantic information of multilevel features.
The regression loss based on the Gaussian distribution (RLGD) algorithm achieves a detection accuracy of 76.70% mAP, representing a 0.87% improvement over the baseline.Moreover, the model outperforms the baseline with detection accuracy metrics of 90.35% mAP(07) and 96.62% mAP (12) on the HRSC2016 dataset.The experimental results demonstrate that the proposed algorithm-by adopting the normalized generalized Jensen-Shannon divergence with Gaussian angle loss as the regression loss function-can render the model scale-invariant, improving the effectiveness of object detection in remote sensing images across multiple scales.
By combining the proposed models of MSFI and RLGD, the detection accuracy, measured by mAP on the DOTA-v1.0 dataset, improved to 76.98%, which is a relative increase of 1.15% compared to the baseline.Meanwhile, the detection accuracy metrics mAP(07) and mAP (12) on the HRSC2016 dataset are 90.55% and 97.67%, representing a significant improvement of 0.32% and 1.56% from the baseline.The experimental results show that the improvements made by the proposed algorithm effectively enhance object detection capabilities in remote sensing images.In order to further explore the benefits of each component of the proposed algorithm, more detailed experiments and discussions were carried out.

Effect of the MSFI
In this paper, we propose a multi-scale feature interaction (MSFI) network that includes a cross-layer feature interaction (CLFI) module consisting of a grounding feature interaction (GFI) module and a rendering feature interaction (RFI) module, as well as a spatial feature interaction (SFI) module.The computational input of FLOPs is 1024 × 1024.Table 4 presents the results of the ablation experiments.From the experimental results, it can be seen that the detection accuracies improve with the addition of the GFI, RFI, and CLFI modules, achieving mAP values of 76.42%, 76.07%, and 76.51%, respectively, all of which exceed those in the baseline.This shows that the cross-layer feature interaction method effectively integrates deep and shallow features, capturing comprehensive information about the object.Additionally, the SFI module achieves a detection accuracy of 76.39% mAP, which is higher than the baseline.This indicates that the method effectively utilizes spatial domain information to enhance features and suppress background noise.
Therefore, each sub-module of the MSFI algorithm enhances the object detection effect in remote sensing images.The MSFI effectively combines these sub-modules and improves the object detection effect with only a slight increase in computation.As a result, the detection accuracy, measured by mAP, is significantly improved.

Effect of the RLGD
This paper proposes a regression loss algorithm based on Gaussian distribution (RLGD), which includes an oriented bounding box representation based on the Gaussian distribution (GD), a regression loss function based on generalized Jensen-Shannon divergence (GJSD), and Gaussian angle loss for the angle confusion problem (GA).The ablation experiment results are presented in Table 5.The baseline model utilizes the smooth-L1 regression loss function [10].The GWD denotes the oriented bounding box representation of the Gaussian distribution and employs the normalized Wasserstein distance as the regression loss function [7].The experimental results show that the detection accuracy of the GWD method achieves 75.91% mAP, which is higher than that of the baseline model.This transformation of the regression problem into a measure of similarity between Gaussian distributions helps improve detection accuracy by mitigating the negative effects of scale and unit differences between parameters.Moreover, the detection accuracy metric mAP for the GJSD method is 76.12%, which is an improvement compared to the baseline model.This suggests that the GJSD-based regression loss method with scale invariance can enhance the scale adaptation ability of the model.In addition, when the RLGD model is enhanced with the introduction of GA, the detection accuracy, measured by mAP, improves to 76.70%, which is 0.87% higher than the baseline model.This indicates that the integration of Gaussian angle loss effectively addresses the angle confusion issue associated with the rotated bounding box representation of the Gaussian distribution for square bounding boxes, further enhancing object detection.
As a consequence, all components of the RLGD algorithm enhance the object detection effect of remote sensing images.The synergistic effects of these components significantly improve the detection accuracy.

Visualization Analysis
In this section, we compare the detection results of our method with the baseline Oriented R-CNN, as shown in Figure 11. Figure 11 displays three groups of detection results, each containing several ship objects.The objects within the same category vary significantly in size, reflecting large-scale changes in remote sensing image objects.Figure 11 illustrates the comparison of detection results between the baseline and our method.The first row displays the baseline results, where the objects of small ships in the selected and enlarged boxes are not detected, resulting in missed detections.In contrast, the second row shows the improved results of our method, accurately detecting the small ships in the selected and enlarged boxes and reducing the missed detection rate.The model presented in this paper significantly improves the detection effect compared to the baseline.Our method accurately detects several small-sized ship objects in these figures while also ensuring the detection of large-sized ships, making it more effective in multi-scale object detection than the baseline.

Limitations
This paper explores the issue of multi-scale object detection in remote sensing images.While some progress has been made in improving the performance of detection algorithms, there are still open issues.
One such issue is that the multi-scale feature interaction model achieves feature interaction through the attention mechanism.This enhances detection accuracy but has a large impact on model speed due to the large number of parameters and high computational complexity of the mechanism.In future work, the attention mechanism can be improved to optimize the computation while maintaining detection accuracy.
Additionally, the algorithm for regression loss based on Gaussian distribution uses Gaussian representation, which requires additional matrix transformations and may increase the computational load.Moreover, this method is unsuitable for detecting objects with large aspect ratios.Therefore, further optimization is necessary to reduce the computational effort caused by matrix transformations and improve the effect of object detection with large aspect ratios.

Conclusions
In this paper, we propose a multi-scale object detection algorithm based on feature interaction and Gaussian distribution to address the problem of scale variation in objects within remote sensing images.In the proposed method, the multi-scale feature interaction (MSFI) network combines feature interactions at the feature layer and in the spatial domain so that the network can make full use of the spatial and semantic information of multi-level features.The proposed regression loss (RLGD) algorithm adopts the method based on Gaussian distribution instead of rotating bounding box representation and improves it by proposing a scale-invariant regression loss.Experimental results demonstrate that our proposed approach can effectively improve the ability of multi-scale object detection.Compared to other advanced algorithms, our method shows superior detection results on remote sensing image datasets.

Figure 1 .
Figure 1.Large-scale variation in objects within remote sensing images.

Figure 2 .
Figure 2. Overall framework of the proposed algorithm.

Figure 3 .
Figure 3. Overall architecture of the multi-scale feature interaction network.(a) Multi-scale feature interaction network; (b) grounding feature interaction module; (c) rendering feature interaction module; (d) spatial feature interaction module.

Figure 4 .
Figure 4.The foundation for feature interaction in the CLFI.

Figure 5 .
Figure 5.The foundation for feature interaction in the SFI.

′ 5
is convoluted by 3 × 3 with a step size of 2 to obtain C ′ 6 .C ′ 2 ∼ C ′ 6 have the same number of channels but different resolutions.

Figure 6 .
Figure 6.Multi-scale feature interaction process.(a) The lateral connection and top-down pathways; (b) multi-scale feature interaction from C ′ i to P i at the i-th layer.The multi-scale feature interaction is performed on the feature map pyramid C ′ = same number of channels but different resolutions.

Figure 9 .
Figure 9. Detection results of different algorithms on the HRSC2016 dataset.

Figure 10 .
Figure 10.Visualization of remote sensing image object detection.
In order to verify the effectiveness of each component of the proposed algorithm, ablation experiments were conducted on the DOTA-v1.0 dataset and HRSC2016 dataset.The baseline used was Oriented R-CNN with ResNet-50 serving as the backbone for feature extraction.The results of the ablation experiments are presented in Table

Figure 11 .
Figure 11.Comparison of detection results between the baseline and our method.

Table 1 .
Detection results of different algorithms on the DOTA-v1.0 dataset(%).The bold represents the highest detection accuracy among a class or all classes.

Table 2 .
Detection results of different algorithms on the HRSC2016 dataset (%).The bold represents the highest detection accuracy.

Table 3 .
Ablation experimental results of our method on the DOTA-v1.0 dataset and HRSC2016 dataset (%).The bold represents the highest detection accuracy.✓ indicates that the algorithm adopts this module.

Table 4 .
Ablation experimental results of the MSFI network on the DOTA-v1.0 dataset.The bold represents the highest detection accuracy.✓ indicates that the algorithm adopts this module.

Table 5 .
Ablation experimental results of the RLGD algorithm on the DOTA-v1.0 dataset and HRSC2016 dataset (%).The bold represents the highest detection accuracy.✓ indicates that the algorithm adopts this module.