Reverse Difference Network for Highlighting Small Objects in Aerial Images

: The large-scale variation issue in high-resolution aerial images signiﬁcantly lowers the accuracy of segmenting small objects. For a deep-learning-based semantic segmentation model, the main reason is that the deeper layers generate high-level semantics over considerably large receptive ﬁelds, thus improving the accuracy for large objects but ignoring small objects. Although the low-level features extracted by shallow layers contain small-object information, large-object information has predominant effects. When the model, using low-level features, is trained, the large objects push the small objects aside. This observation motivates us to propose a novel reverse difference mechanism (RDM). The RDM eliminates the predominant effects of large objects and highlights small objects from low-level features. Based on the RDM, a novel semantic segmentation method called the reverse difference network (RDNet) is designed. In the RDNet, a detailed stream is proposed to produce small-object semantics by enhancing the output of RDM. A contextual stream for generating high-level semantics is designed by fully accumulating contextual information to ensure the accuracy of the segmentation of large objects. Both high-level and small-object semantics are concatenated when the RDNet performs predictions. Thus, both small- and large-object information is depicted well. Two semantic segmentation benchmarks containing vital small objects are used to fully evaluate the performance of the RDNet. Compared with existing methods that exhibit good performance in segmenting small objects, the RDNet has lower computational complexity and achieves 3.9–18.9% higher accuracy in segmenting small objects.


Introduction
With the improvement of the spatial resolution of aerial images, footpath (or cycle track)-level objects are recorded well. As shown in Figure 1, the bikes, drones, and pedestrians are of footpath-level sizes. Some applications, including urban monitoring, military reconnaissance, and national security, have urgent needs in terms of identifying small targets [1]. For example, pedestrian information is not only a data source for constructing urban human-flow patterns [2] but also useful for safe landing [3]. However, the identification of the footpath-level small targets encounters large-scale variation problems. Figure 1 shows the large-scale variations in aerial image datasets, including the UAVid [4] and Aeroscapes [5]. In UAVid, pedestrians (Figure 1a,b) are considerably smaller than trees and roads. In Aerospaces, bicycles ( Figure 1c) and drones (Figure 1d) are considerably smaller than roads and cars. This large-scale variation significantly lowers the accuracy of segmenting smaller objects. For example, in the published evaluations of Aerospaces, the intersection-over-union (IoU) score of the bike category with the smallest size is only 15%, whereas that for the sky category with large objects is 94% [5].
The reason why the large-scale variation issue results in low accuracy of the segmentation of small objects has been studied by [6]. That is, most state-of-the-art methods, such as the pyramid scene parsing network (PSPNet) [7] and point-wise spatial attention network (PSANet) [8], focus on the accumulation of contextual information over significantly large receptive fields to generate high-level semantics. The high-level semantics (Figure 2c,e) extracted using deep convolutional neural network (CNN) layers mainly depict the holistic information of the large objects and ignore small objects [9,10]. Therefore, large objects achieve high accuracy; however, small objects have extremely low accuracy.  The saliency maps of the low-level features and the high-level features (semantics). The lowlevel features (ResNet18 [11]) and the high-level features (ResNet18) are extracted by the first and fourth inner layers of ResNet18, respectively. The low-level features (BiSeNetV2) and the high-level semantics (BiSeNetV2) are extracted by the detail and semantic branches of BiSeNetV2, respectively. Fortunately, the low-level features generated by the shallow layers contain small-object information, which is discovered by [9,10]. Consequently, several methods, such as the bilateral segmentation network (BiSeNet) [12], BiSeNet-v2 [13], and context aggregation network (CAgNet) [14] with dual branches, have been proposed and applied to remote sensing tasks. These methods set up a branch with shallow layers to extract low-level features that contain small-object information. The low-level features and high-level semantics extracted by the other branch are fused to make the final prediction. These methods improve the accuracy of the segmentation of small objects to some extent; however, their effects are limited. This is because the low-level features (Figure 2b,d)) are the mixture of both the large and small objects. Specifically, the object details presented in Figure 2d are mainly for large objects, and we can hardly find the details for pedestrians. The low-level features extracted by shallow layers do not eliminate the predominant effects of large objects. Consequently, large objects push small objects aside when these models are trained.
Different to the dual-branch networks, a study by [15] uses a holistically nested edge detection (HED) [16] to produce closed contours with deep supervision. Semantic segmentation is obtained using SegNet [17] with the help of contours. This study achieves acceptable accuracy in the segmentation of relatively small objects (cars) in the ISPRS two-dimensional semantic labeling datasets (the Potsdam and Vaihingen datasets). However, cars are not small now in aerial images with the improvement of spatial resolution. For example, in the UAVid [4] and Aeroscapes [5] datasets, a range of objects (bikes, drones, and obstacles) that are considerably smaller than cars exist. Moreover, the contours extracted by HED exist for both large and small objects. Thus, the use of HED does not change the predominant relationship between large and small objects. Furthermore, the study in [15] uses a normalized digital surface model (nDSM) and DSM as the input of one of its model branches. However, nDSM and DSM are not always provided; thus, their application is limited.
To the best of our knowledge, the aforementioned small-object problem remains unsolved. We propose a reverse difference mechanism (RDM) to highlight small objects to address this issue. RDM can alter the predominant relationship between large and small objects. Thus, when the model is trained, small objects will not be pushed by large objects. RDM excludes large-object information from low-level features via the guidance of high-level semantics. The low-level features, which are a mixture of both large and small objects, can be the features produced by any shallow layer. The high-level semantics can be the features generated by any deep layer with large receptive fields. We design a novel neural architecture called a reverse difference network (RDNet) based on RDM. In RDNet, a detailed stream (DS) followed by RDM is proposed to obtain small-object semantics. Furthermore, a contextual stream (CS) is designed to generate high-level semantics to ensure sufficient accuracy in the segmentation of large objects. Both the small-object and high-level semantics are concatenated to make a prediction. The code of the RDNet will be available at https://github.com/yu-ni1989/RDNet. The contributions of this study are as follows.

•
A reverse difference mechanism (RDM) is proposed to highlight small objects. RDM aligns the low-level features and high-level semantics and excludes the large-object information from the low-level features via the guidance of high-level semantics. Small objects are preferentially learned during training via RDM. • Based on the RDM, a new semantic segmentation framework called RDNet is proposed. The RDNet significantly improves the accuracy of the segmentation of small objects. The inference speed and computational complexity of RDNet are acceptable for a resource-constrained GPU facility. • In RDNet, the DS and CS are designed. The DS obtains more semantics for the outputs of RDM by modeling both spatial and channel correlations. The CS, which ensures the sufficient accuracy of the segmentation of large objects, produces high-level semantics by enlarging the receptive field. Consequently, the higher accuracy scores of the segmentation of both small and large objects are achieved.

Related Work
Semantic segmentation can be considered as a dense prediction problem, which is first solved by a fully convolutional network (FCN) [18]. Based on the FCN framework, a wide range of semantic segmentation networks has been proposed. In this study, we categorize these semantic segmentation networks into two groups: traditional and smallobject-oriented methods. Additionally, semantic segmentation networks specified in remote sensing are reviewed.

Traditional Methods
Traditional semantic segmentation networks focus on overall accuracy improvement and do not consider the scale variation issue. These methods satisfy this objective primarily by solving the lack of contextual information issues in the FCN. This is caused by local receptive fields and short-range contextual information [19]. PSPNet [7], DeepLabv3 [20], and DeepLabv3+ [21] are designed with pyramid structures to pool the feature map into multiple resolutions to enlarge receptive fields. Furthermore, connected component labeling (CCL) [22] and a dual-graph convolutional network (DGCNet) [23] satisfy the same objectives via other ways. CCL designs a novel context that contrasts local features, and DGCNet uses two orthogonal graphs to model both spatial and channel relationships. To generate dense contextual information, an attention mechanism which facilitates the modeling of long-range dependency is commonly used. Representative methods include nonlocal networks [24], CCNet [25], dual attention network (DANet) [26], expectation-maximization attention network (EMANet) [27], and squeeze-and-attention network (SANet) [28]. We do not review them in detail because they share similar selfattention mechanisms. Recently, the vision transformers (ViTs) [29] extended the attention mechanism. Then, a range of networks such as the MobileViT [30], BANet [31], and semantic segmentation by early region proxy [32] based on the transformer mechanism are proposed. The transformer-based methods have less parameters, but they consume more GPU memory and rely on powerful GPU facilities.
Furthermore, numerous networks have focused on reducing the complexity of the model and accelerating the inference speed. To meet this objective, existing studies have replaced time-consuming backbones with lightweight backbones. For example, PSPNet [7] uses ResNet50 or ResNet101 [11] as its backbone. Both the training and inference of PSPNet are time-consuming. Recent real-time semantic segmentation studies, such as BiSeNet [12] and SFNet [33], replace ResNet101 with ResNet18 and obtain a better trade-off between inference speed and accuracy. In addition to ResNet18, recent studies have proposed a range of novel backbone networks to address real-time issues. The representative methods include MobileNet series [34,35], GhostNet [36], Xception [37], EfficientNet series [38,39], and STDC (Short-Term Dense Concatenate) [40]. Specifically, the Xception and MobileNet series use shortcut connections and depth-wise separable convolutions. The EfficientNet series use the depth-wise separable convolutions too, however, they focus on model scaling and achieve a good balance among the network depth, width, and resolution. GhostNet generates more feature maps from inexpensive operations with a series of linear transformations. STDC gradually reduces the dimensions of the feature maps and uses their aggregation for image representation. In this study, representative methods among them are tested in the experiments.

Small-Object-Oriented Methods
Recently, small-object-oriented methods, such as the gated fully fusion network (GFFNet) [6], have been introduced in several studies. GFFNet improves the accuracy of segmenting small/thin objects using a gated fully fusion (GFF) module to fuse multiple-level features. Then, multiple-level features can contribute to the segmentation simultaneously. Similar to GFFNet, SFNet [33] fuses the multiple-level features using an optical flow idea. It proposes a flow alignment module (FAM) to learn semantic flow between feature maps of adjacent levels. However, GFFNet and SFNet do not change the predominant effect of the large objects. Therefore, the improvements in the segmentation of small objects are limited.
Except for GFFNet and SFNet, most methods, such as BiSeNet [12], BiSeNetV2 [13], CAgNet [14], attentive bilateral contextual networks (ABCNet) [41], and the classification with an edge [15], improve the accuracy of the segmentation of small objects by integrating object details into the convolutions. BiSeNet, BiSeNet-v2, CAgNet, and ABCNet use dual branches to extract both object details and high-level semantics. Specifically, ABCNet integrates the self-attention mechanism into the dual-branch framework, thus it is a selfattention-based method at the same time. Classification with an edge [15] employs HED [16] to extract close contours of dominant objects and make predictions with the help of nDSM and DSM. Because the method in [15] requires diverse data inputs, it is difficult to compare it with other methods.

Semantic Segmentation Networks in Remote Sensing
In the remote sensing field, the studies by [42,43] lay the foundation for semantic segmentation. Subsequently, several excellent methods are proposed [44,45]. Among these, the study by [46] proposes a set of distance maps to improve the performance of deep CNNs. The local attention-embedding network (LANet) [47], relation-augmented FCN (RA-FCN) [48], and cross-fusion net [49] further develop the attention mechanism. The study by [50] argues that existing methods lack foreground modeling and proposes a foregroundaware relation mechanism.
In terms of high-resolution aerial image datasets, Aeroscapes [5] and UAVid [4], based on unmanned aerial vehicle (UAV) images that provide complex scenes with oblique views, include a large range of small objects. Aeroscapes contain a large range of scenes that include rural landscapes, zoos, human residences, and the sea; however, scenes near urban streets are not included. The UAVid [4] dataset is published to address this issue. Coupled with the UAVid dataset in the study of [4], a multiscale-dilation net is proposed. Experiments demonstrate its superiority over existing algorithms. However, the accuracy of segmenting UAV images is relatively lower compared with that of segmenting natural images. This issue remains unsolved.

RDNet
The RDNet is constructed as shown in Figure 3. Figure 3 shows the neural architecture; only the RDM, DS, CS, and the loss functions are necessary to be presented in detail. For the backbone, although the experimental tests (Section 5.3) demonstrate the superior performance of ResNet18 in RDNet, we do not specify the backbone network. Any general-purpose network, such as Xception [37], MobileNetV3 [35], GhostNet [36], EfficientNetV2 [39], and STDC [40], can be integrated into RDNet, and we cannot ensure that ResNet18 achieves superior results compared with future work. The aforementioned general-purpose networks can be easily divided into four main layers (LY 1 , LY 2 , LY 3 , and LY 4 ). If the input image patch has H × W pixels, LY 1 , LY 2 , LY 3 , and LY 4 extract features with reduced resolutions 1 Figure 3. Notably, each LY 1 , LY 2 , LY 3 , and LY 4 contains several sublayers. For example, ResNet18 has four main layers, which is well known. If we combine the convolution, batch norm, ReLU, and max-pooling operators at the beginning of ResNet18 with the first layer, the inner layers of ResNet18 can be divided into LY 1 , LY 2 , LY 3 , and LY 4 , which produces features with reduced resolutions 1  Given an input image patch I ∈ R 3×H×W , the backbone network is performed, and a set of features are generated. We select the features f b 8 W generated by LY 1 and LY 2 as the low-level features. We present the saliency maps of the features extracted by LY 1 , LY 2 , LY 3 , and LY 4 in ResNet18 in Figure 4 to demonstrate the rationale behind the selection for low-level features. From Figure 4b,c, f b 1 and f b 2 produced by ResNet18 are the mixture of both the large and small objects. Meanwhile,

RDM
The low-level features f l ∈ R C l ×H l ×W l (such as f b 2 in Figure 4) extracted by shallow layers are the mixture of both the large and small objects. The high-level semantics f h ∈ R C h ×H h ×W h generated by the deep layers primarily depict large objects (see the saliency map of f h in Figure 4). RDM attempts to exclude large objects from f l via the guidance of f h ; thus, the predominant effects of large objects are eliminated. The principle of RDM differs from the dual-branch framework and skip connection, which have some positive effects on the segmentation of small objects. However, they do not eliminate the predominant effects of large objects. Consequently, when these models are trained, large objects push small objects aside. RDM eliminates the predominant effects of large objects using the idea of alignment and difference, as shown in Figure 5. The key innovations are two-fold: (1) the reverse difference concept and (2) the semantic alignment between f l and f h . The reverse difference concept lays the foundation for RDM. By considering the cosine alignment branch (see Figure 5) as an instance, after f l and f h are aligned, is produced by a difference operator. Here, S(·) is the Sigmoid function which transforms the intensity values in f l and f h cos into intervals from 0 to 1. Notably, the difference must subtract f h cos from f l . Only in this manner can numerous intensity values on the positions of large objects in S( f l ) − S( f h cos ) be negative. Subsequently, we use the ReLU function to set all the negative values to zero. Consequently, the large-object information is washed out, and the small-object information is highlighted.
For semantic alignment between f l and f h , f h typically has more channels and lower resolutions than f l . The up-sampling and down-sampling can change the resolution but fail to change the number of channels. Even if they have the same number of channels, we cannot ensure semantic alignments between f l and f h . For example, the ith channel in f h is more likely to contain a specific category of large objects. Does the ith channel in f l contain similar information of the same categories as the ith channel in f h ? If not, how do we ensure that the reverse difference mechanism is in effect? RDM provides two alignment modules to align the semantics from different perspectives by fully modeling the relationship between f h and f l as shown in Figure 5. The aligned high-level semantics produced by the cosine and neural alignments are f h cos and f h neu , respectively. The difference features f d ∈ R 2C l ×H l ×W l produced by RDM are computed as follows: where Cat(·) denotes the concatenation. In the following sections, the details of cosine and neural alignments are presented.

Cosine Alignment
The cosine alignment presented in Algorithm 1 determines the relationship between f h and f l without learning from the data. It has a mathematical explanation based on cosine similarity. First, we down-sample f l as f l down ∈ R C l ×H h ×W h , which has the same resolution as f h . Subsequently, f h ∈ R C h ×H h ×W h and f l down are converted into vector forms The cosine similarity between each pair of channels in f h and f l down is then computed as where f l down [i] and f h [j] are vectors that belong to the i-th and j-th channels in f l down and f h , respectively. "·" is the dot product between the vectors, and "×" is the product between the scalars. After all pairs of vectors in f h and f l down are performed using Equation (4), a similarity matrix M sim ∈ R C l ×C h is constructed. We do not compute the cosine similarity per element to facilitate the implementation. The matrix multiplication " * " can be used to obtain M sim as where Norm(·) is the l 2 normalization for each channel in f l down and f h . l 2 normalization is essential because it ensures that the matrix multiplication is equal to the cosine similarity.
f h T denotes the transpose of f h .

Require:
The low-level features f l ∈ R C l ×H l ×W l and the high-level semantics Produce the similarity matrix M sim ∈ R C l ×C h using f l down and f h as Equation (5). 4: Obtain f h cos ∈ R C l ×N h using M sim and f h as Equation (6). 5: Transform and up-sample f h cos ∈ R C l ×N h such that f h cos ∈ R C l ×H l ×W l . 6: return f h cos A Softmax function is performed along the C h dimension to render M sim more representable. This transforms the elements in M sim into a set of normalized weights for f h channels. Finally, M sim and f h are multiplied as follows.
where f h cos is the aligned f h . The multiplication * reduces the number of channels in f h such that f h cos has the same number of channels as f l . Furthermore, using Equation (6), we can obtain an alignment effect. As shown in Figure 6, when we take the value f h cos (i, k) on the i-th row and k-th column in f h cos as an example, f h cos (i, k) is computed as: where f h (j, k) is the value on the j-th row and k-th column in f h , and M sim (i, j) is the value on the i-th row and j-th column in M sim . Consequently, f h cos (i, k) is a weighted average of f h (j, k), j = 1, · · · , C h . If the j-th channel in f h is more similar to the i-th channel in f l down , M sim (i, j) will be larger. Therefore, the equation gives larger weights to the channels in f h that are more similar to the i-th channel in f l down . As a result, the cosine alignment tries its best to meet the objective that the i-th channel in f l down and the i-th channel in f h cos containing similar large object information. Finally, f h cos is transformed back into the image plane and up-sampled such that f h cos ∈ R C l ×H l ×W l . f h cos is the aligned high-level semantics produced by the cosine alignment.

Neural Alignment
The neural alignment, which is shown in Algorithm 2, aligns the semantics between f h and f l using convolutions. Firstly, a convolution layer with 1 × 1 kernels followed by an up-sampling operator is performed to compress f h : where W neu [C h , C l , 1 × 1] is the weight matrix in the convolution layer, ⊗ is the convolution, U p(·) is the up-sampling, and f h red ∈ R C l ×H l ×W l is the compressed f h . Via Equation (8), f h red has the same number of channels as f l . Actually, W neu [C h , C l , 1 × 1] is the learnable projection bases which projects f h into the space we prefer. This procedure has an effect of semantic alignment by iteratively optimizing the projection bases during the training.

Require:
The low-level features f l ∈ R C l ×H l ×W l and the high-level semantics f h ∈ R C h ×H h ×W h Ensure: The aligned f h (denoted by f h neu ∈ R C l ×H l ×W l ) 1: Project and up-sample f h as Equation (8) to generate f h red ∈ R C l ×H l ×W l . 2: Generate the channel attention vector v a ∈ R 2C l ×1 via average pooling using both f l and f h red . 3: Transform v a into R C l ×1 and model the correlation between the channels of f l and f h red based on v a . 4: Obtain f h neu using v a as Equation (9). 5: return f h neu Then, a mutual channel attention is proposed to enhance the alignment. The traditional channel attention has been used to select informative features via introducing global weights for channels of the input features [49,51]. However, the traditional channel attention is a self-attention mechanism which does not have the alignment effect between the features generated by different layers. Here, to consider both the information in f l ∈ R C l ×H l ×W l and f h red ∈ R C l ×H l ×W l , we concatenate f l and f h red together. Based on this, the average pooling layer is performed to generate a channel attention vector v a ∈ R 2C l ×1 . Subsequently, a convolution layer with 1 × 1 kernels, a batch normalization, and a sigmoid activation are performed on v a to model the correlations of the channels in f l and f h red . As a result, v a is transformed as v a ∈ R C l ×1 , which can be used directly to select the information of f h red by an element-by-element multiplication along the C l channels. Concretely, using v a , the aligned high-level semantics f h neu ∈ R C l ×H l ×W l are formulated as: where is the element-by-element product along the C l channels. Finally, f h neu is the output of the neural alignment.

DS
The f d generated by RDM is the input to DS. In RDNet, two RDM components produce difference features f d 8 H, and W l = 1 8 W. f d includes only small-object information with fewer semantics. The DS is proposed to obtain more semantics to facilitate prediction without adding excessive parameters.
The commonly used method adds several convolutional layers with strides, such as ResNet18 to generate semantics. This method significantly increases the number of parameters and computational complexity. Here, we use depth-wise convolution to address these issues. Furthermore, when the stride is set to two or larger, the convolution enlarges the receptive field; however, some extremely small objects, such as pedestrians in UAVid, are ignored. Therefore, we do not use any strides. Additionally, we do not know which kernel size is most suitable. A larger kernel size suppresses small objects, whereas a smaller kernel size generates less semantic information. Here, the DS designs two convolutional branches using 1 × 1 and 3 × 3 kernels to alleviate the negative effects produced by a fixed kernel size, as shown in Figure 7. The first branch contains only a single convolutional layer with 1 × 1 kernels and is simple and lightweight. It convolves f d using a small kernel size and creates a correlation between all the channels. We use f d 1×1 ∈ R C d ×H l ×W l to denote the output of the first branch. The second branch convolves f d using a larger kernel size. It makes spatial correlations using a depth-wise convolution with 3 × 3 kernels and makes a cross-channel correlation using shared kernels. Specifically, f d is first processed using a depth-wise convolutional layer to obtain a spatially correlated feature map f sp ∈ R C d ×H l ×W l . Subsequently, a crosschannel correlation is performed. This adaptively pools f sp into feature vector v sp ∈ R C d ×1 . A two-dimensional convolutional layer with padding and 3 × 3 kernels, followed by sigmoid activation, is then used to make correlations along the channels: where v cc ∈ R C d ×1 is the derived cross-channel correlated feature vector, P(·) is padding, S(·) is sigmoid activation, and W c [2C l , 2C l , 3 × 3] is the weight matrix of the convolution.
Subsequently, the features f d 3×3 ∈ R C d ×H l ×W l extracted by the second branch are: where "•" is the element-by-element product of the image plane. The activated v cc is expanded to the same dimension as f sp by an expanding operator Expand(·) to perform multiplication. After both branches are performed, the small-object semantics f s ∈ R C d ×H l ×W l that fully depict small objects are generated as follows.

CS
The CS is designed to process the output f b 4 ∈ R C b × 1 32 H× 1 32 W of LY 4 in the backbone network. Here, we generate high-level semantics f h by extending the receptive field. This compels f h to further focus on large objects. The average pooling pyramid (APP) [7] is utilized to enlarge the receptive field. Unlike the strategy that directly enlarges the kernel size, APP does not increase the computational complexity [7]. As shown in Figure 8 [26]; CAM, channel attention module [26].
Although the receptive field is enlarged, it is difficult to generate continuous contextual information that washes out heterogeneity in large-object features. Dual attention [26] is used to fully model long-range dependencies to address this issue. Dual attention, which includes a position attention module (PAM) and channel attention module (CAM), considers both spatial and channel information. The PAM computes the self-attention map for each position on the feature-image plane, and the CAM computes the self-attention map along the channels. They share similar procedures, which are detailed in [26], for producing these self-attention maps using matrix multiplication. Then, the input features are enhanced by these self-attention maps. Thus, the enhanced features produced by PAM and CAM contain continuous spatial dependency and channel dependency information, respectively.
In the i-th layer of Figure 8, f b 4 is averagely pooled as the down-sampled feature f cs i ∈ R C b ×H i ×W i . Two reduced feature mapsf cs i ∈ R C r ×H i ×W i andf cs i ∈ R C r ×H i ×W i are first generated by two convolutions with 1 × 1 kernels based on f cs i . Here, C r is set to C b /12.f cs i is processed by the PAM to obtain continuous spatial dependence information. Subsequently, a depth-wise convolutional layer with 3 × 3 kernels is used to generate the semanticsf sem i ∈ R C r ×H i ×W i .f cs i is processed by CAM to obtain dependence information along the channels. Again, a depth-wise convolutional layer with 3 × 3 kernels is used to generate the semanticsf sem

The Loss Function
As shown in Figure 3, the loss function L is composed of two terms: where L main (T, P f ) is the selected main loss which penalizes the errors in the final prediction, P f . L aux (T, P a ) is the auxiliary loss which penalizes the errors in the auxiliary prediction P a produced by the high-level semantics f h . P f and P a are produced using similar prediction layers. First, the prediction layer convolves the input features using 3 × 3 kernels. Subsequently, batch normalization and ReLU activation are performed. Finally, the prediction is produced using a convolutional layer with 1 × 1 kernels. The differences between the prediction layers for P f and P a are only the parameters for the two inner convolutions. For P f , we set the input and output numbers of the channels of the first convolution as C b + 2C b 1 + 2C b 2 and 128, where C b 1 , C b 2 and C b are the number of channels for the small-object semantics f s 1 and f s 2 and the high-level semantics f h , respectively. The input and output numbers of the channels of the remaining convolution are set to 128 and K, where K is the number of classes. For P a , the input and output numbers of the channels for the first convolution are set to C b and 64, respectively, and the input and output numbers of the channels for the remaining convolution are set to 64 and K.
We integrate a threshold Tr L with the standard cross-entropy loss to further highlight small objects without adding prior information about the data using L main (T, P f ). For more detail, we compute the individual main loss L main (T, P f )[i] of each pixel in P f as follows: where T(i) is the ground-truth label of the i-th pixel and K is the number of categories. P f (i) is the i-th vector in P f containing probabilistic values belonging to all categories of the i-th pixel. P f (i)[j] is the probability that the i-th pixel belongs to the j-th category. Then, an indicator function I is defined as where I[i] is the indicator of the i-th pixel in the input image I, and Tr L = 0.7. L main (T, P f ) is computed based on I and L main (T, P f )[i] as follows: where N denotes the number of pixels in I. CNN-based models are advantageous for the recognition of large objects and elimination of heterogeneity in large objects. After several training iterations, the losses in the prediction of large objects will be smaller than those of small objects. Larger losses are more likely to occur on pixels in small objects. Therefore, apart from the RDM, L main (T, P f ) further highlights the small objects during the backward procedure to some extent.
L aux (T, P a ) uses the standard cross-entropy loss function without changing the optimization preference for large (dominant) objects. Thus, L aux (T, P a ) favors the preference of f h for depicting large objects, which ensures the accuracy of large-object recognition. L aux (T, P a ) is formulated as

Datasets
In this paper, the UAVid and Aeroscapes datasets are employed to evaluate the performance in the segmentation of small objects. Both the two datasets contain objects having large-scale variations. It is notable that we do not employ the ISPRS Potsdam and Vaihingen datasets. As discussed in Section 1, only cars contained in the Potsdam and Vaihingen datasets are smaller. However, now, cars in UAV images are not small, and a large range of objects are much smaller than cars.

The UAVid Benchmark
The UAVid benchmark [4] is a newly published dataset of aerial imagery. Its detailed information is given on its official website (https://uavid.nl/ (accessed on 1 May 2020)). UAVid contains 8 classes, defined as "Clutter", "Building", "Road", "Tree", "Vegetation", "Moving car", "Static car", and "Human". UAVid focuses on the scenes near streets but consists of diverse landscapes including downtown areas (Figure 9a), villa areas (Figure 9b), and outskirts (Figure 9c). For the scale variations, to accurately measure the size, we calculate the average number of pixels for a single object in each category in Table 1. To simplify the notation, we use "Build.", "Veg.", "Mov.c.", and "Stat.c." to denote the Building, Vegetation, Moving car, and Static car classes. Definitely, the instances in the Human class have the smallest size (555.0 pixels) on average. More importantly, our objective is for the recognition of footpath-level objects. It is notable that here it is "footpath", not "road" or "street". Footpaths are much narrower than roads or streets. In UAVid, only the Human class contains the footpath-level objects. Therefore, we only consider that the Human class has such a small size. In addition, although cars are much larger than pedestrians, they are not as large as the other objects. Therefore, we consider that the Stat.c and Mov.c classes have medium sizes. In this way, the existing categories are divided into three groups, as shown in Table 1.

The Aeroscapes Benchmark
The Aeroscapes [5] benchmark is another aerial dataset annotated for dense semantic segmentations. The details of Aeroscapes are shown on its official website (https://github. com/ishann/aeroscapes (accessed on 29 May 2020)). As shown in Figure 10, Aeroscapes contains a large range of scenes, including countryside, playgrounds, farmlands, roads, animal zoos, human settlements, and seascapes, among others. There are 12 classes that are defined as: "Background", "Person", "Bike", "Car", "Drone", "Boat", "Animal", "Obstacle", "Construction", "Vegetation", "Road", and "Sky". For brevity, we use "b.g.", "Constr.", and "Veg." as the simplified notations for the "Background", "Construction", and "Vegetation" classes, respectively. Among these categories, the Person, Bike, Drone, Obstacle, and Animal classes contain footpath-level objects from a human perspective. As we all know, the UAV platform is oblique and the flight altitude is variable. For some specific scenes, the spatial resolutions of the images obtained by the UAV platform are really high. In this case, the footpath-level objects may not be small. To eliminate the larger ones caused by the UAV platform among the footpath-level objects, we calculate the average number of pixels for a single object in each category in Table 2. In Table 2, the oblique view and flight altitude do not enlarge the sizes of instances in the Person, Bike, Drone, and Obstacle classes. Therefore, we consider that the Person, Bike, Drone, and Obstacle classes have small sizes. However, the Animal class is influenced. The animal zoo scene is relatively smaller than the other scenes. When the UAV platform collects data, it must be nearer to the targets. This leads to really high spatial resolutions, and thus the animals in the image are not small (see Figure 10f). Therefore, we consider that the Animal class has medium sizes. In detail, the existing categories in Aeroscapes are divided into three groups, as shown Table 2.

Experiments
Firstly, we introduce the evaluation metrics used in this study and the implementation details of the training procedure. Subsequently, the experimental settings are discussed.
Then, the results of RDNet and the comparisons with existing methods for both the UAVid and Aeroscapes datasets are presented. Finally, we discuss some issues of interest.

Evaluation Metrics
To keep consistency with existing public evaluations, such as the studies by [4,5,14], we mainly use the IoU scores as the evaluation metrics. For the dataset having a range of classes with small/medium/large objects that pose a challenge to analyze the performance across scales, we divide the existing classes into small, medium, and large groups based on the scale information presented in Table 1 and Table 2. Accordingly, the mean IoU score of the classes in each group is computed. Thus, three mean IoU scores mIoU s , mIoU m , and mIoU l are defined for the small, medium, and large groups, respectively. As these metrics are well-known and easily understood, we do not present them in any detail here.

Implementation Details
The proposed RDNet is implemented based on Pytorch. To train the network, we use stochastic gradient descent as the optimizer. The commonly used learning rate policy is employed, where the initial learning rate is multiplied by (1 − iter max_iter ) power for each iteration. Here, power and the initial learning rate are set to 0.9 and 10 −4 , respectively. We train the RDNet on the training set for 10 epochs. In each epoch, we randomly selected 30,000 image patches with a batch size of three. Each patch is 704 × 704 in size. In more detail, to randomly select each image patch, we randomly generate two values h r and w r to indicate the top-left corner. Then, the image patch is copied from the original image based on the coordinates (h r , w r ) and (h r + 704, w r + 704). Both the RDNet models for UAVid and Aeroscapes are trained using these settings. As is known, the training procedure usually trains the model iteratively. For each iteration, we actually obtain a trained model. As we focus on the small objects, we use the trained model that gives the highest accuracy scores for the small objects to obtain the final segmentation results. In terms of the comparison methods, we obtain the trained model for testing in the same way.

The Backbone Selection
Currently, there are a large range of general-purpose networks which can be used as the backbone. Our objective is to select a backbone which can make a good tradeoff between computational complexity and accuracy. We list the integrations of RDNet with the representative general-purpose networks which have the real-time property in Table 3. MobileNetV3, GhostNet, STDC, EfficientNetV2, Xception, and ResNet18, which are commonly used CNN-based networks, can be integrated into RDNet in the way shown in Figure 3. MobileViT is a new general-purpose network that combines the strengths of CNNs and ViTs [29]. Although the architecture of MobileViT is different with the purely CNN-based networks, it can easily integrate with the RDNet framework. At the beginning of the MobileViT, a MobileNetV2 [34] block sequence is used to extract a set of low-level features. Subsequently, these low-level features are put into the ViT sequence to generate high-level semantics. We select the low-level feature maps with 1 4 H × 1 4 W and 1 8 H × 1 8 W pixels as the f b 1 and f b 2 and select the high-level semantics produced by the ViT sequence as the f h . Then, the RDNet based on MobileViT can be constructed using the way shown in Figure 3.
In Table 3, the number of parameters (Pars.), computation complexity, GPU memory consumption (Mem.), and mIoU scores are presented. The computation complexity is measured by the float-point operations (FLOPs) and frames per second (FPS), which are computed using an image patch with 3 × 1536 × 1536 pixels. The accuracy scores are obtained in the validation set of UAVid. Notably, the validation set is not used for training the model. RDNet(ResNet18), which is the RDNet version using ResNet18 as backbone, obtains the highest accuracy. RDNet(Xception) and RDNet(EfficientNetV2) obtain really close accuracy to RDNet(ResNet18), however, they cannot surpass RDNet(ResNet18). Xception and EfficientNetV2 are constructed based on the depth-wise separable convolutions, which reduces many weights for cross-channel modeling. In contrast, ResNet18 fully models the cross-channel relationship using full convolutional operators. This suggests that the cross-channel modeling in the backbone is beneficial to RDNet. The RDNet(MobileViT) has the potential for accuracy improvements because it contains only 2.08M parameters. Unfortunately, the GPU memory consumption of RDNet(MobileViT) is really huge (28.5G), which limits its application. The RDNet(MobileNetV3) and RDNet(GhostNet) are lightweight and fast. However, they achieve much lower accuracy than the RDNet(ResNet18). From the comparison, the RDNet(ResNet18) obtains better trade-off among parameters, complexity, speed, GPU memory consumption, and accuracy. Therefore, we select the ResNet18 as the default backbone.

The Selection of Comparison Methods
As of now, there are hundreds of networks for semantic segmentation tasks, and we cannot compare all these methods with our RDNet. As our RDNet focuses on small objects, we test the existing small-object-oriented methods for which official codes have been released. They are BiSeNet, ABCNet, SFNet, BiSeNetV2, and GFFNet, which were introduced in Section 2.2. Notably, there exists a large range of backbone networks; we cannot test all the backbones for each method. Here, we use the backbone provided by the official codes of each method. Table 3 shows the properties and accuracies of these methods. We can find that most of the small-object-oriented methods have real-time property and prefer ResNet18. In addition, we select three representative networks from traditional methods (see Section 2.1). The DANet [26] is one of the self-attention-based methods, and the DeepLabV3 [21] is one of the methods using atrous convolutions and pyramids. MobileViT [30] is the representative method using vision transformer [29]. For the semantic segmentation task, MobileViT [30] uses the DeepLabV3 architecture and replaces the ResNet101.
In Table 3, DANet and DeepLabV3, can only achieve higher mIoU scores than BiSeNet when they use ResNet101. In most cases, they receive lower accuracy scores than the smallobject-oriented methods, though they use much more complex backbones (ResNet101) as default. More importantly, based on the mIoU s scores, the performance for segmenting small objects is more inferior. Although MobileViT achieves a relatively higher mIoU s score of segmenting small objects than DANet and DeepLabV3, it still cannot surpass the existing small-object-oriented methods. Furthermore, the GPU memory consumption of MobileViT is huge (27.8G). Therefore, we pay attention mainly to the comparison between RDNet and small-object-oriented methods in more detail in the following. Meanwhile, to fully compare RDNet to the state of the art, we show the published results for each dataset.

Results on UAVid
This section is divided into two parts: the quantitative and visual results. The details of the UAVid dataset are presented in Section 4.1. Specifically, the human class has small sizes. Table 4 shows the quantitative results for the UAVid test set. In Table 4, the comparisons are divided into two groups. The first group shows state-of-the-art results obtained by MSD [4], CAgNet [14], ABCNet [41], and BANet [31]. MSD, CAgNet, ABCNet, and BANet officially publish their results on UAVid. We directly quote their accuracy scores here. The second group shows the accuracy scores obtained by the small-object-oriented methods that are not included in state-of-the-art results. It shows the results obtained by BiSeNet [12], GFFNet [6], SFNet [33], and BiSeNetV2 [13]. We trained BiSeNet, GFFNet, SFNet, and BiSeNetV2 ourselves and used the official server (https://uavid.nl/) of UAVid to obtain the accuracy scores. In Table 4, RDNet obtains the highest mIoU score across these methods. More importantly, for each of the small, medium, and large groups, our RDNet obtains the highest accuracy (which is denoted by the mIoU s , mIoU m and mIoU l scores). This means that the performance of segmenting large and medium objects does not become worse when our RDNet improves the accuracy of segmenting small objects. On the contrary, the accuracy scores (mIoU m and mIoU l ) of segmenting the medium and large objects are further improved. For the segmentation of small objects, our RDNet achieves 4.5-18.9% higher mIoU s scores than the existing methods. Notably, BiSeNetV2, GFFNet, and SFNet, which are powerful small-object-oriented methods, already have great performance in segmenting small objects, as noted in the introduction in Section 2.2. Therefore, our RDNet, which obtains higher mIoU s than BiSeNetV2, GFFNet, and SFNet, has a stronger ability for segmenting small objects. Moreover, compared with state-of-the-art results, our RDNet achieves 11.8-18.9% higher mIoU s scores, which is a significant improvement. Therefore, based on the results in Table 4, the superiority of our RDNet, especially for the segmentation of small objects, is validated.

Visual results
To facilitate the visual comparison, we use the prediction maps of the validation set which has the ground-truth. Notably, the validation set is not used for training. Figure 11 shows the visual results for all classes in UAVid. Notably, we do not present the scenes that do not satisfy the scale-variation issue. Furthermore, SFNet and BiSeNetV2 achieve higher mIoU and mIoU s scores than the other existing methods, as shown in Table 4; therefore, we only visually compare the results of RDNet with those of SFNet and BiSeNetV2 owing to the page limits. Based on the ground-truth (Figure 11, second column), RDNet provides more reliable results compared with other methods. The small objects extracted by RDNet are more complete, and the segmented large and medium objects are better.
In terms of small objects, we highlight the subareas that challenge the segmentation of small objects in the first and second images (Figure 11, first and second rows) using red rectangles. The sanitation worker in the first image and the man near a car in the second image are combined with a particularly complex background. The pedestrians in the second image have similar spectral features to the shadow of the buildings. Consequently, both SFNet and BiSeNet-v2 miss these small targets in the results, whereas RDNet identifies all of them ( Figure 11, last column). In terms of large objects, we highlight the subareas that challenge the segmentation of large objects in the second and third images (Figure 11, second and last rows) using black rectangles. In the second image, the subarea contains trees, vegetation, and roads. These are different types of large objects shading each other, which results in difficulty of recognition. Neither SFNet nor BiSeNet-v2 segments this subarea well, whereas RDNet succeeds. In the last image, the subarea contains a building surrounded by vegetation. BiSeNet-v2 and RDNet misclassify the building, whereas RDNet produces a perfect result. The visual results show that RDNet obtains superior results for both the small and large objects in UAVid. The accuracy of segmenting large objects is not reduced when RDNet highlights the small objects. On the contrary, the accuracy of segmenting large objects is further improved.

Results on Aeroscapes
The structure of the results on the Aeroscapes dataset is similar to Section 5.4. The details of the Aeroscapes dataset are presented in Section 4.2. Specifically, the person, bike, drone, and obstacle classes have small sizes.

Quantitative Results
We compare the results of RDNet with the state-of-the-art results specified in Aeroscapes and those obtained using BiSeNet, ABCNet, GFFNet, SFNet, and BiSeNetV2. The accuracy values are presented in Table 5. In terms of the state-of-the-art results, that is, EKT-Ensemble [5], specific issues must be recognized. In addition to the precise mIoU scores, the evaluation for IoU scores is only provided in a bar chart with an accuracy axis (see details in [5]). We cite this evaluation by carefully determining the values along the accuracy axis. Therefore, these values are considered approximate (≈) in Table 5. [5] assembled different datasets, including PASCAL [52], CityScapes [53], ADE20k [54], and ImageNet [55] to improve accuracy, owing to the large range of scenes contained in Aeroscapes. The results using this assembly, denoted as EKT-Ensemble, achieve a 57.1% mIoU score. Table 5. Quantitative comparisons with the state of the art and the small-object-oriented methods for Aeroscapes. EKT-Ensemble(-) [5] 57. As shown in Table 5, RDNet obtains the highest mIoU score among all the methods. More importantly, the mIoU s score for segmenting small objects, mIoU m score for segmenting medium objects, and mIoU l score for segmenting large objects obtained by RDNet are all higher than those obtained by the other methods. This validates that RDNet further improves accuracy for the segmentation of large objects when it improves the accuracy of segmenting small objects. Specifically, the mIoU s score is 3.9% higher than that of BiSeNetV2, which ranks the second. For the bike class with the smallest objects, our RDNet obtains at least a 7.8% higher IoU score than the other methods. For the person class with small objects, RDNet obtains at least 5.0% higher accuracy than the other methods. SFNet obtains a higher IoU score for segmenting the obstacle class with small objects. This validates that SFNet has great power for segmenting small objects. However, compared with our RDNet, SFNet obtains lower IoU scores for segmenting the other classes with small objects, including the person, bike, and drone classes. This validates the superiority of RDNet.

Visual results
Aeroscapes contains numerous scenes with large-scale variations between the classes. Here, we select four images that cover the classes with small objects and the objects challenging the segmentation in Figure 12. From these visual results, we observe some mistakes in each of the predictions, which demonstrates the challenges of the Aeroscapes dataset. RDNet obtains reasonable segmentation results, in which all objects are recognized. Compared with SFNet and BiSeNet-v2, RDNet obtains superior results with less misclassification in each image. Specifically, the first row of Figure 12 presents the smallest object, a bike. The result obtained by RDNet are much better than those obtained by SFNet and BiSeNetV2. In the second row of Figure 12, RDNet obtains superior results for the segmentation of the person and drone classes. In Table 5, SFNet obtains a higher IoU score for segmenting obstacles than RDNet. However, in the visual results, RDNet can achieve better segmentation results in a range of scenes, such as the third row in Figure 12. In the last row of Figure 12, we show the segmentation results of the animal class. RDNet obtains the best results for segmenting animals in the image.

Ablation Study
In RDNet, we propose three modules, including RDM, DS, and CS. RDM has two alignment branches: cosine alignment (RDM C ) and neural alignment (RDM N ). Here, we analyze the influence of each of these factors. We compare the results obtained by RDNet without each component with those of the complete RDNet to directly demonstrate the effects. We use RDNet-RDM as the simplified form of the RDNet without the RDM. The other variants of RDNet are denoted in similar forms. Table 6 shows the ablation study using the Aeroscapes dataset. RDNet-RDM shows the lowest accuracy. Compared with RDNet, the mIoU score decreases by more than 6%. Moreover, the mIoU s score of RDNet-RDM is only 31.9% which is considerably lower than those of the RDNet and other variants. This demonstrates that RDM has a significant influence on improving accuracy scores of segmenting small objects. DS has positive roles for segmenting small and medium objects based on the changes in the mIoU s and mIoU m scores. CS generates high-level semantics which primarily depict large objects. From Table 6, CS satisfies the objective. When the CS is removed, the high-level semantics become f b When CS and DS are both removed from RDNet, the network only contains ResNet18 and RDM, which is shown in the second row in Table 6. The high-level semantics f h in ResNet18+RDM are the output of LY 4 in ResNet18 directly. ResNet18+RDM is really simple, its FLOPs score is 70G less than that of BiSeNetV2. The mIoU s score of segmenting small objects obtained by ResNet18+RDM can meet 40.7%, which is 2.4% higher than that obtained by BiSeNetV2. As shown in Table 5, BiSeNetV2 obtains a mIoU s score of segmenting small objects 3.9% lower than our RDNet, however, it obtains a mIoU s score higher than the other existing methods. In other words, just using RDM and the simple backbone ResNet18, a higher accuracy of segmenting small objects than other existing methods can be obtained. This validates the great power of RDM for segmenting small objects again.

The Output of Each Module in RDNet
As shown in Figure 2, the high-level semantics extracted by deep layers primarily depict large objects. The low-level features are a mixture of the large and small objects. Although small-object information is recorded in the low-level features, large-object information plays a dominant role. This is why the existing deep-learning-based methods obtain high accuracy for segmenting large objects, while achieving low accuracy for segmenting small objects (see Tables 4). To fully analyze the positive influences of each module in RDNet, we present the saliency map of the features produced by each module in Figure 13. The features f b 1 and f b 2 generated by the LY 1 and LY 2 in ResNet18 have the same properties as the low-level features in Figure 2. The high-level semantics f h extracted by CS and f b 4 generated by the deep layer LY 4 in ResNet18 have the same properties as the high-level semantics presented in Figure 2. However, from the comparison between f b 4 and f h in Figure 13, the contextual information in f h is more homogeneous, which facilitates the representation of large objects. This validates the positive effect of the CS from a visual perspective.
In Figure 13, the difference features f d 1 and f d 2 produced by the proposed RDM fully highlight small objects and exclude the predominant effect of large objects. This proves that RDM meets the objective of changing the predominant relationship between large and small objects. However, f d 1 and f d 2 only locate the small objects; they cannot depict the small objects well. A DS is proposed to generate more semantics to address this issue. After the DS is performed using f d 1 and f d 2 as its input, the generated small-object semantics f s contain the necessary semantics for depicting more properties of a small object. Coupled with the quantitative analysis in Table 6, we can conclude that RDM, DS, and CS are essential for RDNet, specifically RDM, which plays a vital role in improving the representation ability of small objects.

The Usage of Complex Backbones
In Section 5.3.1, we select ResNet18 to make a good trade-off between computational complexity and accuracy. Here, we give a discussion for the usage of more complex backbones via the segmentation in Aeroscapes. In Table 7, ResNet50 and ResNet101 are integrated. For comparison, the small-object-oriented methods that use backbone networks are tested. In terms of the small-object-oriented methods, including BiSeNet, ABCNet, GFFNet, and SFNet, the complex backbones cannot improve the mIoU s score for segmenting small objects. However, the mIoU m and mIoU l scores improve compared with those in Table 5 to some extent. The main reason is that they do not change the predominant effects of large objects. For our RDNet, the complex backbones improve not only the mIoU m and mIoU l scores for segmenting medium and large objects but also the mIoU s score for segmenting small objects.
In Table 7, RDNet obtains the highest mIoU s , mIoU m , and mIoU l scores. Specifically, when ResNet50 is used, RDNet achieves a 7.4% higher mIoU s score for segmenting small objects than SFNet, which ranks second. For the usage of ResNet101, RDNet obtains a 7.9% higher mIoU s score than SFNet. To conclude, the usage of complex backbones cannot alter the superiority of RDNet.

Conclusions
In this study, a novel semantic segmentation network called RDNet is proposed for aerial imagery. In RDNet, RDM is first proposed to highlight small objects. RDM develops a reverse difference concept and aligns the semantics for both high-level and low-level features. Consequently, RDM eliminates the predominant effect of large objects in the lowlevel features. Then, the DS, which models the spatial and cross-channel correlations, is proposed to generate small-object semantics using the output of RDM. Additionally, the CS is designed using an average pooling pyramid and dual attention to generate high-level semantics. Finally, the small-object semantics bearing the small objects and high-level semantics that focus on the large objects are combined to make a prediction.
Two aerial datasets are used to fully analyze the performance of RDNet: UAVid and Aeroscapes. Based on the experimental results, RDNet obtains superior results compared with the existing methods for both datasets. The accuracy scores of segmenting small, medium, and large objects are all improved. More importantly, the accuracy im-provement for segmenting small objects is prominent. According to Table 3, RDNet has less computational complexity and uses less GPU memory compared with the existing small-object-oriented methods. This shows that RDNet achieves superior results using less computing resources. The ablation study demonstrates that the proposed RDM plays a vital role in the accuracy improvement of the segmentation of small objects. The DS further enhances the output of RDM, and the CS ensures good performance in segmenting large objects. Meanwhile, based on the visualization of the output, the positive effect of each module in RDNet is vividly shown.
In the future, the resolution of remote sensing images will be further improved, though the current resolution is so high that footpath-level objects are recorded well. Small object recognition will become more important. RDNet architecture, which highlights small objects for deep-leaning-based models, will be used for more applications in the field of remote sensing.