Global Multi-Scale Information Fusion for Multi-Class Object Counting in Remote Sensing Images

: In recent years, object counting has been investigated and has made signiﬁcant progress under a surveillance-view. However, there exists only a few works focusing on the remote sensing object density estimation, and the performance of existing methods is not promising. On the one hand, due to the imbalance distribution of targets in remote sensing images, the model might collapse, leading a severe degradation. On the other hand, the scale of targets in remote sensing images actually varies in real scenarios, which remains a challenge for counting objects accurately. To remedy the above problems, we propose an approach named “SwinCounter” for object counting in remote sensing. Moreover, we introduce a Balanced MSE Loss to pay more attention to the fewer samples, which alleviates the problem of imbalanced object labels. In addition, the attention mechanism in our SwinCounter can precisely capture multi-scale information. Thus, the model is aware of different scales of objects, which capture small and dense targetes more precisely. We build experiments on the RSOC dataset, achieving MAEs of 7.2, 151.5, 14.38, and 52.88 and MSEs of 10.1, 436.0, 22.7, and 74.82 on the Building, Small-Vehicle, Large-Vehicle, and Ship sub-datasets, which demonstrates the competitiveness and superiority of the proposed method.


Introduction
With the rapid development of the remote sensing technology, the quantity and quality of remote sensing images used to characterize various objects (such as airports, aircraft, buildings, etc.) on the earth's surface have been greatly improved.Thus, the research of intelligent earth observation for analyzing and understanding these satellite and aerial images has attracted extensive attention.The aim of the object counting task is to estimate the number of objects of a certain class in a given image, which plays a crucial role in image and video analysis for the applications such as crowd behavior analysis [1-3], public security, traffic monitoring, urban planning, et al.Benefiting from the development of deep learning, object counting has achieved great performance in natural images, and a few works have applied object counting in remote sensing images.However, due to the gap between remote sensing and natural images, there is still great space for improvement in remote sensing object counting.
The existing object counting methods can be divided into detection-based and regression-based.Detection-based methods obtain the number of objects by counting the detected bounding boxes.Dijkstra et al. [4] propose a network named CentroidNet, which obtains the detection results by detecting the centroids.the detection-based method works fine for sparse and large targets, but it suffers a huge performance decrease while dealing with dense and tiny objects.Compared to the detection-based method, the regression-based method requires less annotation and is more stable.In the early stage, researchers extract the manually designed features from the input image [5] and generate target density by general regression methods, which is also known as the direct regression method.A significant milestone in the development of object counting is converting object counting to density map estimation, which was proposed by Lempitsky et al. [6].Following this work, Zhang et al. propose a multi-branch network.Each branch has a different size of convolution kernel to capture the objects.Sam et al. [7] propose a switch convolutional neural network to select the final regressor by training a switch classifier to obtain the results.Liu et al. [8] blend the perspective and density maps by adding an extra branch.Jiang et al. [9] propose a novel encoder-decoder structure that can effectively facilitate the fusion of feature maps at multiple scales.All those approaches mentioned above can obtain relatively good results when dealing with uniformly distributed objects, but they all failed while dealing with uneven distribution and targets with large scales variations.
As shown in Figure 1, compared with object counting in natural images, remote sensing object counting is a more challenging task.Different from natural images, remote sensing images appear to be taken from a high altitude vertical angle by an onboard camera of a satellite or aircraft.Therefore, remote sensing object counting mainly faces the following challenges: 1.
As the remote sensing images are collected by various space-borne or airborne cameras under different locations with diverse conditions, the scale of objects of different categories in remote sensing images varies greatly, and even the same target could suffer large-scale variation between each collection.2.
Remote sensing images have more complex backgrounds than natural images, and the acquisition process of remote sensing images is prone to different degrees of occlusion by light, weather, and other factors.These complex backgrounds and occlusion are prone to mislead the model with wrong predictions.

3.
Remote sensing images are large in scale and cover a wide range.It may contain extremely dense small objects, such as vehicles, ships, aircraft, etc., so it is difficult to accurately count all these tiny objects within one model.Aiming at the problems in remote sensing target counting mentioned above, to alleviate the interference brought by scale variation and complex backgrounds, the vision transformer is introduced here to suppress background interference while capturing multi-scale information.To further alleviate the problem of unbalanced samples in training samples, the Balanced MSE loss function is introduced to solve the imbalance of labels from the perspective of statistics.Overall, there are three main contributions of this paper: 1.

Ground Truth Density Map
To our best knowledge, this is the first time that the transformer has been introduced into remote sensing object counting.Due to the attention mechanism of the transformer, it can find some finer objects than the traditional method, thereby increasing the robustness of our model when facing objects of different scales.

2.
Achieve competitive results in the commonly used Remote Sensing Object Counting (RSOC) dataset, which contains the Large-vehicle, Small-vehicle, Building, and Ship sub-datasets.3.
To tackle the sample imbalance problem for the targets within remote sensing images, we exploit the Balanced MSE loss to replace the standard MSE loss.the experiments demonstrate that the loss function effectively improves the overall performance of the model.

Related Work
This section mainly introduces the related work of four parts, namely remote sensing object counting, object counting, imbalanced data distribution, and tiny object detection.

Remote Sensing Object Counting
Remote sensing object counting is to estimate the number of objects in remote sensing images.It is a hot issue in the field of computer vision and remote sensing image analysis [10][11][12].Remote sensing images mainly include satellite images and aerial images, the former obtained by scanning with scanners by adjusting the angle of the satellite, the latter obtained by photographing with optical lenses carried by aircraft.Remote sensing image existences more challenges for the object counting task compared with natural scenes.This is because long-distance aerial photography has problems such as a wider field of view, more complex scenes, and larger-scale variation.
Previously, remote sensing object counting is tackled by combining classification and detection [13][14][15][16][17][18].Tan et al. [19] use image segmentation and edge detection algorithms to detect and classify road vehicles.Bazi et al. [20] propose an automatic counting method using the Gaussian Process Classifier (GPC) to calculate the number of olive trees in remote sensing images.Santoro et al. [21] adopt a four-step algorithm, mainly using sensor data to calculate the number of fruit trees.Xue et al. [22] propose a semi-supervised animal counting method.Salam í et al. [23] use the parallel architecture to connect the UAV and computer, and finally obtain a real-time information processing of images on the UAV for counting.
Recently, deep learning-based remote sensing object counting dominates the mainstream trend, which benefits from the strong representation ability of convolution neural networks.Mubin et al. [24] use basic LeNet to realize the detection and counting of oil palm.Shao et al. [25] construct the YOLO-based CNN framework to detect and count cattle on the images collected by UAVs, and finally realize the management of grazing cattle.Wan et al. [26] propose an end-to-end framework to learn the density map of the counter and achieve good performance.Gao et al. [27] propose a new remote sensing object counting network PSGCNet, which includes a Pyramid Scale Module (PSM) and a Global Context Module (GCM).
Detection-based object counting method realizes counting through the sum of detection results.It can be divided into an instance-based [31] and partial-base [32] type.the former uses the complete object instance as the detection target, while the latter uses part of the instance to alleviate the interference caused by occlusion.However, all detectionbased counting fails while dealing with extremely crowded scenes.
Regression-based methods treat counting as the global density prediction task to avoid the occlusion problem faced by the detection-based method.This means directly estimating the number of the target from the given input image and generated feature.Tan et al. [33] design a semi-supervised elastic net regression method.Chan et al. [34] design a new regression method for crowd counting and link the approximate Bayesian Poisson regression with the Gaussian process.However, the solution based on regression only focuses on the global image features, which may lead to large errors and suffer a low fault tolerance rate.
Density map estimation-based methods disperse the target point distribution information to the position around it to generate the intermediate variable, called the density map, and archives counting task through estimating the density map of the given input.A Multi-Column Neural Network (MCNN) is proposed by Zhang et al. [35], which inputs images of any resolution, learning multi-scale features through each column of MCNN.Then, it obtains the predicted density map, and finally obtains the number of people.a Congestion Scene Recognition Network (CSRNet) is developed by Li et al.CERNet [36], by performing extended convolution, understands highly crowded scenes.Singagi et al. [37] combine a high-level prior with the network to learn a model that meets various density levels in the dataset.Ma et al. [38] design a Bayesian loss function for the point-supervised crowd counting task.Gao et al. [2] design a Domain Adaptive Crowd Counting (DACC) framework, which can generate pseudo tags in real scenes to improve the prediction quality.

Imbalanced Data Distribution
Imbalanced data distribution is a common problem in the real world.Recently, researchers have begun to pay attention to the problem of imbalanced regression in computer vision.Branco et al. [39] use the Gaussian noise enhancement method, and the following year they design a resampling REBAGG [40] algorithm based on BAGGing.Yang et al. [41] also propose new methods of Label Distribution Smoothing (LDS) and Feature Distribution Smoothing (FDS).However, the common problem in visual regression is the imbalance of labels.the performance of commonly used MSE loss functions on rare samples is often unsatisfactory.As a result, Ren et al. [42] propose a balanced MSE for imbalanced visual regression.Balanced MSE uses a probability method to solve label imbalance from the perspective of statistics, and gets rid of the impact of imbalanced label distribution on MSE.We introduce balanced MSE into our work to deal with the problem of label imbalance.

Tiny Object Detection
Tiny object detection [43][44][45] is a hot issue in the field of computer vision.It faces the problems of image blur, less information, high-level noise, and low resolution.The corresponding solutions are: the Feature Pyramid Network (FPN) proposed in [46], which adopts the method of multi-scale feature fusion and uses the results of feature fusion at different levels for detection.There are also other works [47,48] following the idea of FPN, adding insights to improve the performance.Kisantal et al. [49] propose the oversampling and copy-paste operations on images containing tiny objects (without obscuring the original targets).In addition, more accurate results can be obtained by applying appropriate training methods, such as SNIP [50], SNIPER [51], and SAN [52].For tiny objects, the dense anchor sampling and matching strategy S 3 FD [53], FaceBoxes [54] are adopted, which are also common methods.

Methodology
As shown in Figure 2, the pipeline of our method can be divided into the following stages: Firstly, the input remote sensing image is cut into a set of image paths to form the linear patch embedding.Then, some Swin Transformer blocks [55], the core of a powerful backbone in a computer vision task, are used to extract features with different scales.These multi-scale features are fused with a Multi-Scale Feature Fusion (MSFF) module, which can effectively aggregate features from coarse to fine resolution.Finally, the prediction block is used to generate the predicted density map according to the concatenated multi-scale fused feature.The balanced MSE is used to calculate the distance between the ground truth density map and the predicted one.It also alleviates the impact of data imbalance on the training process at the same time.For the details of the different steps of the entire framework and the balanced MSE loss function, please refer to the following subsections.Firstly, Swin Transformer is used to extract multi-scale feature and send it into MSFF module to fuse multi-scale information.Finally, the fused feature information is used to predict density map.

Multi-Scale Feature Extraction
There are many challenges in remote sensing object detection such as large-scale variation and complex backgrounds.Most existing methods use the convolution neural network to extract the image feature.However, due to the physical properties of the CNNs, it is difficult to model the dependencies between global pixels.Here, the "global pixel" represents the global information of a given input image.Some works use the multi-layer CNN to increase the scope of the receptive field, but it is not efficient and still cannot deal with tiny objects in the remote sensing images.The self-attention mechanism provides an effective method to capture global context information.To achieve better performance for the remote sensing counting task, we introduce the Swin Transformer [55] based on self-attention for richer feature extraction.Swin Transformer uses the shifted window mechanism with a hierarchical design to divide the image into non-overlapping local windows and calculate self-attention within each window.In this way, the background noise can be suppressed and the object features can be highlighted even for the tiny targets.
Given an RGB image I ∈ R H×W×3 as the input, first, we use the patch partition operation to divide the input I into 4 × 4 patches by using a simple convolutional layer of the same size of kernel size and stride, which is set to 4 here.Then, all the patches are projected to the pre-defined dimension (denoted as C) through the linear embedding operation.
After that, each patch is flattened and input into Swin Transformer blocks as a "token".As shown in Figure 3, different from the original ViT [56], the Swin Transformer uses Window-based Multi-head Self-Attention (W-MSA) to calculate attention in a window of limited size (set to 7 by default).At the same time, it enhances the connection between different windows through a Shifted Window-based Multi-head Self-Attention(SW-MSA) operation.Specifically, the Swin Transformer block uses a linear layer to project the input token onto three elements: Q (query), K (key), and V (value), where d is the query/key dimension.the three elements are used to calculate attention in the fixed window.the process can be formulated as follows As shown in Figure 2, in the proposed method, we use a 3-layer Swin Transformer to extract feature information under 3 different scales.Specifically, 2 Swin Transformer blocks are used to extract shallow feature information at first, and 18 Swin Transformer blocks are then utilized to extract more elaborate feature information.Specifically, the process of multi-scale feature extraction can be expressed as where I represents the original image.F 1 , F 2 , and F 3 represent feature maps of three scales, and their dimensions are

Muti-Scale Feature Fusion Module
As mentioned above, the proposed method uses three Swin Transformer layers to extract features with different scales from coarse to fine.To further make use of the information contained in features of different scales, we propose the Multi-Scale Feature Fusion (MSFF) module inspired by FPN.As shown in Figure 4, the feature vector under each scale is applied to the fusion block and element-wisely added to feature vectors under other scales, which transmits the feature information repeatedly between different scales.In the last stage, each fused vector is added to the original one and passes through the fusion block to form the final output.Each fusion block contains a depth-wise convolutional layer and a 1 × 1 convolutional layer, and then followed by batch normalization and the Relu activation.
In general, the feature map F 1 , F 2 , F 3 extracted by the backbone is sent to the feature fusion module, and the 1 × 1 and 3 × 3 convolutional layers are used to obtain the feature layer after the feature fusion, which can be formulated as follow where S 1 , S 2 , and S 3 represent feature maps generated by MSFF.Their sizes are , and H 4 × W 4 × 16C, respectively.These feature maps are then upsampled and concatenated.

Density Map Estimation
The final step is to generate density maps from the output features of MSFF.The prediction block of the SwinCounter contains a 1 × 1 convolution layer to further integrate the input feature.the following two transposed convolutions are used to align the output size while reducing the channel dimension to 1.The ground truth density map is generated by the Gaussian filtering result of the dot map, which can be formulated as where G represents the process of mapping the initial labels to density labels and g represents the final generated density map used in training.

Balanced MSE Loss Function
As shown in Figure 5, in the RSOC dataset [57], we statistically analyze the sample distribution among four different categories.There is a serious imbalanced sample distribution in the data set.Most remote sensing object counting methods use the MSE loss function.It can be formulated as where x gt represents the ground truth of the label and x pred represents the predicted value output by the prediction model.However, the imbalance of target labels, a major problem in computer vision research, also exists in the research field of object counting.We can observe from Figure 5 that the distribution of positive and negative samples within this dataset is extremely unbalanced.Therefore, we introduce Balanced MSE Loss [42] as the loss function of our model, which has obvious advantages in theory and practice compared to traditional MSE Loss.We first use p train (x | z) to represent the distribution of the data during training, and p test (x | z) to represent the distribution of the data during testing.It can be formulated as The balanced MSE loss estimates p test (x | z) by minimizing the Negative Log Likelihood (NLL) loss of p train (x | z) The NLL loss is formally defined as Accordingly, the balanced MSE loss MSE B can be expressed as: For more details, please refer to the original work [42].

Experiment
In this section, we firstly introduce the dataset and evaluation metrics used in the experiments.Then, we compare the proposed SwinCounter with other state-of-the-art methods on the RSOC dataset.Finally, we show our ablation experiments, where we validate the effectiveness of the proposed method.

Remote Sensing Object Counting (RSOC) Dataset
The RSOC [57] dataset contains a total of 3057 images and 286,539 target instances.This dataset is currently the largest in the field of remote sensing counting.This dataset is divided into four datasets, namely Building, Small-vehicle, Large-vehicle, and Ship, and the training set of each dataset contains 1205, 222, 108, and 97 images, respectively.the test set of each dataset includes 1263, 58, 64, and 60 images, respectively.More detailed information is presented in Table 1.

Evaluation Metrics
For the counting tasks, the most widely used evaluation metrics are MAE and MSE; we also use these metrics here.We compare the performance of the proposed SwinCounter with other state-of-the-art methods under this evaluation criterion.The MAE is defined as follows: The MSE is defined as follows: where n represents the number of tested images.X i is the ground truth density map of the ith image, and Xi represents the predicted density map of the countering models for the ith image.

Object Counting Result on the RSOC Dataset
To demonstrate the effectiveness of the proposed method, we perform the validation experiments on the RSOC dataset.
The performance of the proposed SwinCounter on each sub-dataset is compared with other SOTA methods in Table 2. From it, we find that SwinCounter archives the best on the Building, Small-vehicle, and Ship subdatasets, and second for the Large-vehicle.The Swin Transformer-based feature extractor and MSFF provides a more efficient feature vector and finally bring improvements to counting performance.Due to the use of Balanced MSE Loss, the problem of unbalanced sample labels has been alleviated to a certain extent.In particular, compared with the suboptimal approach ASPDNet [57], the SwinCounter improves by 0.39 MAE and 0.56 MSE on the RSOC_building subset, 22.03 MAE and 37.29 MSE on the RSOC_ship subset, and 5.95 MAE on the RSOC_small_vehicle subset.From the evaluation metrics, our method does achieve a state-of-the-art performance on multiple datasets, and the performance is considerable for larger objects, but for dense and small object counting, the method still suffers large room for improvement.Moreover, our method achieves good results for objects of different scales in multiple datasets, which also proves the robustness of the method.
The visualization results of the input image ground truth and predicted density map of the four specific sub-datasets are shown in Figure 6.By visualizing the results, it can be concluded that our method has achieved good results.
To verify the effectiveness of each proposed block, we conducted ablation experiments on the Small-Vehicle dataset and set up four different network combinations (baseline, baseline + feature addition, baseline + MSFF, and baseline + MSFF+ Balanced MSE loss).The detailed setting of each combination is introduced as follows: • Baseline: We set the baseline as an original Swin Transformer backbone, and directly output the prediction results by upsampling the last layer of the feature layer.
It can be concluded from The comparison results are shown in Table 3. Together with Figure 7, the MAE of the balanced MSE loss function is much higher than the normal MSE loss at the beginning, but it also drops faster and, finally, performs better.After introducing each block (MSFF and balanced MSE loss), the MAE and MSE are becoming better than the baseline model, which shows the effectiveness of the proposed approaches.The visualization for each group's predicted density map is shown in Figure 8.Where the first and third columns are scenarios of large-scale dense objects, according to the visualization, we can observe the effectiveness of the proposed SwinCounter.the other images are less dense scenes, and the effect of the proposed approach is more obvious.The counting result of the proposed SwinCounter is more close to the ground truth and it generates density maps with better details.

Ablation Study
At the same time, we also conducted comparative experiments with different window sizes of the Swin Transformer block.the results are given in Table 4. Through this experimental result, we found that the best result can be obtained when the window size is set to 7 × 7, this is also the final setting for the SwinCounter.

Conclusions
In this paper, we propose the SwinCounter based on the Swin Transformer and achieve the state-of-the-art performance of remote sensing object counting on some mainstream datasets.As far as we know, it is the first time that the vision transformer has been introduced into the remote sensing object counting task.The use of the multi-level Swin Transformer encoder provides stronger representation ability without introducing too many parameters, while MSFF further optimizes the fusion and transmission of multi-scale information.Finally, we also use the Balanced MSE loss function to alleviate the problems caused by label imbalance.The proposed method has achieved considerable performance for relatively larger objects.However, for the processing of extremely dense small targets, there is still room for improvements in the current method, which is also the goal of our future challenges.

Figure 1 .
Figure 1.Remote sensing image object counting.For remote sensing image with manually annotated point label, the Gaussian smoothing on the dot map is used to generate corresponding density map.Different from natural images, remote sensing images suffer from large target scale variation and unbalanced target distribution.

Figure 2 .
Figure 2. Pipeline of proposed proposed remote sensing object counting framework (SwinCounter).Firstly, Swin Transformer is used to extract multi-scale feature and send it into MSFF module to fuse multi-scale information.Finally, the fused feature information is used to predict density map.

Figure 5 .
Figure5.The sample distribution of four objects (i.e., Small Vehicle, Large Vehicle, Building, and Ship) in the RSOC dataset[57], the ordinate represents the number of samples, and the abscissa represent the proportion of pixels of the target object in the sample to pixels of the whole sample.

Figure 6 .
Figure6.The first row is the input image from four sub-datasets, the second row is the corresponding ground truth, and the last row is the predicted density map of the proposed SwinCounter.

Figure 7 .
Figure 7.The MAE convergence curve during the training process on the RSOC Small-vehicle dataset under the MSE and Balanced MSE loss function.the orange one is the curve under MSE loss, and the blue one represents Balanced MSE loss.

Figure 8 .
Figure 8.The images in the first row are the original input images from the Small-vehicle subdataset, and the second row is the ground truth.The last row is the final density estimation result of the complete SwinCounter, while the rest are the density maps generated by the incomplete models for comparison.

Table 1 .
Relevant statistics for the four sub-datasets of the RSOC dataset.

Table 3
This group directly adds the feature layers extracted by the Swin Transformer to obtain the final feature layer, and, finally, obtain the density prediction through a simple convolution layer.According to the results in Table3, we can see that after simple feature addition, the model effect has a certain improvement.•Baseline+ MSFF: the third group uses the newly proposed MSFF as the feature fusion block to perform a pyramid-style feature fusion on the multi-scale features provided by the baseline and uses the normal MSE loss function during the training process.
• Baseline + MSFF + Balanced MSE loss: The last group adopts the Balanced MSE loss function to tackle the label imbalance problem, which is the full model, namely SwinCounter.

Table 2 .
Comparing with other state-of-the-art methods, the Large-vehicle and Ship sub datasets of the RSOC dataset are compared, and MAE and MSE are selected as evaluation metrics.the bestperforming method is marked in red, and the second-best performing method is marked in blue.

Table 3 .
The experimental results of the SwinCounter under different network blocks are embodied in the MAE and MSE.

Table 4 .
The influence of the window size; the measurement indicator is MAE and MSE.