Small Object Detection Method Based on Adaptive Spatial Parallel Convolution and Fast Multi-Scale Fusion

: As one type of object detection, small object detection has been widely used in daily-life-related applications with many real-time requirements, such as autopilot and navigation. Although deep-learning-based object detection methods have achieved great success in recent years, they are not effective in small object detection and most of them cannot achieve real-time processing. Therefore, this paper proposes a single-stage small object detection network (SODNet) that integrates the specialized feature extraction and information fusion techniques. An adaptively spatial parallel convolution module (ASPConv) is proposed to alleviate the lack of spatial information for target objects and adaptively obtain the corresponding spatial information through multi-scale receptive fields, thereby improving the feature extraction ability. Additionally, a split-fusion sub-module (SF) is proposed to effectively reduce the time complexity of ASPConv. A fast multi-scale fusion module (FMF) is proposed to alleviate the insufficient fusion of both semantic and spatial information. FMF uses two fast upsampling operators to first unify the resolution of the multi-scale feature maps extracted by the network and then fuse them, thereby effectively improving the small object detection ability. Comparative experimental results prove that the proposed method considerably improves the accuracy of small object detection on multiple benchmark datasets and achieves a high real-time performance.


Introduction
Since the advent of deep convolutional neural networks, the performance of object detection methods has been rapidly improving. At present, the representative object detectors, as the core components of various object detection methods, are mainly divided into two categories: (1) two-stage proposal-based detectors with the advantage of accuracy [1,2]; (2) single-stage proposal-free detectors with the advantage of speed [3,4]. Many recently proposed two-stage detectors [5][6][7] focus on improving the accuracy of object detection. Some single-stage detection frameworks, such as YOLO [8,9] and those using improved YOLO, are applied to different datasets such as MS COCO [10] and PASCAL VOC [11], and their performance is better than some two-stage detectors. Additionally, the real-time performance of these single-stage detectors shows an improvement over two-stage detectors. As an important objective evaluation indicator, the frames per second (FPS) of the realtime performance are generally greater than or equal to 30 [2,12]. Therefore, single-stage detectors [4,9,13] have been widely used in scenes with high real-time requirements.
Most of the current mainstream object detection frameworks have not made special improvements for small objects. However, a large number of cases involve small objects in actual scenes, such as recognizing a disaster victim in an unmanned aerial vehicle (UAV) search-and-rescue, and recognizing distant traffic signs and vehicles using autopilot. In this paper, both the training and testing stages of small object detection are implemented on images with a resolution range 51 × 72 ≤ size ≤ 4064 × 6354. According to the image resolution range used in this paper, the objects with a resolution of 32 × 32 or lower are generally called small objects. When the objects' resolution is 20 × 20 or lower, the corresponding objects are specifically called tiny objects. As shown in Figure 1, the absolute size represents the actual pixel size of the object in the image, and the relative size represents the ratio of the pixel size of the object in the image to the entire image. As shown in Figure 1a-c, TinyPerson [14], Tsinghua-Tencent 100K [15], and unmanned aerial vehicles' detection and tracking (UAVDT) [16] are three small object datasets, which contain a high number of UAV and autopilot object detection scenes, respectively. The resolution range of all the images in TinyPerson is 497 × 700 ≤ size ≤ 4064 × 6354. The resolution of all the images in both Tsinghua-Tencent 100K and UAVDT is 2048 × 2048 and 1024 × 540, respectively. The resolution range of all the images in MS COCO is 51 × 72 ≤ size ≤ 640 × 640. As shown in Figure 1d,e, when objects have a small absolute or relative size, the object detection performance of existing detectors decreases to a certain extent. Many small object detection methods have been proposed to meet the needs of practical applications. Most of them were developed based on the improvement of existing object detection methods. Additionally, these developed methods mainly focus on improving the accuracy of small object detection. However, a high real-time performance of detectors is usually necessary in small object detection scenes. Positioning and classification are two main object detection subtasks [17]. Therefore, object detection should not only accurately locate all objects in an image, but also correctly identify their categories. Object detection tasks usually require the spatial and semantic information extracted from neural networks to assist in object positioning and classification [18,19]. However, due to the inconspicuous/weak features of small objects, small object detection needs to be optimized for the following two aspects. First, the existing research results show that the surrounding environment is essential for humans to recognize small objects [20]. In object detection, local context information represents the visual information of the area around the object to be detected [21]. Additionally, the experimental results of existing computer vision research show that the proper modeling of the spatial background can improve the accuracy of object detection [22]. Therefore, the existing methods capture the local context information of an object through a relatively large receptive field, thereby trying to obtain the abundant fine-grained spatial information of the object [23][24][25][26]. However, the excessive use of large-scale convolution kernels with large receptive fields causes an increase in the time and space complexity of detection models, which is not conducive for single-stage detectors to achieve real-time performance. Second, due to their small size, the spatial information of small objects usually disappears in the feature transmission process. In neural networks, image features are gradually transmitted to deep layers. Additionally, the corresponding image size simultaneously decreases. If any relevant processing is not applied to the small objects shown in the image, the related small objects disappear in the feature transmission process. The multi-scale fusion of feature maps between different network levels is an effective way to solve the above issue [21]. For example, some existing solutions, such as [6,[27][28][29], generally adopt a top-down path to construct a feature pyramid [27], thereby alleviating feature disappearance to a certain extent. Additionally, a feature pyramid can be used to fuse both the spatial and semantic feature information, which can optimize small object detection. This paper proposes a SODNet composed of an adaptively spatial parallel convolution module (ASPConv) and fast multi-scale fusion module (FMF) to optimize both the extraction of spatial information and fusion of spatial and semantic information, thereby achieving real-time processing. ASPConv is used to adaptively extract features by using multi-scale receptive fields. FMF optimizes both the semantic and spatial information of output features to achieve feature map upsampling and multi-scale feature fusion. Additionally, the real-time-related factors are considered in both modules to ensure the high real-time performance of the proposed SODNet.
The proposed SODNet is applied to four public datasets, TinyPerson, Tsinghua-Tencent 100K, UAVDT and MS COCO. According to the comparative experimental results, the proposed SODNet can effectively improve the accuracy of small object detection in real-time. This paper has three main contributions, as follows: • This paper proposes an adaptive feature extraction method using multi-scale receptive fields. Due to the small proportion in the image and inconspicuous features, the spatial information of small objects is always missing. The proposed method divides the input feature map equally among the channels and performs feature extraction on the separated feature channels in parallel. Additionally, the cascading relationship of multiple convolution kernels is used to achieve the effective extraction of local context information for different channels. Therefore, the features related to small objects with multi-scale spatial environmental information can be obtained by fusing the extracted information. • This paper proposes a new feature map upsampling and multi-scale feature fusion method. This method uses both nearest-neighbor interpolation and sub-pixel convolution algorithm to map a low-resolution feature map with rich semantic information to a high-resolution space, thereby constructing a high-resolution feature map with rich semantic features. A feature map with sufficient spatial and semantic information is obtained by the fusion of the constructed feature map and a feature map with rich spatial information, thereby improving the detection ability of small objects.
• This paper designs a one-stage, real-time detection framework of small objects. The ASPConv module is proposed to extract image features from multiple channels in parallel, which effectively reduces the time complexity of feature extraction to achieve real-time small object detection. The FMF module is proposed to apply both nearestneighbor interpolation and sub-pixel convolution to achieve a fast upsampling. The processing time of multi-scale feature map fusion is reduced by improving upsampling efficiency to ensure real-time small object detection.
The rest of this paper is organized as follows: Section 2 discusses the related work; Section 3 describes the details of the proposed method, including the implementation of both ASPConv and FMF modules; Section 4 presents the experimental results, a comparative analysis, and an ablation study on TinyPerson, Tsinghua-Tencent 100K, UAVDT and MS COCO datasets to verify the effectiveness of the proposed method; and Section 5 concludes the paper.

Related Work
With the rapid development of deep learning, the performance of object detection methods was accordingly improved. Object detectors are usually classified into two categories: two-stage detectors [2,[5][6][7] and single-stage detectors [4,8,9,13]. Although most of the existing methods achieved a relatively good object detection performance, they do not have any specialized optimization for small objects. However, most existing small object detection methods were proposed based on the optimization of conventional object detection methods. Generally speaking, small object detection methods usually optimize the following aspects: Feature extraction. Well-designed convolution modules can adaptively extract the rich feature information of small objects in complex scenes.Dilated convolution [24,30] controls the size of receptive fields by changing the sampling center distance. Receptive field block net (RFB) [23] introduces a dilated convolution on the basis of inception [25], and strengthens the network extraction ability by simulating the receptive fields of human vision. Deformable convolution [31] adaptively learns the unique resolution of a single object to adapt it to multi-scale features. Selective kernel networks (SKNet) [32] design a selection module to adaptively adjust the size of receptive fields according to the multiscale input information. These methods use a specifically designed convolution module to make receptive fields rich enough to adjust for the inconsistency in object size. However, they do not focus on improving the detection performance of small objects and ignore the importance of spatial information.
High-resolution features. Since small objects are difficult to find and locate, spatial information is necessary, which can easily be obtained from high-resolution feature maps. Li et al. [33] proposed a feature-level super-resolution method, specialized for small object detection, which used the features of large objects to enhance the features of small objects through Perceptual GAN. Noh et al. [34] first applied super-resolution techniques to enhance the region of interest (RoI) features of small objects. Then, appropriate highresolution object features were used as supervision signals in the model training process to enhance the model's small object learning ability. The efficient sub-pixel convolutional neural network (ESPCN) [35] performs super-resolution reconstruction through sub-pixel convolution and uses a series of convolution operations to reconstruct low-resolution features into high-resolution features, to achieve the purpose of upsampling. A referencebased method proposed by Zhang et al. [36] uses the rich texture information of highresolution reference images to compensate for the missing details in low-resolution images. Although these methods are helpful when obtaining high-resolution feature maps, they do not fully fuse the spatial and semantic information in the features. Therefore, it is still difficult to detect small objects.
Multi-scale feature maps. Many studies have proved that the fusion of multi-scale feature maps is also conducive to small object detection. Due to the slow speed of image pyramids and high memory consumption, the current mainstream object detection methods use feature pyramid structures to achieve cross-scale connections and fusion. Feature pyramid networks (FPN) [27,37] use a top-down method to build one feature pyramid and fuse the features of different scales to improve the multi-scale detection performance. The path aggregation network (PANet) [28] adds a bottom-up path based on FPN to transmit detailed spatial information. Gated feedback refinement network (G-FRNet) [38] is a gated feedback optimization network. It adds a gated structure based on FPN. It uses the feature layer with rich semantic information to filter the fuzzy information and ambiguity in the feature layer with rich spatial information. Therefore, the transmission of key information is achieved to block ambiguous information. The bi-directional feature pyramid network (BiFPN) structure used in an efficient detection net [6] overlaps the optimized FPN in a repeated form to improve feature richness. Although these feature pyramid-like structures improve the performance of multi-scale detection, they cannot improve the detection performance on small objects. Therefore, this paper proposes the ASPConv module to perceive the local context of the target through multi-scale receptive fields and enrich the spatial information. In addition, this paper proposes the FMF module to integrate high-resolution feature map generation and multi-scale feature map fusion. Both spatial and semantic feature information is fused. The suitable high-resolution feature maps are generated for small object detection. Figure 2 shows the structure of the proposed SODNet. In the feature extraction stage, the detailed spatial information of small objects is first extracted by the ASPConv module. Then, further feature extraction is performed to obtain the feature maps C i , i = 1, 2, 3, 4 that were downsampled by the sub-backbone in the proposed backbone module. The obtained feature maps contain features from different network levels and can be used in the subsequent feature fusion (shown in Table 1). In the feature fusion stage, the elementby-element addition to the horizontal connection of FPN is replaced by a concatenation operation and the fine-tuned FPN structure is applied to the fusion of feature maps to obtain the new feature maps P 2 , P 3 , and P 4 . Specifically, the feature map P 4 is obtained by the convolution of the feature map C 4 , the feature map P 3 is obtained by fusing the feature maps P 4 and C 3 , and the feature map P 2 is obtained by fusing the feature maps P 3 and C 2 . Additionally, the FMF module maps the low-resolution feature maps (P 2 , P 3 , and P 4 ) with rich semantic information to the high-resolution space, and integrates the feature map C 1 with rich spatial information from the ASPConv module to generate the high-resolution feature map P 1 with rich semantic and spatial information. Therefore, the obtained feature maps P j , j = 1, 2, 3, 4 can be used to improve the detection ability of small objects.

The Proposed Method
As shown at the bottom of Figure 2, the predictor uses four independent convolution units to perform positioning and classification. The feature maps P j , j = 1, 2, 3, 4 are processed to obtain the detection results [9]. The size and step length of the convolution kernel of each convolution unit are 1 × 1 and 1, respectively. The predictor predicts the categories and bounding boxes of all objects in the corresponding feature maps through these convolution units. In the training stage, the categories and bounding boxes obtained by the predictor and ground-truth are classified and regressed. The loss of each image is calculated, and the network weights are continuously updated through backpropagation until the model converges. In the inference stage, threshold filtering and non-maximum suppression are applied to the classification and positioning results obtained by the predictor to eliminate overlapping or abnormal bounding boxes and obtain the final detection result. This section may be divided by subheadings. It should provide a concise and precise description of the experimental results and their interpretation, as well as the experimental conclusions that can be drawn.

Adaptively Spatial Parallel Convolution Module
As shown in Figure 3, this paper proposes an adaptively spatial parallel convolution module to enable neurons to adaptively learn the spatial information of objects in the original image. This module can associate with local context to obtain rich spatial information. Conv module consists of Conv2d, Bacth Normalization [39] and Hardswish function [40]. Conv2d represents a standard convolution operation. k represents the size of the convolution kernel. s is the stride (unless specified, the default size of the convolution kernel is 3 × 3; the default stride is 1). C represents the channel number of a feature map. ASPConv module first downsamples the original image to obtain the feature map X. Then, the obtained feature map X is equally divided on the channel to obtain feature maps X 1 and X 2 . Next, the obtained feature maps X 1 and X 2 are convolved in a parallel manner. Additionally, the cascading relationship of multiple convolutions is used to realize the effective extraction of the local context of the objects in the feature maps X 1 and X 2 . Subsequently, all the information is fused by a jumping connection [41] to obtain the detailed spatial information that is conducive to the detection of small objects. Finally, downsampling and further feature extraction are performed on the fused feature map to obtain the feature map C 1 . Local context information. The size of receptive fields (RF) is important for the recognition of small objects. The size of the area mapped on the original image is defined through the pixels on the feature map output by each convolutional neural network layer [42] as follows: where RF i+1 represents the receptive field of the (i+1)-th layer (the current layer), RF i represents the receptive field of the i-th layer (the previous layer), Stride i represents the stride in the i-th layer of convolution or pooling operation, and k represents the convolution kernel size of the (i+1)-th layer (the current layer). According to the calculation of Equation (1), the pixel points on the feature map have a mapping area size of 3, 7, and 11 on the original image, respectively, after the convolution of X, X 1 , and X 2 . Therefore, the parallel convolution operations in the ASPConv module can capture the multi-scale local context information. In addition, unlike some existing structures [24,25], the ASPConv module captures rich local context information through the multi-convolution cascading relationship of the SF sub-module. Additionally, the reduction in feature maps in the SF sub-module also effectively reduces the time complexity of ASPConv. Split. Given any feature map X ∈ R C×H×W , two transformations are first performed on the feature map, and then the feature map is divided into two components, F 1 : [39], and Hardswish function [40]. The transformation F 2 is composed of Conv2d. To further improve the efficiency, two convolutions with 3 × 3 kernel size are used to replace the convolution with 5 × 5 kernel size on the channel of the feature map X 1 , so the amount of both the calculation and parameters can be reduced under the same receptive field conditions [43].
Fuse. The multi-scale local context information of objects is adaptively learned and fused in a different receptive field size to construct the spatial position relationship between objects and environment. Therefore, the spatial information that is conducive to the detection of small objects can be obtained. The calculation process of the fused feature map X 3 is given as follows: where δ(·) and B(·) represent Hardswish function and Batch Normalization, respectively, and X 3 ∈ R C×H×W . X 3 is the intermediate result of the feature maps X 1 and X 2 after parallel convolution and concatenation operation, which is defined as follows.
where Cat[·] represents a concatenation operation, C(·) represents a standard convolution operation with a 3 × 3 kernel size and a stride of 1, and G(·) = δ(B(C(·))). Extract. After the Split and Fuse operations, the ASPConv module effectively fuses the multi-scale local context information from the same layer, enriches the spatial information and obtains the fused feature map X 3 . Subsequently, the feature map X 3 is downsampled to reduce its resolution and a further feature extraction is performed to obtain the output feature map X out of the ASPConv module, which is defined as follows: Finally, the ASPConv module effectively extracts and fuses the multi-scale local context information through operations Split, Fuse, and Extract, and obtains a feature map X out with rich spatial information, which corresponds to the feature map C 1 in Figure 2. In addition, the feature map C 1 is applied to the multi-scale feature map fusion used in the FMF module (Section 3.3) to improve the small object detection ability of the proposed model.

Proposed Backbone Module
As shown in Figure 4, the proposed backbone module consists of a sub-backbone and FPN. The sub-backbone is mainly improved by the cross-stage partial darknet (CSP-DarkNet) [9] to balance the speed and accuracy. The original structure of CSPDarkNet is C i , i = 1, 2, 3, 4 shown in Table 1. In the sub-backbone, the structure of the feature map C 1 is changed to the proposed ASPConv module. Therefore, feature maps with rich spatial information can be obtained, which are conducive to small object detection. In addition, as shown in Table 1, the original PANet [28] is replaced by FPN [27] in the multi-scale feature map (P j , j = 2, 3, 4) generation stage. The specific implementation details of FPN are shown in Figure 4.     [9] and the proposed SODNet. C i and P j are consistent with the definitions in Figure 2. CBTC 1 and CSBTC are consistent with the definitions in Figure 4.

Layer Name Layer Components
Yolov5s [9] SODNet (Proposed) In Figure 4, the sub-backbone is composed of the CBTC i and CSBTC modules, respectively. They all consist of some convolution units and feature extraction units, in which the BTC i module obtained after CSP-related operations is the basic component of CBTC i and CSBTC modules. N feature extraction units of BT i integrated into the BTC i module. The number of BT i owned by C i (C 2 , C 3 , and C 4 ) is adjusted to 3, 3, and 2, respectively. For the middle layer of CBTC 1 module, Conv module, as its component, is set to k = 3, s = 2 to achieve the downsampling according to the parameter settings in [9]. The Conv module of CBTC 2 in FPN is set to k = 3, s = 1 to further extract features and eliminate the aliasing effects that may be caused by feature fusion. In addition, the module CSBTC consists of convolution units and SPP [44]. SPP is applied to pool and cascade multi-scale local area features. Meanwhile, global and local multi-scale features are used to improve the detection accuracy. The parameter settings of max pooling in SPP are the same as the corresponding ones used in [9]. The core size of max pooling is set to 1 × 1, 5 × 5, 9 × 9, and 13 × 13, respectively, in the following experiments in this paper. Finally, the proposed backbone module improves the sub-backbone with a low computational cost to increase the inference speed of the overall model and enhance the feature extraction ability. Meanwhile, the feature maps P 2 , P 3 , and P 4 required for fusion and detection are generated through FPN.

Fast Multi-Scale Fusion Module
The fusion of multi-scale feature maps is conducive to the detection of small objects. In addition, the effective spatial information of small objects usually exists in the feature map C 1 [45]. A fast multi-scale fusion module is proposed to fuse multi-scale feature maps and generate high-resolution feature maps with rich semantic and spatial information, thereby improving the detection ability of small objects.
As shown in Figure 5, the FMF module uses the sub-pixel convolutional layer [35] to learn an ascending filter array, upscales the feature map P 3 to a high-resolution space, and automatically learns the interpolation function in the transformation process from low-resolution to high-resolution through the previous convolutional layers. The sub-pixel convolution is highly efficient, dispersing pixels in the channel dimension and adding pixels in the width and height dimensions. The feature dimension of the input sub-pixel convolutional layer is L ∈ R Cr 2 ×H×W . Therefore, the feature dimension is H ∈ R C×rH×rW after rearrangement. The corresponding operation can be formalized as follows. Figure 5. The structure of FMF. The low-resolution feature maps P 2 and P 3 with rich semantic information are mapped to the high-resolution space and fused with the feature map C 1 with rich spatial information to generate a high-resolution feature map P 1 with rich semantic and spatial information. P 2 and P 3 are generated by FPN shown in Figure 4. C 1 is the output feature map in the APSConv module.
SP C(T ) x,y,c = T x r , y r ,C·r·mod(y,r)+C·mod(x,r)+c (6) where SP C(T ) x,y,c is the output pixel value on the spatial coordinates (x, y, c) after the pixel scatter operation SP C(·), and r is the upscaling ratio. The proposed method uses r = 2 to double the spatial scale of feature map P 3 for fusion with the feature map P 2 .
The FMF module first concatenates the feature map P 2 with H on the channels, then passes through the BTC 2 module to eliminate the aliasing effects that may be caused by concatenation, and further extracts the fused feature information. At the same time, an element-wise addition method is adopted to ensure that the outputĤ fuses the semantic and regional information from feature maps P 2 and P 3 as follows.
where f Conv and f BTC 2 represent the Conv module in Figure 3 and BTC 2 module in Figure 4 respectively, andĤ ∈ R C×rH×rW . SP C is first used to perform the channel-to-space transformation, and then f BTC 2 is used to enhance the transformation in the spatial range. Finally, the high-resolution feature map P 1 is generated by fusing semantic and spatial information, as follows: where N N I(·) represents the nearest neighborhood interpolation operation, and P 1 ∈ R C 2 ×2rH×2rW . Similarly, N N I is first used to transform the input features in space, and then f BTC 2 is used to further spread their spatial influence.
In the FMF module, SP C and N N I are alternately used in upsampling to achieve the fusion of semantic and spatial information. Therefore, the loss of detailed information on small objects caused by the too-high upsampling rate can effectively be avoided. The convolution kernel of 1 × 1 or 3 × 3 size is used in the FMF module. Compared with large-size convolution kernels, the time complexity of the FMF module is lower.

Predictor
SODNet finally inputs four feature maps P 1 , P 2 , P 3 , and P 4 to the predictor to detect the classification and positioning. According to the existing research [3,9,14], the loss function specialized for classification and positioning in SODNet mainly includes three components: location loss, confidence loss, and classification loss.
The position loss is the error between the predicted bounding boxes and the groundtruth, which is calculated using the generalized intersection over union (GIoU) [46] loss function. Assuming that the coordinates of bounding boxes and ground-truth are where A p and A g represent the area of B p and B g , respectively,x where x c 1 = min x where I as the area intersection of the coordinates B p and B g is obtained by Equation (12), U is the area union of the coordinates B p and B g , and U = A p + A g − I. So, GIoU can be calculated as follows: The final position loss L GIoU can be obtained by GIoU as follows: Confidence loss L con f is the relative error of the object confidence score prediction [3], which can be calculated as follows: where S is the side length of the feature map that is input into the predictor; B is the number of anchors in each cell of the feature map; λ noobj as the balance coefficient is set to 0. are opposite to that of I obj ij ; C ij is the confidence score of the j-th anchor in the i-th cell predicted by the proposed SODNet, and C ij ∈ [0, 1].
The classification loss L cls is the error between the predicted category of the object and the corresponding true category [3], which is calculated as follows.
where p c ij is the predicted category of the j-th anchor in the i-th cell, andp c ij is the true category. Therefore, the total loss L of SODNet is obtained as follows. L = L GIoU + L con f + L cls (19) In the training stage, SODNet continuously optimizes the loss L and updates the network weights through backpropagation until the model converges. In the testing stage, SODNet does not perform backpropagation. It directly performs post-processing operations, such as confidence threshold screening and non-maximum suppression processing, on the classification and positioning results obtained by the predictor to obtain the final detection results.

Datasets and Evaluation Metrics
TinyPerson. TinyPerson [14] is a small object benchmark dataset containing a high number of small objects. All the images were collected from real-world scenes by unmanned aerial vehicles (UAVs). TinyPerson contains 1610 images with 72,651 labeled frames, of which 794 and 816 are used as training and testing images, respectively. The objects in TinyPerson are very small. According to the area occupied by each object, the objects are divided into tiny1 (area ≤ 8 × 8), tiny2 (8 × 8 < area ≤ 12 × 12), tiny3 (12 × 12 < area ≤ 20 × 20), tiny (area ≤ 20 × 20) consisting of tiny1, tiny2 , and tiny3, small (20 × 20 < area ≤ 32 × 32), and non-small (area > 32 × 32) objects. As shown in Figure 1, their corresponding proportions are 25.2%, 21.4%, 24.4%, 71%, 14.0%, and 15.0%, respectively. The object proportion of the interval area ≤ 32 × 32 in the TinyPerson dataset is 85%. This means that the TinyPerson dataset can be used to evaluate the small object detection performance of the proposed model. Therefore, both real-time testing experiments and an ablation study were carried out on this dataset.
According to Tiny Benchmark [14], average precision (AP) and miss rate (MR) are used to evaluate the performance of object detection. AP, as a widely used evaluation indicator in object detection, reflects the precision and recall of detection results. When the value of AP increases, the detector performance improves. MR is usually used in pedestrian datasets. It reflects the object loss rate. When the value of MR decreases, the detector performance improves. The comparative experiments were implemented over five intervals of small and tiny objects respectively, including tiny1, tiny2, tiny3, tiny, and small. A detailed analysis is provided. The threshold of intersection over union (IoU) was set to 0.25, 0.5, and 0.75, and both MR tiny and AP tiny at IoU = 0.5 were used as the main indicators to evaluate the small object detection performance in the TinyPerson dataset [14]. IoU = 0.5 means that when IoU ratio between bounding boxes and ground-truth in the detection result was greater than or equal to 0.5, the detection was correct [34]. The objects in the TinyPerson dataset are quite small. When the value of IoU exceeds 0.5, the detector performance dropped considerably. Therefore, only three IoU values, 0.25, 0.5, and 0.75, were selected to evaluate the detection performance of small objects on the TinyPerson dataset, instead of an IoU interval.
Tsinghua-Tencent 100K. Tsinghua-Tencent 100K [15] is a large-scale traffic sign benchmark dataset, which contains 100,000 high-resolution (2048 × 2048) images and 30,000 traffic sign instances. According to the area occupied by each object, Tsinghua-Tencent 100K divides objects into smaller objects (area ≤ 32 × 32 pixels), medium objects (32 × 32 < area ≤ 96 × 96 pixels), and large objects (area > 96 × 96 pixels). The original division of objects is applied to the following experiments on the Tsinghua-Tencent 100K dataset [15]. The proportions of small, medium, and large objects are 42%, 50%, and 8%, respectively. Since small and medium objects are dominant in this dataset, it is also a good benchmark to evaluate the performance of small object detection.
According to the protocols in Tsinghua-Tencent 100K [15], classes with fewer than 100 instances are ignored. A total of 45 classes were finally selected for evaluation, and accuracy and recall were used as evaluation indicators. Additionally, F 1 score was also used as an evaluation indicator. In the Tsinghua-Tencent 100K experiments, when IoU (the ratio of bounding boxes to ground-truth) was greater than or equal to 0.5, the corresponding detection was considered successful [34].
UAVDT. UAVDT [16] is a large-scale challenging benchmark dataset. It contains about 80,000 frames of images with annotated information. It is used to achieve three basic computer vision tasks (object detection, single-object tracking, and multiple-object tracking). For the object detection, the UAVDT dataset has three categories of objects (car, truck, bus), and contains 23,829 training images and 16,580 testing images, with 1024 × 540 resolution. The object classification standard of UAVDT also uses the classification standard in MS COCO [10], which is the same as that of Tsinghua-Tencent 100K. According to Figure 1, the proportion of objects in the small interval area ≤ 32 × 32 of the UAVDT dataset is 61.5%. Therefore, this dataset is also a good benchmark to evaluate the performance of small object detection.
The indicators used in MS COCO [10], including AP

Implementation Details
The aspect ratio of people in most of the TinyPerson images varies considerably. Therefore, according to the approach used in [14], the original images were segmented into overlapping sub-images during training and inference. In comparative experiments, the original images in TinyPerson were adjusted to 640 × 640 size for training and testing. Kaim-ing normal [53] was used to initialize the network. The experiments in all four datasets use the default parameters in [9] for training. The number of training rounds was 300 epochs. The initial learning rate was 0.01. The warm-up strategy [41] was used to adjust the learning rate. A stochastic gradient descent (SGD) with a weight decay of 0.0005 and a momentum of 0.937 was used to train the entire network.
According to the settings used in [15,33,34], the image size was adjusted to 1600 × 1600 for training and testing in Tsinghua-Tencent-100K-related experiments. In UAVDT-related experiments, the original image resolution 1024 × 540 was used for training and testing. Both Tsinghua-Tencent-100K-and UAVDT-related experiments used models that were pre-trained on the MS COCO dataset to initialize the network. In MS-COCO-related experiments, the original image resolution 640 × 640 was used for training and testing.
For both TinyPerson and UAVDT datasets, one GPU was used for training, and the batch size was 32. For the Tsinghua-Tencent 100K dataset, due to the large image resolution, four GPUs were used for training, and the batch size was 20. For the MS COCO dataset, due to the high amount of data, four GPUs were used for training, and the batch size was 64. For all four datasets, only one GPU was used in testing.

Experiment Preparation
TinyPerson. The proposed method was compared with the state-of-the-art single-stage and two-stage object detection methods. Tables 2 and 3 show the detailed experimental results of the TinyPerson test dataset. Although some SOTA detectors (such as Libra RCNN [5], Grid RCNN [7], etc.) performed well on MS COCO [10] or PASCAL VOC [11], they did not achieve good results for small object datasets. A potential reason for this is that the target size in the TinyPerson dataset is too small, which causes the performance of these detectors to considerably decrease. The proposed method uses YOLOv5s [9] as the baseline. Although YOLOv5s achieved good results, the proposed method still improves the core indicators MR tiny [0. 5] and AP tiny [0.5] by 2.68% and 5.94%, respectively. Compared with some methods, which are specialized for small object detection, such as FS-SSD512 [49], FRCNN-FPN-SM [14], FRCNN-FPN-SM S − δ [47], etc., the proposed method performed better than the best one, RetinaNet-SM S − δ [47], and the corresponding core indicators MR were improved by 3.7% and 2.99%, respectively. Although the indicators MR small [0.5] and AP small [0.5] of the proposed method were 0.36% and 0.74% lower than the corresponding ones of Scaled-YOLOv4-CSP [8] and FRCNN-FPN-SM [14], respectively, the performance of the proposed method was better than the performance of other methods. Compared with the baseline, the indicators MR small [0.5] and AP small [0.5] of the proposed method were improved by 0.72% and 1.99%, respectively. The results confirm that the proposed method can pay more attention to small objects and improve the recognition ability of small objects. Therefore, the proposed method significantly improves the small object detection performance and achieves a better performance than the state-of-the-art methods.
Tsinghua-Tencent 100K. According to the experimental results shown in Table 4, the proposed method can significantly improve the small object detection performance of the baseline. Table 4 shows the experimental results of the proposed SODNet and other state-of-the-art methods on the Tsinghua-Tencent 100K test dataset in detail. Since the object size of the large interval is greater than 96 × 96 pixels and this paper focuses on evaluating the recognition performance of the methods on small objects, the large interval is not evaluated. The object size range of the overall interval in Table 4 is area ≤ 400 × 400, the test results in this interval are used to comprehensively evaluate the detection performance. According to Table 4, the proposed method can achieve a similar performance to that of the state-of-the-art method proposed by Noh et al. [15] and achieve a higher real-time performance. The method proposed by Noh et al. [15] is developed based on the two-stage detector FRCNN [2] as the benchmark model. As the source codes are lacking, we were unable to reproduce Noh's method. In addition, compared with the baseline, the F1 scores obtained by the proposed method improved the corresponding performance on small, medium, and overall classes by 1.3%, 1%, and 0.8%, respectively. The performance in the three classes was improved to varying degrees, but the performance improvement of the small class was greater than the corresponding improvements in the other two classes. Table 4. Performance comparison with the state-of-the-art models on the Tsinghua-Tencent 100K test dataset. Partial experimental data shown in this table are missing, because some models [15,33,45,50] only provide a part of the related data. The best results are marked in bold. UAVDT. According to the experimental results shown in Table 5, the proposed method achieved a state-of-the-art performance on the UAVDT [16] dataset. In this dataset, the indicator AP [0.5,0.95] was used to evaluate the experimental results of all the targets in the dataset, and the indicator AP small [0.5,0.95] was used to evaluate the experimental results of the targets within the size interval area ≤ 32 × 32. Specifically, the results of the first four rows in Table 5 were calculated by the indicators used in MS COCO [10] for the experimental results provided by Du et al. [16]. According to Table 5, compared with the best performance achieved by ClusDet [52], the proposed method improves the main evaluation indicator AP , the proposed method in this paper shows improvements of 2.8% and 2.1%, respectively, compared to ClusDet and baseline. This verifies the improvement in the proposed method for small objects. In addition, the indicator means that the detection is correct when the ratio of the bounding boxes' IoU to ground truth is greater than or equal to 0.75. This means that the indicator AP [0.75] has strict requirements for positioning accuracy. The results of 0.5 and 0.75 refer to the overlap ratio of the predicted frame to the actual frame, at 50% and 75%, respectively. A higher value indicates a higher overlap ratio. For the indicator AP [0.75] , the proposed method shows improvements of 5.5% and 5.6%, respectively, over ClusDet and baseline. This also verifies the improvement in the proposed method regarding the accuracy of small object positioning. The main reason for this improvement is that the proposed ASPConv and FMF modules optimize the spatial information of small objects.  [0.75] represent that the detection is correct when the IoU ratio between bounding boxes and ground-truth is greater than or equal to 0.5 and 0.75, respectively. This confirms that the proposed method can effectively improve the spatial information of features, thereby improving the positioning accuracy of objects. According to Table 6, the network Noh et al. [34] focused on small objects and achieved good results on the Tsinghua-Tencent 100K dataset but did not achieve good results on the MS COCO dataset. Compared with Noh et al. [34], the result obtained by SODNet is 3.9% higher than Noh et al. [34] on the small object interval. In addition, according to the experimental results of FRCNN-FPN [27] in Table 6 and the FPS testing results in Table 7, the proposed method achieved a similar detection accuracy to FRCNN-FPN, which is about three times higher than the FPS obtained by FRCNN-FPN. Therefore, the experimental results on the MS COCO dataset also confirm that the proposed method can effectively improve the accuracy of small object detection while ensuring a certain real-time performance.

Real-Time Comparison
Not all the comparative methods shown in Tables 2 and 3 have public source codes. Therefore, an efficiency comparison is only performed on the methods with public source codes and the proposed SODNet. Table 7 shows the FPS of each model. The proposed SODNet has the second highest FPS, which is considerably better than the other six comparative models. According to the real-time performance mentioned in [2], FPS need to be greater than or equal to 30. Therefore, the proposed SODNet achieved a high real-time performance.
According to Tables 2, 3 and 7, the proposed method only adds a low computational cost to the baseline, but significantly enhances the original baseline performance. According to Table 7, the proposed method is about four times faster than the compared two-stage detectors, such as [14,27], and about three times faster than the compared single-stage detectors, such as [4,8,49], under the same input size.

Ablation Study
Since the proportion of small objects in the (area ≤ 32 × 32) interval on the TinyPerson and UAVDT datasets reached 85% and 61.5%, respectively, the corresponding ablation experiments of both ASPConv and FMF modules on the TinyPerson and UAVDT test datasets are discussed in this section. As shown in Table 8 5] were improved by 2.68% and 5.94%, respectively. Therefore, the combination of the two modules can make a significant performance improvement. As shown in Figure 2, the FMF module fuses the feature map C1 from the ASPConv module and fuses multi-scale feature maps with rich semantic and spatial information, so the model performance is considerably improved. According to Table 8 35% and 4.34%, respectively, by using the FMF module only. Since the FMF module effectively integrates spatial information that is conducive to the detection of small objects, this improvement is reasonable. As shown in Table 8, the corresponding FPS is reduced by 10 after adding the ASPConv module, compared with the baseline. The number of convolutions in ASPConv is significantly higher than that of the focus module [9], so the corresponding computation time increases. After adding the FMF module, the FPS is increased by 4 compared with the baseline. Since SODNet uses the FMF module to replace the original PANet [28] and reduces the bottom-up path enhancement, the FPS is slightly improved. When the ASPConv and FMF modules are added, SODNet only reduces the FPS by 7 compared with the baseline, but the detection accuracy of small objects is considerably improved. In addition, the ASPConv module improves the baseline by 1 ), respectively. The corresponding FPS is only reduced by 4. In the ablation experiments, although the ASPConv module reduces a certain real-time performance, it can enrich the spatial information in the feature maps of C i , i = 2, 3, 4 and P j , j = 1, 2, 3, 4. The ASPConv module can effectively improve the detection accuracy of small objects when it works with the FMF module. According to the ablation study, when the two proposed modules work together, they can achieve a better improvement in small object detection than any single module.

Qualitative Results
As shown in Figure 6, for TinyPerson, the magnified sub-image in the green frame represents the bounding box predicted by the SODNet. For Tsinghua-Tencent 100K, the magnified sub-image in the red frame represents the ground-truth, and the magnified subimage in the blue frame represents the object frame predicted by the SODNet. For UAVDT, the magnified sub-image in the green frame represents the bounding box predicted by the SODNet. For MS COCO, comparative experiments were performed on the MS COCO test-dev dataset. Since there is no ground-truth on the MS COCO test-dev dataset, all the rectangular boxes in Figure 6j-l are the bounding boxes predicted by the SODNet. For each pair of images, the images on the left-and right-hand sides are the detection results of the baseline and the proposed SODNet. Figure 6 shows some selected testing results for the TinyPerson, Tsinghua-Tencent 100K, UAVDT, and MS COCO test sets. For each pair of figures, the detection results of the baseline and the proposed method are shown on the left-hand and right-hand sides, respectively. Compared with the baseline, the proposed method can achieve a better detection performance on small and dense objects. In Tsinghua-Tencent 100K, the proposed method still detected some existing but unmarked objects, which can be regarded as reasonable examples of false positives.  (a-c)), Tsinghua-Tencent 100K (shown in subfigures (d-f)), UAVDT (shown in subfigures (g-i)) and MS COCO (shown in subfigures (j-l)) test datasets.

Conclusions
The proposed method is applied to the single-stage detector YOLOv5s to solve the issues of small object detection. First, an adaptive spatial parallel convolution module (AS-PConv) is proposed to extract the multi-scale local context information of small objects and enhance the spatial information of small objects. Second, a fast multi-scale fusion module (FMF) is designed, which effectively integrates the high-resolution feature maps with rich spatial information output from the APSConv module. The low-resolution feature maps with rich semantic information can be efficiently mapped to high-resolution space. Multiscale feature map fusion is performed to generate high-resolution feature maps with rich spatial and semantic information that are conducive to small object detection. In addition, according to the ablation study results shown in Table 8, the two modules can effectively be integrated to achieve fast and accurate detection. The experimental results of the TinyPerson, Tsinghua-Tencent 100K, UAVDT, and MS COCO benchmark datasets confirm that the proposed method efficiently and significantly improves the detection performance of small objects, and the corresponding results are highly competitive. In TinyPerson-related experiments, compared with the most advanced methods in the literature, the proposed method improves AP tiny [0.5] by 5.94%, and achieves a 91 FPS on a single Nvidia Tesla P100. Therefore, the proposed SODNet can effectively enhance the detection performance of small objects and realize real-time performance. Therefore, the proposed method can be transferred to many small object detection scenes, such as UAV search-and-rescue and intelligent driving. In future research, the optimization and applications of the proposed method will be further explored in more fields.