DTA-Head: Dynamic Task Alignment Head for Regression and Classification in Small Object Detection

Ye, Kaiqi; Li, Qi; Yan, Yunfeng; Wang, Xianbo; Qi, Donglian

doi:10.3390/app15169060

Open AccessArticle

DTA-Head: Dynamic Task Alignment Head for Regression and Classification in Small Object Detection

by

Kaiqi Ye

¹,

Qi Li

²,

Yunfeng Yan

²,

Xianbo Wang

³

and

Donglian Qi

^2,*

¹

The Ocean College, Zhejiang University, Zhoushan 316021, China

²

The College of Electrical Engineering, Zhejiang University, Hangzhou 310058, China

³

The Hainan Institute, Zhejiang University, Sanya 572025, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 9060; https://doi.org/10.3390/app15169060

Submission received: 26 June 2025 / Revised: 9 August 2025 / Accepted: 14 August 2025 / Published: 17 August 2025

Download

Browse Figures

Versions Notes

Abstract

Detecting small targets poses significant challenges due to the limited feature information and the aggregation of features in deep feature maps. Existing single-stage detectors handle classification and regression separately, leading to inconsistent predictions and potential filtering by Non-Maximum Suppression (NMS). To address these issues, we propose the Dynamic Task Alignment (DTA) Head. This novel approach comprises two branches: regression and classification. The regression branch computes offsets and masks for feature alignment adopted by Deformable ConvNets v2 (DCNv2), while the classification branch enhances feature interaction through dynamic selection. The task-decomposition module separates features for each branch. Additionally, we introduce Diverse-Scale Channel-Specific Convolution (DSCSC) to apply diverse convolutions across specific channels and exchange channel information. Our methods achieved an AP@.5 of 30.9% on the TinyPerson dataset, a 3.3% improvement over the original model’s 27.6% and outperforming other common models.

Keywords:

computer vision; small object detection; dynamic task alignment; diverse-scale channel-specific convolution

1. Introduction

In the contemporary field of object detection, detection models have achieved remarkable success in identifying large targets. However, progress in detecting small targets has been relatively limited. For instance, CenterNet [1] reports that the Average Precision (AP) for small targets within the COCO test dataset is only 26.6%, which is significantly lower than the 47.1% for medium targets and 57.7% for large targets. The limited pixel representation of small targets complicates their recognition following multiple downsampling layers. In situations where the number of small targets is significantly fewer than that of large targets, detection models often prioritize the learning of characteristics associated with larger targets, which may result in the neglect of smaller targets. The disparity in the distribution of positive and negative samples is partly due to the inherent complexities involved in detecting small targets, and it typically exhibits limited feature information, which presents challenges for models attempting to accurately extract essential features from the sparse data. Consequently, models are more susceptible to making mistakes in detecting small targets.

As illustrated in Figure 1, our proposed DTA-Head enhances the feature interaction between the classification and regression branches; the classifier predicts the presence of an object, and the regressor estimates its bounding-box coordinates. Thereby facilitating dynamic alignment and improving the consistency of predicted bounding boxes. The baseline models (DynamicHead-M, Gold-YOLO-M, HGNetV2-M, and YOLOv8-M) tend to omit adjacent small targets, leading to a significant number of missed detections. Since small targets occupy only a few pixels, even a minor spatial error in the regressor can result in a significant drop in the Intersection over Union (IoU). At the same time, the classifier may still output a high-confidence positive score, leading to a mismatch between the two branches. Figure 2 quantitatively illustrates the fragility of small targets’ IoU. A two-pixel shift causes the IoU of the small target to drop from 100% to 20%, while the large target only drops to 63.3%. This observation confirms that small targets are susceptible to positional noise, which further motivates the necessity of our DCNv2-based regression branch: it learns offsets and masks to refine the sampling grids, minimizing such IoU degradation dynamically.

In summary, we investigate the challenges associated with deep feature aggregation and the insufficient exchange of inter-channel information during the convolution process in the context of small object detection. Two enhancement strategies are proposed to address these issues:

Dynamic Task Alignment Head: This innovative detection head enables the dynamic alignment of classification and regression branches by adjusting the predicted bounding boxes that display substantial discrepancies between these branches. This modification seeks to improve the consistency of the model’s predictions across various tasks, which is essential for enhancing the reliability of small object detection.
Diverse-Scale Channel-Specific Convolution: This convolution methodology not only decreases the overall parameter count of the model but also promotes a more efficient exchange of information among feature channels. The expected result of this improved feature integration is the production of more comprehensive representations, which are advantageous for the detection of small targets.

As shown in Figure 3, the DTA-Head and the DSCSC approach achieved the best detection performance on the TinyPerson [2] dataset. The DTA-Head effectively mitigates misalignment issues in classification and regression branches, particularly in contexts where there are substantial errors in the detection box for small targets. This improvement can be attributed to the dynamic convolution and feature selection mechanisms incorporated within the detection head, which enable the model to selectively extract essential feature information and address the challenges associated with accurately identifying small targets. Moreover, the DSCSC method is highly effective in extracting multi-scale feature information across channels, thereby significantly enhancing detection performance for small targets. Additionally, the overall parameter count of the model is reduced in comparison to the original model, resulting in a more streamlined and efficient architecture.

2. Related Work

2.1. Multi-Scale Learning

Compared to standard targets, small targets present significant challenges due to the limited number of pixels available for feature extraction. As the depth of the network increases, both the feature and positional information about small targets tend to be progressively diminished. Consequently, the integration of shallow-level feature information with deep-level semantic information is crucial for achieving accurate detection of small targets.

Classic networks in object detection, such as Fast R-CNN [3], Faster R-CNN [4], SPPNet [5], and R-FCN [6], predominantly depend on the final layers of deep neural networks for prediction tasks. However, the detection of small targets within deep feature maps presents a significant challenge due to the degradation of spatial and detailed feature information. In deep neural networks, shallow layers exhibit a smaller receptive field, reduced semantic information, and a lack of contextual information; nevertheless, they are proficient in capturing detailed spatial features. Building upon this concept, Liu et al. [7] introduced a multi-scale object detection algorithm known as the Single-Shot Multibox Detector (SSD), which leverages shallower feature maps for the detection of small targets and deeper feature maps for larger targets. Cai [8] and colleagues have proposed a unified multi-scale deep convolutional neural network to tackle the issue of sparse information associated with small targets, a challenge that traditional networks often encounter. This network employs deconvolutional layers to enhance the resolution of feature maps, thereby significantly improving the detection capabilities for small targets while simultaneously reducing memory and computational costs. Bell et al. [9] introduced the Inside–Outside Network (ION), a methodology designed to extract features from specified regions of interest within feature maps of varying scales. The ION model synthesizes these multi-scale features to enhance detection efficacy. To conserve computational resources and enhance feature integration, Lin [10] and colleagues combined singular feature mapping, pyramidal feature hierarchies, and comprehensive features to establish the Feature Pyramid Network (FPN). The FPN, a widely adopted multi-scale architecture, incorporates a bottom-up and top-down network design that enhances features through the fusion of adjacent layers.

Based on the FPN framework, Liang et al. [11] developed a deep feature pyramid network that enhances the semantic representation of small targets through a feature pyramid structure with lateral connections. This network is further augmented by specially designed anchor boxes and loss functions for training, which aim to accelerate the detection speed of small targets.

2.2. Anchor-Free Mechanism

Current anchor-box configurations often struggle to establish an optimal balance between the recall rate for small targets and the computational resources necessary for their detection. This imbalance results in a marked disparity in the availability of positive samples for small versus large targets, causing models to prioritize the detection of larger targets while neglecting smaller ones.

Fully Convolutional One-Stage (FCOS) [12] introduces a pixel-level predictive methodology that obviates the necessity for intricate calculations associated with anchor boxes, such as overlap computations during the training phase, thereby conserving memory resources. Adaptive Training Sample Selection (ATSS) [13] facilitates the dynamic selection of positive and negative samples based on the statistical characteristics of the targets. FoveaBox [14] concurrently predicts the probability of each valid location being the center of a target and the dimensions of the corresponding target, thereby providing both category confidence and size information for bounding-box transformation. Feature-Selective Anchor-Free (FSAF) [15] enables each instance to select the most appropriate feature layer for network enhancement, moving away from anchor constraints in favor of an anchor-free encoding strategy. Soft Anchor Point Detection (SAPD) [16] introduces soft-weighted anchor points for sample weighting across various positions, along with soft-selected pyramid levels that facilitate the distribution of samples across multiple resolutions through weighted aggregation and refines the loss function to eliminate the manual designation of anchor boxes, enabling the network to autonomously select the most suitable anchor for real object matching. CenterNet [1] advances a robust bottom-up object detection technique by identifying each target as a set of key points, thereby capturing and discerning global information within targets of diverse geometric configurations. ObjectBox [17] designates only the central position of the target as a positive sample, treating all targets uniformly across various feature levels, regardless of their size or shape. The authors redefine the regression target as the distances from the two corners of the central unit to the four edges of the bounding box and introduce a customized IoU loss to manage frames of varying magnitudes effectively.

2.3. Attention Mechanism

In the field of neural networks, the attention mechanism introduces an additional component that selectively emphasizes specific aspects of the input data or assigns varying importance weights to different elements. This functionality enables the network to filter and prioritize critical information from large datasets efficiently.

Squeeze and Excitation (SE) [18] is designed to enhance model performance by incorporating channel-specific attention weighting. This attention mechanism entails evaluating the significance of each channel through learned weights, which are subsequently applied to the feature maps of the channels. The primary aim of this process is to amplify the contribution of relevant features while diminishing the influence of those that are deemed irrelevant. The Convolutional Block Attention Module (CBAM) [19] utilizes a simple yet effective attention mechanism based on a feedforward convolutional neural network, which functions across two distinct dimensions of the feature map: the channel and spatial axes. This mechanism generates an attention map that is then multiplicatively combined with the input feature map, facilitating adaptive feature refinement. In contrast, Efficient Channel Attention (ECA) [20] replaces the fully connected layer utilized in squeeze-and-excitation networks with a 1 × 1 convolutional kernel for processing. This modification leads to a reduction in the number of parameters within the model, thereby improving its overall efficiency. Proponents of ECA argue that convolutional operations are particularly effective in capturing cross-channel information, which diminishes the necessity of incorporating all channel information. Consequently, the fully connected layer is replaced by a 1 × 1 convolutional layer. Furthermore, Coordinate Attention (CA) [21] disaggregates channel attention into two distinct one-dimensional feature-encoding processes that aggregate features across multiple directions. This approach facilitates the capture of long-range dependencies along one spatial dimension while preserving accurate positional information along another. As a result, the generated feature maps are encoded independently, yielding a pair of directionally sensitive and position-aware feature maps. These maps can be synergistically integrated with the input feature map to improve the representation of the target. Cai [22] introduced a novel feature extraction backbone network, termed the Poly Kernel Inception Network (PKINet), specifically designed for remote sensing target detection. In contrast to existing methodologies that depend on large kernels or dilated convolutions to increase the receptive field, PKINet employs multiple deep convolutional kernels of varying sizes arranged in parallel, without gaps, to effectively capture dense texture features across diverse receptive fields. These texture features are systematically combined along the channel dimension to aggregate local contextual information. Additionally, to integrate long-range contextual details, the author proposed the Context Anchor Attention (CAA) mechanism, which leverages global average pooling and one-dimensional strip convolution to assess the correlation between distant pixels, thereby enhancing the features within the central region.

3. Methodology

3.1. Dynamic Task Alignment Head

The YOLO series of models has been widely recognized for their efficient architecture and rapid detection capabilities. Our methodology is based on the YOLOv8 detection framework. As shown in Figure 4, YOLOv8 employs a backbone–FPN–head architectural paradigm. To enhance efficiency and minimize the parameter count, the model integrates two shared convolutions with 3 × 3 group normalization in the feature layers (

P_{3}

,

P_{4}

, and

P_{5}

). The implementation contributes to a reduction in the number of parameters, which has also been demonstrated to improve the classification and regression capability of the detection head within the FCOS framework [12], thereby yielding superior performance in object detection tasks. Gold-YOLO [23] uses a gather-and-distribute neck to fuse multi-scale features into one shared representation for all heads. In contrast, our work keeps the neck intact, splits the head into task-specific branches, and employs DCNv2 offsets to realign classification and regression predictions explicitly.

F_{u n i o n} = c o n v_{G N_{1}} (c o n v_{G N_{2}} (F)) + F

(1)

The feature map (F) is extracted from the backbone through the multi-layer feature fusion of layers

P_{3}

,

P_{4}

, and

P_{5}

. The feature maps obtained from various layers contain information from distinct receptive fields. Two convolutional layers, designated as

c o n v_{G N_{1}}

and

c o n v_{G N_{2}}

, are employed in a shared manner and utilize group normalization in conjunction with a 3 × 3 convolutional kernel. The term

F_{u n i o n}

refers to the union feature produced through residual connection. These newly integrated features are subsequently forwarded to the DTA-Head for detection.

Dynamic Head [24] also unifies multiple heads by attention, but it operates on the whole feature tensor with three orthogonal attentions (scale, spatial, and task) applied sequentially on the same feature map. In contrast, our DTA-Head separates classification and regression at the very beginning using a task-decomposition module, then dynamically re-aligns the two task-specific predictions with a learnable offset-and-mask mechanism (DCNv2). As illustrated in Figure 5, union features are obtained through a task-decomposition module, resulting in a regression prediction map designated as A and a classification prediction map designated as B. Additionally, two supplementary feature maps, labeled as C and D, are generated along the branch lines to enhance and align the predictions of A and B. Moreover, to address variations in target scales identified by each detection head, a scale layer is utilized to adjust the feature dimensions dynamically.

In the regression branch, a generator is assigned to produce the mask and offset required for DCNv2 [25]. This generator is constructed with a convolutional layer that utilizes a 3 × 3 convolutional kernel. Considering the requirement for DCNv2 in the subsequent input, which also employs a 3 × 3 kernel size, the mask dimension is established at 9, and since the offset operates along both the X and Y axes, the total offset dimension amounts to 18. Consequently, the output channel count of the generator is determined to be 27. These offsets and masks facilitate adaptive modifications to the regression boxes at each position, thereby ensuring the most accurate predictive frames in the surrounding area. This capability allows the model to dynamically identify the features that require extraction from the feature tensor while simultaneously suppressing irrelevant features. The specific formula is expressed as follows:

\begin{matrix} C_{M / O} = G (F_{u n i o n}) \end{matrix}

(2)

C_{M}

represents the mask utilized in DCNv2, which allows for the modulation of the convolutional operation to accommodate the variations present in the feature map. Meanwhile,

C_{O}

refers to the offset, a set of displacement values that are essential for the deformation of the convolutional kernel. G stands for generator. The authors of Deformable-DETR [26] employed deformable convolution to facilitate the model’s ability to identify and prioritize essential features autonomously. This novel approach effectively addresses the challenges of slow convergence and suboptimal performance in the detection of small targets, which are prevalent in the original DETR model. We integrate this methodology with the regression prediction map (A), which is processed through the task-decomposition module, leading to an enhanced regression prediction map. In the classification branch, a sequence of convolutional operations is performed, starting with a

1 \times 1

kernel, which is subsequently followed by a

3 \times 3

kernel. Following each convolutional operation, the ReLU and sigmoid activation functions are applied. The classification prediction boxes are dynamically modified based on joint features extracted from the joint feature maps. This approach facilitates the model’s comprehension of the spatial alignment of the prediction boxes for both tasks, thereby improving the accuracy of the detection process. The formula is expressed as follows:

D = σ (c o n v_{2} (δ (c o n v_{1} (F_{u n i o n}))))

(3)

D denotes the new feature map obtained from the classification branch,

c o n v_{1}

refers to a convolutional layer characterized by a kernel size of

1 \times 1

,

δ

represents the ReLU activation function,

c o n v_{2}

indicates a convolutional layer with a kernel size of

3 \times 3

, and

σ

signifies the sigmoid activation function.

As illustrated in Figure 6, the task-decomposition module commences by adaptively averaging the joint features (

F_{u n i o n}

) to obtain the average pooled features (

F_{a v g}

). This procedure is followed by two convolution-activation operations, each utilizing a convolutional layer with a kernel size of 1 × 1. Following the initial convolution, the ReLU activation function is applied, whereas the Sigmoid function is employed after the second convolution to calculate the layer attention weights (L).

L = σ (c o n v_{2} (δ (c o n v_{1} (P_{a v g} (F_{u n i o n})))))

(4)

The convolutional layers (

c o n v_{1}

and

c o n v_{2}

) both employ a kernel size of 1 × 1.

δ

denotes the ReLU activation function, while

σ

represents the Sigmoid activation function.

P_{a v g}

is utilized to indicate the adaptive average pooling layer, and L is employed to indicate the layer attention weights. The attention weights for the feature maps are subsequently employed to conduct a weighted convolution. This process involves reshaping the layer attention weights to enable their integration with the convolutional kernel weights.

W = L \times W_{c o n v}

(5)

L denotes the layer attention weights as determined by Equation (4),

W_{c o n v}

represents the convolutional weights, and W signifies the new convolutional weights obtained through a weighted computation of

W_{c o n v}

.

The joint features, denoted as

F_{u n i o n}

, are reshaped to align with the dimensions of the weighted convolutional kernel, which is succeeded by a matrix multiplication operation that calculates the product of the weighted convolutional kernel and the joint features, thereby producing a new feature map. Subsequently, the feature map undergoes group normalization and activation function processing, ultimately generating the processed predictive feature map.

F_{p} = S (G N (W \times F_{u n i o n})

(6)

F_{u n i o n}

denotes the joint feature, W indicates the computed convolutional weight, which is obtained from the calculations presented in Equation (5).

G N

refers to the application of group normalization, whereas S represents the SiLU activation function.

F_{p}

signifies the resultant predictive feature map.

3.2. Diverse-Scale Channel-Specific Convolution

The traditional methodology employed in deep convolutional operations typically involves the extraction of features on a per-channel basis. This approach may be insufficient in facilitating the necessary inter-channel interactions required for a comprehensive representation of features. Furthermore, a single-sized convolutional kernel exhibits weaker feature extraction for objects of varying scales.

MobileNet [27] optimizes the model and decreases computational demands by employing depthwise separable convolutions, which partition the conventional process into depthwise and pointwise convolutions. GhostNet [28] generates “Ghost” feature maps that efficiently capture critical information through a series of linear transformations applied to a set of original feature maps. It suggests that not all feature maps require convolutional derivation; instead, they can be generated using more economical methods. We introduce a novel strategy that involves the segmentation of channels during the convolutional process within the model. This approach applies varying convolutional kernel sizes to different groups of channels, thereby facilitating the extraction of feature information. As illustrated in Figure 7, channels are divided into four groups, and convolutions using kernel sizes of

1 \times 1

,

3 \times 3

,

5 \times 5

, and

7 \times 7

. Subsequently, the features from each group are combined through a 1x1 convolution. This methodology effectively leverages the benefits of DSCSC, enhancing information exchange among different channels and improving the model’s capacity to detect features across multiple scales.

DSCSC not only improves the detection of small targets but also decreases the number of model parameters. In contrast to a standard

3 \times 3

convolution, assuming that the input and output channels are equal and denoted by C, it requires

C \times 3 \times 3 \times C

convolutional kernels, with the parameter calculated as follows:

C \times 3 \times 3 \times C = 9 C^{2}

(7)

We divide our convolutional process into four groups, each employing convolutional kernel sizes of 1 × 1, 3 × 3, 5 × 5, and 7 × 7 and parameter counts of

\sum_{k = 1}^{n} \frac{C}{4} \times k \times k \times \frac{C}{4} = 5.25 C^{2} (n = 1, 3, 5, 7)

(8)

Subsequently, a 1 × 1 convolutional kernel is adopted to facilitate the exchange of feature information, with the total number of parameters being

C \times 1 \times 1 \times C = C^{2}

(9)

The total parameter is calculated as

5.25 C^{2} + C^{2} = 6.25 C^{2}

, which is 30% fewer than the

9 C^{2}

demanded by a standard 3 × 3 convolution, delivering the same feature richness at substantially lower cost.

4. Experiment

4.1. TinyPerson Dataset Object Detection

The TinyPerson dataset was specifically designed to address maritime rapid rescue scenarios, focusing on the detection of small targets. It consists of 1610 annotated images and 759 unannotated images, all obtained from a common video dataset, leading to a total of 72,651 annotations. Significantly, the TinyPerson dataset distinguishes itself from other datasets due to its substantial number of small target instances, which are distributed more evenly across a range of sizes. This unique characteristic poses challenges, in addition to offering opportunities for the enhancement of detection algorithms within this field.

4.2. Evaluation Metrics and Implementation Details

We take the YOLOv8 architecture as the foundational network for our model, utilizing a training batch size of 32 and conducting a total of 300 training epochs. We employed the standard evaluation metrics—AP@.5 and AP@[.5,.95]0— to assess our model’s performance. We also included the Params and FLOPs to compare the model’s parameters and computational cost. We set both the initial and final learning rates to 0.01, use SGD as the optimizer, and apply neither data augmentation nor pre-training techniques. All models were trained on four NVIDIA RTX 3090 GPUs.

4.3. Experimental Results

Table 1 shows that equipping YOLOv8-M with our DTA-Head improves AP@.5 from 27.6% to 30.7% (+3.1%) while reducing Params by 2.69 M (25.84 M → 23.15 M). Adding DSCSC yields another +0.2% AP@.5 and a further parameter reduction of 1.47 M, demonstrating that it not only enhances the features of small targets but also compresses the model. The trend holds across N/S/M sizes: the full model consistently outperforms its YOLOv8 counterpart by 2.3–3.3% AP@.5 with fewer parameters.

Table 2 compares our approach with recent state-of-the-art detectors on the TinyPerson dataset. DTA+DSCSC-M reaches 30.9% AP@.5, surpassing DynamicHead-M (25.9%), Gold-YOLO-M (28.1%), HGNetV2-M (26.2%), and FADC-M (28.0%) by margins of 5.0%, 2.8%, 4.7%, and 2.9%, respectively. The smaller variants (N/S) exhibit similar gaps, for example, +1.2% over FADC-N and +2.8% over Gold-YOLO-S. These improvements confirm that DTA-Head and DSCSC are especially beneficial for tiny targets.

The two tables jointly reveal that our two lightweight modules, DTA-Head and DSCSC, yield a clear and consistent edge on the TinyPerson dataset: within the YOLO family, they raise AP@.5 while trimming parameters, and against recent state-of-the-art competitors, they outperform best in each size group while operating at a markedly smaller model size. These improvements come from reductions in the negative impact of spatial misalignment and sufficient multi-scale features on small targets.

4.4. Supplementary Experiment

We performed experiments to evaluate the receptive field [38] of the proposed model. The darker area in Figure 8a represents pixels with contribution scores above 50%. Our model achieves the largest dark region, which means that DSCSC successfully enlarges the ERF without introducing irrelevant background. A quantitative evaluation of various models is presented in Table 3. It is apparent that over 99% of the contribution scores generated by our methodology are concentrated within an area of 92.3%, indicating that a substantial proportion of pixels significantly contribute to the ultimate prediction. In comparison to other approaches, our proposed method exhibits a large receptive field. HGNetV2 [33] enlarges receptive fields by stacking large-kernel and dilated convolutions within each stage to create a single, shared representation for all tasks, whereas our method keeps the backbone unchanged, decouples classification and regression into separate heads, and uses DCNv2 offsets to realign their predictions.

Furthermore, we conducted heatmap experiments [39] to compare our model with other models. Figure 8b shows heatmaps of densely clustered small persons. The heatmaps of competing methods (DynamicHead, YOLOv8-M, etc.) are either dispersed or focus on the background. In contrast, the heatmap of our DTA+DSCSC-M precisely concentrates on the person pixels, demonstrating that DTA-Head, together with DSCSC, forces the network to concentrate on the correct targets.

To thoroughly assess the impact of the sizes of convolution kernels within DSCSC, we performed an ablation study, as summarized in Table 4. The configuration of K = [1, 3, 5, 7] delivers the highest AP@.5 on the TinyPerson dataset while simultaneously achieving lower Params and FLOPs values. Specifically, this setting improves AP@.5 by 1.5% over K = [1, 3, 3, 5] and by 0.4% over K = [5, 7, 7, 9]. Moreover, compared with the next best configuration, our chosen configuration reduces Params by 4.01 M and FLOPs by 8.5 G, demonstrating its exceptional balance between accuracy and efficiency.

We also conducted extensive experiments on the VisDrone-2019 validation set. Table 5 summarizes the results: our DTA + DSCSC-M detector achieves 45.5% AP@.5 and 28.0% AP@[.5,.95], outperforming the strongest competing method (Gold-YOLO-M) by 1.7% and 1.4%, respectively. This is accomplished with only 21.15 M Params, which is 5.55 M fewer than Gold-YOLO-M, demonstrating that our gains are achieved without incurring additional computational cost. The AP@.5 of our method surpasses that of the lightweight HGNetV2-M by 4.3% while adding merely 2.76 M Params. These consistent improvements confirm the generalization capability of our DTA-Head and DSCSC modules beyond the TinyPerson dataset.

5. Conclusions

Using deep convolutional networks presents a considerable challenge in detecting small targets, primarily due to the progressive accumulation of convolutional layers, which can lead to a degradation of feature information. This degradation particularly hampers the accurate identification of small targets. Our proposed DTA-Head utilizes DCNv2, which facilitates dynamic alignment in the regression branch and dynamic feature selection in the classification branch. To address the limited interaction among channels within the convolutional module during the feature extraction phase of the model, we introduced DSCSC, which involves the division and reorganization of channels during the convolution process, followed by the application of convolutional operations with varying kernel sizes to distinct groups. This technique enhances the extraction of different features, thereby improving the convolutional module’s ability to capture features across different target scales and facilitating information exchange among channels. This work can contribute to the field of small object detection.

Author Contributions

Conceptualization, D.Q.; Methodology, K.Y.; Validation, Q.L.; Resources, D.Q.; Writing—review & editing, Y.Y. and X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded in part by the National Natural Science Foundation of China (Grant Nos. U1909201, 62101490, 6212780029, 2022C01056, 52467024, and 62476242), the State Grid Corporation of China Technology Project (Grant No. 5700-202019487A-0-0-00), the National Natural Science Foundation of Zhejiang Province (Grant No. LQ21F030017), the Research Startup Funding from Hainan Institute of Zhejiang University (Grant No. 0210-6602-A12203), and the Sanya Science and Technology Innovation Project (Grant No. 2022KJCX47).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that this study received funding from State Grid Corporation of China. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale match for tiny person detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 1257–1265. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 2016, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. arXiv 2016, arXiv:1605.06409. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A unified multi-scale deep convolutional neural network for fast object detection. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part IV 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 354–370. [Google Scholar]
Bell, S.; Zitnick, C.L.; Bala, K.; Girshick, R. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2874–2883. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liang, Z.; Shao, J.; Zhang, D.; Gao, L. Small object detection using deep feature pyramid networks. In Proceedings of the Advances in Multimedia Information Processing–PCM 2018: 19th Pacific-Rim Conference on Multimedia, Hefei, China, 21–22 September 2018; Proceedings, Part III 19. Springer: Berlin/Heidelberg, Germany, 2018; pp. 554–564. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Li, L.; Shi, J. Foveabox: Beyound anchor-based object detection. IEEE Trans. Image Process. 2020, 29, 7389–7398. [Google Scholar] [CrossRef]
Zhu, C.; He, Y.; Savvides, M. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 840–849. [Google Scholar]
Zhu, C.; Chen, F.; Shen, Z.; Savvides, M. Soft anchor-point object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 91–107. [Google Scholar]
Zand, M.; Etemad, A.; Greenspan, M. Objectbox: From centers to boxes for anchor-free object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 390–406. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27706–27716. [Google Scholar]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. arXiv 2023, arXiv:2309.11331. [Google Scholar]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the Stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 5694–5703. [Google Scholar]
Fan, Q.; Huang, H.; Chen, M.; Liu, H.; He, R. Rmt: Retentive networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 5641–5651. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 16965–16974. [Google Scholar]
Chen, L.; Gu, L.; Zheng, D.; Fu, Y. Frequency-Adaptive Dilated Convolution for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 3414–3425. [Google Scholar]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. Repvit: Revisiting mobile cnn from vit perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 15909–15920. [Google Scholar]
Han, K.; Wang, Y.; Guo, J.; Wu, E. ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 15751–15761. [Google Scholar]
Zheng, M.; Sun, L.; Dong, J.; Pan, J. SMFANet: A lightweight self-modulation feature aggregation network for efficient image super-resolution. In Proceedings of the European Conference on Computer Vision, Seattle, WA, USA, 16–24 June 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 359–375. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
Draelos, R.L.; Carin, L. Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks. arXiv 2020, arXiv:2011.08891. [Google Scholar]

Figure 1. In the comparison of small target detection performance on the TinyPerson dataset, other detection methods demonstrate limitations in distinguishing clustered small targets. In contrast, our DTA-Head exhibits a notable ability to effectively identify these clustered small targets.

Figure 2. When the detection box moves by two pixels, the IoU for the small target decreases from 100% to 20%, while for large targets, it declines to 63.3%. As the size of the large target increases and the size of the small target decreases, this disparity is expected to further intensify.

Figure 3. A comparison of Average precision at a threshold of 0.5 (AP@.5) on the TinyPerson dataset utilizing various methodologies. Notably, we attained a score of 30.9% while maintaining a relatively limited number of parameters.

Figure 4. Model framework diagram illustrating the process by which features extracted by the backbone are initially processed through two shared convolutional layers. Subsequently, these processed features are integrated with the original features to generate joint features. Ultimately, the joint features are input into the DTA-Head for the purposes of classification and regression detection.

Figure 5. The joint features are initially processed within the task-decomposition module to produce prediction maps relevant to both classification and regression tasks. Simultaneously, these joint features generate feature maps that aid in aligning the prediction maps. A scale layer is utilized to adjust the various feature layers to accommodate the differing scales of the targets.

Figure 6. The architecture of the task-decomposition module, ultimately resulting in the feature map used for prediction.

Figure 7. In the proposed DSCSC architecture, the convolutional channels are categorized into four distinct groups, each employing convolutional kernels of varying dimensions: specifically, 1 × 1, 3 × 3, 5 × 5, and 7 × 7. Subsequently, the feature information obtained from each group is integrated using a 1x1 convolutional kernel.

Figure 8. (a) Comparison of effective receptive fields (ERFs) across various methodologies reveals that a more widely distributed dark area corresponds to a larger ERF. Our approach exhibits a notable proficiency in achieving larger ERFs. (b) In conducting a heatmap analysis of densely clustered small targets within the TinyPerson dataset, our approach emphasizes the regions where these small targets are primarily situated.

Table 1. A comparative analysis of the YOLO-series models utilizing the TinyPerson test dataset. (Designations of N, S, and M correspond to the model sizes).

Method	AP@.5	AP@[.5,.95]	Params (M)	FLOPs (G)
Yolov5-N	23.8	7.42	2.50	7.1
yolov6-N [29]	22.3	6.85	4.23	11.8
Yolov8-N	24.1	7.39	3.00	8.1
Yolov9-T [30]	21.9	6.8	1.97	7.6
Yolov8+DTA-N	26.1	8.06	2.24	8.6
Yolov8+DTA+DSCSC-N	26.4	8.17	2.14	8.4
Yolov5-S	26.5	8.13	9.11	23.8
yolov6-S [29]	25.1	8.12	16.30	44
Yolov8-S	26.7	8.59	11.13	28.4
Yolov9-S [30]	26.8	8.55	7.17	26.7
Yolov8+DTA-S	29	9.16	8.88	33
Yolov8+DTA+DSCSC-S	29.9	9.48	8.47	32.2
Yolov5-M	27.7	8.97	25.05	64
yolov6-M [29]	19.8	6.12	51.98	161.1
Yolov8-M	27.6	9.05	25.84	78.7
Yolov9-M [30]	27.3	8.71	20.16	77
Yolov8+DTA-M	30.7	9.45	23.15	98.3
Yolov8+DTA+DSCSC-M	30.9	9.68	21.68	95.2

Table 2. Comparison with the most recent mainstream methods applied to the TinyPerson test dataset.

Method	Reference	Input Size	AP@.5	AP@[.5,.95]	Params (M)	FLOPs (G)
DynamicHead-N [24]	CVPR2021	640	23.2	7.4	3.49	9.6
Gold-YOLO-N [23]	NeurIPS 2023	640	24.2	7.38	5.98	10.2
PKINet-N [22]	CVPR2024	640	23.1	7.42	6.06	25.1
StarNet-N [31]	CVPR2024	640	20.7	6.7	2.21	6.5
RMT-N [32]	CVPR2024	640	24.3	7.96	14.83	43.2
HGNetV2-N [33]	CVPR2024	640	23.6	7.33	2.35	6.9
FADC-N [34]	CVPR2024	640	25.2	7.86	3.02	8
RepViT-N [35]	CVPR2024	640	23.1	7.26	2.28	6.3
ParameterNet-N [36]	CVPR2024	640	24.2	7.64	4.43	6.9
SMFANet-N [37]	ECCV2024	640	23.2	7.19	2.67	7.3
DTA+DSCSC-N(Ours)	-	640	26.4	8.17	2.14	8.4
DynamicHead-S [24]	CVPR2021	640	25.9	8.58	10.85	28.1
Gold-YOLO-S [23]	NeurIPS 2023	640	27.1	8.68	13.61	29.9
PKINet-S [22]	CVPR2024	640	23.4	7.49	10.46	36.1
StarNet-S [31]	CVPR2024	640	22.4	7.31	6.54	17.3
RMT-S [32]	CVPR2024	640	25.3	8.3	19.39	54.4
HGNetV2-S [33]	CVPR2024	640	25.5	8.18	8.47	23.3
FADC-S [34]	CVPR2024	640	26.6	8.52	11.16	28
RepViT-S [35]	CVPR2024	640	26.4	8.31	8.24	21.5
ParameterNet-S [36]	CVPR2024	640	26.8	8.69	16.80	23.7
SMFANet-S [37]	ECCV2024	640	26.8	8.64	9.59	25
DTA+DSCSC-S(Ours)	-	640	29.9	9.48	8.47	32.2
DynamicHead-M [24]	CVPR2021	640	25.9	8.4	24.71	75.2
Gold-YOLO-M [23]	NeurIPS 2023	640	28.1	9.14	26.69	76.7
PKINet-M [22]	CVPR2024	640	23.9	7.64	18.36	59.9
StarNet-M [31]	CVPR2024	640	23.2	7.33	14.41	41.1
RMT-M [32]	CVPR2024	640	26.4	8.5	405.89	78.4
HGNetV2-M [33]	CVPR2024	640	26.2	8.64	18.39	57.9
FADC-M [34]	CVPR2024	640	28	9.09	25.92	77.6
RepViT-M [35]	CVPR2024	640	27.8	8.96	16.34	49.3
ParameterNet-M [36]	CVPR2024	640	28.3	9.08	44.40	59.4
SMFANet-M [37]	ECCV2024	640	28.2	9.19	21.38	65.6
DTA+DSCSC-M(Ours)	-	640	30.9	9.68	21.68	95.2

Table 3. Quantitative analysis on the ERF, the minimum contribution area ratio of the rectangle (r) required to achieve a contribution score that exceeds a predetermined threshold (t); a larger r suggests a smoother distribution of high-contribution pixels and, hence, a larger ERF.

	t = 20%	t = 30%	t = 50%	t = 99%
DynamicHead-M [24]	3.8%	6.2%	12.4%	61.3%
Gold-YOLO-M [23]	5.3%	8.7%	17.4%	88.2%
PKINet-M [22]	2.1%	4.4%	12.4%	89.4%
StarNet-M [31]	0.8%	2.2%	8.5%	88.8%
RMT-M [32]	1.1%	2.6%	9.1%	77.9%
HGNetV2-M [33]	3.5%	6.2%	13.7%	84.7%
FADC-M [34]	4.4%	7.1%	14.2%	86.4%
RepViT-M [35]	4.3%	7.0%	14.2%	73.0%
ParameterNet-M [36]	4.9%	7.8%	15.6%	87.6%
SMFANet-M [37]	4.9%	7.8%	15.9%	83.6%
Yolov8-M	5.0%	8.2%	16.4%	84.7%
DTA+DSCSC-M(ours)	6.6%	10.9%	21.0%	92.3%

Table 4. Ablation study on different DSCSC kernel combinations. K denotes the ordered sequence of four convolutional kernel sizes used in DSCSC. The [1, 3, 5, 7] configuration achieves the best AP@.5 of 30.9%.

K	AP@.5	AP@[.5,.95]	Params (M)	FLOPs (G)
[1, 3, 3, 5]	29.4	9.58	20.35	92.3
[3, 5, 5, 7]	29.6	9.56	22.49	96.9
[5, 7, 7, 9]	30.5	9.62	25.69	103.7
[3, 5, 7, 9]	29.7	9.49	24.36	100.8
[1, 3, 5, 7]	30.9	9.68	21.68	95.2

Table 5. Comparison on VisDrone-2019. Our method achieves the top detection performance.

Method	AP@.5	AP@[.5,.95]	Params (M)
DynamicHead-M [24]	0.434	0.266	24.72
Gold-YOLO-M [23]	0.438	0.266	26.7
HGNetV2-M [33]	0.412	0.249	18.39
FADC-M [34]	0.433	0.263	25.92
Yolov8-M	0.432	0.262	25.85
DTA+DSCSC-M(Ours)	0.455	0.28	21.15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, K.; Li, Q.; Yan, Y.; Wang, X.; Qi, D. DTA-Head: Dynamic Task Alignment Head for Regression and Classification in Small Object Detection. Appl. Sci. 2025, 15, 9060. https://doi.org/10.3390/app15169060

AMA Style

Ye K, Li Q, Yan Y, Wang X, Qi D. DTA-Head: Dynamic Task Alignment Head for Regression and Classification in Small Object Detection. Applied Sciences. 2025; 15(16):9060. https://doi.org/10.3390/app15169060

Chicago/Turabian Style

Ye, Kaiqi, Qi Li, Yunfeng Yan, Xianbo Wang, and Donglian Qi. 2025. "DTA-Head: Dynamic Task Alignment Head for Regression and Classification in Small Object Detection" Applied Sciences 15, no. 16: 9060. https://doi.org/10.3390/app15169060

APA Style

Ye, K., Li, Q., Yan, Y., Wang, X., & Qi, D. (2025). DTA-Head: Dynamic Task Alignment Head for Regression and Classification in Small Object Detection. Applied Sciences, 15(16), 9060. https://doi.org/10.3390/app15169060

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DTA-Head: Dynamic Task Alignment Head for Regression and Classification in Small Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Multi-Scale Learning

2.2. Anchor-Free Mechanism

2.3. Attention Mechanism

3. Methodology

3.1. Dynamic Task Alignment Head

3.2. Diverse-Scale Channel-Specific Convolution

4. Experiment

4.1. TinyPerson Dataset Object Detection

4.2. Evaluation Metrics and Implementation Details

4.3. Experimental Results

4.4. Supplementary Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI