1. Introduction
Object detection, as one of the critical tasks in computer vision, has received extensive attention. It is increasingly important in many real-world applications, such as autonomous driving, monitoring systems, human–computer interaction, medical diagnosis, smart agriculture, and retail analysis. Its broad applicability underscores its importance, driving research efforts to refine and advance detection technology.
The object detection method in the early era mainly used hand-made components to extract features until the emergence and widespread use of convolutional neural networks (CNN) [
1,
2,
3,
4]. The method of object detection ushered in a paradigm shift. CNN-based frameworks, such as Faster R-CNN [
2], YOLO [
1,
5], and Mask-RCNN [
6], not only surpass traditional techniques in accuracy but also exhibit unprecedented efficiency, enabling real-time detection. However, although these models are groundbreaking and achieve good performance, they have also received extensive attention and substantial improvement. Still, they often suffer from complex object relationships and long-range dependencies.
Transformer [
7] has achieved great success in NLP (Natural Language Processing) and has also been extended to computer vision [
8,
9]. With the promise of handling complex spatial relationships and feature dependencies, models such as the DEtection Transformer (DETR) [
8] and its derivatives [
10,
11,
12,
13,
14] set new benchmarks by leveraging the Transformer’s inherent self-attention mechanism [
7], which enables object detection architecture to open up the DETR paradigm. However, we argue that even in these state-of-the-art architectures, there is an implicit challenge—content and location information are constantly intermixed in attentional computations.
As mentioned earlier, the currently widely used DETR paradigm method usually regards the object detection [
8,
10,
11,
12,
13,
14] task as a single task. It integrates location and classification when querying the feature map so that the object detection in the middle of the model is always a mixed task of location and classification, as shown in
Figure 1a,b.
Compared with simple and single positioning and classification tasks, achieving precise positioning and detailed classification will be more difficult. Furthermore, the richness of content attributes, such as complex textures, patterns, and color gradients, is often obscured. Our work introduces a fine-grained multi-task approach to object detection, recognizing these potential pitfalls and focusing on distinguishing between localization and classification tasks, which can be regarded as learning location information and learning content information, as shown in
Figure 1c. This ensures that tasks in different dimensions are optimized, simplifying the object detection task within the model without affecting other dimensions.
Inspired by conditional-DETR [
11] and DAB-DETR [
8], our approach starts from the joint learning stage to ensure a robust balance between localization and classification. Subsequently, a specialized content learning mechanism queries the content information of objects in the feature map and can identify subtle object attributes, thus providing a more comprehensive representation. At the same time, the positioning learning component adjusts the position in the feature map of interest and allows it to fine-tune the model in stages to ensure accuracy.
Our main contributions include the following:
- (1)
Propose a pioneering multi-task object detection framework of simple subtasks. We divide content query tasks and location query tasks and jointly optimize object detection tasks and subtasks, while subtasks do not affect each other.
- (2)
We design task-specific loss functions and iterative training methods.
- (3)
Comprehensive evaluations on leading benchmarks confirm our model’s excellence in accuracy, object understanding, and scalability while increasing the comprehensibility of the model’s internal components.
This paper further digs into the technical complexity, empirical validation, and comprehensive discussion, revealing how our approach defines the contemporary object detection field in a new light.
3. Proposed Method
We propose an approach consistent with DETR-like models that divides the object detection task into multiple subtasks, including content learning related to object information and classification (involving content knowledge of multiple targets) and position learning related to object localization (involving the positioning of multiple objects), and joint learning of positioning and content learning.
Based on the DAB-DETR model, we design and implement our tasks, thus facilitating the concurrent execution of the above functions within the integrated framework. In the following sections, we will first outline the process of our method, explain the combined learning of localization and classification in object detection, explain the structural design of content learning, and detail the scheme of position learning. We will explain the model’s loss function and training strategy.
3.1. Overall Structure of the Proposed Model
As shown in
Figure 2, the architecture of our proposed method is decoder-centric, where object detection is treated as multi-task learning. The decoder consists of three sub-modules related to content learning, position learning, and joint learning of content and position. Each sub-module is displayed with the same color in
Figure 2.
Content learning: For a more in-depth study, we separate the classification task in object detection as a content learning subtask, as self-attention-C and cross-attention-C blocks are shown in
Figure 2. Our content learning mechanism is specifically tailored to emphasize the complex textures, patterns, and color gradients of objects. Unlike traditional models, our design emphasizes content attributes for object detection, enhancing the model’s classification accuracy.
Position learning: Similarly, we separate the positioning task in target detection as a position learning sub-task, as self-attention-P and cross-attention-P blocks are shown in
Figure 2. Position learning focuses on the area with objects in the feature map and ignores its specific category. Incorporating into the base layer fine-tunes the model’s ability to predict the spatial location of objects in an image accurately.
Joint learning of content and position: The core of our method is the object detection decoder based on the DETR-like model. By querying, object queries are used to query the feature map and learn to obtain the information required for the object detection task, as self-attention_D and cross-attention_D blocks are shown in
Figure 2.
In DETR paradigm architecture, the decoder’s information query functions of positioning and classification are always integrated. In traditional object detection paradigms, merging these tasks often results in a precision or recall trade-off. The object detection task will also be more complicated than a single positioning and classification task. Therefore, by using DETR’s decoder as the base layer of our model, we leverage the localization decoder and classification decoder as auxiliary layers, allowing us to treat localization and classification as separate but intertwined tasks. This approach ensures that the model understands the fundamental properties of object detection from the beginning and makes the model better interpretable.
The synergy of these three learning processes forms a cohesive end-to-end training workflow that follows a progression from simple to complex, ensuring that every aspect of object detection is addressed with precision and depth. The following subsections delve into the complexities and design philosophies underpinning each stage.
3.2. Content Learning Stage
In the development of object detection tasks, a richer understanding of image content becomes crucial. Our proposed content learning phase explicitly addresses this requirement. Although existence detection methods [
8,
10,
11,
12,
14] focus on the spatial representation of objects, including no clear distinction between classification and localization within DETR-like models, the complex details of objects (including their unique features, textures, and relationships) are often not fully explored. Our approach aims to fill this gap by deploying a content query process.
Attention is a key mechanism in Transformer [
7]. It operates on queries (Q), keys (K), and values (V), and it is defined as follows.
where
is the dimension of queries and keys.
As shown in
Figure 2, the content learning stage is similar to the primary object detection stage. Also, it utilizes both the self-attention layer and cross-attention layer, which we define as
and
. This stage is to separate the classification in object detection into a single content learning task. Its core goal is not only to identify objects in the image but to conduct more in-depth research and provide a detailed understanding of its attributes (but does not include its specific location).
The encoder processes the image to achieve this, producing high-dimensional image features, denoted as . At the same time, we deploy a set of learnable queries. Unlike traditional queries, these queries are mapped exclusively to the content space and are therefore called content queries . Their main functionality revolves around extracting the intricate details of an object independent of its spatial positioning.
However, to ensure that content queries can effectively focus on image content during the cross-attention stage, we introduce the concept of “fake location ground truth” . It acts as a guide, providing auxiliary location information for content queries. But crucially, it remains irrelevant to training optimization, effectively blocking gradients. At this stage, we treat the latter as actual bounding box parameters by combining content queries with fake location ground truth elements. This strategy allows for a scenario where the model extracts underlying content details, especially object categories, assuming the location is known.
This process can be represented as follows.
where
and
mean the self-attention and cross-attention layers of the content decoder.
denotes the fake position queries extracted from position queries
.
Therefore, the output of this stage is considered to be the object category in the known region.
3.3. Position Learning Stage
While content recognition is crucial in fully understanding objects, accurate object localization remains the cornerstone of any effective object detection system, and learning how and obtaining more accurate location information is also a highlight of the improvement plans of other DETR-like models [
10,
11]. The position learning stage aims to improve the accuracy of spatial recognition and localization of objects in images, focusing on understanding the location of objects in the image. As shown in
Figure 2, similar to the content learning stage, our location learning stage follows a structure that includes self-attention and cross-attention mechanisms. However, its core difference from the content learning stage is that it focuses exclusively on the spatial properties of objects without any interference from content details.
The inputs to this stage are high-dimensional image features from the encoder and a set of learnable positional queries , which are aimed at finding and refining the spatial coordinates of objects within the image. To further guide these location queries, we introduce the concept of “fake content ground truth” as auxiliary category information. However, similar to its counterpart in the content learning phase, remains irrelevant to training optimization and only serves as a guide.
Essentially, by mixing with , the latter is treated as the actual category label. Therefore, this stage effectively localizes multiple known objects by querying the location of that object in the image features under the assumption that a specific object exists in the known image.
Similarly, for these location queries to perform optimally, they need to undergo a self-attention mechanism that allows each query to adaptively adjust its focus based on insights gleaned from other queries, which we defined as . This adaptive refinement helps reduce overlap and redundancy.
The following steps involve the cross-attention mechanism, defined as . Here, refined location queries are fused with fake content labels and are then interacted with image features , allowing each query to focus on a specific spatial region of the image. This interaction brings sharper spatial focus, ensuring higher accuracy in predicting bounding boxes.
After the cross-attention stage, similar to the object detection stage, the location query is transformed to produce predictions of object space boundaries. These predictions are represented in the form of bounding box coordinates.
The procedural steps can be summarized by the following equation:
Therefore, the output of this stage is used to locate the area where objects exist in the image features.
3.4. Joint Cotent and Position Learning Stage
We describe object detection as a joint learning process of localization and classification. This stage builds on the structural framework of other decoders similar to the DETR model [
10,
11,
14], employing a collaborative learning approach to localize and classify objects in images, that we defined as decoder_D, which includes self-attention_D and cross-attention_D, as shown in
Figure 2.
The encoder first processes the input image and generates a set of high-dimensional image features, denoted as . Meanwhile, the decoder uses several predefined object queries as input, each potentially corresponding to an object in the image. These queries are abstract representations, and they are converted into localization as a bounding box and classification of object category predictions.
Self-attention mechanism: Object queries undergo a self-attention mechanism before interacting with image features. This step allows queries to be correlated with each other, ensuring that predictions do not overlap or become connected in the feature space. We can also argue that allowing queries to talk or communicate with each other creates an environment for them to interpret the scene together.
Cross-attention mechanism: After self-attention, cross-attention is calculated between the output queries as and the image feature . This process allows each query to focus on a specific area of the image, focusing on particular details, textures, patterns, and also spatial features. The result of this mechanism is a refined set of features for each query, tuned and predictively prepared for their respective focus areas.
Prediction phase: As features are enriched, each query undergoes a series of transformations to predict the location and classification of its corresponding potential object. Locations are bounding boxes with coordinates, and classifications are probability distributions over predefined object categories. During the inference phase, our network structurally uses detection outputs to predict outcomes. However, the detection output utilizes a fusion mechanism to integrate information from the content output and position output. Specifically, the content output provides class information about detected objects, while the location output provides spatial information about the location of these objects. The detection output then combines these two pieces of information to produce a final confidence score and bounding box for object detection. This fusion approach ensures that the model can utilize the complete information learned through separated learning, content, and location information during inference to improve detection accuracy.
This collaborative learning mechanism ensures that with each iteration, the model can better recognize the exact location and category of objects in the image. Over time, through backpropagation and optimization routines, learnable object queries fine-tune themselves to match real-world objects, ensuring accurate object detection.
This stage is the decoder process of the DETR-like model. During the training process, the queries used for positioning and classification are fused and participate in the calculation to achieve basic object detection tasks. The use mechanism of query and the decoding mechanism of queries in this process have laid the foundation for introducing more detailed subtask implementation.
3.5. Training Strategy and Loss Computation
We adopt an end-to-end training strategy, meaning that all components (including object detection, content learning, and location learning) are trained simultaneously, enabling queries to detect objects in images while enhancing object location and category awareness. In the content learning stage, the model focuses on the image content to ensure that the content query can capture the content information of the object. While in the position learning stage, the model focuses on the possible object regions in the image, ensuring that the position query can accurately capture the position information of the foreground region.
Considering that our model has to handle multiple tasks, we use a multi-task learning loss function to train the model. Following the approach of DETR-like models, we first find the best bipartite match between the predictions produced by a detection query and the ground truth and design a loss function accordingly. According to our task design, the loss function mainly consists of object detection loss, content loss, and positional loss. Among them, detection loss also includes positioning loss and content loss. The three outputs of the model correspond to object query, content query, and location query. We will optimize each output according to these three losses and adjust the proportion of each loss to obtain the best performance. Specifically, the localization loss combines
loss and
loss to quantify the difference between predicted and ground truth bounding boxes. In addition, the classification loss uses focal loss. Therefore, the loss function is formed as follows:
where
,
, and
are the weights of each loss, which is used to adjust the importance of each loss in the overall loss. The choice of these weights is usually tuned based on performance on the validation dataset. After experimental verification, we set
,
, and
to 2, 2, and 1, respectively. For the specific hyperparameter settings in detection, we follow the settings in DAB-DETR [
10].
4. Experimental Results
4.1. Datasets
We evaluate with the COCO 2017 object detection dataset, split into train2017 and val2017. MS-COCO is composed of 160K images with 80 categories. These images are divided into train2017 with 118K images, val2017 with 5K images, and test2017 with 41K images. Following the common practice [
15], we report the standard mean average precision (AP) results on the COCO validation dataset under different IoU thresholds and object scales.
4.2. Implementation Details
Following the approach of DAB-DETR [
10], we utilize various ResNet [
24] models, which have been pre-trained on ImageNet, as our backbone. Regarding hyperparameters, we use the same values in DAB-DETR’s configuration by employing a 6-layer Transformer encoder and a 6-layer Transformer decoder with a hidden dimension of 256.
Our decoder includes three self-attention layers and three cross-attention layers as a block, corresponding to object detection, content learning, and position learning, as shown in
Figure 2.
4.3. Object Detection Performance Comparison
Our proposed approach is of plug-and-play type. Therefore, we use our proposed method to insert several popular DETR-like models and compare them to evaluate the effectiveness of our method in the object detection task.
Our goal is to observe and validate the effectiveness of our model in improving the performance of object detection tasks and the role of its separated components in explaining the object detection task during model decoding.
Table 1 shows the comparison results between various DETR-like models after using our method. For a fair comparison, we use the same parameter settings as each method, and the difference only exists in whether to insert our proposed subtask module.
Our approach gives superior performance compared with each model, like in DAB-DETR, with a 0.6% improvement in Average Precision (AP) compared with the baseline. Specifically, for small objects, our method attains a 0.9% enhancement in performance, indicating that our approach offers improved detection efficacy on small objects.
This kind of comparison validates our approach’s effectiveness and points out areas of potential improvement that simplify the model’s task internally. Through this rigorous analysis, we can identify the strengths of our model and areas where further refinements could be beneficial.
4.4. Ablation Experiments
We conduct a series of component ablation experiments to gain insight into the contribution of individual components in our proposed method to the performance.
Baseline model: We first need to determine the baseline model, which is trained without any of our specific added components, that is, the original model. For example, if we conduct a series of ablation implementations based on DAB-DETR, the baseline model is the DAB-DETR model. This provides a reference point that allows us to evaluate the performance improvement of each component.
Single component addition:
Baseline plus content learning: this configuration only added a content learning phase to the baseline model.
Baseline plus position learning: we add a position learning phase to the baseline model in this setting.
Baseline plus content learning and position learning: this allows us to observe performance changes when the two main components are present simultaneously and thus evaluate their synergy.
We trained and evaluated the model in each configuration, recording key performance metrics, as shown in
Table 2.
The results show that the content and position learning stages are necessary for performance improvement, where position learning contributes most significantly. Through these ablation experiments, we confirm the effectiveness and necessity of the component design of our method.
4.5. Attention Visualization of Decoders
To more intuitively understand the decoding mechanism of feature maps by different decoders in our model, we followed DETR attention visualization methods [
10,
25]. Also, we conducted a series of attention map visualization experiments. This method uses a ‘hook’ to hook the attention weights of the decoder layer and then visualizes each detected object. Through its visualization results, we can analyze which part of the image is being looked at by the decoder’s object queries. The part where a query is looking at can acquire more attention weight, and then the model uses this to predict specific bounding boxes and classes. We use this visualization method to observe which parts of the image the queries indicated by our different decoders are looking at simultaneously.
Cross-attention visualization of DAB-DETR: We first show cross-attention maps in the original DAB-DETR model, as shown in
Figure 3a. This provides a baseline for comparison, revealing the focus of the decoder’s attention without any specific improvement component.
Cross-attention visualization of content learning: We first analyzed the cross-attention weight in the content learning stage. As shown in
Figure 3b, for content query, we can see that the model mainly focuses on the core part of the object in the image to capture the object content and ignores the background.
Cross-attention visualization of position learning: In the visualization of the position learning stage, as shown in
Figure 3, second row, the model’s attention is more focused on the boundary area of the object, and relatively less attention is paid to the object’s interior. This proves that the model successfully learns the location information of the object at this stage.
Cross-attention map visualization of joint content and position learning: Figure 3d is the decoder attention map that combines the content learning and position learning stages, revealing the synergistic effect when these two components work at the same time and their impact on the focus of attention and target detection performance of the original model’s detection queries.
Object detection visualization: We also show the object detection results predicted by the decoder, outlined by a blue rectangle, as shown in
Figure 3e. This provides an intuitive display of results. At the same time, the queries that predict the results correspond to the previous attention map. For example, in the query that predicts ‘sports ball’, the area of focus is shown in the first attention map in the first four rows.
From the above visualizations, we can see that although all decoders use the cross-attention mechanism, they pay attention to different parts of the image at different learning stages. This further proves that the attention distributions of the three decoders we designed are different when dealing with other tasks and are consistent with their design goals.
4.6. Experimental Results on Small Objects
In real-life scenes, detecting small objects in images is challenging because the number of pixels is limited, details are often difficult to capture, and there are higher requirements for image understanding.
The proposed model learns and understands images’ content from multiple aspects. Through the performance comparison of
Table 1, we can see that one of the achievements of our proposed model is that compared with the baseline, its performance in detecting small objects has been improved and gives better enhancement in the context of the COCO dataset.
Figure 4 shows the comparison results of our model and baseline methods on images containing small objects from the COCO dataset.
The detection results in
Figure 4 show that our model provides better detection and localization results for small objects on the COCO dataset. At the same time, the baseline method [
10] either misses or inaccurately locates some of these objects, such as those in the red box in
Figure 4b. Compared with baseline results, our method better detects small objects in the red box area where smaller objects exist without specifically training or fine-tuning the small object dataset. Usually, the detection of small objects requires a better understanding of the details of the image [
23], so by analyzing the above small object detection results, our model has a better understanding of the content and details of the image, which is due to our auxiliary independent learning and emphasis on the content information and location information of the image. Also, reflecting the separate stages of content learning and location learning in our model (as described in
Section 3.3 and
Section 3.4) enhances the model’s ability to capture complex details of the object.
Our results on the COCO dataset show the promise of our improved method to enhance small object detection, as we believe it has multiple layers and a deeper understanding of image details. While we did not train specifically on a dedicated small object dataset, nor were we designed to generalize across a variety of datasets, our results demonstrate the model’s potential in the context in which it was trained. Future work could study its adaptability and efficacy on datasets dedicated to small object detection or investigate its design for generalization capabilities.
5. Conclusions
In this study, we propose an innovative multi-component object detection technique that seamlessly integrates the joint learning stage of localization and classification, the content learning stage, and the location learning stage within the DETR model framework. Our approach divides object detection into three subtle stages. Starting with joint learning of positioning and classification lays a strong premise for in-depth exploration of subsequent content and accurate learning of object locations. We are introducing a decoder structure designed explicitly for understanding object content information and using an independent content learning mechanism to enable the model to capture objects’ complex details and characteristics meticulously. Furthermore, our uniquely designed position learning architecture emphasizes capturing precise object locations, ensuring the model can identify object locations in multi-faceted scenarios. Through testing on standard benchmarks, our proposed method consistently exhibits excellent performance in object detection tasks, significantly outperforming established baselines. We believe our approach reveals potential improvements in object detection and demonstrates the effectiveness of refined research directions in this area.
However, we acknowledge that our method, like all research, has limitations. First, although our model performs well in benchmarks, its generalization to multiple datasets and to real-world scenarios needs further exploration. Then, there is also the issue of computational efficiency in processing datasets, especially large-scale datasets, because our multi-stage learning mechanism can be resource-intensive. Moving forward, we aim to address these limitations by optimizing our model’s computational architecture, thereby enhancing its efficiency. We will also investigate the application of our method to broader datasets and real-world scenarios to ensure robustness and versatility. Additionally, we plan to delve into integrating unsupervised learning techniques to improve the model’s performance in less controlled environments, paving the way for more generalizable object detection systems. In conclusion, our research focuses on more refined and detailed object detection strategies, providing new perspectives for improving accuracy and reliability in this field.