Pairwise CNN-Transformer Features for Human–Object Interaction Detection

Human–object interaction (HOI) detection aims to localize and recognize the relationship between humans and objects, which helps computers understand high-level semantics. In HOI detection, two-stage and one-stage methods have distinct advantages and disadvantages. The two-stage methods can obtain high-quality human–object pair features based on object detection but lack contextual information. The one-stage transformer-based methods can model good global features but cannot benefit from object detection. The ideal model should have the advantages of both methods. Therefore, we propose the Pairwise Convolutional neural network (CNN)-Transformer (PCT), a simple and effective two-stage method. The model both fully utilizes the object detector and has rich contextual information. Specifically, we obtain pairwise CNN features from the CNN backbone. These features are fused with pairwise transformer features to enhance the pairwise representations. The enhanced representations are superior to using CNN and transformer features individually. In addition, the global features of the transformer provide valuable contextual cues. We fairly compare the performance of pairwise CNN and pairwise transformer features in HOI detection. The experimental results show that the previously neglected CNN features still have a significant edge. Compared to state-of-the-art methods, our model achieves competitive results on the HICO-DET and V-COCO datasets.


Introduction
Human-object interaction (HOI) detection aims to recognize interactions between humans and objects in a scene.This task requires the precise localization of human-object pairs and the identification of associated actions.HOI detection can facilitate the development of many tasks, such as anomaly detection [1], image understanding [2], and action recognition [3].HOI detection methods [4] can be categorized into two-stage methods [5][6][7], traditional one-stage methods [8][9][10], and end-to-end one-stage methods [11,12].Two-stage methods leverage an object detector to obtain confidence scores for humans and objects, subsequently filtering out low scores while retaining high-quality instances.However, these methods face challenges in capturing robust contextual cues.Traditional one-stage methods improve inference speed by parallelizing object detection and interaction prediction.However, matching human-object pairs and interactions requires complex strategies.Inspired by DEtection TRansformer (DETR) [13], the end-to-end one-stage methods maintain a concise architecture and avoid the matching process of human-object pairs and interactions.These methods leverage global features from the transformer [14], providing the model with rich contextual information.However, they cannot acquire prior knowledge from object detection.Based on the analysis above, our goal is to design a unified model that combines the advan-tages of both methods.The model can take advantage of the object detector and also has rich contextual information.
Determining how to use the object detector effectively.We note that feature pyramid network (FPN) [15] and HRNet [16] successfully enhance the representation of features by fusing low-resolution features with high-resolution features.Inspired by this, we explore using an object detector to extract multiple types of features.This aims to enhance the features representation.
Current one-stage methods [11,17] outperform two-stage methods [18,19] across the board, so many studies ignore the potential of two-stage methods.Recently, UPT [20] analyzes why one-stage methods are superior to two-stage methods.It is mainly due to using more efficient object detectors and transformer representation capabilities.The two-stage method SCG [21] achieved a significant performance improvement by replacing Faster R-CNN [22] with DETR [13] (object detector).Its performance is close to that of the one-stage method QPIC [11].Inspired by UPT, our model selects DETR as the object detector.Since DETR inspired a series of one-stage transformer-based methods, these methods use both the convolutional neural network (CNN) backbone and the transformer architecture [14].However, existing research focuses on using the transformer features to predict HOIs [11,17,[23][24][25] while ignoring the CNN features.Thus, the intuitive scheme is to fuse CNN and transformer features.
Determining how to obtain contextual cues.The interaction head of the two-stage methods [20,26] uses a transformer encoder structure, where the input is only the human-object pair features.The one-stage methods [11,17] use a transformer decoder structure, where the inputs are global features and learnable queries.In two-stage methods, the simple concatenation of human and object features lacks sufficient contextual information.However, the global features from the transformer have proven effective in one-stage methods.Therefore, we propose introducing these global features to provide more explicit contextual cues.In addition, introducing global features can be used to visualize the attention of the model, facilitating future research.
To achieve the above goals, we propose the Pairwise CNN-Transformer (PCT), a simple and intuitive two-stage detector that utilizes pairwise CNN and pairwise transformer features.Figure 1 illustrates our PCT pipeline: 1.
We extract features of human and object instances from both the CNN backbone and the transformer architecture.

2.
The instance features are fused and paired to form enhanced human-object pair features.

3.
The enhanced features are fed into the interaction head with global features to predict actions.Determining why previous work did not compare transformer features and CNN features.Since one-stage transformer-based methods [11,17] use a set of queries to predict HOIs, the number of learnable action queries is fixed.Two-stage methods [5,6,20] generate human-object pairs by pairwise matching.The number of human-object pairs is variable.Different selection strategies for human-object pairs make it challenging to compare the performance of CNN and transformer features fairly.Although UPT [20] has the conditions to compare them, it does not investigate this issue.To fairly compare their performance in HOI detection, we construct three variant models of UPT.Experimental results show that CNN features outperform transformer features in three sets of comparative experiments.
In summary, our contributions are as follows: 1.
We propose fusing the CNN and transformer features of the instances to enhance the feature representations.These enhanced representations effectively improve the precision of HOI detection while introducing little computational cost.

2.
We propose utilizing global features from the transformer to enrich contextual information.Through cross-attention, human-object pair features can aggregate valuable information from these global features.

3.
We fairly compare the performance of pairwise CNN and pairwise transformer features in HOI detection.The results show that CNN features outperform transformer features.4.
Our model achieves competitive performance on two commonly used HOI datasets, HICO-DET and V-COCO.
We propose a simple and efficient two-stage method PCT.Compared with the recent two-stage method UPT, our method performs better in terms of accuracy, trainable parameters, and inference time.In addition, two-stage methods have a shorter training time and smaller training memory compared to one-stage methods.The code is available at https://github.com/hutuo1213/PCT(accessed on 25 January 2024).

Object Detection
The object detection task requires locating and identifying objects.This task is the cornerstone of understanding high-level semantics.In recent years, object detection has rapidly evolved from two-stage methods, such as Faster R-CNN [22], to end-to-end methods, such as DETR [13].Traditional HOI detection methods acquire accurate human and object instances through object detection.Both object detection and HOI detection involve localization and recognition tasks, which makes it possible to migrate object detection strategies to HOI detection.In summary, object detection continues to drive the development of HOI detection.Our method employs a more advanced DETR as the object detector, which consists of a CNN backbone and a transformer network.This method simplifies the detection process by transforming object detection into a set prediction problem.

Features Fusion
Effective feature fusion enhances the expressive power of features.FPN [15] and HRNet [16] fuse low-resolution features with high-resolution features to enhance features.MSFIN [27] obtains rich information by fusing features from different scales.Kansizoglou et al. [28] explore training strategies for multi-modal fusion.We have observed that explicit methods for features fusion have not been proposed in HOI detection.Therefore, we explore how to fuse features to enhance their representations.

Human-Object Interaction Detection
HOI detection can be divided into two-stage and one-stage methods.Two-stage methods first perform object detection and then interaction prediction, while one-stage methods simultaneously perform object detection and interaction prediction.HO-RCNN [5], typically a two-stage CNN-based method, proposes using a multi-stream architecture to apply spatial information.The network uses an object detector to obtain spatial information from human-object pairs to construct spatial features.Subsequent studies have constructed different feature priors based on object detection, including visual features of human-object pairs [7], word embeddings of objects [24,29,30], human pose features [19], and body part features [31,32].ICAN [6] uses an attention module to aggregate HOI cues selectively.MLCNet [19] combines almost all available feature priors to predict HOI.In addition, graph neural networks have been used to reason about relationships between humans and objects [18,30].One-stage CNN-based HOI methods [8][9][10] directly predict interactions, increasing the speed of model inference.However, they must match human-object pairs and interactions, resulting in a more complex design.
Inspired by DETR [13], one-stage transformer-based methods are beginning to grow in popularity.Traditional two-stage methods use pairwise matching of humans and objects to predict HOIs.One-stage transformer-based methods directly predict HOIs using a set of queries.QPIC [11] modifies DETR for HOI detection tasks.HOTR [12] uses interaction representations to infer relevant human-object pairs.DOQ [24] performs knowledge distillation based on QPIC.CDN [17] avoids the challenges of multi-task learning through cascading object detection and action classification modules.GEN-VLKT [25] employs the powerful language-image model CLIP [33] to guide feature learning.These one-stage methods utilize the transformer's global features to obtain good performance in HOI detection.So, the global features contain rich interaction cues.
In our work, we combine the advantages of the two-stage and one-stage methods.Like UPT [20] and STIP [34], our method is a two-stage transformer-based method.However, our research focuses on improving the expressiveness of features, while UPT and STIP are more focused on model design.

Method
The HOI detection task is to accurately locate humans and objects in a scene and identify action categories among them.The principle of our method is based on the Information Gain (IG) in information theory.Specifically: where P is the prediction set of the model, and F is the interaction features.Gain(P, F) denotes the IG corresponding to the performance in HOI detection.H(P) and H(P|F) denote information and conditional entropy, respectively.H(•) corresponds to the network model.We can reduce the prediction uncertainty (H(P|F) decreases) by enhancing the expression of the interaction feature F. Ultimately, a larger IG is obtained.Section 3.1 introduces our PCT framework.Sections 3.2 and 3.3 describe the PCT's two critical components.Section 3.4 presents the training and inference details of the PCT.

Overview
Figure 2 shows our PCT framework.The weights of DETR are frozen throughout the entire training process.In the DETR [13] part, the image i ∈ R 3×H×W is input to the CNN backbone to generate feature maps z ∈ R 2048× H 32 × W 32 .After adding positional encoding, the feature map z performs self-attention (transformer encoder) to model global features . Subsequently, the global features G and learnable object queries q ∈ R 100×256 perform cross-attention (transformer decoder) to generate human and object instances f ∈ R 100×256 .These instances are decoded into object categories and bounding boxes through a multilayer perceptron (MLP).After filtering out instances with low confidence, the retained bounding boxes are used to extract human and object instances from the CNN backbone network.Then, human and object instances are pairwise matched to generate pairwise CNN features and pairwise transformer features.Add these CNN and transformer features to obtain human-object pair features.In the interaction head part, the human-object pairs and global features perform cross-attention (transformer decoder layer) to aggregate contextual information.Next, these human-object pair features perform self-attention (transformer encoder layer) to refine them [20].The refined features are decoded into action categories by an MLP.

CNN Features for Human-Object Pairs
Despite the success of the transformer in DETR, it has certain limitations in capturing sequential sequences.CNN are more suitable for capturing the local information than transformers.Therefore, we fuse CNN and transformer features to improve the expressiveness of features.
First, we discussed how DETR works [35].Figure 3a shows the original image.Figure 3b shows the encoder self-attention at a reference point.We observe that the encoder can separate different object instances.Figure 3c is the decoder attention of an object query.The decoder focuses on features at the edges of objects, such as legs and heads, to better distinguish different objects.Then, we consider how to achieve this goal.Object queries can activate the corresponding object regions in the image.Therefore, the final representations contain object features (transformer features).In addition, the CNN features of objects can be obtained from the region of interest (ROI) in the CNN backbone.Therefore, we can easily align and fuse the CNN and transformer features of instances.
Specifically, according to the detection results of DETR, we first use non-maximum suppression and set a threshold to retain high-score instances.These instances build a set {d i } n i=1 , where is the confidence score, c i ∈ K is the object class, K is the set of object categories, and f i ∈ R 1×256 is a human or object instance.The transformer features set { f i } n i=1 comprises n1 humans ( f T−h ∈ R n1×256 ) and n2 objects ( f T−o ∈ R n2×256 ).Next, we map the bounding boxes b to the last layer of the ResNet [36] backbone to obtain ROI.Subsequently, global average pooling is applied to these regions to generate CNN features.Finally, the dimensions of the CNN and transformer features are aligned.Specifically: where z ∈ R 2048× H 32 × W 32 denotes the last layer feature map of the ResNet backbone.b ∈ R n×4 denotes the object bounding box, ROI denotes the region of interest operation, and GAP denotes the global average pooling operation.f Ins ∈ R n×2048 represents instance features of both humans and objects.W 1 , W 2 ∈ R 2048×256 denotes the linear layer, LN 1 , LN 2 denotes the layer normalization, and Relu denotes the Relu function.f C−h ∈ R n1×256 and f C−o ∈ R n2×256 denote the CNN features of the human and object, respectively.Next, as shown in Figure 4, We pairwise match human and object instances by concatenation.n1 humans and n2 objects will generate m = n1 × (n1 + n2 − 1) human-object pairs.Then, the pairwise CNN features and pairwise transformer features are added element-wise.Specifically: where ⊕ denotes concatenation, + denotes element-wise addition.Z ∈ R m×512 denotes the human-object pair features, where m denotes the number of pairs.

Interaction Head
We argue that simply connecting humans and objects does not provide contextual clues.Given that previous studies often use global features to provide interaction cues, we introduce global features in the interaction head.In Section 4.6, we visualize the decoder attention of the interaction head.Human-object pairs are effectively localized and highlighted when global features are applied.Specifically, the global features are extracted from the transformer encoder of DETR [13].We use a standard transformer decoder layer that performs cross-attention on human-object pair and global features, as follows: where . Dec denotes the transformer decoder layer, and output ∈ R m×512 denotes output features.
Inspired by UPT [20], we used a transformer encoder layer to refine the features of human-object pairs.Specifically: where Enc denotes the transformer encoder layer.Finally, we decode the output into action classification scores.Specifically: sk denotes the unnormalized action score.k is the k-th possible action of the human-object pair.W denotes the linear layer.For HICO-DET [5], we use W ∈ R 512×117 ; for V-COCO [37], we use W ∈ R 512×24 .

Training
Following UPT, we use focal loss [38] with a sigmoid layer to alleviate the imbalance between positive and negative samples and maintain numerical stability [20].The specific formula is as follows: where s h and s o are the confidence scores of humans and objects, respectively.σ is the sigmoid function, σ(s k ) ∈ [0, 1] is the normalized action score, and ŷ ∈ [0, 1] is the final action score.y ∈ {0, 1} is the binary ground truth label.β ∈ [0, 1] and γ ∈ R + are hyperparameters.

Inference
Based on previous research [5], we excluded invalid combinations of actions and objects, such as holding an airplane.After SCG [21], we implemented an instance suppression strategy [20,31] (as shown in Figure 5) to avoid excessive object scores dominating the final HOI scores.Specifically, the object scores are reduced by adding a power of 2.8, and the final HOI scores s k are calculated as follows:  [31] and SCG [21], respectively.We follow the SCG.

Experiments
Section 4.1 details our experiment settings.Section 4.2 compares the results of our method with the state-of-the-art methods.Section 4.3 presents our model's ablation study and comparative study.Section 4.4 demonstrates the costing study.Section 4.5 compares the expressive power of CNN features and transformer features.Section 4.6 presents qualitative results and limitations.If not specified, the models in this section use the ResNet-50 backbone and report their performance on the HICO-DET [5] dataset.

Experiment Settings 4.1.1. Datasets
The HICO-DET dataset contains 37,633 valid training images and 9546 valid testing images, covering 80 objects, 117 actions, and 600 interactions.The V-COCO [37] dataset contains 4969 valid training images and 4532 valid testing images, covering 29 actions.Following UPT [20], we report results for 24 actions in the V-COCO dataset.

Evaluation Metrics
We report the mean Average Precision (mAP) on the HICO-DET and V-COCO datasets.A prediction is considered a true positive if the intersection over union (IOU) of the human and object boxes with the ground truth boxes is higher than 0.5, and the HOI class is correct.For HICO-DET, we evaluate mAP on default and known object settings.Each setting contains 3 HOI sets: full (600 HOIs), rare (138 HOIs, less than 10 training instances), and non-rare (462 HOIs, 10 or more training instances).For V-COCO, we evaluate the mAP of Scenario 1 and Scenario 2. For the cases without role annotations, Scenario 1 requires the role prediction to be null, while Scenario 2 ignores the role prediction.

Implementation Details
For HICO-DET, we use the DETR model [20] fine-tuned on HICO-DET.For V-COCO, we use the DETR model [20] trained from scratch on MS COCO [39].Our data augmentation techniques include random cropping, image scaling, and color dithering [11,13].We collect instances with scores greater than 0.2 and keep a maximum of 15 humans and 15 objects.The hyperparameters of focal loss are the same as UPT [20].Our model is trained for 20 epochs using the AdamW [40] optimizer.The initial learning rate is 10 −4 , and the learning rate is reduced to 10 −5 at the 10th epoch.The training uses a NVIDIA Tesla V100 device with a batch size of 16.
Without performing instance suppression (λ = 1), PCT with ResNet-101 outperforms PCT with ResNet-50 by 0.38 mAP.As λ gradually increases to 2.8, the performance gap between these two models decreases to 0.16 mAP.Additionally, the instance suppression strategy proposed by TIN [31] also reduces the performance gap between these two models.Experimental results show that the instance suppression strategies narrow the performance gap between these two models.We show the results of the ablation study in Table 4.The interaction head of the baseline model is a transformer encoder layer.Compared to the baseline, adding a transformer decoder layer improves about 2 mAP, adding the CNN features of human-object pairs improves about 3.3 mAP, and the complete PCT model improves about 4 mAP.In addition, removing the encoder layer from the complete model degrades about 0.8 mAP.Following the UPT [20], we show the effect of model components on the interaction scores.Since we freeze the object detector, different interaction heads will use the same object detection result.We divide the negative samples according to the baseline model.An interaction score less than 0.05 is defined as an easy negative sample, otherwise a hard negative sample.In Table 5, we explore the impact of each component on positive and negative samples by controlling variables.Table 5 shows that adding CNN features increases the average score of positive samples by 0.0167 and degrades the average score of hard negative samples by 0.0152, widening the gap between positive and negative sample scores.Adding the transformer decoder layer improves the average score of positive samples by 0.0127, but hard negative samples have almost no effect.Therefore, both of our proposed components are valid.Finally, our complete model improves the average score of positive samples by 0.0332 and reduces the average score of hard negative samples by 0.0273, further enlarging this gap.This indicates that the two components are complementary.

Contrast Study
We show the results of different interaction heads in Table 6.Our interaction head first uses a transformer decoder layer and then a transformer encoder layer (D.E.+E.N., 33.63 mAP).The interaction head (E.N.+D.E., 33.41 mAP) with the opposite order of our submodules drops about 0.2 mAP in performance than ours.We speculate that the decoder layer needs to perform attention and generate HOI representations, which is challenging to balance.In addition, the design is inflexible due to the long global feature migration distance.The interaction head composed of double transformer encoder layers (E.N.+E.N., 33.15 mAP) is about 0.5 mAP lower than ours, which indicates that global features provide effective HOI clues.Although the interaction head consisting of double transformer decoder layers (D.E.+D.E., 33.63 mAP) covers ours, the performance does not improve further.

Computational Costing Study
Our network structure is straightforward, but the feature extraction method may increase the model's complexity.To clarify the complexity of our model, Table 7 compares the computational cost of our method with the classical one-stage and two-stage methods.In addition, Table 8 shows the computational cost of our practical components.All experiments were performed on a NVIDIA GeForce GTX 1080 with an average input image size of [3,887,1055].Table 7 shows that our model (13.02 fps) only lags behind the one-stage method QPIC (13.78 fps) in inference speed but outperforms other HOI detectors.Both UPT and our method are two-stage transformer-based methods.Compared to UPT (13.24 M), our model (8.60 M) has fewer trainable parameters, higher FPS, and is only slightly higher than UPT in terms of FLOPs.As shown in Figure 6, although UPT has lower FLOPs than ours, it has more submodules, which affects the inference speed.Table 8 shows that our key components add a small number of parameters and a slight FPS reduction.The main reason for the increased FLOPs is caused by the decoder layer, which performs cross-attention between global features and human-object pair features.

Comparing CNN and Transformer Features
To fairly compare the performance of CNN and transformer features in HOI detection, we designed three variant models based on UPT.These models follow the structure shown in Figure 7, where the backbone CNN and the cooperative layer (Coop.)can be replaced.The modified encoder (M.E.) adds spatial information compared to the vanilla encoder (E.N.).Table 9 shows that CNN features outperform transformer features in all variant models.When the vanilla transformer encoder layer (E.N.) is used in the cooperative layer, CNN features (31.16 mAP) outperform transformer features (30.39 mAP).We attribute this to the better local detail and stability of CNN features.Notably, when the cooperative layer uses the modified transformer encoder layer (M.E.), CNN features (32.17 mAP) further improve performance, while transformer features (30.22 mAP) perform unfavorably.We attribute this to the modified transformer encoder providing the spatial information, but the transformer features cannot effectively exploit this spatial information.When we replace the backbone network with the more powerful ResNet-101, improvements are achieved using both CNN and transformer features.Moreover, transformers show strong adaptability [20,26] and expressiveness.Figure 8 shows how the PCT model works qualitatively.We visualize attention at the transformer decoder layer in the interaction head.Figure 8c shows attention using only transformer features.We observe that the related human-object pair and its closer human-object pairs are focused, and we speculate that global features provide implicit spatial clues.As for the model paying attention to neighboring human-object pairs, the reason is the use of self-attention between human-object pairs.Figure 8d shows the attention of the complete PCT model.Using the fusion features of CNN and transformer can focus on relevant areas of human-object pairs more fully.We attribute this to the CNN effectively capturing local features and detailed information.Qualitative analysis of HOI detection.Figure 9 demonstrates that our model can accurately identify HOI categories.However, our model does not perform well when the image is strongly occluded (as shown in Figure 9b) because the object score affects the final interaction score.Furthermore, we show several failure examples of the model.In Figure 10a, the false detection is caused by similar spatial relations with the straddling action.In Figure 10b, false detection occurs due to strong occlusion.In Figure 10c, the interaction score is too low due to fewer training samples.In Figure 10d, not detecting the correct object leads to wrong predictions.In Figure 10e, semantic ambiguity is caused by ambiguous interaction actions.These qualitative analyses provide valuable clues for model optimization.

Conclusions
In this paper, we try to combine the advantages of two-stage and one-stage methods.Our PCT extracts CNN features through an object detector to improve feature representation.Meanwhile, it uses the transformer's global features to provide contextual information.Our method achieves competitive performance with a simple and efficient architecture.Then, we fairly compare the performance of CNN and transformer features in HOI detection.The results show that CNN features are not inferior to transformer features.Our work may advance the study of feature fusion and lightweight networks for HOI detection.In addition, we speculate that the features generated by the object detector are more suitable for object classification than interaction classification.In future work, we plan to adopt a universal visual pre-trained model to provide robust and high-quality features.Our method relies on human-object pair features to predict interactions, so it cannot predict scenarios that do not involve objects (such as running or falling).In addition, incorrect classification of objects by the object detector leads to wrong HOI predictions.

Figure 1 .
Figure 1.Our PCT pipeline.PCT is a two-stage transformer-based HOI detector.We use blue and green to represent humans and objects, respectively.

Figure 2 .
Figure 2. Our PCT framework comprises an object detector (DETR) and an interaction head.First, DETR detects humans and objects in the image, excluding instances with low confidence.Next, the retained humans and objects form human-object pairs through pairwise matching.Finally, the features of these human-object pairs and the global features are fed into the interaction head to predict action categories.

Figure 3 .
Figure 3. (a) Image with reference point.(b) Encoder self-attention for a reference point.(c) Decoder attention for an object query [35].

Figure 4 .
Figure 4.The process of pairwise matching between humans and objects.

Figure 5 .
Figure5.The green curve indicates no instance suppression fraction.The red and blue curves are the instance suppression strategies of TIN[31] and SCG[21], respectively.We follow the SCG.

Figure 6 .
Figure 6.Comparison between UPT and our model.

Figure 7 .
Figure 7. Framework of the UPT variant.The position before pairwise operations is called the cooperative layer.

Figure 8 .
Figure 8.Effect of pairwise CNN features on attention.Please zoom in to view.

Figure 9 .
Figure 9. Successful cases.We filter out human-object pairs with scores lower than 0.2.Please zoom in to view.

Table 1 .
Performance (mAPx100) comparison on the HICO-DET dataset.The highest result for each setting is shown in bold.

Table 2 .
Performance (mAPx100) comparison on the V-COCO dataset.The highest result for each setting is shown in bold.

Table 3 .
Results of implementing the instance suppression strategies.

Table 4 .
Ablation studies of our PCT components.The interaction head of the baseline model uses a standard encoder layer.

Table 5 .
Change in average interaction score when a component is added to the reference network.According to the prediction results of the baseline model, the samples are categorized as positive, easy negatives, and hard negatives.The number of samples in each category is in parentheses.

Table 6 .
Comparison of results for different interaction heads.The acronym E.N. stands for encoder layer, and the acronym D.E. stands for decoder layer.The order of the acronyms indicates the order of the submodules.

Table 7 .
Computational cost comparison.DETR is the object detector in our method.The number of trainable parameters is shown in parentheses.

Table 8 .
Costs of our model components.The Params column shows the number of trainable parameters.Changes in the number of parameters are shown in parentheses.All methods use resnet50 as the backbone network.

Table 9 .
Comparison of the results of CNN features and transformer features.The acronym E.N. stands for standard encoder layer, and M.E.stands for modified encoder layer.In addition, the cooperative layer does not use the dropout layer.