Multi-Resolution and Semantic-Aware Bidirectional Adapter for Multi-Scale Object Detection

These


Introduction
Object detection, a crucial task in computer vision, entails the classification and localization of pertinent objects within an image.As convolutional neural networks (CNN) and vision transformers have experienced significant advancements, object detection methods have made considerable progress, contributing to the enhancement of recognition performance across diverse visual tasks.Numerous methods [1][2][3][4] have been proposed to enhance performance from various perspectives, demonstrating remarkable results on popular benchmarks like MS-COCO [5].
The object detection task involves predicting objects in natural and real-world scenes, which encompasses objects of varying scales.Nevertheless, scale variation poses a challenging dilemma that hampers the performance of detection methods.Several studies [6,7] have confirmed the sensitivity of CNNs to object scale and image resolution.Moreover, following a series of pooling and convolution operations on the input image, the features lose a noticeable amount of information, particularly pertaining to the fine details of microscopic objects.Furthermore, there is an imbalance of information across different levels.High-level features encompass semantic information, albeit lacking in spatial details, while low-level features preserve detailed information but grapple with capturing semantic context.The aforementioned issues have emerged as bottlenecks for contemporary detection algorithms.
The implementation of multi-level feature integration serves as an effective strategy to mitigate these issues.For instance, FPN [8] employed a top-down feature integration method to combine features at different scales.Nonetheless, the input features of FPN are directly extracted from the backbone network.They may have already lost original information during the inference process.Additionally, some studies [9] have validated the significance of semantic information in the high-level features.However, there exists insufficient exploration and utilization of the high-level features in FPN.Additionally, the merging approach incorporates a rigid fusion of two features, neglecting the variability in features across different levels.Relying solely on high-level features to direct low-level features leads to an absence of low-level spatial information in the high-level features.Moreover, the direct fusion process from top to bottom may dilute the semantic information within the high-level features.This suggests an insufficient interaction among multi-level features.As depicted in Figure 1, the baseline network (FPN) in the left column struggles to effectively address multi-scale object challenges, resulting in numerous false positive cases.This is primarily due to the underutilization of features and the absence of precise object representations.The MSBA-based results exhibit a significant reduction in false positives and a qualitative performance enhancement.AP S , AP M , and AP L denote the AP of small, medium, and large objects.In the bottom row, a similar trend is observed for Mask R-CNN, where our approach (right) consistently outperforms the baseline (left).AP bbox and AP S bbox pertain to detection performance and bbox AP for small objects.AP mask and AP L mask correspond to instance segmentation performance and mask AP for large objects.
To mitigate the constraints of current approaches, we introduce a novel and potent framework, termed a multi-resolution and semantic-aware bidirectional adapter for multiscale object detection (MSBA).More precisely, this framework comprises three sequential components: multi-resolution cascaded fusion (MCF), a semantic-aware refinement transformer (SRT), and bidirectional fine-grained interaction (BFI).Respectively, these three components target the input, enhancement, and interaction aspects of the feature integration process through a coarse-to-fine strategy.The MCF component receives inputs in the form of multi-stage features and multi-resolution images from the backbone.It then adaptively extracts suitable multi-level features tailored to distinct object instances through a cascaded fusion strategy involving multiple receptive fields.Additionally, SRT is introduced to enhance the multi-scale semantic representation by refining both detailed and global semantic information while minimizing computational costs.SRT is designed with a semantic association strategy and employs multi-branch attention to effectively integrate semantic information across diverse scales.Moreover, to achieve a versatile and effective feature interaction, we introduce BFI, a mechanism for establishing a bidirectional flow of information.The bottom-up interaction is intended to furnish spatial guidance transitioning from low-level to high-level layers, fostering interaction across multiple levels.By leveraging intricate spatial information from low-level layers, high-level layers can effectively identify salient regions and provide enhanced semantic information with greater accuracy.Conversely, the top-down interaction is employed to establish semantic enhancement from high-level layers to low-level layers.Building upon the copious semantic information in the high-level layers, low-level layers can exhibit a comprehensive comprehension of object instances.In conclusion, the introduced coarse-to-fine process allows for the attainment of a more potent representation of objects across multiple scales.
Thorough experiments are carried out to validate the efficacy of the proposed approach.The introduced MSBA serves as a plug-and-play framework that seamlessly integrates with diverse backbones and detectors.On the MS COCO dataset, our method consistently outperforms state-of-the-art methods, achieving superior performance across different backbones and detectors, without any additional bells and whistles.As depicted in Figure 1, our detection results, presented in the second column, demonstrate superiority in accurately detecting multi-scale objects.In summary, this study offers the following key contributions:

•
To mitigate the challenge of scale variation, we introduce a novel multi-resolution and semantic-aware bidirectional adapter for multi-scale object detection, referred to as MSBA.It alleviates the scale-variant issue by addressing the input, refinement, and interaction facets of feature integration.• Our proposition, MSBA, is composed of multi-resolution cascaded fusion (MCF), a semantic-aware refinement transformer (SRT), and bidirectional fine-grained interaction (BFI).SRT is dedicated to refining the multi-scale semantic representation, while BFI is employed to foster ample interaction across various levels.Importantly, all these modules are pluggable.

•
The proposed method is rigorously evaluated on the widely used MS-COCO dataset, demonstrating its superiority over state-of-the-art approaches.Thorough ablation experiments are conducted to confirm the efficacy of the proposed modules within the MSBA framework.

Related Work
Object detection is a fundamental task in computer vision that finds wide application in other visual fields, including remote sensing [10,11] and self-driving [12,13] technologies.It involves identifying and classifying objects of interest within an image.Object detection has made remarkable advancements in terms of accuracy and speed, thanks to convolutionbased and transformer-based algorithms.

Object Detection
In the field of object detection, when it comes to the convolution-based network architectures, most detectors can be organized into two types: two-stage detectors [14][15][16][17] and one-stage detectors [18][19][20][21].Two-stage detectors can achieve better performance with longer computation time, and the one-stage detectors show superiority in speed with inferior accuracy.In terms of the representation of the object, there can be divided into anchor-based and anchor-free detectors.Anchor-based [16,20] methods employ a multitude of anchor boxes to classify and locate objects, while anchor-free methods [22][23][24] utilize key points (e.g., center or corner points) for detection rather than relying on intricate manual design and hyperparameter settings.ATSS [25] has been proposed as a flexible label assignment method to narrow the discrepancy between anchor-free and anchorbased approaches.Recently, transformer-based methods [4,[26][27][28][29][30] have made significant advancements.DETR [4] is the first end-to-end detector based on transformer blocks that achieves comparable performance at a high computation cost.Subsequently, deformable DETR [26] is proposed to enhance performance while mitigating computation costs through the use of deformable attention strategies.Additionally, Sparse R-CNN [27] employs sparse boxes to accomplish multi-stage refinement using a combination of self-attention modules and iterative structures.MCCL [31] is introduced to apply a novel training-time technique for reducing calibration errors.NEAL [32] is dedicated to training an attentive CNN model without the introduction of additional network structures.PROB [33] presents a novel probabilistic framework for objectness estimation within the context of open-world object detection.

Approaches for Scale Variation
Scale variation in object instances poses a significant challenge in object detection, hindering the improvement of detection accuracy.Singh et al. introduces SNIP [6] and SNIPER [34] as solutions to address this issue.The proposed method acknowledges the sensitivity of CNN to scales and advocates for detecting objects within a specified scale range.Consequently, a scale normalization training scheme is devised to facilitate the detection of objects at varying scales.These concepts have been widely adopted to acquire multi-scale information.However, SNIP exhibits high complexity, limiting its suitability for certain practical applications.FPN [8] introduces a novel feature pyramid architecture to solve the problem of scale variation by merging adjacent layers from top to bottom.It has achieved significant advancements and serves as a fundamental structure in many detectors.However, there is still room for performance improvement.PANet [35] is subsequently proposed to enhance FPN by introducing a new bottom-up structure that shortens information propagation.Moreover, FPG [36] stacks multi-pathway pyramids to enrich feature representations.DSIC [37] utilizes a gating mechanism to dynamically control the flow of data, enabling the automatic selection of different connection styles based on input samples.Furthermore, to address scale variation, PML [38] designs an enhanced loss function by modeling the likelihood function.HRViT [39] combines high-resolution multi-branch architectures with vision transformers (ViTs).MViTv2 [40] includes residual pooling connections and decomposed relative positional embeddings.In contrast to the aforementioned methods, our approach highlights the roles of different layers and maximizes information exchange between high-level and low-level layers to enhance feature representations.In contrast to the aforementioned methods, our approach incorporates both multi-stage features and multi-resolution images as suitable inputs, employing a cascaded fusion strategy.Furthermore, the proposed MSBA highlights the roles of different layers and maximizes information exchange between high-level and low-level layers to enhance feature representations.

Vision Transformer
The application of transformers in diverse visual tasks has made significant advancements.ViT [41] employs a standard transformer backbone for image classification, but this approach incurs significant computational overhead.Subsequently, a series of studies are conducted to enhance ViT.For instance, T2T-ViT [42] divides the image into overlapping patches as tokens, enhancing token interactions.TNT [43] investigates both patch-level and pixel-level representations using nested transformers.Additionally, CPVT [44] introduces implicit conditional position encodings that depend on the local context of the input token.Notably, the Swin transformer [45] introduces a hierarchical approach that incorporates multi-level features and window-based attention.Moreover, the application of the transformer to other vision tasks has achieved remarkable progress, such as video captioning [46,47], vision-language navigation [48,49], and visual voice cloning [50,51].These excellent works have witnessed the milestone success of the vision transformer.Furthermore, numerous endeavors [52][53][54] have been dedicated to leveraging the strengths of both the CNN and transformer, resulting in improved performance while reducing computational overhead.However, the majority of the aforementioned studies concentrate on enhancing the attention mechanism within individual feature states, disregarding the variations among features across different receptive fields.Conversely, our transformerbased approach can amalgamate global and local semantic information within high-level features, due to the proposed effective attention mechanism.Furthermore, our proposed method places greater emphasis on exploring interactions among diverse receptive fields and accentuating the reusability of features to enhance their representational capacity.

Foundation
The overview of the proposed MSBA is illustrated in Figure 2. As depicted in Figure 2a, MCF comprises two feature information streams.The C 2 , C 3 , C 4 , C 5 indicate the features derived from the multi-resolution input image, processed through multiple convolutions to capture sufficient coarse-grained information.{C 2 , C 3 , C 4 , C 5 } represent features from distinct stages of the single-resolution image undergone by the backbone network.In Figure 2b, to ensure consistent notation within the same module, we employ M 2 , M 3 , M 4 , M 5 in BFI to denote features derived from MCF's output.SRT concentrates on enhancing the multi-scale semantic representation in the high-level feature, specifically targeting C 5 .Besides, Additionally, BFI encompasses pixel-level filter interaction (PLI) and channel-wise prompt interaction (CWI).The output of PLI is denoted as {M 2 , M 3 , M 4 , M 5 }, where M 2 remains unchanged (M 2 ) without any further operations.Similarly, P 2 , P 3 , P 4 , P 5 mirrors {M 2 , M 3 , M 4 , M 5 } and represents features resulting from PLI's output.Additionally, {P 2 , P 3 , P 4 , P 5 } signify features enriched with meticulous se- mantic prompt information, primed for predictions.
The matching gate functions as a controller, aiming to mitigate inconsistencies and redundancy arising from rigorous interaction between two features.It dynamically modulates the fusion process in response to the present input.In detail, when provided with input features X, Y ∈ R c×h×w as input, the matching gate G (•) can be described as: in which α f ine ∈ R c×1×1 represents the control matrix of X and F mul means the Hadamard product.α f ine can be obtained from the switch (S) in the matching gate as: where O(•) represents the operations such as 3 × 3 convolution and pooling.σ(•) signifies a nonlinear activation function, executed as Tanh within our method.The matching gate adeptly fosters complementarity between the two features.

Multi-Resolution Cascaded Fusion
FPN employs a single-resolution image as its input to create a feature pyramid.It can partially mitigate the challenge of scale variation.However, this approach is limited since a single-resolution image can only offer a restricted amount of object information within a specific scale.Using high-resolution images as input can be advantageous for detecting small objects, yet it might lead to relatively lower performance in detecting larger objects.Conversely, utilizing low-resolution images as input may lead to subpar performance in detecting small objects.Consequently, employing a single-resolution image as input might not suffice for effectively detecting objects across various scales.
Hence, the inclusion of a multi-scale image input is crucial for detectors to gather a broader spectrum of object information across different resolutions.This observation motivates our introduction of the multi-resolution cascaded fusion, which integrates multiresolution data into the network architecture, as illustrated in Figure 2a.Initially, the input image undergoes both backbone processing and direct downsampling to align with the size of C i = {C 2 , C 3 , C 4 , C 5 } from the backbone as Cds i = Cds 2 , Cds 3 , Cds 4 , Cds 5 .Following this, the downsampled multi-resolution images undergo a sequence of convolution, batch normalization, and activation operations, culminating in the creation of corresponding features imbued with both coarse-grained spatial details and semantic insights.Furthermore, we employ a matching gate to adaptively manage the fusion process between the generated multi-resolution features and the multi-stage features derived from the backbone.This procedure can be described as: Here, Cds i refers to the input image that has been downsampled to align with the suitable spatial dimensions of C i , with i representing the feature level index from the backbone.Ψ i (•) represents a sequence of operations, including a 3 × 3 Conv, BN, and ReLU to produce semantic features.Subsequently, we leverage C i to merge with the corresponding C i using a matching gate, thereby generating a feature that is more effective.Additionally, we formulate a multi-receptive-field cascaded fusion strategy to extract multi-scale spatial information from the lower-level features.The entire procedure can be expressed as follows: where R i signifies the convolution operator applied with different dilation rates.M i corresponds to the input for the subsequent stage, enriched with ample coarse-grained and multi-scale spatial information.Notably, M 2 is derived from the matching gate without the incorporation of dilated convolution.Generally, our multi-resolution cascaded fusion supplies diverse resolution information.The proposed MCF is advantageous for object instances of varying scales.Additionally, we employ a matching gate as a controller to dynamically regulate the interaction process between multi-resolution images and the multi-stage features of the backbone.This adaptively controlled process aids in avoiding the inclusion of unnecessary information.Furthermore, the proposed multi-receptive-field cascaded fusion strategy contributes to the extraction of ample multi-scale spatial information for the high-level features.The resulting features consequently achieve a more comprehensive representation of different scales.

Semantic-Aware Refinement Transformer
Based on earlier investigations [9,55], it is evident that the semantic message contained in the high-level features significantly contributes to mitigating scale variations.However, in conventional approaches, there is a lack of distinction between different levels.Common methods merely employ high-level features to provide semantic information in their original states.Moreover, the transformer is designed to capture long-range semantic messages due to its self-attention mechanism.Nevertheless, directly applying the transformer to high-level features may disregard the variations in features across diverse representation situations.Thus, we propose the SRT transformer encoder to enhance the comprehensive semantic representation of high-level features across different feature states.This enhancement facilitates the acquisition of multi-scale semantic global information by high-level features.
As illustrated in Figure 3, we employ SRT on C 5 to augment the semantic information.The entire process of SRT can be elucidated as follows: where LN denotes the layer normailzation operation.PE introduces the position embedding for the feature and the FFN serves to enhance the non-linearity of these features.Attn SRT signifies the novel SRT attention mechanism, enabling the query of the original feature to probe long-range semantic relationships across various feature states.Furthermore, the sufficient semantic information can be integrated through the SRT attention mechanism effectively.The process can be delineated as: The term q 1 represents the query extracted from the original feature.The keys, namely k 2 , k 3 , along with the values v 2 , v 3 , signify the keys and values obtained through processing the corresponding features using average and max pooling operations.The processed features can achieve more expressive with tiny spatial size.The h denotes the number of attention heads.Following this, q 1 engages in interactions with the other keys to amplify the semantic representation of the high-level feature under various representation states.The mechanism Attn is employed to calculate token-wise correlations among the features.Details can be formulated as follows: where q, k, and v represent the query, key, and value, separately.d k denotes the feature channels.Our proposed approach employs the initial query to compute correlations with other keys sourced from diverse sections of the feature.This process enables the sufficient extraction of semantic information from the high-level feature.In summary, our proposed SRT comprehensively investigates the semantic information across different states of the high-level feature.This facilitates the refinement and enhancement of multi-scale semantic details through long-range relationship interactions.Moreover, the computational cost remains minimal due to the small spatial size of the high-level feature.

Bidirectional Fine-Grained Interaction
While acquiring the appropriate input for the merging process, a more effective interaction of features among various levels becomes essential.In a typical feature pyramid, a top-down pathway connects features from high to low levels in a progressive manner.Low-level features are enriched with semantic information from higher levels, which proves advantageous for classification tasks.Nevertheless, detection tasks demand sufficient information pertinent to both classification and regression tasks, which poses a challenge due to the differing information needs of these tasks.The regression task mandates precise object contours and detailed information from high-resolution levels.Additionally, the classification task necessitates ample semantic information from low-resolution levels.However, the FPN scheme is not fully harnessed, resulting in the underutilization of highresolution information from lower levels.The integration of numerous object contours and detailed information does not occur as effectively as anticipated.Furthermore, the semantic information gradually diminishes along the top-down path.
Building upon the aforementioned knowledge, we introduce bidirectional fine-grained interaction to address the challenge of underutilizing multi-scale features and to foster interplay across distinct levels.Initially, we recognize that a straightforward bottom-up path could potentially introduce additional noise in lower levels.Therefore, we devise a pixel-level filter (PLF), depicted in Figure 2b, which centers on salient locations and dynamically sieves out extraneous pixel-level information based on the current feature's characteristics.Moreover, high-level features often lack location-specific information.As a solution, we introduce a bottom-up scheme where low-level features employ the pixel-level filter to guide high-level features towards object-specific locations.
The pixel-level filter comprises two primary components: the identification of salient locations and the removal of superfluous pixel-level information, as well as the provision of fine-grained location guidance.The initial component, referred to as the pixel-level filter, can be outlined as follows: where Tanh(•) is tanh activation that transforms the operation into an encoded feature vector, ranging from (−1, 1); Φ(•) refers to a 1 × 1 conv operation; and Max ensures non-negativity.W i is the output of PLF that denotes the filter result of M i .The pixel-level filter effectively removes superfluous information by suppressing values below 0 and dynamically emphasizes the salient region.In the subsequent part, the adjacent layer M i+1 is guided by the filter results W i from preceding layers, facilitating focus on the desired region: Φ(•) is a convolution operator applied to M i with the intention of obtaining a focused region through a learning strategy.M i+1 signifies the output of interaction.It is obtained by matching the M i+1 with the prominent information derived from preceding layers.M 2 remains unchanged, equivalent to M 2 .Upon acquiring features enriched with accurate object contour and detailed information, we incorporate the concept of channel-wise prompt to facilitate the propagation of semantic information.As shown in Figure 2c, channel-wise prompt is devoted to extracting the semantic prompt map of the feature at the channel level, adaptively.Then, we utilize the semantic prompt map of higher levels to instruct the adjacent layer, which can heighten the semantic perception ability of objects.The detailed process can be articulated as: where R i denotes the semantic prompt map of high-level features, and avg and max represent the average pooling and max pooling operation block.Then, P i−1 learns the semantic knowledge according to the prompt map.The process can be written as: The proposed bidirectional fine-grained interaction takes full advantage of multi-scale features.During the bidirectional interaction process, both semantic and spatial information can be effectively completed among different levels.The low-level layers, which possess high-resolution information, effectively capture salient location information via pixel-level filtering at the pixel level.This information is then utilized to establish a bottom-up information flow.This aids in enhancing the essential location information of objects within high-level layers.Conversely, the high-level layers, abundant in semantic information, contribute significant semantic prompts when subjected to channel-wise prompting at the channel level.The prominent semantic prompt can be effectively transmitted to the low-level layers with minimal loss.BGI promotes adequate interaction among different levels with abundant multi-scale information.Training is conducted on the train2017, while ablation experiments and comparable results are generated using the val2017.The performance assessment utilizes standard COCOstyle average precision (AP) metrics, incorporating varying intersection over union (IoU) thresholds ranging from 0.5 to 0.95.AP s , AP m , and AP l represent the AP of small, medium, and large objects.Moreover, AP b and AP m denote the AP of the bounding box and mask in the instance segmentation task.
Implementaion Details.To maintain experimental comparison fairness, all experiments are conducted utilizing PyTorch [56] and mmdetection [57].In our configuration, input images are resized to ensure their shorter side measures 800 pixels.We train detectors with 8 Nvidia V100 GPUs (2 images per GPU) for 12 epochs.The initial learning rate is 0.02.And it is reduced by a factor of 0.1 after the 8th and 11th epochs, respectively.The backbones utilized in our experiments are publicly available and have been pretrained on ImageNet [58].The training process incorporates linear warming up during the initial stage.All remaining hyperparameters remain consistent with the configurations outlined by mmdetection.Unless stated otherwise, all baseline methods incorporate FPN, and the ablation studies utilize Faster R-CNN based on ResNet50.

Ablation Studies 4.2.1. Ablation Studies on Three Components
To assess the significance of the components within MSBA, we progressively integrate three modules into the model.For all our ablation studies, the baseline method employed is Faster R-CNN with FPN, based on ResNet-50.As indicated in Table 1, MCF enhances the baseline method by 1.2 AP, owing to the utilization of diverse-resolution images and a cascaded dilated convolution fusion strategy.Multi-resolution images encompass ample spatial object information, while the cascaded method provides diverse receptive field messages.MCF effectively furnishes adequate information for objects of varying scales-small, medium, and large.SRT contributes a 1.3 AP enhancement to the baseline method by refining long-range relationships within high-level features.The most substantial contribution to the superior performance stems from the enhancements in AP L (+2.9 AP), facilitated by ample semantic information.The findings suggest a deficiency in semantic information within the high-level features of the baseline method.SRT rectifies this shortfall by refining semantic information and enhancing feature representation in the high-level layer.BFI boosts detection performance by 1.4 AP, with a noteworthy improvement in AP S .Evidently, robust interaction across various levels is conducive to mitigating scale variations.Furthermore, the fine-grained messages proficiently enhance detail and contour information across multi-scale features.
Combining any two of these components results in significantly improved performance compared to the baseline method, underscoring the efficacy of their synergistic interaction.For instance, the simultaneous integration of MCF and SRT yields an AP improvement of 39.0, surpassing the enhancement achieved by either module individually.Furthermore, the incorporation of all three components with the baseline method results in an AP of 39.5.These ablation results substantiate the efficacy of the three individual components and their combined configurations, affirming their mutual complementarity.

Ablation Studies of Various Dilation Rates
Table 2 presents the experimental results from various implementations of MCF.To validate the efficacy of MCF, we employed distinct dilation rates.Employing narrower dilation rates such as 1, 2, 3 and 2, 3, 4 yields constrained enhancements owing to the insufficiency of spatial information.Conversely, when employing dilation rates of 3, 6, 12, the performance fails to improve as anticipated.This suggests that the substantial disparity among the three dilation rates might result in incongruous receptive information.The more favorable outcome underscores the dominance of the appropriate configuration 1, 3, 6, which effectively provides ample pragmatic information for multi-level features.

Ablation Studies of Different Fusion Styles
Subsequently, we delve into the fusion techniques employed for combining two features within the MCF.The experiments are performed using distinct fusion styles within the matching gate.Initially, we employ the product operation on the two features to derive the fused feature.Subsequently, we sum the two features in another experiment for comparison purposes.As shown in Table 3, the summation operation applied to feature fusion yields superior performance, effectively preserving ample spatial and semantic information from both features.

Ablation Studies of the Effect of Individual Component in BFI
In this section, we undertake comparative experiments to ascertain the efficacy of individual components within BFI.We employ two distinct directional structures to facilitate interaction independently.As shown in Table 4, both components enhance the performance of the baseline method.Furthermore, the outcomes reveal the superiority of combining both methods.The PLF and CWP are complementary and partially overlapping, leading to enhanced performance when combined.We subsequently undertake relevant experiments to validate the significance of interaction order between the two structures within BFI.The experiment is conducted by interchanging the positions of CWP and PLF.As shown in Table 5, the sequence of CWP followed by PLF surpasses other alternatives.However, following CWF, the PLF may introduce more noise and background information to high-level features.In contrast, when PLF precedes CWP, it effectively mitigates the aforementioned issues owing to the influence of semantic guidance.

Performance Comparison
To ascertain the efficacy and superiority, we perform comprehensive experiments encompassing both object detection and instance segmentation tasks.Furthermore, we re-implement the baseline methods using mmdetection to ensure equitable comparisons.Generally, the resulting performances surpass those reported in public articles.Additionally, we apply our proposed approach across multiple backbones and detectors, employing extended training schedules and techniques to demonstrate its generalizability.

Object Detection
As shown in Table 6, detectors incorporating MSBA consistently achieve substantial enhancements in comparison to conventional methods, encompassing both single-stage and multi-stage detectors.Our proposed MSBA demonstrates improvements of 1.5 and 2.1 points when integrated with RetinaNet and Faster R-CNN utilizing ResNet 50, respectively.Leveraging the ample coarse-grained information at lower levels, multi-stage detectors exhibit a more pronounced accuracy enhancement.Moreover, when combined with diverse backbones in conjunction with more sophisticated detectors, our approach attains superior outcomes, attributable to the reinforced multi-scale representation.Additionally, as depicted in Figure 4, MSBA effectively captures substantial spatial information through ample interaction, while mitigating the impact of erroneous and overlooked detections.

Instance Segmentation
We also conduct comprehensive experiments to confirm the superiority and generalizability of MSBA in the context of instance segmentation tasks.As shown in Table 7, our approach significantly enhances performance in both detection and instance segmentation tasks, exhibiting substantial advancements when contrasted with various robust models.Mask R-CNN achieves 41.7 AP on detection and 37.3 AP when equipped with MSBA based on ResNet-101.Despite the complexity of potent methods like HTC, MSBA exhibits a notable enhancement of 1.6 points in detection AP and 1.4 points in instance segmentation AP, both based on ResNet-50.Furthermore, MSBA achieves superior performance on large ob-jects in both tasks, owing to substantial interaction and rich semantic information at higher levels.In addition, as shown in Figure 5, MSBA captures global semantic information, enabling accurate classification predictions and maintaining segmentation completeness.

Comparison on Transformer-Based Method
We further substantiate the generalizability of MSBA across transformer-based methods.As indicated in Table 8, we undertake relevant experiments encompassing both single-stage and two-stage detectors for both tasks.Our MSBA approach yields improvements of 1.2 and 0.9 points in the detection task when applied to pvt-tiny and swin-tiny methods, respectively.Moreover, even employing the same techniques, such as extended training schedules and multi-scale training, MSBA continues to demonstrate effectiveness and superiority when utilized with the more potent Swin-Small backbone, resulting in a 0.5-point enhancement over the baseline method.Due to the extensive multi-scale represen-tation facilitated by MSBA, the performance improvement for small objects in the detection task is particularly notable.

Comparison with State-of-the-Art Methods
We evaluate MSBA based on more expressive methods with the longer training schedule and various tricks, compared with other state-of-the-art object detection approaches.To ensure equitable comparisons, we re-implement the corresponding baseline models, incorporating FPN within mmdetection.As shown in Table 9, MSBA consistently attains notable improvements, even when employed with more potent backbones, encompassing both CNN-based and Transformer-based configurations.MSBA achieves 42.1 AP and 43.0 AP when employing ResNeXt101-32×4d and ResNeXt101-64×4d as the feature extractors of Faster R-CNN, respectively.This marks an enhancement of 0.9 points compared to the FPN counterparts.When applied to transformer-based detectors employing identical training schedules and strategies, the consistently superior performance underscores the applicability of MSBA across various detector architectures.Additionally, we assess our approach on more potent models like HTC with a 20-epoch training schedule and Mask R-CNN with a 36-epoch training schedule.This leads to enhancements of 0.8 and 0.5 points in detection AP for ResNeXt101-32×4d and Swin-Small, respectively.Consequently, our approach yields substantial enhancements across diverse public backbones and distinct tasks.The enhanced performance serves as evidence of MSBA's capacity for generalization and robustness.

Error Analyses
Subsequently, we conduct error analyses to further substantiate the effectiveness of our approach.As illustrated in Figure 6, we randomly select four categories for error analysis, encompassing objects of diverse scales.Our approach outperforms the baseline method across various thresholds.When disregarding localization errors, MSBA surpasses the baseline, attributed to our approach's ability to offer more accurate classification information.Furthermore, when excluding errors associated with similar classes from the same supercategory and different classes, our method exhibits noteworthy enhancements compared to the baseline.This underscores MSBA's superior location accuracy.

Conclusions
In this paper, we introduce a novel and efficacious multi-resolution and semanticaware bidirectional adapter, denoted as MSBA, for enhancing multi-scale object detection through adaptive feature integration.MSBA dissects the complete integration process into three segments, each dedicated to managing appropriate input, refined enhancement, and comprehensive interaction.The three corresponding constituents of MSBA, namely multi-resolution cascaded fusion (MCF), the semantic-aware refinement transformer (SRT), and bidirectional fine-grained interaction (BFI), are devised to address these three segments.Facilitated by these three simple yet potent components, MSBA demonstrates its adaptability across both two-stage and single-stage detectors, yielding substantial enhancements when contrasted with the baseline approach across the demanding MS COCO dataset.

Figure 1 .
Figure 1.Visual Comparison of Results.The top row displays the detection outcomes of Faster R-CNN using FPN (left) and MSBA (right).The MSBA-based results exhibit a significant reduction in false positives and a qualitative performance enhancement.AP S , AP M , and AP L denote the AP of small, medium, and large objects.In the bottom row, a similar trend is observed for Mask R-CNN, where our approach (right) consistently outperforms the baseline (left).AP bbox and AP S bbox pertain to detection performance and bbox AP for small objects.AP mask and AP L mask correspond to instance segmentation performance and mask AP for large objects.

Figure 2 .
Figure2.The overall architecture of MSBA.There are three components: multi-resolution cascaded fusion (MCF), semantic-aware refinement transformer (SRT) and bidirectional fine-grained interaction (BFI).MCF performs an adaptive fusion of multi-receptive-field and multi-resolution features, providing ample multi-scale information.Subsequently, SRT refines the features by amplifying long-range semantic information.Moreover, BFI ensures robust interaction by establishing two opposing directions of guidance for features containing fine-grained information.The pixel-level filter establishes a bottom-up pathway to convey spatial information from high-resolution levels.Concurrently, the channel-wise prompt guides low-level semantic information via the top-down structure.
Dataset and Evaluation Metrics.Our experiments utilize the MS COCO dataset, a publicly available and reputable dataset comprising 80 distinct object categories.It consists of 115 k images for training (train2017) and 5k images for validation (val2017).

Figure 4 .
Figure 4. Example pairs of object detection results.(Top row) The outcomes are obtained using Faster R-CNN with FPN.(Bottom row) In contrast to Faster R-CNN with FPN, our MSBA method markedly enhances the localization capability of multi-scale objects through substantial interaction across diverse levels, as illustrated qualitatively.

Figure 5 .
Figure 5. Example pairs of instance segmentation results.(Top row) The results are from Mask R-CNN with FPN.(Bottom row) our MSBA method significantly enhances the instance classification performance and effectively mitigates duplicate bounding boxes within densely populated regions, as demonstrated qualitatively.

Figure 6 .
Figure 6.The analyses of four categories: The results in the first row correspond to the baseline, while those in the second row correspond to MSBA.

Table 2 .
Comparsion of different dilation rates in MCF on COCO val2017.

Table 3 .
Comparsion of fusion style in the matching gate in MCF on COCO val2017.

Table 4 .
Comparsion of the effect of each component in BFI on COCO val2017.PLF: pixel-level filter, CWP: channel-wise

Table 5 .
Comparsion of interaction orders in BFI on COCO val2017.

Table 6 .
Object Detection: Performance comparisons with typical detectors based on FPN."MSBA" represents our proposed adapter."√" denotes the methods equipped with MSBA.

Table 7 .
Instance Segmentation: Performance comparisons with powerful instance segmentation methodologies.All baseline approaches incorporate FPN.The † denotes the models trained with longer training schedules.

Table 8 .
Comparison with transformer-based backbone on object detection: Performance comparisons paired with Mask R-CNN.The baseline methods are integrated with FPN.† represents the models trained with extra tricks such as multi-scale crop and longer training schedule.

Table 9 .
Comparisons with the states of the art: The symbol "*" signifies our re-implemented results on mmdetection."Schedule" refers to the learning schedules of the respective methods.The † symbol indicates models trained with additional tricks, such as multi-scale training.