Abstract
Dense image prediction tasks require both strong semantic category information and precise boundary delineation in order to be effectively applied to downstream applications. However, existing networks typically fuse deep coarse features with adjacent fine features directly through upsampling. Such a straightforward upsampling strategy not only blurs boundaries due to the loss of high-frequency information, but also amplifies intra-class conflicts caused by high-frequency interference within the same object. To address these issues, this paper proposes an Adaptive High–Low Frequency Collaborative Auxiliary Feature Alignment Network(AHLFNet), which consists of an Adaptive Low-Frequency Multi-Kernel Smoothing Unit(ALFU), a Gate-Controlled Selector(GCS), and an Adaptive High-Frequency Edge Enhancement Unit(AHFU). The ALFU suppresses high-frequency components within objects, mitigating interference during upsampling and thereby reducing intra-class conflicts. The GCS adaptively chooses suitable convolutional kernels based on the size of similar regions to ensure accurate upsampled features. The AHFU preserves high-frequency details from low-level features, enabling more refined boundary delineation. Extensive experiments demonstrate that the proposed network achieves state-of-the-art performance across various downstream tasks.
1. Introduction
Dense image prediction plays a crucial role in computer vision and has been widely applied to tasks such as segmentation [,,], classification [,], and object detection []. These tasks further serve as the foundation for real-world scene understanding in applications including autonomous driving, smart agriculture, and intelligent agents. In such applications, it is essential to assign each pixel to a known category in order to leverage semantic information for classification, while precise boundary information is required for accurate localization. To achieve feature representations that simultaneously capture richer semantic information and finer boundary details, popular models [,] commonly adopt downsampling-based feature fusion techniques []. Feature fusion refers to the process of combining shallow, high-resolution detail features with deep, low-resolution coarse features. Since feature maps at different levels have mismatched spatial resolutions, deep coarse features are typically upsampled to the same size as shallow detail features using methods such as nearest-neighbor or bilinear interpolation, after which they are either added or concatenated.
However, nearest-neighbor and bilinear interpolation introduce two critical issues that severely affect downstream tasks: intra-class conflict and edge drift. Intra-class conflict refers to inconsistencies in feature representation within the same object during interpolation or feature fusion, caused by variations in illumination, shape, or structural differences. For example, the skin region of a human face exhibits smooth textures and gradual color changes, while the eyes contain strong edges and high-contrast details. Similarly, car wheels and windows often exhibit pronounced intra-class conflict. Such conflict within the same category result in disorganized feature representations after interpolation, ultimately weakening the model’s discriminative ability. As shown in Figure 1, edge drift refers to the phenomenon where object boundary features become excessively blurred or shifted due to the smoothing effects of simple interpolation methods such as nearest-neighbor or bilinear interpolation, resulting in boundary information that is inconsistent with the actual object contour. This misalignment significantly degrades the performance of downstream tasks that rely on precise boundaries, such as segmentation and detection. FPN [] has pointed out that traditional feature fusion, due to its sampling mechanism, fails to effectively handle intra-class conflict. Similarly, Sapa [] and FADE [] demonstrate that simple interpolation algorithms tend to oversmooth features, resulting in edge drift. Furthermore, high-frequency noise within objects leads to nonsmooth fluctuations that disrupt intra-class consistency []. In addition, SSFA [] shows that the absence of high-frequency information in boundary regions causes semantic contour displacement errors. The design philosophy of this paper is inspired by a form of functional symmetry: just as a complete visual perception system must simultaneously handle the two complementary aspects of semantic consistency and boundary precision, our AHLFNet also adopts a symmetrical dual-pathway architecture. we propose the Adaptive High–Low Frequency Collaborative Auxiliary Feature Alignment Network (AHLFNet), a novel method for enhancing feature fusion. This network comprises three key components: an Adaptive Low-Frequency Multi-Kernel Smoothing Unit (ALFU), a Gate-Controlled Selector (GCS), and an Adaptive High-Frequency Edge Enhancement Unit (AHFU). The ALFU selectively applies different convolution kernels depending on the size of similar regions to suppress high-frequency components within objects, thereby reducing intra-class conflict. During upsampling, this module maintains smoothness while overcoming the limitations of fixed kernel sizes, which are unable to balance both global smoothness and local detail preservation. Specifically, large-scale feature inconsistencies can be alleviated by emphasizing large kernels to smooth and upsample coarse high-level features, thereby reducing pixel-level differences and mitigating feature inconsistency. In contrast, fine-grained boundaries require small kernels to preserve details and avoid blurring. The GCS adaptively determines kernel size based on regional similarity. It conducts local similarity analysis using three dilated convolutions with different dilation rates. If the similarity among the three kernels is consistent, the large kernel is selected. If the similarities of the small and medium kernels are consistent, the medium kernel is chosen. If the small kernel differs significantly from the other two, or in other ambiguous cases, the small kernel is selected. This mechanism ensures that, during upsampling, sampled pixels exhibit higher similarity with their neighboring pixels, thereby reducing intra-class conflict. As a result, the enhanced similarity and discriminability of intra-class features make this mechanism particularly effective for dense prediction tasks. Although the ALFU and GCS effectively reconstruct upsampled high-level features with low intra-class conflict and clear boundaries, fine-grained edge details in low-level features remain difficult to fully recover in high-level representations during downsampling. According to the Nyquist–Shannon sampling theorem, any frequency component above the Nyquist frequency is irreversibly lost during downsampling. To address this, we introduce the AHFU, which compensates for boundary details beyond the Nyquist frequency that cannot be restored through upsampling. By capturing low-level high-frequency details lost during downsampling, this unit significantly improves boundary localization accuracy. Together, these three modules reconstruct fused features that are both semantically consistent and boundary-sharp. The main contributions of this study can be summarized as follows:
Figure 1.
Edge shifts among three different upsampling methods. Compared with the other two methods, our approach achieves better suppression of edge drift.
- We propose a novel frequency-centric design paradigm. Motivated by the complementary roles of frequency components in images—where low frequencies govern semantic consistency and high frequencies dictate boundary precision—we introduce a principled Adaptive High–Low Frequency Collaborative framework. This framework is specifically designed to address the root causes of intra-class conflict and edge drift in dense prediction tasks, ensuring both strong semantic discriminability and fine boundary precision.
- Under this paradigm, we instantiate the AHLFNet which integrates three key components: (1) an Adaptive Low-frequency Unit (ALFU), which smooths object interiors to suppress high-frequency interference and reduce intra-class conflict; (2) a Gate-Controlled Selector (GCS), which adaptively determines the kernel size for smoothing through similarity analysis to ensure feature consistency; and (3) an Adaptive High-frequency Unit (AHFU), which compensates for boundary details lost during downsampling to improve edge localization accuracy.
- Extensive experiments demonstrate that the proposed method yields significant performance improvements in object detection, instance segmentation, and semantic segmentation. Consistent gains are achieved across benchmark datasets including COCO, ADE20K, and Cityscapes, validating the effectiveness and robustness of our approach.
2. Related Work
2.1. Feature Fusion and Upsampling Methods
Feature fusion aims to effectively combine high-resolution shallow features with low-resolution deep features, thereby preserving both fine-grained local details and abstract semantic representations []. In most existing frameworks, deep features are first upsampled to match the spatial resolution of shallow features, after which they are fused through weighted summation or concatenation. Traditional upsampling methods, such as nearest-neighbor and bilinear interpolation, rely on fixed manually designed convolution kernels. As a result, they often suffer from over-smoothing and edge drift when applied to complex semantic scenes. To overcome these limitations, researchers have proposed a variety of improved feature fusion strategies. Methods such as deconvolution [], pixel shuffle [], and DUpsampling [] allow convolution kernel weights to be learned during training. However, once training is complete, the kernels remain fixed, lacking adaptability to spatial variations. To address this issue, several studies have explored dynamic upsampling mechanisms. For instance, LHC [] jointly learns both kernel shapes and weights, while unpooling [] leverages the indices from max pooling for spatial reconstruction, reflecting location-dependent dynamics. CARAFE [] employs a context-aware mechanism to dynamically generate kernel weights, enabling adaptive local feature reassembly. These methods share a common characteristic: they adopt data-dependent, predictable kernels whose parameters are generated by sub-networks, thereby providing greater expressive flexibility for feature upsampling. Similarly, GUM [] predicts guidance offsets to adjust upsampling coordinates, aligning high- and low-level features. AlignSeg [] further extends this idea by introducing predicted sampling offsets for multi-resolution feature alignment and contextual modeling. Mask2Former [] employs a deformable attention module that combines spatial offsets with adaptive attention weights to achieve more efficient multi-scale fusion.
In summary, existing feature fusion approaches during upsampling can be broadly categorized into two paradigms: kernel learnability and sampling coordinate optimization. The former enhances representational capacity by learning convolution kernels but generally lacks spatial adaptability; the latter improves feature alignment via offsets or attention, yet still struggles to reduce intra-class conflict and precise boundary information. A comparative analysis with several major methods has been elaborated. For instance, while CARAFE achieves general content-aware upsampling through predicted kernels, our proposed GCS introduces similarity-driven kernel selection specifically designed for low-frequency smoothing, establishing a dedicated mechanism that fundamentally differs in both purpose and implementation for handling intra-class conflicts. Similarly, whereas adaptive filtering techniques are typically applied in general image processing, our ALFU implements task-specific adaptive multi-kernel smoothing for feature consistency in dense prediction tasks. The integration of three scalable kernels with the GCS mechanism represents a significant architectural advancement. The proposed Adaptive High–Low Frequency Collaborative Auxiliary Feature Alignment Network approaches the problem from a frequency-domain perspective, combining low-frequency dynamic smoothing with high-frequency edge enhancement. This integrated design aims to further alleviate both intra-class inconsistency and boundary blurring issues in feature fusion.
2.2. Intra-Class Conflict and Boundary Issues
In dense prediction tasks such as semantic segmentation and object detection, intra-class conflict and boundary accuracy remain critical bottlenecks to performance. Prior studies [] have shown that although deep features possess strong semantic representation capabilities, they often exhibit intra-class conflict within the same category due to variations in illumination, texture irregularities, or structural complexity. Meanwhile, commonly used interpolation-based upsampling methods tend to oversmooth feature representations, leading to semantic misalignment along object boundaries [] and consequently reducing the model’s ability to distinguish fine-grained structures. To mitigate intra-class conflict, one line of research introduces regularization or contrastive learning techniques to bring the feature representations of samples from the same class closer together. For example, clustering regularization and center loss explicitly constrain intra-class samples toward their category centers, thereby enhancing compactness [,]. Another line of work leverages channel attention or feature re-weighting mechanisms, enabling the model to emphasize category-discriminative regions while suppressing redundant features [,]. However, these methods mainly focus on optimizing global representations and lack effective handling of intra-class conflict in boundary regions. To alleviate edge drift, multi-scale feature fusion and high-resolution branch architectures [] improve boundary quality to some extent. Nevertheless, these approaches rely on fixed-scale fusion or interpolation operations, which often lead to blurred fine-grained boundaries. Other methods explicitly impose semantic contour constraints through boundary-aware loss functions or edge enhancement modules. However, such approaches are largely limited to task-level optimization and seldom incorporate feature similarity in analyzing edge drift. Different from the above methods, our work introduces a cosine-similarity-based analytical framework.
2.3. Applications of Frequency-Domain Methods in Visual Tasks
Frequency-domain analysis has long been a fundamental tool in the field of image processing [], and in recent years, it has also demonstrated significant theoretical and practical value in deep learning. For example, frequency-domain methods have been applied to study the optimization characteristics [] and generalization ability [] of deep neural networks (DNNs). OSBN [] and DFPF [] observed that DNNs tend to learn low-frequency patterns first during training, a phenomenon known as spectral bias or the frequency principle. AdaBlur [] introduced content-aware low-pass filtering to suppress aliasing effects during downsampling, while FLC [] further confirmed that frequency aliasing degrades model robustness. ISID [] showed that low-pass filtering effectively suppresses high-frequency interference caused by image noise. FreD [] specifically pointed out that high-frequency perturbations significantly improve intra-class conflict. In terms of methodological innovation, researchers have explored diverse applications of frequency-domain techniques. DHP [] and FcaNet [] utilized discrete cosine transform coefficients to enhance channel attention mechanisms. AFFT [] applied the classical convolution theorem to DNNs, demonstrating that adaptive frequency filters can serve as efficient global feature integrators. It has been shown that intra-class conflict arises from disturbances in high-frequency components, whereas edge drift is caused by the loss of high-frequency information. To address these issues, we propose an innovative design that employs an ALFU and a GCS to reduce feature conflicts, while leveraging an AHFU to strengthen useful high-frequency details and sharpen object boundaries.
Existing frequency-based methods typically process frequency components in isolation or sequentially. Such serial or segregated architectures inevitably lead to the loss or distortion of one frequency component while processing the other. In contrast, our framework establishes a synchronous dual-path processing mechanism, where the ALFU+GCS pathway is dedicated to low-frequency purification, and the AHFU pathway focuses on high-frequency preservation. Both paths operate simultaneously and are fused through our enhanced fusion strategy.
3. Method
We first present the overall improvements in feature fusion. Specifically, features from different stages are initially combined through basic fusion, and the resulting fused features are then fed into three distinct components for further enhanced fusion, as illustrated in Figure 2. We then provide a detailed explanation of the implementation principles of the AHLFNet. To facilitate readability, we have listed the symbols used in the formulas and their corresponding meanings in Table 1.
Figure 2.
Implementation process of the Adaptive High–Low Frequency Collaborative Auxiliary Feature Alignment Network. The upper part above the dashed line illustrates the basic fusion, while the lower part below the dashed line depicts the enhanced fusion. ALFU, AHFU, and GCS denote the Adaptive Low-Frequency Multi-Kernel Smoothing Unit, the Adaptive High-Frequency Edge Enhancement Unit, and the Gate-Controlled Selector, respectively. PS represents PixelShuffle, and Proj indicates projection. “*” in a circle indicates element-wise multiplication, while ⊕ indicates element-wise addition.
Table 1.
Symbols and Their Meanings.
3.1. Overall Improvement Through Feature Fusion
Traditional feature fusion methods typically concatenate or add features from the l-th layer with upsampled features from the (l + 1)-th layer, where upsampling is generally implemented using bilinear interpolation or nearest-neighbor methods. To optimize this process, we first construct a preliminary fusion approach through channel adjustment, which can be mathematically formulated as follows:
where and denote the l layer feature extracted from the backbone and the feature to be fused at the l layer, respectively. UpS represents the upsampling operation, denotes the compressed fused feature. Proj refers to channel adjustment, which can be implemented using a 1 × 1 convolution. Although the preliminary design in Equations (1) and (2) addresses the channel dimension matching problem, it suffers from two inherent limitations: (1) It can still exacerbate intra-class conflict, as the fusion process does not suppress high-frequency interference within the feature maps. (2) It remains ineffective in mitigating edge drift, since valuable low-level details are not explicitly enhanced during fusion. These limitations motivate our proposed enhanced fusion mechanism presented below. Enhanced fusion is achieved by collaboratively integrating the outputs of a dual-path processing strategy: the l-th layer feature undergoes high-frequency edge enhancement to yield , while simultaneously, the -th layer feature is processed through adaptive low-frequency smoothing and upsampling to produce . This process is mathematically formulated as follows:
Here, AL (the Adaptive Low-frequency Multi-kernel Smoothing Unit) performs adaptive low-frequency smoothing on high-level features to mitigate intra-class conflict by enhancing consistency in intra-class feature similarity. Meanwhile, AH (the Adaptive High-frequency Edge Enhancement Unit) focuses on amplifying high-frequency boundary details in low-level features, thereby effectively suppressing edge drift. The notation (i, j) denotes spatial coordinates in the feature map. Through this high–low frequency division-and-conquer strategy [], our collaborative design simultaneously addresses the two fundamental challenges of intra-class conflict and edge drift.
3.2. AHLFNet
Adaptive Low-Frequency Multi-Kernel Smoothing Unit(ALFU). The design objective of ALFU is to perform adaptive low-frequency smoothing on high-level features during the upsampling process [], thereby enhancing intra-class feature consistency. As illustrated in Figure 3a, the workflow can be decomposed into the following four steps. (1) Weight Generation: The unit takes the preliminary fused feature as its input. First, the input is processed through a convolutional layer to generate the spatially variable base weights . Then, a kernel-wise softmax operation is applied to each kernel scale (3, 5, 7) within , producing independent, normalized weight tensors for each scale. This operation ensures the non-negativity and normalization (i.e., summation to one) of all kernel coefficients. (2) Weight Reorganization and Assignment: The generated weight tensor is reorganized along the channel dimension and evenly divided into four groups, denoted as , where g ∈ (1, 2, 3, 4). Each group corresponds to a spatially varying low-pass filter kernel for a specific sub-pixel location in the upsampled output. (3) Multi-scale Filtering and Fusion: The high-level feature is convolved with the filter kernels of the three scales (K = 3, 5, 7) belonging to the same group g, yielding filtered features . The gate-controlled selector then assigns weights to the outputs of each scale, and a weighted summation is performed to obtain the final intermediate feature for that group. (4) Sub-pixel Reorganization and Output: The intermediate features of the four groups, to are spatially rearranged according to sub-pixel rules, ultimately producing the 2× upsampled smoothed feature . Through this design, ALFU integrates feature upsampling and adaptive smoothing into a single operation, efficiently enhancing feature consistency. This process can be formally expressed as
where denotes the neighborhood of scale k (). The ALFU can adaptively predict spatially varying filtering kernels based on the feature content, effectively enhancing feature consistency.
Figure 3.
(a) Details of the Adaptive Low-Frequency Multi-Kernel Smoothing Unit. (b) Details of the Gate-Controlled Selector. (c) Details of the Adaptive High-Frequency Edge Enhancement Unit. Here, HFEE denotes High-Frequency Edge Enhancement, and E represents the identity matrix.
Gate-Controlled Selector(GCS). Although fixed-kernel low-frequency smoothing units can reduce intra-class conflicts through feature smoothing, they face significant limitations when dealing with large inconsistent regions and fine boundaries. Large kernels can improve consistency across broad regions but tend to degrade fine details, whereas small kernels preserve details but struggle to correct widespread inconsistencies. This necessitates the adaptive selection of appropriate kernels. To address this, we propose the Gate-Controlled Selector, which is based on a similarity-driven principle. As illustrated in Figure 3b, for each pixel, three similarity (t ∈ (small(3), middle(5), large(7))) are computed, representing the average similarity of the pixel with its eight neighboring pixels under three different dilation rates (1, 2, 3). The similarities are then converted into soft decisions in the range [0,1], which are used to construct unnormalized weights . The process for generating the three weights is as follows: if the average similarity of the small and medium matrices is higher, approaches 1 while and approach 0, resulting in the selection of the medium kernel. Other cases are handled similarly, ensuring that one weight in is close to 1 while the other two are near 0, thereby selecting the appropriate kernel. Finally, the normalized weights (t ∈ 3, 5, 7) are constructed. This process can be formally expressed as
where ‘Sim’ denotes cosine similarity, ‘cen’ denotes the center pixel, ‘nei’ represents the neighboring pixels, denotes the activation function and is a hyperparameter with a value of 0.3. represents a learnable threshold parameter, initialized to 0.7. During training, is optimized end-to-end alongside the rest of the network through backpropagation, enabling the model to adaptively learn optimal decision boundaries for kernel selection across diverse scenes and object scales.
Adaptive High-Frequency Edge Enhancement Unit(AHFU). Although the ALFU and the GCS can effectively reconstruct upsampled high-level features with low intra-class conflict and clear boundaries, fine-grained edge information lost in low-level features during downsampling is difficult to fully restore in high-level representations. To address this limitation, as shown in Figure 3c, we design the AHFU to recover detailed edge information lost during downsampling. Specifically, this unit takes the initial fused feature as input and predicts spatially varying high-frequency kernels. To preserve fine edge details, the unit employs only convolutional kernels, a softmax layer, and a filter inversion operation. In implementation, low-frequency kernels are first obtained via kernel-level softmax, and then inverted [] to produce high-pass kernels, where E represents a 3 × 3 identity kernel, in which the center element is set to 1 while all eight surrounding elements are set to 0. Finally, the high-frequency kernels are applied and combined with residual addition to produce the enhanced output . The residual structure ensures the preservation of original feature information while enhancing high-frequency details. This process can be expressed as
We visualize the four-stage output features obtained using AHLFNet and bilinear interpolation upsampling methods with ResNet50 as the backbone in Figure 4. The figure clearly shows that in the first two stages, AHLFNet produces features with sharper boundaries and more accurate localization. However, the last two stages contain higher-level semantic information that falls beyond the scope of human interpretation.
Figure 4.
The first row displays features from the ResNet50 backbone, the second row shows features with added FPN (using nearest-neighbor interpolation for upsampling), and the third row presents features from AHLFNet. Our method demonstrates superior visualization results.
4. Experiment
We first evaluate the generalizability of the proposed Adaptive High–Low Frequency Collaborative Auxiliary Feature Alignment Network (AHLFNet) across three representative dense prediction tasks: semantic segmentation, object detection, and instance segmentation. Subsequently, we conduct ablation studies on the three proposed modules to thoroughly validate their effectiveness. We provide the definitions of the evaluation metrics used here. IOU refers to the ratio of the intersection between the predicted region and the ground truth to their union. Among these, mIOU means the mean intersection over union across all categories, APb refers to the detection accuracy at the bounding box level, and APm calculates the IOU between segmentation masks.
4.1. Semantic Segmentation
Semantic segmentation aims to predict the correct category label for each pixel while ensuring effective aggregation of pixels belonging to the same class. Existing methods typically adopt progressive upsampling and multi-level feature fusion [,], highlighting the critical role of feature fusion in segmentation. Motivated by this, we select semantic segmentation as a task to evaluate the effectiveness of the AHLFNet. This task demands low intra-class conflict and precise boundary localization, and the network’s optimized feature alignment and fusion help reduce intra-class conflict and mitigate edge drift, thereby improving overall segmentation performance.
Experimental Settings. We evaluate our method on mainstream segmentation datasets, including Cityscapes, ADE20K [], and COCO-Stuff []. In implementation, the proposed modules are integrated into baseline models such as SegFormer [], Mask2Former [], and SegNeXt [], while keeping their original training configurations. We use the AdamW [] optimizer with an initial learning rate of and poly decay, combined with standard data augmentation strategies including random flipping, scaling, and cropping. The batch size is set to 8 for Cityscapes and 16 for the other datasets. Training iterations are 160 K for Cityscapes and ADE20K, and 80 K for COCO-Stuff. Three AHLFNets are inserted into the multi-scale feature fusion stages of SegFormer and Mask2Former, while two are inserted into SegNeXt, to comprehensively validate the effectiveness of the proposed method.
Experimental Results. Comparison with strong baselines: As shown in Table 2, SegFormer-B1 is used as the baseline segmentation model, upon which various state-of-the-art methods are compared on the ADE20K dataset. The proposed AHLFNet achieves the best improvement, with a gain of 2.9% mIoU.
Table 2.
Comparison of various state-of-the-art methods on the ADE20K dataset, using SegFormer-B1 as the base segmentation model.
Integration with advanced methods: As reported in Table 3, when combined with Mask2Former using a ResNet-50 backbone on the Cityscapes dataset, the AHLFNet achieves a 1.1% mIoU improvement, outperforming IFA, FaPN, and the original Mask2Former with offset-based fusion. Replacing the backbone with ResNet-101 further demonstrates the consistent effectiveness of the method.
Table 3.
Performance comparison on the Cityscapes using Mask2Former as the base segmentation model, with the proposed method integrated, against various segmentation methods.
Generalization across multiple mainstream models: Table 4 reports experiments integrating the AHLFNet with a variety of mainstream architectures, covering both convolutional networks and Transformer-based frameworks. UPerNet [] uses an FPN structure, while SegFormer and SegNeXt rely on feature concatenation for fusion. Despite the clear architectural differences, the AHLFNet consistently enhances performance, demonstrating strong adaptability to diverse model designs.
Table 4.
Performance improvement on the Cityscapes when integrating AHLFNet using single-scale inference across 3 segmentation models. The values in parentheses indicate the improvement over the original model.
Experiments on challenging datasets: Using SegNeXt as the base model, we conduct experiments on multiple challenging datasets. As shown in Table 5, the AHLFNet consistently improves performance across all three datasets. Specifically, it increases mIoU by 0.9%, 2.3%, and 2.1% on Cityscapes, ADE20K, and COCO-Stuff, respectively. Further experiments with SegMAN-L [] achieve optimal results of 84.6%, 53.5%, and 49.1% mIoU across the three datasets.
Table 5.
Evaluation of the generalizability and effectiveness of AHLFNet on the Cityscapes, ADE20K, and COCO-Stuff datasets, using SegNeXt-T and SegMAN-L as the base segmentation models.
4.2. Object Detection
In object detection, models are required to simultaneously address two key challenges: precise localization and accurate classification. Most mainstream detectors are built upon Feature Pyramid Networks (FPNs), making the effectiveness of feature fusion critical to overall performance. High-quality fusion produces semantically consistent and boundary-preserving feature representations, which significantly enhance both localization accuracy and class discrimination.
Experimental Settings. We conduct object detection experiments on the MS COCO [] dataset, using the classical Faster R-CNN framework with a ResNet-50 backbone for evaluation. The experiments are implemented based on the MMDetection [] codebase. A 1× training schedule, corresponding to 12 epochs, is adopted. Only the feature fusion modules within the FPN are replaced with our proposed design, while all other settings remain unchanged.
Experiment Results. As shown in Table 6, the AHLFNet demonstrates superior performance on the COCO dataset, achieving an overall AP improvement of 1.9%, significantly outperforming comparison methods such as Carafe and FADE. Figure 5 presents the visualization of detection results from three different methods on the COCO dataset, where AHLFNet results are closest to the ground truth. These experimental results fully demonstrate the effectiveness of the proposed high–low frequency collaborative feature alignment approach in enhancing object detection performance.
Table 6.
Performance comparison on the MS COCO dataset for Faster R-CNN with a ResNet-50 backbone, integrating various upsampling methods.
Figure 5.
Detection results of three different object detection methods on the COCO dataset. The results of AHLFNet are closest to the ground truth.
4.3. Instance Segmentation
Experimental Settings. We conduct instance segmentation experiments on the MS COCO benchmark dataset, using the classical Mask R-CNN [] framework with a ResNet-50 backbone. The experimental setup is consistent with that of object detection, except that our proposed improvements are applied only to the feature fusion modules within the FPN. The implementation is based on MMDetection. Considering that the FPN structure fuses features at 4×, 8×, 16×, and 32× downsampling levels, we insert three AHFLNets for evaluation.
Experimental Results. As shown in Table 7, the AHLFNet achieves significant improvements on the COCO dataset. Specifically, increases by 1.2% and bounding by 1.7%, outperforming all compared methods. Figure 6 presents visualizations of three instance segmentation methods, where our method produces superior results. These findings fully validate the effectiveness of AHLFNet for instance segmentation tasks.
Table 7.
Performance comparison on the MS COCO dataset for Mask R-CNN with a ResNet-50 backbone, integrating various fusion methods. APb and APm denote box AP and mask AP, respectively, and 1× corresponds to 12 epochs.
Figure 6.
Comparison of three different instance segmentation methods. The results of AHLFNet are closest to the ground truth.
4.4. Ablation Studies
Experimental Settings. In this section, we conduct systematic ablation studies to evaluate the contributions of each component in the AHLFNet. We select the high-performance SegNeXt as the baseline model and perform evaluations on the ADE20K dataset.
Module Ablation Results. Using SegNeXt-T as the baseline, Table 8 analyzes the individual and combined effects of the ALFU, the GCS, and the AHFU. When used individually, the ALFU and AHFU improve mIoU by 0.9% and 0.6%, respectively. Since the GCS serves as guidance for the ALFU, it cannot be used independently and thus, no separate ablation is conducted. When two modules are combined, performance surpasses that of any single module, and the use of all three modules achieves the best performance of 43.4% mIoU (+2.3), demonstrating the optimal synergistic effect of the complete framework.
Table 8.
Ablation study on ADE20K using SegNeXt-T as the baseline, evaluating the contributions of the ALFU, GCS, and the AHFU.
The impact of the number of groups and layers on performance in ALFU. Table 9 investigates the grouping strategy of the ALFU using SegNeXt-T as the baseline. Regardless of whether the kernel is fixed at 3 × 3, 5 × 5, or 7 × 7, a group number of 4 achieves the best performance. Therefore, this study adopts a group size of 4 for all experiments.
Table 9.
Ablation study on the number of groups for three different kernel scales in the ALFU, using SegNeXt-T as the baseline.
5. Conclusions
In this study, we propose the Adaptive High–Low Frequency Collaborative Auxiliary Feature Alignment Network (AHLFNet) to address intra-class conflict and edge drift in dense image prediction tasks. The network comprises three key components: the AHLFNet, which smooths high-level features; the GCS, which ensures that pixels with high similarity are sampled within their neighborhoods to prevent increased edge drift and intra-class conflict; and the AHFU, which enhances the high-frequency components of low-level features. The synergistic operation of these three components effectively mitigates intra-class conflict and inaccurate boundary localization. Extensive experiments demonstrate that AHLFNet achieves superior performance across multiple dense prediction tasks, including semantic segmentation, object detection, and instance segmentation.
6. Limitations and Future Work
Limitations in Computational Efficiency: We candidly note that since AHLFNet includes adaptive frequency processing and multi-scale selection mechanisms, its computational overhead is somewhat increased compared to some lightweight upsampling methods (such as bilinear interpolation).
Although AHLFNet demonstrates excellent performance on standard benchmark datasets, we acknowledge that its performance is somewhat dependent on input image quality. Model performance may degrade significantly when the input is highly degraded due to extreme environmental factors, such as severe low-light conditions or dense haze. This limitation highlights the gap between the current work and a fully robust visual perception system. As the reviewer astutely noted, integrating advanced image enhancement and restoration methods (such as FreqSpatNet [] for low-light conditions or VNDHR [] for dehazing) as preprocessing modules with AHLFNet to form an end-to-end enhancement–understanding pipeline is key to overcoming this limitation and improving the model’s applicability in real-world scenarios. This provides a clear direction for our future research.
Author Contributions
Conceptualization, C.Y.; methodology, C.Y.; software, C.Y.; validation, C.Y.; formal analysis, C.Y.; investigation, J.L.; resources, J.L.; data curation, J.L.; writing—original draft preparation, C.Y.; writing—review and editing, J.L.; visualization, C.Y.; supervision, J.L.; project administration, J.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The code is available at https://www.github.com/yuecg/AHLFNet (accessed on 4 November 2025).
Acknowledgments
The authors gratefully acknowledge the visualization support provided by Qichen Wang from Qilu University of Technology.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Roh, W.; Jung, H.; Nam, G.; Lee, D.I.; Park, H.; Yoon, S.H.; Joo, J.; Kim, S. Insightful Instance Features for 3D Instance Segmentation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 14057–14067. [Google Scholar]
- Shi, J.; Chen, G.; Chen, Y. Enhanced boundary perception and streamlined instance segmentation. Sci. Rep. 2025, 15, 23612. [Google Scholar] [CrossRef] [PubMed]
- Bär, A.; Houlsby, N.; Dehghani, M.; Kumar, M. Frozen feature augmentation for few-shot image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16046–16057. [Google Scholar]
- Maxwell, B.A.; Singhania, S.; Patel, A.; Kumar, R.; Fryling, H.; Li, S.; Sun, H.; He, P.; Li, Z. Logarithmic lenses: Exploring log rgb data for image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 17470–17479. [Google Scholar]
- Zhang, S.; Ni, Y.; Du, J.; Xue, Y.; Torr, P.; Koniusz, P.; van den Hengel, A. Open-World Objectness Modeling Unifies Novel Object Detection. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 30332–30342. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Li, T.; Li, C.; Lyu, J.; Pei, H.; Zhang, B.; Jin, T.; Ji, R. DAMamba: Vision State Space Model with Dynamic Adaptive Scan. arXiv 2025, arXiv:2502.12627. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Lu, H.; Liu, W.; Ye, Z.; Fu, H.; Liu, Y.; Cao, Z. SAPA: Similarity-aware point affiliation for feature upsampling. Adv. Neural Inf. Process. Syst. 2022, 35, 20889–20901. [Google Scholar]
- Lu, H.; Liu, W.; Fu, H.; Cao, Z. FADE: Fusing the assets of decoder and encoder for task-agnostic upsampling. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 231–247. [Google Scholar]
- Luo, C.; Lin, Q.; Xie, W.; Wu, B.; Xie, J.; Shen, L. Frequency-driven imperceptible adversarial attack on semantic similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15315–15324. [Google Scholar]
- Chen, L.; Gu, L.; Fu, Y. When semantic segmentation meets frequency aliasing. arXiv 2024, arXiv:2403.09065. [Google Scholar] [CrossRef]
- Masum, M.H.R. Visualizing and Understanding Convolutional Networks. In Proceedings of the Computer Vision–ECCV 2014, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
- Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
- Tian, Z.; He, T.; Shen, C.; Yan, Y. Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3126–3135. [Google Scholar]
- Zhao, R.; Wu, Z.; Zhang, Q. Learnable Heterogeneous Convolution: Learning both topology and strength. Neural Netw. 2021, 141, 270–280. [Google Scholar] [CrossRef] [PubMed]
- Wu, D.; Guo, Z.; Li, A.; Yu, C.; Sang, N.; Gao, C. Semantic segmentation via pixel-to-center similarity calculation. CAAI Trans. Intell. Technol. 2024, 9, 87–100. [Google Scholar] [CrossRef]
- Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe++: Unified content-aware reassembly of features. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4674–4687. [Google Scholar] [CrossRef] [PubMed]
- Mazzini, D. Guided upsampling network for real-time semantic segmentation. arXiv 2018, arXiv:1807.07466. [Google Scholar] [CrossRef]
- Huang, Z.; Wei, Y.; Wang, X.; Liu, W.; Huang, T.S.; Shi, H. Alignseg: Feature-aligned segmentation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 550–557. [Google Scholar] [CrossRef] [PubMed]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Chang, S.; Wang, P.; Luo, H.; Wang, F.; Shou, M.Z. Revisiting vision transformer from the view of path ensemble. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 19889–19899. [Google Scholar]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Learning a discriminative feature network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1857–1866. [Google Scholar]
- Zhang, Z.; Zhang, X.; Peng, C.; Xue, X.; Sun, J. Exfuse: Enhancing feature fusion for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 269–284. [Google Scholar]
- Kirillov, A.; Girshick, R.; He, K.; Dollár, P. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6399–6408. [Google Scholar]
- Pitas, I. Digital Image Processing Algorithms and Applications; John Wiley & Sons: Hoboken, NJ, USA, 2000. [Google Scholar]
- Yin, D.; Gontijo Lopes, R.; Shlens, J.; Cubuk, E.D.; Gilmer, J. A fourier perspective on model robustness in computer vision. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 13276–13286. [Google Scholar]
- Wang, H.; Wu, X.; Huang, Z.; Xing, E.P. High-frequency component helps explain the generalization of convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8684–8694. [Google Scholar]
- Rahaman, N.; Baratin, A.; Arpit, D.; Draxler, F.; Lin, M.; Hamprecht, F.; Bengio, Y.; Courville, A. On the spectral bias of neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; PMLR: Cambridge, MA, USA, 2019; pp. 5301–5310. [Google Scholar]
- Xu, Z.J.; Zhou, H. Deep frequency principle towards understanding why deeper learning is faster. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 19–21 May 2021; Volume 35, pp. 10541–10550. [Google Scholar]
- Zou, X.; Xiao, F.; Yu, Z.; Li, Y.; Lee, Y.J. Delving deeper into anti-aliasing in convnets. Int. J. Comput. Vis. 2023, 131, 67–81. [Google Scholar] [CrossRef]
- Grabinski, J.; Jung, S.; Keuper, J.; Keuper, M. Frequencylowcut pooling-plug and play against catastrophic overfitting. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 36–57. [Google Scholar]
- Chen, L.; Fu, Y.; Wei, K.; Zheng, D.; Heide, F. Instance segmentation in the dark. Int. J. Comput. Vis. 2023, 131, 2198–2218. [Google Scholar] [CrossRef]
- Magid, S.A.; Zhang, Y.; Wei, D.; Jang, W.D.; Lin, Z.; Fu, Y.; Pfister, H. Dynamic high-pass filtering and multi-spectral attention for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 4288–4297. [Google Scholar]
- Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 783–792. [Google Scholar]
- Huang, Z.; Zhang, Z.; Lan, C.; Zha, Z.J.; Lu, Y.; Guo, B. Adaptive frequency filters as efficient global token mixers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6049–6059. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
- Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 633–641. [Google Scholar]
- Caesar, H.; Uijlings, J.; Ferrari, V. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1209–1218. [Google Scholar]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
- Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. Segnext: Rethinking convolutional attention design for semantic segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
- Lu, H.; Dai, Y.; Shen, C.; Xu, S. Index networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 242–255. [Google Scholar] [CrossRef] [PubMed]
- Dai, Y.; Lu, H.; Shen, C. Learning affinity-aware upsampling for deep image matting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6841–6850. [Google Scholar]
- Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6027–6037. [Google Scholar]
- Hu, H.; Chen, Y.; Xu, J.; Borse, S.; Cai, H.; Porikli, F.; Wang, X. Learning implicit feature alignment function for semantic segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 487–505. [Google Scholar]
- Huang, S.; Lu, Z.; Cheng, R.; He, C. FaPN: Feature-aligned pyramid network for dense image prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 864–873. [Google Scholar]
- Fu, Y.; Lou, M.; Yu, Y. SegMAN: Omni-scale context modeling with state space models and local attention for semantic segmentation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 19077–19087. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar] [CrossRef]
- Guan, Y.; Liu, M.; Chen, X.; Wang, X.; Luan, X. FreqSpatNet: Frequency and Spatial Dual-Domain Collaborative Learning for Low-Light Image Enhancement. Electronics 2025, 14, 2220. [Google Scholar] [CrossRef]
- Liu, Y.; Wang, X.; Hu, E.; Wang, A.; Shiri, B.; Lin, W. VNDHR: Variational single nighttime image Dehazing for enhancing visibility in intelligent transportation systems via hybrid regularization. IEEE Trans. Intell. Transp. Syst. 2025, 26, 10189–10203. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).