ERZA-DETR: A Deep Learning-Based Detection Transformer with Enhanced Relational-Zone Aggregation for WCE Lesion Detection

Ye, Shiren; Ma, Haipeng; Zhang, Zetong; Li, Liangjing

doi:10.3390/a19040268

Open AccessArticle

ERZA-DETR: A Deep Learning-Based Detection Transformer with Enhanced Relational-Zone Aggregation for WCE Lesion Detection

School of Computer Science and Artificial Intelligence, Changzhou University, Changzhou 213168, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(4), 268; https://doi.org/10.3390/a19040268

Submission received: 8 February 2026 / Revised: 19 March 2026 / Accepted: 26 March 2026 / Published: 1 April 2026

(This article belongs to the Special Issue Explainable Artificial Intelligence for Disease Detection and Secure Monitoring Systems)

Download

Browse Figures

Versions Notes

Abstract

Wireless capsule endoscopy (WCE) plays a vital role in non-invasive screening of small intestinal lesions. However, the automated detection of lesions remains challenging due to low contrast, uneven illumination, and severe visual variability across images. Existing convolutional detectors rely heavily on manually designed anchors and post-processing, while end-to-end detection transformers developed for natural images exhibit limited adaptability to the complex texture and spectral characteristics of WCE data. To overcome these limitations, this study proposes a deep learning-based detection transformer with enhanced relational-zone aggregation for WCE lesion detection, termed ERZA-DETR, specifically tailored for WCE lesion detection. The framework integrates three complementary modules: a Dual-Band Adaptive Fourier Spectral module (DBFS) that recalibrates frequency responses to suppress illumination artifacts and highlight lesion boundaries; a Fused Dual-scale Gated Convolutional module (FD-gConv) that selectively fuses multi-scale texture features; and a Graph-Linked Embedding at Semantic Scales module (GLES) that preserves local topological relationships through coordinate-gated aggregation. Experimental evaluations on the SEE-AI small intestine dataset demonstrate that ERZA-DETR achieves a 3.2% improvement in mAP@50 and a 12.4% reduction in parameters compared with RT-DETRv2, achieving a superior balance between detection accuracy, computational efficiency, and clinical applicability.

Keywords:

wireless capsule endoscopy; lesion detection; detection transformer; deep learning

1. Introduction

Wireless capsule endoscopy (WCE) has become an indispensable non-invasive tool for comprehensive examination of the small intestine, enabling visualization of regions that are difficult to access using conventional endoscopy [1,2]. By eliminating sedation and reducing patient discomfort, WCE has been widely adopted for screening and diagnosis of small bowel diseases, including obscure gastrointestinal bleeding, inflammatory disorders, and early-stage tumors. However, a single WCE examination typically produces tens of thousands of frames, among which lesion-containing images constitute only a small fraction. Manual review is therefore extremely time-consuming and highly susceptible to fatigue-induced oversight, motivating the development of AI-based computer-aided detection systems to improve efficiency and diagnostic reliability [3,4,5].

Despite substantial progress in deep learning-based WCE analysis, automated lesion detection remains particularly challenging due to the intrinsic characteristics of capsule endoscopy images. In contrast to natural images, WCE data exhibit severe illumination variability, specular reflections, fluid interference, motion blur, and low contrast between lesions and surrounding mucosa [6]. These factors collectively give rise to a noisy-data scenario, in which subtle pathological patterns are readily masked by background interference. As a consequence, detectors that perform well on conventional benchmarks often fail to generalize effectively to WCE scenarios, especially when targeting small or visually ambiguous lesions.

Existing detection paradigms suffer from inherent limitations when applied to WCE imagery. Detectors based on convolutional neural networks (CNN) depend on fixed receptive fields and hierarchical downsampling, which inevitably weaken high-frequency boundary cues and cause small lesions to vanish in deeper feature representations. This progressive feature compression also makes it difficult to preserve fine-grained lesion topology under severe illumination variation. While detectors based on transformers introduce global context modeling and end-to-end prediction, their high computational cost and reliance on pretraining with large-scale natural images hinder adaptability to the complex texture distributions and spectral characteristics of WCE data, and they still provide limited mechanisms to explicitly suppress illumination-related interference during representation learning.

In addition to spatial modeling, the temporal continuity of capsule endoscopy videos has also motivated attempts to exploit inter-frame dependencies for automated analysis. Recent studies investigate temporal feature aggregation across adjacent frames or spatiotemporal modeling of capsule videos to improve the stability of lesion recognition [7,8]. However, practical deployment remains constrained by the extreme length of WCE sequences, the scarcity of frame-level temporal annotations, and the intermittent visibility of lesions caused by rapid capsule motion. Consequently, most current computer-aided detection systems for WCE still rely primarily on frame-level spatial analysis, leaving the suppression of illumination artifacts and the preservation of fine-grained lesion structures largely dependent on spatial representation learning.

These observations expose two specific technical gaps that critically affect WCE lesion detection performance. First, non-uniform illumination and low-frequency background interference dominate WCE images, masking subtle lesion boundaries and degrading feature discriminability. This challenge calls for frequency-aware modeling capable of selectively suppressing illumination artifacts while enhancing diagnostically relevant high-frequency cues. Second, small and semantic lesions tend to be overwhelmed during progressive downsampling and cross-scale fusion, leading to the loss of local structural evidence. Addressing this issue requires semantic-scale modeling that preserves spatial topology and reinforces local relationships without introducing prohibitive computational overhead.

To address these challenges, we propose ERZA-DETR, an enhanced relational-zone aggregation detection transformer tailored for WCE lesion detection. Built upon RT-DETRv2, ERZA-DETR introduces three novel lightweight and complementary modules that explicitly target the aforementioned gaps: the Dual-Band Adaptive Fourier Spectral module (DBFS), the Fused Dual-scale Gated Convolutional module (FD-gConv), and the Graph-Linked Embedding at Semantic Scales module(GLES).

Our main contributions are summarized as follows:

We propose ERZA-DETR, a detection transformer framework for wireless capsule endoscopy (WCE) lesion detection that incorporates frequency-domain enhancement and structure-aware feature modeling. The proposed architecture is designed to better cope with the complex texture and spectral characteristics of WCE imagery.
We introduce three complementary modules to enhance representation learning within the detection transformer. The Dual-Band Adaptive Fourier Spectral module (DBFS) performs adaptive frequency-domain modulation to refine spectral responses under complex illumination conditions. The Fused Dual-scale Gated Convolution module (FD-gConv) improves multi-scale feature aggregation for small lesion representation. The Graph-Linked Embedding at Semantic Scales module (GLES) models structural relationships on semantic feature maps through coordinate-gated graph aggregation.
Through collaborative integration of these modules into the ERZA-DETR framework, we form an efficient detection architecture that not only enhances lesion representation and sensitivity to subtle, low-contrast lesions but also maintains real-time inference performance, thus providing a clinically viable solution for automated WCE lesion detection.
Extensive experiments on the SEE-AI WCE dataset validate the effectiveness of the proposed design. ERZA-DETR demonstrates superior detection capability compared with strong baselines and recent detectors, particularly in challenging scenarios involving small or visually subtle lesions.

2. Related Work

2.1. Object Detection in WCE

Deep learning-based object detection has been widely explored for WCE lesion analysis, with convolutional detectors forming the dominant paradigm. One-stage detectors such as the YOLO series achieve favorable speed-accuracy trade-offs [9], a characteristic that has enabled their extensive application to WCE datasets. These models are typically adapted to the characteristics of gastrointestinal imagery through task-specific architectural modifications. For instance, Ye et al. proposed TSD-YOLO, which introduces a Tiny Detection Layer to preserve shallow features for micro-lesions and integrates a parameter-free SimAM attention module to enhance lesion representation, thereby improving localization performance under low-contrast conditions [10]. Successive iterations of the YOLO family, most notably YOLOv12, further advance detection capability by improving multi-scale prediction strategies and integrating attention-centric feature aggregation modules [11].

Despite these improvements, one-stage detectors for WCE still rely on manually designed anchors and heuristic post-processing such as non-maximum suppression, which hinder end-to-end optimization and degrade localization accuracy for small or low-contrast lesions commonly observed in WCE images.

Two-stage detectors, including Faster R-CNN [12] and Mask R-CNN [13], improve detection accuracy through proposal refinement and region-wise feature pooling, but incur substantial computational overhead, limiting their suitability for real-time or large-scale clinical deployment. This limitation is particularly evident for micro-lesions in WCE images, which may occupy only a few pixels and exhibit weak texture contrast under severe illumination variation, thus requiring the model to generate a large number of tiny candidate boxes and perform fine-grained feature processing, which further exacerbates the computational burden and significantly reduces its practicality. To improve multi-pathology detection performance, Vieira et al. proposed MI-RCNN, an enhanced framework based on Mask R-CNN and PANet [14] that propagates low-level feature information to the mask subnet and performs multi-scale feature fusion, enabling simultaneous detection and segmentation of several WCE pathologies such as bleeding, angioectasias, polyps, and inflammatory lesions [15]. However, the fixed receptive field and aggressive downsampling inherent to CNN-based architectures often result in the loss of fine-grained structural and boundary details in deeper layers, which is particularly detrimental for detecting subtle mucosal abnormalities characterized by weak texture contrast and ambiguous boundaries.

To alleviate these limitations, attention mechanisms and regional contextual modeling have been increasingly integrated into CNN architectures for WCE analysis. For example, Alam et al. proposed RAt-CapsNet, which combines volumetric attention-based feature compression with regional correlation modeling to capture pixel-level relationships between lesions and surrounding tissues [16]. Similarly, Jain et al. introduced WCENet, a two-stage framework that integrates attention-based CNN classification with segmentation refinement for lesion localization [17]. These designs enhance the robustness of lesion representation under complex mucosal patterns and illumination variation. However, since they still rely primarily on local convolutional receptive fields, their ability to capture long-range contextual dependencies and global structural relationships in gastrointestinal images remains limited. This limitation motivates the transition toward transformer-based architectures, which explicitly capture global context for more accurate lesion localization.

2.2. Transformer-Based Detection and Hybrid Frameworks

Transformer-based detectors, exemplified by DETR [18] and its variants, introduce global self-attention and set-based prediction to eliminate heuristic components in traditional detection pipelines. Subsequent improvements such as Deformable DETR enhance small-object detection by attending to sparse but informative spatial locations [19]. Recent real-time hybrid frameworks, including RT-DETR [20] and RT-DETRv2 [21], further improve efficiency through lightweight encoder–decoder designs, multi-scale feature alignment, and simplified training strategies, making them attractive for time-sensitive clinical applications. Inspired by these advances, transformer-based architectures have also begun to be explored for wireless capsule endoscopy analysis. For instance, Hosain et al. introduced a transformer-based framework for gastrointestinal disorder detection, demonstrating that global attention mechanisms can effectively capture contextual dependencies within capsule endoscopy images and improve disease recognition performance [22]. More recently, Habe et al. further confirmed the superiority of RT-DETR variants in achieving real-time, high-precision lesion localization within continuous WCE streams, highlighting the potential of transformer architectures for improving lesion localization accuracy in continuous capsule endoscopy streams [23].

Despite these advances, most transformer-based detectors are primarily developed and optimized for natural images and lack domain-specific adaptations for WCE data. In particular, illumination variability, texture heterogeneity, and subtle lesion morphology are insufficiently modeled by spatial-domain attention alone. Uniform downsampling and scale-agnostic fusion strategies may discard critical fine-grained topology, leading to imprecise boundary delineation for micro-lesions. While emerging sequence modeling approaches such as state space models have demonstrated efficiency in long-range dependency modeling [24,25], their limited spatial inductive bias and immature support for dense localization tasks constrain their applicability to fine-grained lesion detection. These limitations motivate the exploration of hybrid architectures that incorporate domain-aware feature modeling beyond purely spatial representations.

2.3. Frequency Domain and Graph Learning in Vision

Frequency-domain modeling has gained increasing attention for its ability to decouple illumination components and emphasize high-frequency structural cues, which are particularly relevant for medical images with subtle texture variations. Methods leveraging Fourier or cosine transforms have demonstrated effectiveness in enhancing boundary information and suppressing low-frequency interference in dense prediction tasks by recalibrating spectral responses, such as DCT-based attention mechanisms in MADGNet [26] and Fourier-domain adaptive filtering in FDConv [27]. However, purely frequency-based filtering may inadvertently amplify irrelevant noise or artifacts, necessitating adaptive mechanisms that can dynamically fuse spatial and frequency representations.

Complementarily, graph-based learning has shown promise in modeling relational and structural dependencies for visual recognition. Graph-based and higher-order relational frameworks can explicitly encode spatial and semantic interactions among local regions, thereby enhancing contextual inference and improving lesion localization in complex surroundings. Existing graph reasoning methods typically deal with high-level features or query features. For instance, CIGAR [28] builds cross-modal relationship graphs on high-level features to enhance domain-adaptive object detection, while EGTR [29] extracts relationship graphs from transformer attention maps to generate scene graphs, which enhance the global semantics but often sacrifice spatial details. More recent efforts explore higher-order relational modeling through hypergraph structures, such as HyperACE in YOLOv13 [30], to strengthen global context aggregation. Nevertheless, these designs are computationally expensive when applied to dense feature maps and may suppress weak but informative local signals. Recent studies suggest that minimally downsampled feature maps preserve crucial spatial and boundary information for small-object detection. For example, HS-FPN [31] enriches early-stage features with high-frequency perception, while RevBiFPN [32] demonstrates that iterative bidirectional fusion can concentrate multi-scale information into compact representations. Lightweight and localized graph modeling strategies applied at these stages can enrich relational context while maintaining structural integrity. Together, these observations highlight the potential of adaptive spatial–frequency fusion and scale-aware, locally constrained graph reasoning as complementary tools for robust lesion detection in WCE imagery.

3. Materials and Methods

The framework architecture of ERZA-DETR is shown in Figure 1. This model is based on the RT-DETRv2 architecture and integrates three lightweight modules, namely Dual-Band Adaptive Fourier Spectral module (DBFS), Fused Dual-scale Gated Convolutional module (FD-gConv) and Graph-Linked Embedding at Semantic Scales module (GLES), which are respectively used to enhance frequency-domain modeling, feature sampling fusion and structure perception. These modules are naturally embedded in the encoding and fusion stage, effectively enhancing the detection capability for fine-grained lesions while maintaining a relatively low computational cost.

3.1. Dual-Band Adaptive Fourier Spectral Module (DBFS)

To enhance spatial attention with complementary frequency information, we propose a dual-path mechanism termed DBFS, where a lightweight spectral branch operates in parallel with the multi-head self-attention (MHSA). Specifically, the spectral branch adaptively modulates frequency bands and generates spectral cues, which are injected as a residual representation into the attention pathway. As shown in Figure 2, this cooperative design forms the DBFS module and delivers an enhanced token representation.

Given an input token feature map

x \in R^{B \times N \times C}

, where B is the batch size,

N = H \times W

denotes the number of spatial tokens, H and W are the height and width of the feature map, and C is the feature dimension, we first compute the standard attention output:

{src}_{attn} = MHSA (x) \in R^{B \times N \times C}

(1)

To generate the frequency-enhanced branch, we permute and reshape x into a 2D layout and apply a 2D Fourier transform:

x_{freq} = FFT 2 (reshape (permute (x, (0, 2, 1)))) \in C^{B \times C \times H \times W}

(2)

Complementary masks derived from a fixed geometric partition of the frequency plane are employed to separate low and high frequencies. Let

(u_{0}, v_{0}) = (⌊ H / 2 ⌋, ⌊ W / 2 ⌋)

be the frequency-center indices and let

r \in Z_{+}

denote the half-side length of the square low-frequency region. To enable the frequency partitioning strategy to generalize across feature maps of different sizes, r is designed as a scale-aware fixed parameter proportional to the input spatial resolution, rather than a learnable parameter, to restrict the low-frequency region to a compact central area on the frequency plane. This is due to the fact that the introduction of learnable frequency boundaries leads to unstable spectral partitioning during training. Therefore, r is designed to be fixed, and its specific value is determined by sensitivity analysis, which is detailed in the experimental section on Frequency Radius Sensitivity Analysis. Based on this definition of r, accordingly, the binary low-frequency mask

M_{low} \in {0, 1}^{H \times W}

is defined as:

M_{low} (u, v) = \{\begin{matrix} 1, & if u_{0} - r \leq u \leq u_{0} + r - 1 and v_{0} - r \leq v \leq v_{0} + r - 1, \\ 0, & otherwise, \end{matrix}

(3)

where

(u_{0}, v_{0})

denote the frequency-center coordinates and

(u, v)

are the frequency coordinates. The high-frequency mask

M_{high} \in {0, 1}^{H \times W}

is then obtained as the complement of

M_{low}

:

M_{high} (u, v) = 1 - M_{low} (u, v) .

(4)

To enhance the spectral center, the low-frequency branch leverages a learnable complex-valued spectral filter

W_{low} \in C^{C \times 2 r \times 2 r}

. The output of the low-frequency branch is defined as:

x_{low} = IFFT 2 (M_{low} \cdot W_{low} \cdot x_{freq}) .

(5)

In contrast, the high-frequency branch utilizes a real-valued channel-wise scaling factor

γ_{high} \in R^{C \times 1 \times 1}

to selectively emphasize fine-grained details:

x_{high} = IFFT 2 (M_{high} \cdot γ_{high} \cdot x_{freq}) .

(6)

The outputs of the two branches are subsequently summed in the spectral domain and transformed back into token format:

x_{freq_enh} = permute (reshape (IFFT 2 (x_{low} + x_{high})), (0, 2, 1)) \in R^{B \times N \times C} .

(7)

Finally, the frequency-enhanced features are injected into the attention output via residual fusion:

src = Dropout ({src}_{attn}) + x_{freq_enh}

(8)

To further substantiate the frequency-domain modulation effect of DBFS, we visualize the energy distribution of frequency components in feature maps before and after DBFS processing, as shown in Figure 3. The low frequency components encode global background illumination variations, and DBFS suppresses their energy through adaptive modulation of

W_{low}

to reduce illumination-induced noise. Meanwhile, high-frequency components are closely related to lesion edge details, and their relative energy is preserved or enhanced by

γ_{high}

. This directly validates the core design assumption of DBFS, which is to suppress illumination artifacts while preserving fine-grained pathological details in the Fourier domain.

To quantitatively analyze the dual-band modulation behavior of DBFS, we compute the mean signed frequency response over concentric square frequency bands derived from the real part of the Fourier spectrum, as shown in Figure 4.

In low-frequency bands (indices 1–4), the unprocessed features exhibit pronounced response fluctuations, with the mean response dropping to

- 0.1375

at band index 2 and oscillating between positive and negative values (e.g.,

0.1091

,

- 0.1375

,

- 0.1042

, and

0.0784

). Such instability indicates that the baseline features contain biased low-frequency responses that distort the global structural representation. After applying DBFS, these responses become significantly more stable and remain within a narrower range of

[- 0.0219, 0.0721]

. This stabilization suggests that the learnable spectral filter

W_{low}

effectively regularizes low-frequency components, reducing response oscillations while preserving the global structural information encoded in the feature maps. For high-frequency bands (indices 5–10), the baseline features exhibit relatively strong responses ranging from

0.0452

to

0.1178

, indicating the presence of amplified high-frequency components that are often associated with background perturbations, illumination artifacts, or texture noise. After DBFS processing, these responses are consistently compressed into a substantially lower range of

[- 0.0023, 0.0325]

. This behavior confirms that the channel-wise scaling factor

γ_{high}

selectively attenuates redundant high-frequency responses while retaining the discriminative fine-grained patterns necessary for pathological feature representation. Overall, the reduced response magnitude and improved stability across frequency bands demonstrate that DBFS performs controlled dual-band frequency modulation, simultaneously stabilizing low-frequency structural information and compressing redundant high-frequency responses. This balanced spectral regulation helps suppress artifact-driven activations while preserving diagnostically relevant structures and textures, which is particularly beneficial for detecting subtle lesions in wireless capsule endoscopy imagery.

We further visualize intermediate activations to elucidate the functional role and effectiveness of the proposed DBFS branch, as shown in Figure 5. The input features to the DBFS branch primarily encode coarse, color-driven global patterns and exhibit limited sensitivity to lesion-specific structures, which is consistent with the inherent limitation of early-to-middle encoder features that prioritize general semantic information over fine-grained pathological details.

As illustrated in Figure 5b, the standard spatial-attention map tends to produce dispersed responses and may activate irrelevant regions, leading to incorrect lesion localization in the corresponding overlay. In contrast, the DBFS-refined features in Figure 5c exhibit more concentrated and structurally consistent activations, which align better with the true lesion regions. Notably, a lesion that is incorrectly identified by the baseline attention map is correctly localized after DBFS refinement, demonstrating the effectiveness of the proposed module in suppressing spurious activations and enhancing discriminative lesion representations.

In contrast, after DBFS refinement, the activation map becomes remarkably discriminative, with stronger, spatially contiguous responses concentrated precisely around lesion regions while diffuse background signals and illumination-induced artifacts are effectively suppressed, directly validating the core design goal of the DBFS branch—enhancing feature selectivity for clinically meaningful targets. Notably, the DBFS-refined activation map shares coarse-scale structural consistency with the map from the standard spatial-attention pathway, which indicates both methods can capture major lesion locations. However, a critical fine-grained distinction emerges: DBFS minimizes false activations induced by lighting inconsistencies and attenuates blurred background responses that obscure small or low-contrast lesions, thus improving the fidelity of subtle pathological cues. This advantage underscores its unique value as a frequency-aware complement to encoder representations—spatial attention focuses on spatial dependencies, while DBFS targets frequency-domain noise and fine-structure enhancement. Their combination thus yields more robust and clinically actionable features.

In this way, frequency-domain enhancement is employed as a lightweight auxiliary feature information source without altering the original Transformer structure, thereby enhancing the robustness towards fine-grained textures and illumination artifacts.

3.2. Fused Dual-Scale Gated Convolutional Module (FD-gConv)

The CCFF stage in real-time detectors plays a crucial role in reconciling multi-scale information. However, conventional residual blocks often fuse spatial and channel cues in an uncoordinated way, directly combining raw input features with transformed ones without explicit modulation. This indiscriminate fusion can either suppress informative details or fail to filter redundant activations. To address this issue, we introduce FD-gConv, as shown in Figure 6, a filter-driven gating block designed to refine intermediate features by explicitly decoupling spatial evidence extraction and semantic filtering, while maintaining lightweight complexity. The block integrates three core components: a value pathway for spatial encoding, a gate pathway for semantic filtering, and a fusion mechanism that incorporates residual connections to ensure stable feature propagation. Unlike conventional residual designs, our approach resolves the “uncoordinated fusion” problem by inserting selective filtering mechanisms before residual integration, thus retaining the optimization stability of residual connections while eliminating their indiscriminate fusion behavior.

3.2.1. Value Pathway with Depthwise Convolution

The value pathway generates spatially enhanced features using depthwise convolutions. Given an input feature map

x \in R^{B \times C \times H \times W}

, the value representation v is computed as

v = {DWConv}_{k \times k} ({Conv}_{1 \times 1} (x))

. Here, the

1 \times 1

convolution retains the original channel dimension C, and the subsequent depthwise convolution (

{DWConv}_{k \times k}

) operates on each channel independently to capture spatial structures. This design preserves localized details such as lesion boundaries with minimal computational overhead, emphasizing per-channel spatial acuity rather than inter-channel mixing to produce a clean spatial evidence map.

3.2.2. Gate Pathway with Channel Modulation

The gate pathway performs semantic filtering through learnable modulation, incorporating a channel-wise balancing mechanism to prevent over-suppression. It transforms the input x into a semantic mask g through a sequence of operations. First, a base gate

g_{0}

is generated via

g_{0} = {Conv}_{1 \times 1} (x)

, where the

1 \times 1

convolution aggregates cross-channel dependencies to support semantic discrimination between lesions and background, with its output dimension maintained as C to achieve strict feature alignment with the value pathway. Next, a per-channel learnable modulator

m \in R^{1 \times C \times 1 \times 1}

refines channel-specific responses to produce a calibrated gate

g_{2} = g_{0} ⊙ m

. An adaptive balance factor

λ = σ (α)

, with

α

as a trainable parameter tensor of dimensions

R^{1 \times C \times 1 \times 1}

, regulates the fusion of

g_{2}

and

g_{0}

into

g_{mix} = (1 - λ) \cdot g_{2} + λ \cdot g_{0}

. This balance allows the network to emphasize channels benefiting from modulation while preserving unmodulated pathways for others. The final gate

g = σ (g_{mix})

yields a normalized semantic mask that enables the gate pathway to selectively filter redundant activations, thereby preventing over-suppression and stabilizing stacked usage within CCFF.

3.2.3. Residual Feature Fusion

The fusion mechanism combines outputs from the value and gate pathways while maintaining residual stability. The spatially enhanced features v are modulated by the semantic mask g via element-wise multiplication

v ⊙ g

, where the gate pathway’s filtering effect is applied. The result is then projected through a

1 \times 1

convolution to align with the original input’s channel dimension C, producing

v_{proj} = {Conv}_{1 \times 1} (v ⊙ g)

with dimensions

R^{B \times C \times H \times W}

. Residual integration finally combines

v_{proj}

with a bypass pathway from the original input x. This design ensures that even when the gate pathway performs aggressive filtering, essential contextual cues propagate stably through the residual connection—preventing over-suppression of weak lesion signals while maintaining optimization stability through the persistent gradient path.

3.3. Graph-Linked Embedding at Semantic Scales Module (GLES)

Modern real-time detectors often retain a lowest-resolution feature map after successive downsampling and cross-scale fusion, which predominantly encodes high-level semantic information with limited explicit spatial structure. Most existing methods like FPN [33] or PANet utilize this low-resolution feature map as an output for fusion or decoding but rarely model its internal spatial dependencies. To address this gap with minimal computational overhead, we propose GLES, a localized graph-enhanced module designed to improve spatial awareness and structural consistency of low-resolution embeddings. As shown in Figure 7, GLES operates in three consecutive stages: graph topology construction, coordinate-aware modulation, and structure-guided aggregation.

3.3.1. Graph Topology Construction

Let

F \in R^{B \times C \times H \times W}

denote the input feature map to the GLES module. Here,

F

represents the feature tensor, while

(x_{i}, y_{i}) \in {[- 1, 1]}^{2}

denote the normalized spatial coordinates associated with each graph node and are used exclusively for spatial indexing and are not related to the feature notation used in previous sections. We interpret each spatial location as a node in a 2D grid graph

G = (V, E)

, where each node

v_{i} \in V

corresponds to normalized coordinates

(x_{i}, y_{i}) \in {[- 1, 1]}^{2}

. For each

v_{i}

, we define a local 3 × 3 neighborhood, including the center node itself:

N (v_{i}) = {v_{j} ∣ ∥ p_{i} - p_{j} ∥_{\infty} \leq 1}

(9)

This approach constructs a sparse, directed graph over each local patch, preserving translation-invariant structure while avoiding global complexity. Unlike classical GCN [34] or GAT [35] that either use static or globally computed graphs, GLES leverages implicit spatial priors and only connects spatially adjacent nodes, enabling lightweight batch-wise computation. Each edge

(v_{i}, v_{j}) \in E

represents an undirected connection between the center node

v_{i}

and its spatial neighbor

v_{j}

. This deliberate restriction to local neighborhoods sets GLES apart from conventional graph reasoning modules, which are typically designed for long-range dependency modeling on high-level features. Excessive global aggregation may oversmooth weak lesion responses at semantic scales, whereas GLES intentionally restricts graph connectivity to local neighborhoods, and such localized semantic interactions are more effective for preserving subtle but spatially coherent lesion evidence.

3.3.2. Coordinate-Aware Node Modulation

Inspired by the established importance of spatial coordinates in visual representation learning [36], we develop a position-aware feature modulation mechanism. Building on the principle that normalized coordinates provide critical geometric context [37], our method incorporates explicit 2D positional information through a learnable gating scheme.

Specifically, a fixed normalized coordinate grid

P \in R^{H \times W \times 2}

is flattened and transformed via a linear projection followed by sigmoid activation, generating a spatial gate

G = σ (Linear (P))

. The generated gate then performs position-conditioned feature modulation on the flattened node features

\tilde{F} = G ⊙ F_{flat}

. Here, the gating vector

G

acts as a position-dependent mask, adaptively amplifying or suppressing node activations to achieve location-aware feature refinement.

3.3.3. Structure-Guided Graph Aggregation

We propose a dynamic graph aggregation method that leverages geometric relationships to aggregate node features

\tilde{F}

, capturing contextual information within the graph structure. The aggregation mechanism is driven solely by spatial relationships between nodes.

For each central node

v_{i}

, we retrieve the features of its local neighbors

{{\tilde{f}}_{j}}_{j \in N (i)}

. Edge weights are derived exclusively from relative position offsets:

Δ p_{i j} = p_{j} - p_{i}, w_{i j} = MLP (Δ p_{i j})

(10)

These geometric edge weights

w_{i j}

are normalized via softmax over the neighborhood to produce attention scores

α_{i j}

. The node update is then computed as a geometrically weighted sum:

f_{i}^{graph} = \sum_{v_{j} \in N (v_{i})} α_{i j} \cdot {\tilde{f}}_{j}

(11)

The graph-enhanced features

F_{graph}

obtained via aggregation are concatenated with the original features

F

along the channel dimension. After projection through a 1 × 1 convolution, these concatenated features are added back via a residual connection. This operation enhances the spatial structural priors of low-resolution layers, thereby facilitating the improvement of representation capability for tiny/low-contrast lesions.

4. Experiments

4.1. Dataset

We evaluate ERZA-DETR on the publicly available SEE-AI dataset [38], which contains 18,481 small-bowel capsule endoscopy images captured using the PillCam™ SB3 system. Each image is labeled with bounding boxes corresponding to the following twelve lesion categories: angiodysplasia, erosion, stenosis, lymphangiectasia, lymph follicle, SMT, polyp-like, bleeding, diverticulum, erythema, foreign body, and vein. The dataset exhibits challenging imaging conditions, such as non-uniform illumination, motion blur, and subtle lesion boundaries. To prevent cross-subset leakage of the same lesion instance, we adopt an image-level stratified split rather than lesion-level allocation. The dataset is divided into training, validation and test sets using an 8:1:1 ratio while preserving the per-class image distribution. The number of lesion-containing images for each category is summarized in Figure 8.

4.2. Evaluation Metrics

Performance is reported using the standard COCO evaluation metrics, including mean Average Precision (mAP) and mAP at IoU threshold 0.5 (mAP@50). Additionally, we report the number of parameters and GFLOPs to assess the computational efficiency of our approach.

For a given confidence threshold t, let

T P (t)

,

F P (t)

and

F N (t)

denote the number of true positives, false positives and false negatives, respectively. The corresponding precision and recall are defined as

Precision (t) = \frac{T P (t)}{T P (t) + F P (t)}

and

Recall (t) = \frac{T P (t)}{T P (t) + F N (t)}

, and varying t yields a precision–recall curve whose area gives the Average Precision (AP) at a specific IoU threshold.

Following the COCO protocol, the overall performance is summarized using mean Average Precision (mAP), defined as

mAP = \frac{1}{10} \sum_{IoU \in {0.50, 0.55, \dots, 0.95}} A P (IoU)

(12)

To further assess detection quality under a loose matching criterion, we additionally report

mAP @ 50 = A P (IoU = 0.50)

(13)

We also report the model size in terms of the total number of trainable parameters (Params). Let W denote the set of all trainable weights, giving

Params = | W |

(14)

Finally, the computational cost is measured in GFLOPs, defined as

GFLOPs = \frac{Total FLOPs for one forward pass}{10^{9}}

(15)

Together, these metrics jointly characterize the accuracy and computational efficiency of the proposed detector.

4.3. Implementation Details

All experiments were implemented using PyTorch 2.0.1 with Python 3.8 on an Ubuntu 18.04.5 server. The hardware platform consists of an Intel^® Xeon^® Gold 5218R CPU @ 2.10 GHz and an NVIDIA GeForce RTX 3090 GPU with 24 GB GDDR6X memory. Our model is built upon RT-DETRv2 with a ResNet-50 backbone. Detailed hyperparameter settings during training are presented in Table 1. All experiments employ identical training protocols and data augmentation strategies to ensure fair comparison.

5. Results and Discussion

5.1. Frequency Radius Sensitivity Analysis

The frequency radius r is a critical hyperparameter in the DBFS module, as it determines the cutoff between low- and high-frequency components in the Fourier domain. To assess the robustness of DBFS with respect to this parameter and to justify the adopted design choice, we conduct a sensitivity analysis by evaluating multiple fixed values of r.

In our experimental setup, all input images are resized to

640 \times 640

, a standard resolution that well balances computational efficiency and detail preservation. The backbone network (ResNet-50) extracts multi-scale features, where the deepest semantic level feature map (P5) yields a spatial resolution of

20 \times 20

with a

32 \times

downsampling stride. This

20 \times 20

feature map is directly fed into the DBFS module for frequency-domain processing. Accordingly, we test a series of fixed configurations with

r \in {2, 3, 4, 5, 6}

, which correspond to progressively expanding central low-frequency regions on the frequency plane. All other training protocols and hyperparameters are kept identical to ensure fair comparison.

As shown in Figure 9, the detection performance exhibits a clear dependency on the choice of the frequency radius r. When r is set to smaller values (e.g.,

r = 2

or

r = 3

), the low-frequency region becomes overly compact, causing a portion of illumination-related low-frequency components to leak into the high-frequency branch. This weakens the effectiveness of illumination suppression and results in reduced discriminability for low-contrast lesion regions. In contrast, when r is excessively large (e.g.,

r = 5

or

r = 6

), the low-frequency region occupies a disproportionately large area of the frequency plane, leading to the suppression of diagnostically relevant mid- and high-frequency details, such as lesion boundaries and fine structural cues.

Beyond fixed configurations, we also investigated the possibility of treating r as a learnable parameter. However, this design introduces intrinsic optimization difficulties. Specifically, r defines the discrete spatial extent of the low-frequency region and therefore must take integer values to preserve a valid frequency partition. Allowing r to be continuous results in fractional frequency boundaries, which are incompatible with the binary masking operation used for frequency separation. Although discretization strategies such as rounding or clamping can be applied, they introduce non-differentiable operations that hinder effective gradient propagation and lead to unstable optimization behavior in practice.

Based on the above analysis, the best overall performance is achieved when

r = 4

, which corresponds to an

8 \times 8

central low-frequency region for a

20 \times 20

feature map. It is worth noting that the frequency partition is defined in a normalized frequency coordinate space relative to the feature map size. Therefore, the radius parameter does not directly depend on absolute image resolution, which improves its robustness across different input settings. This setting provides a balanced separation between illumination-dominated low-frequency components and lesion-relevant high-frequency details, resulting in optimal detection accuracy. Overall, these results validate the use of a scale-aware but fixed frequency radius and demonstrate that DBFS is not overly sensitive within a reasonable range of r, while still benefiting from an appropriate cutoff selection.

5.2. Clinical Imbalance Analysis

The SEE-AI dataset exhibits a pronounced long-tailed distribution, which presents a substantial challenge for reliable lesion detection. Rare categories such as diverticulum contain only 40 images across the entire dataset (training, validation, and test sets combined), in contrast to frequent categories such as erosion with 3512 training samples. Such imbalance typically leads to noticeable performance disparities across categories.

To preserve clinically realistic data characteristics, we intentionally do not apply explicit class rebalancing strategies (e.g., oversampling or class-aware sampling). In real-world WCE scenarios, lesion occurrence naturally follows a long-tailed distribution, and models are required to generalize under such open-set conditions rather than relying on artificially balanced data. Instead, a standard data augmentation pipeline is adopted, including random photometric distortion, random zoom-out, random IoU-based cropping, and horizontal flipping. These augmentations improve data diversity while maintaining the original distribution, enabling a more realistic evaluation of model robustness.

To further evaluate the discriminative ability of ERZA-DETR across different lesion categories, we conduct per-category bounding box evaluation on the SEE-AI test set. Per-category results are reported using mAP@50, which better reflects localization performance under small-object and ambiguous-boundary conditions. The results are summarized in Table 2.

As shown in Table 2, ERZA-DETR achieves an mAP@50 of 0.509 for diverticulum, while higher scores are obtained for most other lesion categories. For example, stenosis reaches 0.932 mAP@50, whereas several frequent categories such as erosion, polyp-like lesions, and lymph follicle achieve mAP values between 0.660 and 0.789. Despite the severe data imbalance, ERZA-DETR consistently improves the detection performance across all lesion categories compared with the RT-DETR and RT-DETRv2 baselines. In particular, the mAP@50 for diverticulum increases from 0.473 (RT-DETRv2) to 0.509, and similar gains are observed across other lesion types.

Beyond sample scarcity, the difficulty of detecting rare lesions is also closely related to their visual characteristics in WCE images. Rare lesions often occupy very limited spatial regions and exhibit weak or ambiguous visual cues, which makes them easily confused with surrounding anatomical structures. Under such conditions, detectors relying purely on local appearance features may produce incomplete or poorly separated feature representations, leading to missed detections or low-confidence predictions.

Unlike general object detection tasks, WCE lesions exhibit relatively stable spectral and topological patterns. Based on this observation, ERZA-DETR alleviates long-tailed learning challenges by introducing structural priors rather than relying on data rebalancing. Specifically, the DBFS module enhances frequency-domain consistency by emphasizing intrinsic geometric patterns of lesions. For example, diverticulum typically presents concave structures and boundary-related intensity transitions, which correspond to relatively stable spectral responses. By recalibrating these frequency components, the model becomes less dependent on sample quantity and more sensitive to structure-consistent cues.

In parallel, the GLES module captures graph-based topological relationships between lesion regions and surrounding tissues. This enables implicit knowledge transfer between frequent and rare categories through shared anatomical patterns (e.g., intestinal wall textures), thereby improving representation quality for underrepresented classes. These components collectively shift the learning paradigm from data-intensive statistical fitting toward structure-prior-guided representation learning, which is inherently more robust to long-tailed distributions commonly observed in WCE datasets.

Qualitative analysis further supports this observation. As shown in Figure 10, a diverticulum lesion with subtle visual appearance and limited spatial extent is completely missed by the baseline RT-DETRv2 detector. In contrast, ERZA-DETR successfully identifies two diverticulum instances in the same frame, with confidence scores of 0.60 and 0.78, respectively. This case demonstrates that the proposed structure-aware representation can effectively capture multiple subtle rare lesions simultaneously, improving sensitivity to small and low-contrast abnormalities.

Despite these improvements, extremely scarce categories (e.g., diverticulum with only 40 images) remain inherently challenging. In this work, we focus on evaluating model robustness under clinically realistic conditions without introducing additional balancing strategies. Future work will explore dedicated approaches for rare lesion enhancement, such as generative data augmentation based on GAN or diffusion models, to further improve performance on severely underrepresented classes. Overall, these results indicate that ERZA-DETR improves detection robustness under long-tailed distributions by shifting from sample-driven fitting toward structure-prior-guided representation learning, while maintaining a clinically realistic evaluation setting without explicit class rebalancing or external supervision.

5.3. Comparison with State-of-the-Art Methods

As shown in Table 3, it presents a comprehensive comparison between ERZA-DETR and several representative convolution-based and transformer-based object detectors on the SEE-AI dataset. The results demonstrate that ERZA-DETR achieves competitive performance while maintaining favorable computational efficiency.

Compared with conventional convolutional detectors such as Faster R-CNN and SSD [39], ERZA-DETR exhibits substantially improved detection accuracy. Specifically, Faster R-CNN achieves an mAP of 0.316 and an mAP@50 of 0.565, whereas ERZA-DETR reaches 0.454 and 0.755, respectively. This improvement indicates the advantage of transformer-based architectures in modeling long-range contextual dependencies, which is particularly beneficial for identifying subtle lesion structures in WCE imagery. This improvement can be further attributed to the DBFS module, which enhances frequency-domain representations by suppressing low-frequency illumination interference and amplifying high-frequency lesion boundaries, thereby improving sensitivity to low-contrast lesions.

When compared with recent YOLO-series detectors, ERZA-DETR also demonstrates competitive performance. Although YOLOv13-L achieves a higher overall mAP of 0.474, ERZA-DETR attains the highest mAP@50 (0.755) among all evaluated methods, indicating stronger localization capability. Accurate localization is particularly important in WCE lesion detection, where pathological regions often exhibit ambiguous boundaries and small spatial extent. Moreover, the improved localization stability under challenging illumination conditions can be attributed to the spectral recalibration of DBFS and the adaptive filtering mechanism of FD-gConv, which together mitigate the impact of uneven lighting and background noise that commonly occur.

More importantly, ERZA-DETR improves both accuracy and efficiency relative to its baseline RT-DETRv2. Specifically, ERZA-DETR increases mAP from 0.434 to 0.454 (+2.0%) and mAP@50 from 0.723 to 0.755 (+3.2%). At the same time, the parameter count is reduced from 42.7 M to 37.4 M and the computational cost decreases from 137.7 GFLOPs to 110.5 GFLOPs, indicating a more compact and efficient architecture.

In addition, ERZA-DETR converges within 150 training epochs, which is significantly fewer than those required by several CNN-based detectors in Table 3. For example, Faster R-CNN and SSD require 500 epochs to achieve convergence. This faster convergence further reduces overall training time and computational cost, which is advantageous for practical deployment and large-scale clinical training scenarios.

Overall, these results demonstrate that ERZA-DETR achieves a favorable balance between detection accuracy, model complexity, and training efficiency, making it well suited for WCE lesion detection tasks.

5.4. Ablation Studies and Analysis

To analyze the contribution of each component in ERZA-DETR, we conduct an ablation study on the SEE-AI dataset using RT-DETRv2 as the baseline. The baseline achieves 43.4% mAP and 72.3% mAP@50. Since ERZA-DETR introduces three complementary modules, namely DBFS, FD-gConv, and GLES, we evaluate their effectiveness by enabling them individually and in different combinations. The results are summarized in Table 4.

As shown in Table 4, each module independently improves the baseline performance, demonstrating its effectiveness for WCE lesion detection. Specifically, introducing DBFS alone increases mAP@50 by 1.1% compared with the baseline, indicating that the dedicated frequency-domain branch introduces complementary cues that sharpen lesion boundaries and enhance discriminative capacity. Integrating FD-gConv alone yields the largest single-module improvement with a 1.8% gain in mAP@50. Its content adaptive gating emphasizes informative channels and spatial responses while suppressing non-discriminative activations, which improves the quality of fused multi-scale features. Introducing GLES alone produces a 1.2% gain in mAP@50 by applying graph-based local relation modeling at semantic scales, aggregating neighborhood context to reinforce small-lesion evidence.

Pairwise combinations further amplify these effects in ways that match their functional roles. The integration of DBFS with FD-gConv raises mAP@50 by 2.2% over the baseline, as frequency-aware cues are progressively refined by the gating stage, retaining salient patterns while attenuating noise. The combination of FD-gConv and GLES yields a 2.4% improvement in mAP@50. Features filtered by the gate are subsequently aggregated by the graph module along learned spatial relations, reinforcing coherent lesion structures and reducing isolated artifacts. The combination of DBFS and GLES provides a smaller 1.9% increase, which aligns with the absence of an intermediate filtering stage. Without early suppression, frequency cues are aggregated together with noise, limiting the net benefit.

Integrating all three modules delivers the highest overall performance. The model reaches 0.454 mAP and 0.755 mAP@50, which corresponds to gains of 2.0% and 3.2% over the baseline. This progression underscores a clear functional complementarity in which DBFS supplies frequency-aware evidence, FD-gConv adaptively filters and refines this information, and GLES consolidates the distilled representations through relational aggregation. The resulting sequential enhancement process yields stable and cumulative improvements in detection accuracy.

To further illustrate these improvements, qualitative detection comparisons are presented in Figure 11. Compared with the baseline RT-DETRv2, ERZA-DETR produces more accurate lesion localization, fewer false positives, and improved recognition of small or low-contrast lesions. Several representative cases show that lesions missed or incorrectly localized by the baseline are correctly identified by ERZA-DETR, while spurious detections in background regions are effectively suppressed. These qualitative observations are consistent with the quantitative gains observed in the ablation experiments.

6. Conclusions

This paper addresses the challenges of wireless capsule endoscopy (WCE) lesion detection, where non-uniform illumination, low contrast, and the small size of lesions significantly hinder reliable recognition.

To tackle these issues, we propose ERZA-DETR, an enhanced detection transformer that integrates three complementary modules: a Dual-Band Adaptive Fourier Spectral module (DBFS) for frequency-domain recalibration, a Fused Dual-scale Gated Convolutional module (FD-gConv) for adaptive multi-scale feature filtering, and a Graph-Linked Embedding at Semantic Scales module (GLES) for structure-aware relational modeling. Unlike conventional approaches that rely solely on spatial representations, the proposed framework explicitly incorporates spectral and structural cues to improve feature discriminability.

Experimental results on the SEE-AI dataset demonstrate that ERZA-DETR achieves an mAP of 0.454 and an mAP@50 of 0.755, outperforming the RT-DETRv2 baseline while reducing both parameter count and computational cost. In addition, the model converges within 150 epochs, indicating improved training efficiency. These results confirm that the proposed design provides a favorable balance between accuracy and efficiency for WCE lesion detection.

Although the evaluation is conducted on a single dataset, the SEE-AI dataset contains diverse lesion categories and complex visual conditions, providing a representative benchmark for WCE analysis. Future work will focus on extending the framework to exploit temporal information in full-length capsule videos, as well as validating its generalization across multi-center datasets. Further investigation into clinically oriented optimization strategies, such as uncertainty estimation and decision-aware thresholding, may also enhance its applicability in real-world diagnostic workflows.

Author Contributions

Conceptualization, S.Y. and H.M.; methodology, S.Y. and H.M.; software, H.M.; validation, S.Y., H.M., Z.Z. and L.L.; formal analysis, S.Y. and H.M.; investigation, S.Y., H.M., Z.Z. and L.L.; resources, S.Y. and H.M.; data curation, H.M.; writing—original draft preparation, S.Y. and H.M.; writing—review and editing, S.Y., H.M., Z.Z. and L.L.; visualization, S.Y. and H.M.; supervision, S.Y. and H.M.; project administration, S.Y. and H.M.; funding acquisition, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset that supports the findings of this study are openly available at: https://www.kaggle.com/datasets/capsuleyolo/kyucapsule (accessed on 24 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Iddan, G.; Meron, G.; Glukhovsky, A.; Swain, P. Wireless capsule endoscopy. Nature 2000, 405, 417. [Google Scholar] [CrossRef] [PubMed]
Pennazio, M.; Rondonotti, E.; Despott, E.J.; Dray, X.; Keuchel, M.; Moreels, T.; Sanders, D.S.; Spada, C.; Carretero, C.; Valdivia, P.C.; et al. Small-bowel capsule endoscopy and device-assisted enteroscopy for diagnosis and treatment of small-bowel disorders: European Society of Gastrointestinal Endoscopy (ESGE) Guideline–Update 2022. Endoscopy 2023, 55, 58–95. [Google Scholar] [CrossRef] [PubMed]
Zha, B.; Cai, A.; Wang, G. Diagnostic Accuracy of Artificial Intelligence in Endoscopy: Umbrella Review. JMIR Med. Inform. 2024, 12, e56361. [Google Scholar] [CrossRef] [PubMed]
Trasolini, R.; Byrne, M.F. Artificial intelligence and deep learning for small bowel capsule endoscopy. Dig. Endosc. 2021, 33, 290–297. [Google Scholar] [CrossRef]
Cao, Q.; Deng, R.; Pan, Y.; Liu, R.; Chen, Y.; Gong, G.; Zou, J.; Yang, H.; Han, D. Robotic wireless capsule endoscopy: Recent advances and upcoming technologies. Nat. Commun. 2024, 15, 4597. [Google Scholar] [CrossRef]
Tontini, G.E.; Vecchi, M.; Neurath, M.F.; Neumann, H. Advanced endoscopic imaging techniques in Crohn’s disease. J. Crohn’s Colitis 2014, 8, 261–269. [Google Scholar] [CrossRef]
Son, G.; Eo, T.; An, J.; Oh, D.J.; Shin, Y.; Rha, H.; Kim, Y.J.; Lim, Y.J.; Hwang, D. Small bowel detection for wireless capsule endoscopy using convolutional neural networks with temporal filtering. Diagnostics 2022, 12, 1858. [Google Scholar] [CrossRef]
Zhao, X.; Fang, C.; Gao, F.; Fan, D.J.; Lin, X.; Li, G. Deep transformers for fast small intestine grounding in capsule endoscope video. In Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI); IEEE: Piscataway, NJ, USA, 2021; pp. 150–154. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
Ye, S.; Meng, Q.; Zhang, S.; Wang, H. Multi-Scale Feature Fusion Network Model for Wireless Capsule Endoscopic Intestinal Lesion Detection. Comput. Mater. Contin. 2025, 82, 2043–2059. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2017; pp. 2961–2969. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 8759–8768. [Google Scholar]
Vieira, P.M.; Freitas, N.R.; Lima, V.B.; Costa, D.; Rolanda, C.; Lima, C.S. Multi-pathology detection and lesion localization in WCE videos by using the instance segmentation approach. Artif. Intell. Med. 2021, 119, 102141. [Google Scholar] [CrossRef]
Alam, M.J.; Rashid, R.B.; Fattah, S.A.; Saquib, M. Rat-capsnet: A deep learning network utilizing attention and regional information for abnormality detection in wireless capsule endoscopy. IEEE J. Transl. Eng. Health Med. 2022, 10, 1–8. [Google Scholar] [CrossRef] [PubMed]
Jain, S.; Seal, A.; Ojha, A.; Yazidi, A.; Bures, J.; Tacheci, I.; Krejcar, O. A deep CNN model for anomaly detection and localization in wireless capsule endoscopy images. Comput. Biol. Med. 2021, 137, 104789. [Google Scholar] [CrossRef] [PubMed]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 16965–16974. [Google Scholar]
Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar] [CrossRef]
Hosain, A.S.; Islam, M.; Mehedi, M.H.K.; Kabir, I.E.; Khan, Z.T. Gastrointestinal disorder detection with a transformer based approach. In Proceedings of the 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON); IEEE: Piscataway, NJ, USA, 2022; pp. 280–285. [Google Scholar]
Habe, T.T.; Haataja, K.; Toivanen, P. Precision enhancement in wireless capsule endoscopy: A novel transformer-based approach for real-time video object detection. Front. Artif. Intell. 2025, 8, 1529814. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Dao, T.; Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv 2024, arXiv:2405.21060. [Google Scholar] [CrossRef]
Nam, J.H.; Syazwany, N.S.; Kim, S.J.; Lee, S.C. Modality-agnostic domain generalizable medical image segmentation by multi-frequency in multi-scale attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 11480–11491. [Google Scholar]
Chen, L.; Gu, L.; Li, L.; Yan, C.; Fu, Y. Frequency Dynamic Convolution for Dense Image Prediction. In Proceedings of the Computer Vision and Pattern Recognition Conference; IEEE: Piscataway, NJ, USA, 2025; pp. 30178–30188. [Google Scholar]
Liu, Y.; Wang, J.; Huang, C.; Wang, Y.; Xu, Y. Cigar: Cross-modality graph reasoning for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 23776–23786. [Google Scholar]
Im, J.; Nam, J.; Park, N.; Lee, H.; Park, S. Egtr: Extracting graph from transformer for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 24229–24238. [Google Scholar]
Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception. arXiv 2025, arXiv:2506.17733. [Google Scholar] [CrossRef]
Shi, Z.; Hu, J.; Ren, J.; Ye, H.; Yuan, X.; Ouyang, Y.; He, J.; Ji, B.; Guo, J. HS-FPN: High frequency and spatial perception FPN for tiny object detection. In Proceedings of the AAAI Conference on Artificial Intelligence; IEEE: Piscataway, NJ, USA, 2025; Volume 39, pp. 6896–6904. [Google Scholar]
Chiley, V.; Thangarasa, V.; Gupta, A.; Samar, A.; Hestness, J.; DeCoste, D. RevBiFPN: The fully reversible bidirectional feature pyramid network. Proc. Mach. Learn. Syst. 2023, 5, 625–645. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2017; pp. 2117–2125. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar] [CrossRef]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar] [CrossRef]
Chen, Y.; You, J.; He, J.; Lin, Y.; Peng, Y.; Wu, C.; Zhu, Y. SP-GNN: Learning structure and position information from graphs. Neural Netw. 2023, 161, 505–514. [Google Scholar] [CrossRef]
You, J.; Ying, R.; Leskovec, J. Position-aware graph neural networks. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2019; pp. 7134–7143. [Google Scholar]
Yokote, A.; Umeno, J.; Kawasaki, K.; Fujioka, S.; Fuyuno, Y.; Matsuno, Y.; Yoshida, Y.; Imazu, N.; Miyazono, S.; Moriyama, T.; et al. Small bowel capsule endoscopy examination and open access database with artificial intelligence: The SEE-artificial intelligence project. DEN Open 2024, 4, e258. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]

Figure 1. Overall architecture of ERZA-DETR, which integrates Dual-Band Adaptive Fourier Spectral module (DBFS) for frequency cues, Fused Dual-scale Gated Convolutional module (FD-gConv) for selective multi-scale fusion, and Graph-Linked Embedding at Semantic Scales module (GLES) for low-resolution structural reasoning on top of RT-DETRv2.

Figure 2. DBFS module: token features are transformed via FFT, split into low/high-frequency bands with learnable modulation, and fused back through residual addition to enhance detail while suppressing illumination noise.

Figure 3. Visualization of frequency energy distribution before and after DBFS processing. The white square indicates the central low-frequency region, while the surrounding area corresponds to high-frequency components. The seismic colormap is used, where darker colors denote lower energy and brighter colors denote higher energy. (a) Frequency spectrum before DBFS processing. (b) Frequency spectrum after DBFS processing.

Figure 4. Square-band frequency response of feature maps before and after DBFS. The x-axis denotes the square frequency band index defined by Chebyshev distance from the frequency center. The y-axis represents the normalized mean signed real-frequency response within each band. DBFS significantly stabilizes low-frequency responses and compresses high-frequency responses, demonstrating its dual-band modulation capability.

Figure 5. Visualization of intermediate activations. Activation maps are shown using the viridis colormap, where brighter colors indicate stronger responses. (a) Input features (top) and corresponding input image (bottom). (b) Standard spatial-attention map (top) and its overlay on the input image (bottom). (c) DBFS-refined features (top) and corresponding overlay (bottom). Red boxes indicate detected lesion regions, and blue text denotes the predicted category and confidence score.

Figure 6. FD-gConv module: a value pathway captures spatial evidence and a gate pathway performs channel-wise modulation, with their outputs fused through residual projection for stable and selective multi-scale feature integration.

Figure 7. GLES module: each location forms a 3 × 3 local graph, features are modulated by a coordinate gate, neighbors are aggregated with edge weights derived from relative offsets, and the result is fused with the residual after 1 × 1 projection to enhance the structure on the lowest resolution map.

Figure 8. Image-level distribution of lesion categories in SEE-AI. Each bar counts the number of images containing at least one lesion of the corresponding category. Since a single image may contain multiple lesion categories, the counts do not sum to the dataset total.

Figure 9. Sensitivity analysis of the frequency radius r in the DBFS module. Detection performance is evaluated using mAP@50 under different fixed values of r, where r controls the spatial extent of the central low-frequency region relative to the feature map height H. The results show a clear performance peak at

r = 4

, indicating an optimal balance between low-frequency illumination suppression and high-frequency lesion detail preservation.

Figure 9. Sensitivity analysis of the frequency radius r in the DBFS module. Detection performance is evaluated using mAP@50 under different fixed values of r, where r controls the spatial extent of the central low-frequency region relative to the feature map height H. The results show a clear performance peak at

r = 4

, indicating an optimal balance between low-frequency illumination suppression and high-frequency lesion detail preservation.

Figure 10. Qualitative comparison on a rare diverticulum case under severe class imbalance. (a) Detection result of RT-DETRv2, where no diverticulum instance is detected, indicating a complete miss under weak appearance conditions. (b) Detection result of ERZA-DETR, which successfully identifies two diverticulum instances with confidence scores of 0.60 and 0.78, respectively.

Figure 11. Visual detection comparison of lesion localization and recognition performance between two models on the SEE-AI dataset. (a) Visual detection results of RT-DETRv2 on the SEE-AI dataset. (b) Visual detection results of ERZA-DETR on the SEE-AI dataset.

Table 1. Training hyperparameter configuration.

Hyperparameter	Value
Input size	640 × 640
Batch size	16
Training epochs	150
Optimizer	AdamW ( $β_{1}$ = 0.9, $β_{2}$ = 0.999)
Initial learning rate	0.0001
Weight decay	0.0001
Data augmentation	Random photometric distortion, random zoom-out, random IoU crop, random horizontal flip

Table 2. Per-category detection performance (mAP@50) on the SEE-AI dataset.

Model	Lesion Label
Model	Angio-Dyspl-Asia	Erosion	Steno-Sis	Lymph-Angie-Ctasia	Lymph Follicle	SMT	Polyp-like	Bleed-ing	Eryth-Ema	Diver-Ticulum	Foreign Body	Vein
RT-DETR	0.708	0.718	0.904	0.770	0.554	0.690	0.753	0.726	0.698	0.469	0.724	0.849
RT-DETRv2	0.713	0.723	0.908	0.775	0.560	0.695	0.758	0.731	0.704	0.473	0.730	0.852
ERZA-DETR (Ours)	0.742	0.755	0.932	0.806	0.660	0.725	0.789	0.762	0.733	0.509	0.766	0.873

Table 3. Performance of models on the SEE-AI dataset.

Model	Params (M)	GFLOPs	Epoch	mAP	mAP@50
Faster R-CNN [12]	41.2	127.0	500	0.316	0.565
SSD [39]	26.3	342.8	500	0.293	0.532
YOLOv8-L	43.7	165.2	300	0.416	0.727
YOLOv11-L	25.3	86.9	300	0.421	0.742
YOLOv12-L [11]	26.4	88.9	300	0.454	0.729
YOLOv13-L [30]	27.6	89.0	300	0.474	0.743
DETR [18]	40.0	187.1	500	0.359	0.617
Deformable DETR [19]	40.2	172.5	150	0.385	0.648
RT-DETR-R50 [20]	42.7	137.7	150	0.422	0.718
RT-DETRv2-R50 [21]	42.7	137.7	150	0.434	0.723
ERZA-DETR (Ours)	37.4	110.5	150	0.454	0.755

Table 4. Ablation study results on the SEE-AI dataset.

DBFS	FD-gConv	GLES	mAP	mAP@50
-	-	-	0.434	0.723
√	-	-	0.437	0.734
-	√	-	0.438	0.741
-	-	√	0.437	0.735
√	√	-	0.440	0.745
√	-	√	0.439	0.742
-	√	√	0.440	0.747
√	√	√	0.454	0.755

√ indicates the module is enabled.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ye, S.; Ma, H.; Zhang, Z.; Li, L. ERZA-DETR: A Deep Learning-Based Detection Transformer with Enhanced Relational-Zone Aggregation for WCE Lesion Detection. Algorithms 2026, 19, 268. https://doi.org/10.3390/a19040268

AMA Style

Ye S, Ma H, Zhang Z, Li L. ERZA-DETR: A Deep Learning-Based Detection Transformer with Enhanced Relational-Zone Aggregation for WCE Lesion Detection. Algorithms. 2026; 19(4):268. https://doi.org/10.3390/a19040268

Chicago/Turabian Style

Ye, Shiren, Haipeng Ma, Zetong Zhang, and Liangjing Li. 2026. "ERZA-DETR: A Deep Learning-Based Detection Transformer with Enhanced Relational-Zone Aggregation for WCE Lesion Detection" Algorithms 19, no. 4: 268. https://doi.org/10.3390/a19040268

APA Style

Ye, S., Ma, H., Zhang, Z., & Li, L. (2026). ERZA-DETR: A Deep Learning-Based Detection Transformer with Enhanced Relational-Zone Aggregation for WCE Lesion Detection. Algorithms, 19(4), 268. https://doi.org/10.3390/a19040268

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ERZA-DETR: A Deep Learning-Based Detection Transformer with Enhanced Relational-Zone Aggregation for WCE Lesion Detection

Abstract

1. Introduction

2. Related Work

2.1. Object Detection in WCE

2.2. Transformer-Based Detection and Hybrid Frameworks

2.3. Frequency Domain and Graph Learning in Vision

3. Materials and Methods

3.1. Dual-Band Adaptive Fourier Spectral Module (DBFS)

3.2. Fused Dual-Scale Gated Convolutional Module (FD-gConv)

3.2.1. Value Pathway with Depthwise Convolution

3.2.2. Gate Pathway with Channel Modulation

3.2.3. Residual Feature Fusion

3.3. Graph-Linked Embedding at Semantic Scales Module (GLES)

3.3.1. Graph Topology Construction

3.3.2. Coordinate-Aware Node Modulation

3.3.3. Structure-Guided Graph Aggregation

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.3. Implementation Details

5. Results and Discussion

5.1. Frequency Radius Sensitivity Analysis

5.2. Clinical Imbalance Analysis

5.3. Comparison with State-of-the-Art Methods

5.4. Ablation Studies and Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI