3.1. Overall Architecture
Multi-class endoscopic disease detection in real clinical applications faces three key challenges. Large amounts of non-pathological content cause feature redundancy. Normal mucosal tissue, anatomical structures, and imaging artifacts dominate the visual field, while diagnostically relevant lesion regions are sparsely distributed. Severe illumination inconsistency exists across imaging modalities. White-light endoscopy (WLE), narrow band imaging (NBI) and chromoendoscopy (CE) make identical lesions appear very different visually. Chromatic shifts, intensity variations, and specular reflections on mucosal surfaces alter feature representations. Extreme scale variability with blurry boundaries runs through disease stages, from sub-millimeter early dysplastic patches to centimeter-scale late stage lesions, which have unclear boundaries that challenge standard detection architectures.
Compliant computer vision techniques have proposed separate solutions for these challenges: sparse attention schemes for computational efficiency, multi-scale feature pyramids for scale variation, and dual-stream architectures for feature discrimination. However, their direct application to endoscopic disease detection encounters domain specific challenges. Cross-modal endoscopic datasets have constraints such as limited annotation density relative to natural images, extreme class imbalance with severe under-representation of rare but clinically critical classes like early cancer and high-grade dysplasia, and multi-modal imaging protocols that introduce systematic appearance variations beyond natural scene illumination changes.
To address these challenges through domain-specific adaptation, we designed Endo-DET, integrating three synergistic components. Adaptive Lesion-Discriminative Filtering (ALDF) achieves lesion-focused token selection through sparse simplex projection. Global–Local Illumination Modulation Neck (GLIM-Neck) enables illumination-aware multi-scale fusion through four cooperative mechanisms. Lesion-aware Unified Calibration and Illumination-robust Discrimination (LUCID) achieves feature refinement through reciprocal gating between detail and context streams.
The overall framework appears in
Figure 1. Given an endoscopic image of size
, the architecture includes four main components. The backbone extracts hierarchical features
at different resolutions. The ALDF-enhanced Transformer encoder provides lesion-focused sparse attention with lower computational cost. GLIM-Neck integrates illumination-aware global context through pyramid aggregation, Transformer refinement, hierarchical channel allocation, and adaptive local–global injection. LUCID calibrates features using dual-stream reciprocal modulation.
The framework builds on the DEIM architecture, leveraging dense O2O matching for the sparse supervision problems inherent to medical image datasets. We detail the design rationale and mathematical formulation of each component below.
3.2. Adaptive Lesion-Discriminative Filtering Module (ALDF)
Standard Transformer encoders in detection architectures employ dense global self-attention with computational complexity, treating all spatial token positions uniformly. Endoscopic disease detection, however, presents fundamentally different spatial distributions compared to natural object detection. Diagnostically relevant lesion regions occupy sparse, localized areas while non-pathological content, including normal mucosa, anatomical landmarks, imaging artifacts, and surgical instruments, dominates most spatial positions. Dense attention mechanisms allocate substantial computation to background regions lacking diagnostic information, diluting feature saliency from sparse lesion patterns and failing to prioritize the token interactions that are most relevant to medical diagnosis.
To address this spatial redundancy while maintaining detection sensitivity, we adapt sparse attention through grouped semantic modeling and learnable simplex projection, enabling a selective focus on lesion-discriminative token subsets.
ALDF integrates three cooperative mechanisms: grouped channel decomposition for simultaneous multi-scale semantic modeling, hybrid convolutional projection for local spatial context injection, and learnable sparse simplex projection for adaptive top-k token filtering. The detailed architecture appears in
Figure 2, showing the complete computational flow from input feature transformation through attention score computation to sparse token aggregation.
Given input features
augmented with learnable positional encoding, we decompose the channel dimension into
G semantic groups to enable the parallel processing of features at different semantic abstraction levels:
where
implements channel-to-group transformation, with each group capturing feature patterns at different semantic scales. This grouped decomposition facilitates cross-scale token interaction within a unified attention space while reducing per-group computational cost.
Query, Key, and Value projections generate through a hybrid convolutional architecture combining 3D grouped convolution
with depthwise separable convolution
:
where
operates on grouped channels with
kernels,
applies depthwise separable convolution for local context aggregation, and
splits output into three equal components. Grouped features then flatten into token sequences of length
forming
, where each token simultaneously encodes spatial position and semantic group membership.
To stabilize attention computation and prevent gradient saturation, we apply
normalization to Query and Key representations with learnable temperature modulation:
where
denotes
norm, ⊙ denotes element-wise multiplication, and
is a learnable temperature parameter initialized at
. Normalized attention scores
S quantify semantic affinity between all token pairs while maintaining numerical stability through bounded magnitudes.
To achieve computational efficiency without sacrificing detection sensitivity, we introduce learnable sparse simplex projection. For each query token
, we construct sparse attention patterns by retaining only top-
k key tokens exhibiting highest semantic affinity. Given
tokens, the number of retained tokens is
where
is the retention ratio that directly controls the computational complexity ratio between sparse and dense attention:
We set
in all experiments, yielding a 20% reduction in attention computation. This relatively conservative pruning reflects a domain-specific consideration: in endoscopic images, lesion regions are sparsely distributed but their spatial extent varies substantially, from sub-millimeter dysplastic patches to centimeter-scale polyps. Aggressive pruning (
) risks discarding tokens at lesion peripheries where boundary-discriminative information resides, directly harming localization precision. Conversely,
approaches dense attention with a negligible efficiency gain. At
, the top-
k selection provides sufficient spatial coverage to capture the total extent of the lesion, including boundary regions, consistent with the clinical priority of minimizing false negatives during medical screenings. This is formalized through a binary selection mask:
where
selects indices of
k largest elements and
is the indicator function. Sparse attention weights compute through masked softmax normalization:
where unselected positions map to negative infinity, ensuring zero contribution in softmax operation, ⊙ denotes element-wise multiplication, and
is an all-ones matrix. This formalization achieves hard attention selection while maintaining differentiability through softmax operation.
Output features are obtained through attention-weighted value aggregation, projection, and residual connection:
where
denotes
projection convolution mapping aggregated features back to original spatial configuration, and
converts the token sequence to spatial feature map format.
As a core encoder component, ALDF improves lesion discrimination by learning sparse patterns while suppressing computational redundancy in background regions. Sparse simplex projection reduces attention complexity from to , achieving significant efficiency gains when . Grouped channel decomposition provides additional computational savings, with FLOPs proportional to , where k denotes kernel size and G denotes group count.
Detection sensitivity is a priority over computational pruning, since medical screening applications need high recall to minimize false negatives in rare disease detection. Learnable temperature parameter and retention ratio allow for adaptive attention pattern formation according to endoscopic lesion characteristics. ALDF distinguishes generic sparse attention for natural image understanding.
3.3. Global–Local Illumination Modulation Neck (GLIM-Neck)
Cross-modal endoscopic imaging protocols introduce severe illumination inconsistency that fundamentally challenges standard multi-scale feature fusion architectures. White-light endoscopy captures broadband visible spectrum information. Narrow-band imaging selectively filters wavelengths to enhance vascular patterns. Chromoendoscopy applies topical dyes that significantly alter tissue chromaticity. Standard feature pyramid networks propagate entangled illumination-structure representations across scales, where specular reflections on moist mucosal surfaces, shadows from camera position changes, and directional anatomical folds create spatially heterogeneous illumination distributions contaminating hierarchical features.
Existing illumination normalization methods apply spatially uniform correction, risking either over-processing diagnostically relevant color variations at lesion boundaries, where subtle chromaticity encodes tissue pathology, or under-processing background artifacts where strong reflections persist. To address this limitation through content-adaptive processing, we designed GLIM-Neck to implement four cooperative mechanisms for illumination-aware hierarchical feature integration.
GLIM-Neck comprises four cascaded components. The Global Context Pyramid Aggregator (GCPA) constructs lesion-aware multi-scale context through adaptive pooling and feature concatenation. Global Context Transformer Head (GCT-Head) refines aggregated context through self-attention to learn illumination-invariant representations. Hierarchical Asymmetric Channel Allocation (HACA) decomposes refined context into pyramid level-specific guidance with differentiated channel capacity. Adaptive Local-Global Injection (ALGI) enables learnable data-driven fusion between local structural features and global illumination context.
The complete architecture appears in
Figure 3 and
Figure 4, depicting information flow from multi-resolution feature aggregation through Transformer refinement to adaptive hierarchical injection.
For the main processing branch receiving multi-resolution features
,
, and
from fine (stride 8), middle (stride 16), and coarse (stride 32) levels respectively, GCPA performs resolution unification through adaptive pooling followed by channel-wise aggregation:
where
implements
projection convolution,
denotes adaptive average pooling to target resolution
,
executes channel-wise concatenation, and
denotes the projected channel dimension. The integration of
, the lesion-discriminative filtered features, ensures the aggregated global context
encodes both global illumination statistics and spatial priors regarding diagnostically salient regions, extending traditional pyramid fusion with lesion awareness.
GCT-Head refines aggregated global context
through a lightweight Transformer encoder comprising
stacked blocks with 4-head self-attention and feed-forward hidden dimension
. Each block integrates multi-head self-attention with depth-enhanced feed-forward networks:
where
denotes layer normalization,
implements multi-head self-attention enabling global information aggregation,
denotes depth-enhanced feed-forward network introducing local inductive bias, and
indexes layers.
Refined global context effectively captures spatially correlated patterns, where neighboring regions typically share similar illumination while specular highlights and cast shadows manifest as localized discontinuities. Since integrates ALDF-enhanced features encoding lesion spatial priors, refined representation G distinguishes intensity variations attributable to illumination artifacts from tissue reflectance differences attributable to pathological changes.
HACA implements pyramid-level-specific channel capacity allocation based on the observation that different resolutions serve different detection objectives. High-resolution levels (stride 8) handle precise boundary localization and small-lesion detection, requiring local detail preservation. Low-resolution levels (stride 32) support large-lesion detection and global semantic understanding, requiring comprehensive contextual information.
HACA decomposes refined global context
G through learned projection and channel-wise splitting:
where
K denotes pyramid level count and
denotes channel allocation for level
k. In this work,
corresponding to stride-16 and stride-32 feature levels, with linear scaling allocation:
yielding
and
. This asymmetric allocation reflects the information density asymmetry across pyramid levels. The stride-32 level encodes global semantic context over large receptive fields, where illumination trends, lesion category semantics, and inter-class discrimination require richer channel capacity. The stride-16 level primarily relays spatially precise boundary information, which is inherently lower-dimensional and can be adequately represented with fewer channels. The linear scaling rule
provides a simple heuristic: each successive pyramid level receives proportionally more channels, avoiding both the under-parameterization of semantically rich coarse levels and over-parameterization of spatially precise fine levels.
implements learned channel projection, and
splits along the channel dimension according to the above ratios.
ALGI enables content-adaptive local–global integration through learned fusion coefficients at each pyramid level
ℓ. Given local features
and pyramid-specific guidance
from HACA, dual-branch architecture processes local and global information in parallel:
where
,
, and
denote
projection convolutions,
achieves resolution alignment through adaptive pooling or bilinear interpolation, and
denotes hard sigmoid activation.
Adaptive fusion mechanism enables dynamic balance between local and global branches through learned parameters. Learnable parameter
controls fusion coefficient
through sigmoid mapping:
supporting continuous interpolation between branch contributions. The final output achieves soft switching through convex combination:
where ⊙ denotes element-wise multiplication.
This formalization exhibits clear interpretability. When , fusion emphasizes gated local features, preserving boundary details and fine textures, making it suitable for regions with stable illumination and strong local discriminability. When , fusion favors the global context, achieving stronger illumination normalization, making it suitable for regions with severe illumination artifacts requiring global semantic compensation. Parameter undergoes end-to-end optimization, enabling the network to discover task-optimal fusion strategies adapted to specific pyramid levels and input content characteristics, achieving a dynamic balance between detail fidelity and illumination robustness.
3.4. Lesion-Aware Unified Calibration and Illumination-Robust Discrimination Module (LUCID)
After lesion-focused encoding and illumination-aware hierarchical fusion, feature representations may still contain spurious activations from non-diagnostic patterns including surgical instruments like biopsy forceps and suction catheters, imaging artifacts like bubbles, mucus deposits, and fluid residue, specular reflections on moist mucosal surfaces, and repetitive anatomical structures like mucosal folds and vascular patterns. These interfering patterns exhibit local textural similarity to pathological lesions while differing fundamentally in semantic context and spatial distribution characteristics.
Existing single-path feature refinement architectures struggle to simultaneously capture boundary-sensitive local details, which are crucial for precise localization, and discriminative global semantic context, which is crucial for distinguishing genuine lesions from artifacts. To address this through complementary processing pathways, we designed LUCID to implement dual-stream reciprocal modulation with progressive channel refinement for unified lesion-aware calibration.
LUCID comprises three sequentially operating mechanisms. Channel-wise stream splitting separates input features into specialized detail and context processing paths. Reciprocal gating enables bidirectional cross-stream modulation, where context generates spatial attention guiding detail stream’s spatial focus while detail generates channel attention guiding the context stream’s channel selection. Progressive channel sculpting further refines representations through spatial gating units, suppressing non-diagnostic response patterns.
The complete architecture appears in
Figure 5, depicting information flow from channel splitting through reciprocal modulation to progressive refinement.
Given input features
, we perform channel-wise splitting into detail and context streams at a ratio
:
where
and
.
Detail stream constructs boundary-sensitive texture representations through cascaded convolutions with progressively expanding receptive fields:
where
denotes the
i-th
convolution operation,
implements
projection, and
denotes ReLU activation.
Context stream preserves high-level semantic abstraction and illumination-related trends with minimal spatial distortion:
where
denotes
projection convolution maintaining semantic integrity while adjusting channel dimension.
Reciprocal gating mechanism enables bidirectional information flow between streams. The context stream generates a spatial attention map
encoding the per-pixel importance derived from global semantic understanding:
where
generates spatial gate through convolution,
denotes batch normalization, and
denotes sigmoid activation mapping to
.
The detail stream generates channel attention vector
encoding channel-wise reliability derived from local texture analysis:
where
implements depthwise separable convolution capturing local patterns before pooling,
denotes global average pooling aggregating spatial statistics, and
applies sigmoid activation.
These complementary attentions modulate respective streams in reciprocal fashion:
where ⊙ denotes element-wise multiplication with
appropriately broadcast along spatial dimensions.
Modulated dual-stream features fuse through concatenation, followed by
projection and residual connection:
where
implements fusion convolution and
concatenates along the channel dimension.
Progressive channel sculpting further refines channel-wise responses to suppress residual non-diagnostic patterns. Intermediate features undergo channel expansion followed by path splitting:
where
,
implements layer normalization,
executes channel expansion convolution, and
splits into the main path
X and gating path
V.
The spatial gating unit computes:
where
denotes depthwise
convolution capturing local spatial context,
implements Gaussian Error Linear Unit providing smooth nonlinearity, and ⊙ denotes element-wise multiplicative gating.
The final output integrates projection and residual connection:
where
implements output projection convolution and
applies dropout regularization during training.